[07:53:03] * arturo online [08:53:24] i've upgraded grafana.wmcloud.org following the upgrade of grafana.wikimedia.org yesterday, please let me know if you see any issues [08:56:10] trove apparently does not let you specify an IPv6 host for a user T393760 [08:56:10] T393760: trove: Unable to create user with IPv6 address as host - https://phabricator.wikimedia.org/T393760 [09:41:57] * arturo brb [09:56:19] where does https://github.com/toolforge/tf-infra-test live these days? the git repo is shown as archived [09:56:32] ah, https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test [09:57:39] dhinus: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test/-/merge_requests/3 [09:58:51] taavi: thanks, +1d [09:59:06] merged [09:59:09] we could upgrade the openstack provider as well [09:59:11] do i need to manually deploy it somewhere? [09:59:53] hmm good question. I think it can be pulled from the internet? [10:00:05] ah no you mean your change? [10:00:15] yes, let me find the hostname [10:00:50] root@tf-bastion.tofuinfratest.eqiad1.wikimedia.cloud [10:02:08] done [10:02:20] thanks [10:04:05] the cronjob will run in 2 hours, we can do a manual test run first with -target to check if the proxy works fine [10:08:43] ha the cronjob was likely to fail because we need "tofu init -upgrade" to fetch the new version of the provider, and the cronjob script does "tofu init" [10:10:50] * taavi does that manually [10:10:59] already done :) [10:11:03] ah [10:11:26] I'm tempted to do a full run of the script to check if everything works (except magnum likely) [10:11:33] fine by me [10:11:46] ok I'll start it [10:18:05] tofu is working fine, but magnum has the usual issues, 3 heat stacks are in "CREATE_FAILED" [10:21:00] curl timing out similar to the other day [10:24:28] I added the heat logs to T392031 [10:24:29] T392031: openstack magnum (or heat) resource leak - https://phabricator.wikimedia.org/T392031 [10:41:38] hmpf I tried deleting the cluster but it is now stuck in "DELETE_IN_PROGRESS" [10:45:05] ok it completed the deletion eventually [11:21:55] I'm struggling with T393686 [11:21:56] T393686: tofu-provisioning: factorize gitlab pipeline logic - https://phabricator.wikimedia.org/T393686 [11:32:34] I don't know if it's known or announced, when I do "ssh login.toolforge.org" it fails with "Connection closed by 185.15.56.62 port 22" which the IP resolves to "instance-tools-bastion-13.tools.wmcloud.org." [11:32:57] my apologies if I missed an announcement [11:33:32] (also the IP doesn't resolve, it's the other way around obvs. need more coffee) [11:35:22] Amir1: T393732 [11:35:23] T393732: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732 [11:35:43] Amir1: maybe try `dev.toolforge.org` meanwhile [11:35:51] thanks [11:36:27] now I forgot why I wanted to login [11:36:53] (I should change my name to Droi [11:36:58] *Dori [11:39:57] this time I don't see anything weird running on bastion-12 [11:40:09] we're probably due for a reboot of that machine at some point [12:21:31] taavi: another user report in #-cloud we may want to reboot now? [12:24:12] the system, doesn't respond to my ssh [12:24:18] I'll just force reboot it [12:26:44] from the console log, I see the oomkiller working [12:26:46] https://www.irccloud.com/pastebin/hEpLvfYg/ [12:27:20] uid=56519(tools.iran-national-library [12:28:32] was using 1.33GB of memory that 7zz [12:28:42] isn't that something that systemd should have prevented? [12:30:40] oh, right tools-bastion-13 is login.t.o [12:31:16] so I guess both bastions are rebooted now :-S [12:32:07] arturo: yeah, I would have expected the cgroups we have to prevent a single user from hammering all the resources [12:46:16] arturo: remind me why the network tests service user needs toolforge admin access? [12:46:41] taavi: it verifies access to wiki-replicas and a few other toolforge specific things [12:46:56] more than "needs" is "nice to have" [12:46:58] why does that need admin level access? [12:47:22] I don't think it needs admin-level privs [12:48:10] ok, let's try removing that and see what breaks? [12:49:18] sure [12:49:35] removed srv-networktests from https://toolsadmin.wikimedia.org/tools/id/admin [12:49:46] are you running the cookbook or should I? [12:49:53] please do [12:50:27] doing [12:57:26] seemingly tools-bastion-13 is already broken again [12:57:55] by broken I mean showing the same sssd issues we saw earlier [13:28:28] arturo: what do you think about trying to move bastion-13 (login.) to the codfw ldap replicas? [13:51:19] In theory that supports failover doesn't it? So you could add codfw replica as the second choice (or even as the first with eqiad as the fallback) [13:51:29] I don't remember why we don't do that already, maybe it doesn't work :/ [14:21:58] taavi: I don't have a strong opinion [14:22:01] it could work! [14:36:26] taavi: do you remember what's up with `docker-registry.svc.toolforge.org` vs `docker-registry.tools.wmflabs.org` vs `docker-registry.tools.wmcloud.org`? [14:36:36] in particular, which one should I use for a new image [14:37:16] i think docker-registry.tools.wmcloud.org is not used anywhere and should be removed [14:37:37] docker-registry.svc.toolforge.org is what new things should be using, but most old things still use docker-registry.tools.wmflabs.org because changing that name is hard [14:37:41] iirc [14:39:26] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/21 [14:39:39] when I access https://docker-registry.svc.toolforge.org/ I get a 400 [15:03:31] andrewbogott: if you have spare time, T393760 looks like an upstream bug [15:03:31] T393760: trove: Unable to create user with IPv6 address as host - https://phabricator.wikimedia.org/T393760 [15:04:04] I'm already looking at a different upstream openstack bug but I'll save that one for later! [17:03:17] predictably, it was a 'past Andrew' bug and not an 'upstream openstack' bug [17:05:41] can I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143871 ? [17:15:07] done [17:17:20] thanks! [17:18:26] actually the () are redundant, let me fix that [17:23:01] andrewbogott: just realized that we used to have clouddb1021, is that gonna cause troubles? [17:23:05] I'll mention that in the task [17:23:14] it's still in netbox [17:23:28] Best to not ever re-use old hostnames even if the hosts are 100% gone. [17:23:44] so we should change T393733 [17:23:45] T393733: Q4:rack/setup/install clouddb102[1-4] - https://phabricator.wikimedia.org/T393733 [17:23:47] So I guess those will need to be 2022-2025 [17:23:48] yeah [17:24:00] should I comment and let dcops update the task? [17:24:22] nah, you can update it. Those boxes are still on order arent' they? [17:24:59] So no one is invested in the names yet [17:28:38] yep not arrived yet AFAIU [17:30:47] task updated, I'll send a follow-up patch [17:31:20] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143876