[08:22:52] * taavi looking at syslog-server-audit02 [08:24:50] quick review? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/862 [08:25:22] taavi: +1d, and accidentally merged [08:25:37] whoops, I'll deploy then, ty [08:25:59] 👍 /me goes to get coffee [11:08:10] dhinus: seems like something in the new toolsdb replica process needs updating to no longer use the vlan/legacy network [12:23:13] more loki collection rollout: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/864 [12:25:37] taavi: ack, thanks for noticing. maybe it's the cookbook wmcs.vps.create_instance_with_prefix? I'll have a look [12:33:21] hmm I managed to use gerritlab to create a stack in gitlab, but it didn't add the stack links in the MR description: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/59 [12:34:16] taavi: are you using https://gitlab.wikimedia.org/repos/releng/gerritlab or something else for your gitlab MRs? [12:36:19] if you all start using different MRs for many small patches at once, we might want to look into skipping CI steps and other optimizations, otherwise for repos that generate images/packages it will generate one for each stacked patch (potentially reaching the point where the image generated for the first patch, gets pushed out by the image from the later patches) [12:37:55] dhinus: https://github.com/yaoyuannnn/gerritlab is what I use. didn't know we had a fork these days [12:38:44] dcaro: good point, I'm just experimenting at this stage, so I would wait before changing the CI. on tofu repos it might actually be useful to run the CI separately so you run break down "tofu apply" in multiple steps and reduce risk. [12:39:12] taavi: ack, I assumed the fork had some tweak for our GitLab instance, but it looks like the upstream works better :) [12:48:38] using the upstream https://github.com/yaoyuannnn/gerritlab worked fine! [12:48:54] can I get a review for https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/60 and https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/61? [12:55:44] taavi: re: vlan/legacy, the cookbook create_instance_with_prefix just reuses the same network of the previous host... [12:55:52] I think we can migrate manually tools-db-6 to the new network [12:56:10] i wonder whether it's worth adding a special case to the cookbook for this [12:56:44] I think we'll create more tools-db-X in the future so fixing manually tools-db-6 should be enough. maybe also tools-db-4 for consistency. [12:57:04] then the cookbook will do the right thing for the future tools-db-X [12:57:23] how do I migrate manually a host? do I need to shut it down? [12:59:04] if you want to send a patch for the cookbook I'm not against it, I'm just lazy :) (also I don't have much time today) [12:59:29] I'm actually not sure if we can migrate an existing VM [12:59:45] ah I thought you and arturo did migrate other ones [12:59:51] but then I'm misremembering [13:00:37] in theory you need to remove the old port and then create a new one in the new network, but I'm not sure if the DNS record automation for example would work with that [13:00:41] is it an issue if tool-db remains on vlan/legacy for a few more months? [13:01:12] I think I'm happy as long as no Trixie based VMs will get created in vlan/legacy [13:01:24] sgtm. I'll open a task anyway. [13:07:54] Could someone look at this project request and comment or +1? https://phabricator.wikimedia.org/T398254 [13:08:20] (seems valid to me) [13:10:52] taavi: T398625 [13:10:53] T398625: [wmcs-cookbooks] create_instance_with_prefix should not use vlan/legacy - https://phabricator.wikimedia.org/T398625 [13:10:58] not sure it was what you had in mind, feel free to edit [13:11:11] it is, thanks [13:20:40] today I'm seeing this error multiple times while running "tofu plan" from the CI: "Error while installing terraform-provider-openstack/openstack v3.0.0". maybe it's just the opentofu registry having some issues... retrying seems to work [13:31:18] hmm, I have not seen that before [13:34:33] once more: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/554616 [13:34:58] │ net/http: TLS handshake timeout [13:35:06] hmm... that's not nice, it's not even at http level [13:35:34] it's from github directy [13:35:47] is github down, is there a cloud vps network issue, or is there some other issue with that runner? :/ [13:36:46] and any intermediate thing in between :) [13:36:52] *or [13:38:32] the original artifact url works ok from my laptop (https://github.com/terraform-provider-openstack/terraform-provider-openstack/releases/download/v3.0.0/terraform-provider-openstack_3.0.0_linux_amd64.zip), but the one resolved there gets me 401 (still connects tls though) [13:38:39] hmm, looking at the build history all of the failures were on a single runner, runner-1031.gitlab-runners.eqiad1.wikimedia.cloud [13:38:49] so that might suggest the last option [13:40:32] curling on the runner looks ok, maybe k8s/container network botched? [13:41:10] I need a +1 for this stupid typo :) https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/62 [13:41:37] per https://phabricator.wikimedia.org/T398628 that's a brand new runner, and also one of the first in the new dualstack network [13:41:56] hmm, let me force ip4/6 [13:42:35] ip6 does not get to the server [13:42:39] https://www.irccloud.com/pastebin/Dlt17Sjv/ [13:42:46] maybe it's trying to use ip6? [13:42:53] slightly different failure, on the same runner: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/554642 [13:43:29] that one is ip4, and the issue is the response `│ dial tcp 172.16.16.6:5668: connect: connection refused` [13:43:34] as in, port not open? [13:44:19] that port is open to the entire world though [13:45:25] Be warned that the gitlab runners have extra-weird network setups, see https://phabricator.wikimedia.org/T397888 [13:45:48] like, blocked outbound ports I think [13:45:49] so someone is returing it the rst packet for the connection refused instead :/ [13:46:09] I suspect this is an issue for the gitlab folks to look at, and not us :/ [13:46:10] hmm... yep a firewall would do that [13:47:09] I don't see the 5668 port on that list of allowed ports, so probably is being blocked yep, +1 for gitlab folks giving it a look [13:48:13] reminds me that I want to eventually move that API to the normal 443 port :-) [13:48:19] I think I even have a patch for that somewhere [13:49:35] xd [14:15:24] quick review, silly mistake on autocomplete https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/51 [14:22:38] And, just like that, I'm catapulted into deb dependency hell where I will likely spend the rest of my Thursday [14:24:24] let me know if you want some help, not that I don't get baffled, lost and confused by packaging, but I can try to help xd [14:25:48] I'm happy for help! Zigo updated the neutron packages in the osbpo, I'm guessing there's a mistake in his build someplace but I don't understand what it is yet... [14:25:57] you can see the wreckage on cloudnet2005-dev.codfw.wmnet [14:26:09] systemd-timesyncd : Depends: systemd but it is not going to be installed [14:26:13] but of course systemd is installed [14:26:29] how did you trigger that? [14:26:51] puppet [14:26:51] or apt install systemd-timesyncd [14:27:28] maybe it needs a newer version? [14:27:30] https://www.irccloud.com/pastebin/d73FGS9h/ [14:27:36] I don't think this is osbpo specific [14:27:44] is it happening everywhere? [14:28:14] `apt show systemd-timesyncd` has `Depends: libsystemd-shared (= 252.38-1~deb12u1)`, but we have `252.36-1~deb12u1` installed [14:29:15] now the next question is why doesn't apt try to update systemd [14:29:24] some pins? [14:29:43] well, second question is why is systemd-timesyncd not installed [14:30:03] hmmm [14:30:12] is it required for anything if we don't use any timer/ [14:30:13] ? [14:30:20] timesyncd is the ntp client [14:30:23] so yes, we need it [14:30:40] hmm... then the issue would have happened during install/reimage right? [14:30:42] I ran a cookbook on that host which iirc does 'apt-get upgrade' [14:30:43] the apt history log file says something ran `apt-get dist-upgrade -y --allow-downgrades -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold` on the box, which made it removed [14:30:51] oh [14:32:17] well, I still don't get why it chose to remove systemd-timesyncd over upgrading systemd [14:32:39] so I should be able to just install sytemd-timesyncd=252.36-1~deb12u1 and make it happy [14:32:43] hmm... can the pinning do so? did it remove timesyncd? [14:32:44] since that's the state of e.g. cloudcontrol1007 [14:33:21] nope, that says it needs to downgrade libsystemd0 and won't [14:33:54] no... it fails because it depends on a version of libsystemd0 that is already installed [14:35:41] Am I... confused? [14:35:44] https://www.irccloud.com/pastebin/OApZmGcx/ [14:37:47] I fixed this with `sudo apt-get install systemd=252.38-1~deb12u1`, i.e. bringing systemd forward instead of going back [14:38:15] tbh a cookbook doing an unattended `dist-upgrade` might not be the best idea [14:39:00] would you advise just 'upgrade'? Or should I change it to actually enumerate all the packages of interest? (That might be 100 packages) [14:40:16] I think just `upgrade` should be fine here [14:40:27] ok, I'll change the cookbook [14:40:35] the man page says dist-upgrade "[...] will attempt to upgrade the most important packages at the expense of less important ones if necessary" :D [14:40:53] I also tried to install systemd but I guess I needed to specify the version. Thank you for sorting! [14:43:34] fyi. added a note on how to relsolve paws uuid -> mediawiki account in the wiki (from the chat history, thanks bd80.8!) https://wikitech.wikimedia.org/wiki/PAWS/Admin#Translate_from_PAWS_uuid_to_SUL_account_name [14:44:15] thanks dcaro ! [14:44:20] taavi, https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1166214/1/cookbooks/wmcs/openstack/cloudcontrol/upgrade_openstack_node.py [14:44:33] ^ totally untested [14:54:45] ok, the patched/upgraded metadata agent just doesn't start at all :( [15:49:19] oops taavi, I told that user to use the project request ticket to request the project closure [15:49:29] I don't think we have a different process for project removal [15:49:50] context is at the bottom of https://phabricator.wikimedia.org/T398405 [15:50:07] yeah, but I just want a separate ticket instead of re-purposing the original one that is anchient at this point [15:52:55] that's fine with me, I just feel bad for the user following my instructions [16:40:55] dcaro: I'm afraid the .deb pipeline is broken with an error that points to the multiarch work: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/jobs/554797 [16:41:02] do you want to have a look or want me to do it? [16:41:21] oh no! let me have a quick look [16:41:32] I did build the components-cli though [16:44:28] this should be the fix gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/66 [16:44:37] (I think, it's at least "a" fix xd) [16:44:51] +1 [16:45:23] ok, merged, can you try rebuilding? [16:45:32] yeah doing [16:46:12] oh right, gitlab caches the manifest when you retry. one second [16:47:42] yep, much better https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/jobs/554813 [16:47:45] thanks!! [16:48:03] (side note, I wonder why the weird escaping in `+ [[ amd64 != \a\m\d\6\4 ]]`) [16:51:02] I think that's the `set -x` interpreting the line, but yep, no idea why it shows it with escapes instead of quotes [17:11:18] * dcaro of [17:11:20] *off [17:11:33] cya! o/