[08:04:17] fyi, I'm reimaging cloudcephosd1006, I'll try to upgrade to quincy [08:26:58] hmpf.... partman fails :/ [08:27:08] https://www.irccloud.com/pastebin/WfS8zhxG/ [08:27:12] that rings a bell [08:34:11] o/ [08:34:20] seems like I've missed some fun while away [08:34:53] yep xd, are you back today? [08:35:44] mostly yeah [08:35:54] welcome back :) [08:37:05] ceph had another hiccup when upgrading, it seems that ceph quincy + bookworm just starts using memory until everything crashes [08:38:21] that sounds not great :D [08:40:18] how was debconf? [08:42:32] great overall, although it's a relatively long conference so I got quite exhausted towards the end [08:44:34] welcome back taavi! [08:44:53] dcaro: did clodcephosd1006 behave even worse after last night's reimage? [08:45:27] not worse, but just similar, it actually looked ok disk-usage-wise [08:45:33] but it started using RAM and did not stop [08:45:44] I thought on the previous reimages, it took 1 day or so before crashing [08:45:52] this time it took only a few hours [08:46:29] maybe rebalancing uses more ram? [08:48:19] https://phabricator.wikimedia.org/T399858#11022860 [08:48:44] there was an interesting drop around 20:00 UTC [08:48:51] then it spiked again [08:50:51] that might have been me killing things [08:51:01] I took it out of the pool during the night [08:53:00] I had to restart a few of the osds that were taking >10G to let the drain finish before hitting the ram issues [08:57:34] that makes sense [10:00:32] I was able to reimage (with a lot of pain xd), I think the next step is to try to upgrade to quincy on bullseye [11:38:25] fyi. we are moving the cables of a couple ceph nodes and a cloudvirt, I acked the ceph warning alert, should be quick and painless [11:46:41] finished :), there's a couple kernel error tasks I'll close, and the ceph cluster registered a couple slow heartbeats, that will clear in a bit (it retains them for 5m or so) [11:59:39] nice! [12:29:54] can I get a +1 on https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/882 ? it's a temporary quota bump to get a user unstuck until we stop doing dryRun creation of jobs [12:32:47] xd, calendar got confused, and it says I declined, but shows I did not, so I'm not sure now [12:32:50] https://usercontent.irccloud-cdn.com/file/XGnEVxUm/image.png [13:02:36] LOL I see the same thing, declined message, but green check mark! [13:10:09] yeah that happens when you don't clear the note [13:29:45] dcaro: you're moving 2006 to bullseye/quincy? [13:31:10] s/2006/1006/ [13:33:56] only bullseye for now [13:34:10] adding it to the cluster, we might want to start the quincy upgrade with codfw [13:34:22] ok [13:34:48] Yesterday dhinus and I talked about the option of trying bookworm+quincy as an experiment, to confirm that we have a path forward with bookworm. [13:35:34] but we can try that in codfw1 first. [13:40:40] from what I've seen, there's no bookworm quincy packages [13:41:05] oh, true [13:41:19] I wrote https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171279 last night but I have no idea if it will work [13:41:54] I think it might create the repo, but be empty [13:42:01] but we can try :) [13:42:25] yes, that'll just create an empty component [13:42:28] https://www.irccloud.com/pastebin/AbrYgkkg/ [13:43:09] I assume we also need to copy the packages somehow [13:43:16] wb taavi [13:43:19] o/ [13:43:42] upstream there's only bullseye https://download.ceph.com/debian-quincy/dists/ [13:44:00] oh, ok :/ I was assuming that the lack of the package was an us problem not an upstream one. [13:44:17] if there were packages, you would then need a corresponding block in `modules/aptrepo/files/updates` (which then needs to be added to the `Updates:` list in distributions-wikimedia) [13:44:59] meawhile, there's this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171285 [13:45:27] which, I can't really tell why the pcc doesn't like it; maybe because that host was reimaged and puppetdb isn't populated yet? [13:46:18] taavi, can I derail you with a totally different thing? Regarding project-proxy certs [13:46:32] yes! [13:46:44] I built a new acme-chief host (to support ipv6) and it looked 100% happy but it actually installed snakeoil certs rather than real certs. [13:47:01] I think that the pcc error is just missing the import from puppetdb [13:47:02] [ 2025-07-21T21:55:52 ] CRITICAL: Unexpected error running run_host: Unable to find fact file for: cloudcephosd2006-dev.codfw.wmnet under directory /var/lib/catalog-differ/puppet [13:47:19] > for: cloudcephosd2006-dev.codfw.wmnet [13:47:24] I haven't gone back to see if that was just a transitional thing while acme caught up... wondering if you understand/would anticipate that [13:47:26] that space at the front is really suspicious looking [13:48:21] andrewbogott: which hosts? [13:48:41] old acme host is project-proxy-acme-chief-02.project-proxy.eqiad1.wikimedia.cloud (which has not been working for a while) [13:48:46] new is project-proxy-acme-chief-03.project-proxy.eqiad1.wikimedia.cloud [13:49:06] also, by "built a new acme-chief host" do you mean going through the hiera dance to add it as passive and sync over the certs, or just added it and hoped for the best? [13:49:32] see, you have already explained the problem :) [13:49:51] Since the old certs were almost expired anyway, I expected it to make fresh certs and for that to be fine. [13:49:55] Can you tell me more? [13:52:34] looking at the logs of -03, it failed to issue certs due to an issue when talking to lets encrypt, which matches with https://letsencrypt.status.io/pages/incident/55957a99e800baa4470002da/687e8d62b8a4e804fad85799 [13:52:34] taavi: right again, removed the leading spaces from the `Hosts` line in the commit message and the pcc worked (that or it was able to import that node in the meantime) [13:52:34] https://puppet-compiler.wmflabs.org/output/1171285/7079/ [13:53:37] dcaro: I had hosts:..hostname for one of them? I just now checked for vertical alignment but maybe my eyes deceived me [13:53:58] just vertical, the ` Hosts: ...` [13:54:31] And for some reason that broke some hosts but not others [13:54:36] probably the grep/sed/regex it uses is a bit flaky [13:54:40] it was the first of the list [13:54:48] we can try reverting the change xd [13:55:13] eh, I'm not that curious [13:55:53] taavi: does that mean that all of our discussion yesterday about ipv4 and ipv6 was the wrong track and it was just that letsencrypt was down? [13:55:55] * andrewbogott checks 02 [13:56:28] 02 is still saying 'Unable to validate challenge' sometimes [13:56:36] 02 is logging "DNS server 2a02:ec80:a000:4000::2 (ACMEChallengeValidation.UNKNOWN) failed to validate challenge" which sounds very much like a v6 issue :/ [13:57:30] -03 was logging "Service busy; retry later." [13:57:52] ok [13:58:04] So... maybe 03 will work now if I reactivate it. [13:58:11] But I will save that for when you're not in a meeting [13:59:43] I changed `profile::acme_chief::active` to -03 and it started issuing new certificates now [13:59:58] cool [14:00:31] give it some time to issue certs (and the certs to age enough for clients with an out-of-sync clock) and then we can switch `acmechief_host` [14:02:09] in general, when changing acme-chief hosts you should add the new host to `profile::acme_chief::passive` first, to get the existing certificates synced to the new host first [14:02:50] otherwise you won't be able to take advantage of acme-chief's auto-aging feature for the first certs issued by the new host, and risk serving the snakeoil certificates if something goes wrong [14:09:56] thanks taavi! [14:18:19] andrewbogott: I just realized that if project-proxy-acme-chief-02 fails to issue certs because it doesn't have v6 connectivity, then the same issue is going to appear on the other existing acme-chief servers sooner or later [14:18:41] yep [14:18:51] I'll make a task to go through the whole list [14:20:46] T400163 [14:20:47] T400163: Rebuild all cloud-vps acme-chief hosts - https://phabricator.wikimedia.org/T400163 [14:21:53] Seems to be only 6 hosts total [14:23:02] I had a feeling we'd seen this before, and indeed: T245937 [14:23:07] T245937: tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses - https://phabricator.wikimedia.org/T245937 [14:23:16] so there's a workaround of setting `authdns_servers:` to v4 addresses if needed [14:24:39] that's probably worth doing for the projects that we don't actively manage (just traffic I guess) [14:25:29] or if we want to procrastinate a few weeks and only rebuild them to trixie once that is out [14:26:13] (I have a terrible habit of not doing anything when I could wait some time for $THING to happen that'll make the original task easier, only to then find the next $THING to stall the original project on) [14:29:45] That's very efficient, especially if we never actually do the work at all [14:30:59] In this case, I'm definitely happy to skip the rebuild if there's a one-line fix [14:35:08] huh, upgrading cloudcephosd2004-dev to quincy has done something very strange and bad to the systemd units [16:44:36] * dhinus off