[07:53:09] morning! [08:17:04] o/ [09:49:42] is https://grafana-rw.wikimedia.org down for everyone or just me? [09:50:30] hmm works in a private window, cookie issue then [09:50:56] works again :) [10:03:53] xd [11:46:45] cloudinfra-cloudvps-puppetserver-1 is down, looking [11:46:47] (from alert) [11:50:31] console opened up, but it's barely responsive [11:50:41] I think I'll just reboot it [11:53:28] ok, vm up and running again [11:56:40] it seems it got out of memory, the first suspicious log I see is about certmanager user faling to login [12:08:40] right before it deployed new puppet code (cron sync) [13:03:24] * dcaro paged looking [13:03:40] project porxy down [13:04:27] oh, went away [13:06:53] everything looks ok now, let me look at the logs [13:18:43] proxy-5 logs show nothing interesting that I could see, maybe the issue was not on the cloudvps side but the prometheus/network side [13:22:03] logs show nothing on proxy-6 either :/ [13:46:21] the conntrack issue is more mysterious than I thought, the max value set in /etc/sysctl.d/70-nova_conntrack.conf is not loaded correctly [13:46:40] does this ring any bell? Couldn't write '33554432' to 'net/netfilter/nf_conntrack_max' [13:46:48] more details at T399050 [13:46:50] T399050: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050 [13:47:45] hmm.... weird, maybe it's hitting some limit on the kernel side? as in, maybe there's some kernel compilation option needed or something? [13:48:17] I //think// the kernel module is not loaded yet [13:48:25] but it is loaded now if I check with lsmod [13:48:58] oh, interesting [13:49:07] (just saw the task link) [13:52:43] OK, it's time for my weekly "Why doesn't my silence actually silence things?" question: [13:52:56] https://usercontent.irccloud-cdn.com/file/4WSTQzd3/Screenshot%202025-07-10%20at%208.51.56%E2%80%AFAM.png [13:53:16] xd [13:53:30] what did I do wrong this time? Forget how a regex works? [13:53:39] andrewbogott: we could use it as an icebreaker game instead of redactle :P [13:53:47] yeah :/ [13:53:54] what is that you wanted to silence? [13:54:03] regex golf was a thing no? [13:54:04] xd [13:54:15] I'm reimaging cloudcephosd1xxx nodes [13:54:18] https://alf.nu/RegexGolf?world=regex&level=r00 [13:54:21] and want it to not email me about them being down [13:54:32] oh, the email might be different [13:54:38] oooh [13:54:55] (not 100% sure) [13:54:59] they still come from alert manager don't they? [13:55:13] I think they come from icinga? [13:55:15] (tbh if it's just emailing /me/ that seems fine, but I assume that everyone else has a few dozen of those messages now) [13:55:21] if the sender is nagios@alert1002.wikimedia.org [13:55:22] I got a bunch tonight [13:55:46] I downtimed them in icinga too but it looks like that expired. Let me try that... [13:56:40] So does that mean that my alert manager downtime is actually correct? [13:56:46] andrewbogott: if you use the reimage cookbook, that removes any silence after it reboots :) [13:57:08] shouldn't it also silence before reimaging though? [13:57:21] I'm doing firmware upgrades too, but all on cumin1003 so it should have the power to prevent emails... [13:57:25] true that [13:58:00] the alertmanager one looks ok to me, there would be the 'ceph cluster warning' that might not match (as it does not have the instance) [13:58:47] oh, good point. I guess that's why when you did it you just used .*ceph.* as your regex [13:59:20] that was filtering on the service label yep [13:59:44] I'm not sure though if the host down alert has the service=ceph set [14:02:26] tools-puppetserver-01 is alerting again :/ [14:02:41] let me look, I did something that should not have broken anything [14:24:18] hmm, it seems it does not rebase anymore, it cherry-picks, so having manually cherry-picked a commit will end up in an error (it did not before, as the rebase would skip it if it's already applied) [14:24:34] good to know :/ [14:24:38] sorry for the noise xd [14:24:43] np [14:34:07] the conntrack issue is probably a repeat of the old T136094, that one was fixed adding a dependency to "class ferm" in puppet. but cloudvirts don't use firm, so I wonder how they ever worked :D [14:34:08] T136094: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094 [14:34:13] maybe we just got lucky with the previous reboots [14:35:15] we don't want ferm in cloudvirts, right? I will send a patch to replicate the fix for cloudvirt hosts [14:38:53] it's installed at least [14:38:59] https://www.irccloud.com/pastebin/w5WGVFKk/ [14:39:08] wait, that's a cloudcontrol [14:39:38] yep installed in cloudcontrols, not installed in cloudvirts [14:39:40] yep, no ferm in cloudvirts [14:39:55] andrewbogott: do you remember why? [14:40:23] it clashes with the firewall rules managed by OpenStack to provide access to the VMs [14:40:29] because neutron manages the network there, it's creating and destroying connections willy-nilly [14:41:06] ack [14:41:49] moritzm: I'll copy your fix from T136094 and apply it to cloudvirts... other non-ferm hosts (if they exist) will still be broken though [14:41:50] T136094: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094 [14:42:13] unless you can think of a better place to put the fix [14:42:27] fyi `systemctl restart systemd-sysctl.service` also reloads the values [14:43:14] I also wonder what is loading the module nf_conntrack if not ferm [14:43:16] dhinus: I don't remember the finer details, it has been eight years, but please add me to reviewer and I'll refresh my memory when reviewing [14:44:25] ok thanks :) [14:47:01] /etc/modprobe.d/options-nf_conntrack.conf seems to load it, added by puppet [14:48:36] hmm is a line with "options" enough to load the module? [14:54:03] good point, not sure [14:55:37] I created T399212 [14:55:37] T399212: nf_conntrack_max is not set at boot in cloudvirts - https://phabricator.wikimedia.org/T399212 [14:59:25] I suspect it might be openvswitch [14:59:36] I was writing the same :) [14:59:45] https://www.irccloud.com/pastebin/6EbvnAxB/ [15:00:20] so handled outside the modprobe stuff probably [15:31:45] dcaro: can you tell what's going on with cloudcephosd1006? I reimaged it yesterday, it was working until a few minutes ago. I rebooted and now... [15:31:48] https://www.irccloud.com/pastebin/diPgN0pA/ [15:32:05] hmm, let me check [15:33:11] The network seems to be working for that host otherwise... [15:34:27] andrewbogott: I stopped, reset-failed, and started it [15:34:35] my guess is that the network started after the osd tried to start [15:34:42] (one guess xd at least) [15:34:46] hm, that happened twice in a row [15:34:59] might be a race condition [15:35:06] guess I'll keep an eye out on other OSDs. Planning to decom that one soon anyway. [15:35:07] thank you! [15:37:24] it seems to be coming up [15:37:33] was there any change or something? [15:38:05] I just restarted all the others [15:38:28] The big change was a reimage to bookworm yesterday. Just now, all i did was reset all failed and then restart [15:39:05] as in the services only? [15:39:08] no reboot? [15:39:10] right [15:39:18] well, wait [15:39:24] it seems it rebooted [15:39:26] before I pinged you I already tried rebooting the system twice [15:39:47] did it fail after the reimage? or it started failing out of the blue? [15:40:00] I reimaged and activated yesterday... [15:40:04] it worked fine for about 12 hours [15:40:09] (i see that one of the interfaces took almost a minute to start) [15:40:13] https://www.irccloud.com/pastebin/FbNgSyfG/ [15:40:24] then all the osds switched to 'down' and the console was frozen [15:40:29] so I rebooted [15:40:32] and then saw what you saw [15:40:36] so then rebooted again, no improvement [15:40:39] then I pinged you [15:40:41] I think that's the whole story [15:41:33] hmmm [15:41:37] yeah [15:41:53] I was figuring oom killed, since I'd seen that before on bookworm. But not this time [15:42:09] so it failed because of something else probably, there's a crash is the cluster crash log saying it failed to respond to heartbeats (only the old 40 though) [15:42:33] annoying if the reboot is not bringing everything up though [15:42:41] yeah [15:42:48] well, if it's just that one node that's fine, can just decom [15:43:14] I'll take a break from reimaging to make sure this doesn't happen elsewhere [15:43:52] yep, there was a race condition at boot [15:44:01] the osd tried to start before the interface was ready [15:44:02] Jul 10 15:29:15 cloudcephosd1006 systemd[1]: ceph-osd@40.service: Failed with result 'exit-code'. [15:44:16] (that's a bit before the interface was ready) [15:44:39] maybe the systemd dependency is on having internet or something, and once any interface comes up osd tries to come up [15:44:50] (another guess xd) [15:44:58] but that does not explain the first failure :/ [15:45:28] the cluster crash log is from today [15:45:40] `root@cloudcephmon1006:~# ceph crash info 2025-07-10T15:13:04.678521Z_8e802088-26a6-42a2-8ff4-f5a7581074c6 [15:45:40] ` [15:46:10] so that does not explain the error yesterday either :/ [15:46:31] yeah, it crashed just now while I was reimaging a different node so of course I ignored the alert [15:46:45] hahahaha xd coincidences [15:47:53] hmm. it seems to have been working somewhat during the night though [15:48:28] I think it was working fine until 15:13 [15:48:29] when did this happen? `then all the osds switched to 'down' and the console was frozen` [15:48:52] oh, that's actually the timestamp of the crash in the cluster, that makes sense [15:49:03] (as in matches I mean) [15:49:03] not sure exactly but probably 15:13, it was while I was watching this meeting [15:49:07] yeah [15:49:45] the last log before the first reboot in [15:49:47] *is [15:49:47] Jul 10 15:07:47 cloudcephosd1006 ceph-osd[16930]: 2025-07-10T15:07:47.392+0000 7f5a1b7fe6c0 1 osd.40 pg_epoch: 72588459 pg[8.c85( v 72588440'1062760> [15:49:57] there's nothing after that timestamp from the osd service [15:50:23] and from the whole log is [15:50:24] Jul 10 15:11:13 cloudcephosd1006 ceph-osd[17078]: 2025-07-10T15:08:16.584+0000 7f45c0ff96c0 -1 osd.41 72588464 heartbeat_check: no reply from 10.64.2> [15:50:36] so it stopped writing to disk at all [15:50:39] that's not good :/ [15:50:54] there's [15:50:54] btw I just updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167708, hopefully understanding what you suggested [15:50:55] Jul 10 15:07:57 cloudcephosd1006 kernel: md: resync of RAID array md1 [15:51:28] hm, what does 'rsync of RAID' mean? [15:52:08] not sure, my guess is that the raid got messed up at some point [15:52:58] oh, did you want me to specify vlan_id /and/ interface name, in case some future scenario uses both? [15:53:50] yep :) [15:53:56] ok, one more try... [15:54:08] that probably simplifies the puppet code [15:54:09] xd [15:54:10] okok [15:54:29] the OS raid misbehaving would match the "nothing written to the logs" issue [15:54:31] https://www.irccloud.com/pastebin/tNOHwGF0/ [15:55:12] from what I can read, the resync happens when the raid has a failure/issue and starts recovering [15:57:27] https://www.irccloud.com/pastebin/V8oDK3VR/ [15:57:35] md1 is pending resync [15:58:29] what is that md1? [15:59:43] md2 is the one that holds the os volumes (pvs shows) [15:59:49] md0 is boot [16:01:07] md0 and md1 are (I thought) raided together to make the os drive [16:01:30] oh now, wait... [16:01:36] that's some artifact of the partman recipe... [16:01:55] md1 is swap [16:02:03] md1 is swap [16:02:04] yep [16:02:08] https://www.irccloud.com/pastebin/Nbd8e51l/ [16:06:07] hmm.... so yep, currently my best guess is that the md2 raid somehow had an issue and everything got stuck, then the reboot brought the network slightly late, enough that the osd stopped trying to start before it was completely up, and made the osds fail [16:06:58] and somehow md1 is still not fully recovered (missing a resync) [16:07:22] now, this being a software raid, might be a software issue to, so not sure it's hardware [16:09:54] is md1 still /trying/ to recover? I can also just reimage everything again... [16:10:52] I think it might [16:11:08] https://www.irccloud.com/pastebin/ObKkTpCg/ [16:11:13] it's pending the resync [16:12:11] So maybe it will sort itself out... [16:12:22] btw I updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167708 again [16:13:04] that's what I meant yep :) [16:13:06] thanks [16:13:28] ok, probably I will merge and see what breaks, after lunch [16:15:25] xd [16:15:44] I might not be around though, just fyi. have to go grocery shopping and such [16:16:03] smart values look ok to me also on the devices [16:16:13] Well, it's for a new node so the likely scenario is just that that node doesn't work. [16:17:48] hmm... I suspect that the swap is read only until the sync finishes [16:18:00] https://www.irccloud.com/pastebin/XgHnrVUL/ [16:19:45] it seems it will become read-write, and resync when the first write comes in [16:19:48] weird [16:19:48] It seems like if we need the swap partition then something terrible is already happening [16:20:05] manually forced it just in case [16:20:08] mdadm --readwrite /dev/md1 [16:20:14] it started resyncing [16:20:54] swap is useful even if you don't 'need it' badly, but yep, if you need it badly it's probably not good already [16:22:59] it finished resyncing [16:23:01] well [16:23:50] now that we (really you) know what happened, does that make you think that I should proceed with reimages, or still be nervous about reimages? There's nothing obviously tied to the upgrade in that unless it's a bookworm bug [16:24:14] I'm kinda annoyed about the inability for the reboot to come up cleanly [16:24:44] maybe it will now that the raid is repaired... [16:24:46] want me to try? [16:24:46] the reimage for now feels ok yep, if you want we can wait until monday or something see if it repeats [16:25:12] I suspect it might not be the raid that slows it down, but sure let's try :) [16:25:39] ok, there it goes! [16:25:42] btw. how much faster are now the reimages? [16:26:06] topranks: can you think of any risk to merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167708 other than it not working for that already not-in-service node? [16:26:43] dcaro: joining the cluster after the reimage only takes a minute or two, as opposed to I guess all day before. so that's better! [16:27:02] Unfortunately the reimage process seems to require a disk wipe and firmware upgrade so it's still pretty tedious. [16:27:31] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph#Upgrading_OSD_nodes [16:28:40] that's a great improvement :) [16:29:04] it is! It's not like we didn't have to do the firmware upgrade before [16:29:07] network is still not up [16:29:12] https://www.irccloud.com/pastebin/BuE7knUu/ [16:29:26] so it wasn't the raid [16:29:31] yep [16:30:00] maybe it needs a reseat? [16:30:15] hmm... the interface is down, is that because a switch/interface config mismatch or something? [16:30:19] now it came up [16:30:35] https://www.irccloud.com/pastebin/V7BT8Kmb/ [16:30:35] topranks: now we have two questions :) [16:30:51] I'll get those osds online in the meantime [16:31:42] andrewbogott: no I don't think there is much risk. the way the patch was done it should not affect any existing hosts [16:31:55] let me see if I can see what's wrong though [16:32:06] which host was it you were working on? [16:32:14] topranks: so these are two different topics, sorry [16:32:18] the patch is for 1048 but not merged yet [16:32:27] ok yep np [16:32:29] the definitely unrelated issue with the nic is cloudcephosd1006 [16:32:40] which takes like 100 years for the second nic to come up [16:32:47] after a reboot [16:33:08] ens3f1np1 [16:33:58] it's up now [16:34:00] no idea tbh [16:34:15] want me to reboot it so you can see it misbehave? [16:34:21] it's not working now actually [16:34:30] ok, there you go :) [16:34:35] it's just the universe telling us to use the primary link for both of these :) [16:34:45] xd [16:35:02] ok, I'll get someone in the dc to reseat and then will decom if it continues to misbehave [16:36:52] thanks for looking topranks [16:37:19] ohh.... I think it's puppet [16:37:47] ok actually it is working fine now [16:37:55] I forgot the switch/gateway IP on those vlans was .254 [16:38:02] my ping for .1 failed, but that's expected [16:38:12] dcaro: ? [16:38:13] Interface::Ip[osd-cluster-ip]/Exec[ip addr add 192.168.4.6/24 dev enp175s0f1np1]/returns) executed successfully (corrective) [16:38:16] https://www.irccloud.com/pastebin/ONM0w68Z/ [16:38:22] that happened right before the interface when up [16:38:30] so yeah right now it's working, as to why it takes a long time to init I don't know [16:38:48] the int name is showing as 'ens3f1np1' [16:39:23] oooh I bet the interface changed names when I reimaged [16:39:38] * andrewbogott wonders why that didn't happen on everything he reimaged [16:39:59] andrewbogott: it may still work, and indeed it has worked [16:40:01] how does that puppet command work? [16:40:05] not sure if it explains delay [16:40:22] the name it shows now is based on acpi ID, and the old name is based on PCIe location [16:40:35] stupid systemd / udev "predictable" naming scheme, let's not even go there [16:41:11] maybe this is not being picked up on boot then [16:41:16] https://www.irccloud.com/pastebin/1hYJPOHr/ [16:41:25] and it's not until puppet runs that it sets it correctly [16:42:24] https://www.irccloud.com/pastebin/aPiVnoho/ [16:42:58] ^^ the system recognizes the old name as an 'altname'. I'm not sure if that means we can just use the old name as before or not [16:43:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167905 [16:43:15] andrewbogott: one sec, let me try something, one last reboot xd [16:43:16] For the three hosts I already moved to bookworm [16:43:18] * dcaro curious [16:43:20] ok! [16:44:10] andrewbogott: okok, reboot, I changed the name in /etc/network/interfaces manually [16:44:50] it's rebooting [16:44:51] ok good idea for now [16:45:13] we should change anyhow :), but that might gives us some closure on the "why it took so long to start" [16:46:46] yeah, if indeed it starts quicker now [16:47:09] but yes best to change either way, less confusing [16:48:08] so annoying though [16:48:21] yep, it did [16:48:23] :) [16:48:38] it did - you mean the int came up quicker this time? [16:48:48] and the osds seem to be coming up correctly [16:48:50] yep [16:49:10] https://www.irccloud.com/pastebin/MolN6H24/ [16:49:13] no delay [16:49:37] ok. [16:49:58] so now the question is whether that puppet change works properly after the fact or if I need to make the puppet change /before/ a reimage [16:50:06] it should work well I think [16:50:09] * andrewbogott prepares to find out [16:51:38] gtg though, page me if anything goes out of control xd [16:52:51] ok! [16:53:29] doesn't seem to have caused any harm [16:58:47] this puppet diff is interesting [16:58:50] https://www.irccloud.com/pastebin/GfGVucUe/ [16:59:02] it seems we were not setting the jumbo frames correctly? [17:00:19] maybe we can add the check that the interfaces is what's configured to the cookbooks somewhere [17:01:47] it's a good idea, trying to think of how we'd do that... [17:02:37] o/ cya on monday [17:02:59] * andrewbogott waves [17:47:59] andrewbogott: I forgot about the jumbos [17:48:16] we best hold off on the vlan interface for the new hosts for now, I think that's a blocker [17:48:25] topranks: for 1048-1051 you mean? [17:48:28] yeah [17:48:33] ok. I'm still just trying to get them to reimage anyway [17:48:52] yeah. I don't think we can have the sub-inteface with a higher MTU than the physical [17:49:07] and we can't change the MTU on the parent, or at least not without a lot of consideration [17:49:41] Can we just enable jumbo frames for both 'interfaces'? [17:49:56] oh, that's "we can't change the MTU on the parent, or at least not without a lot of consideration" [17:50:57] topranks: will you drop some notes on T395910? [17:50:58] T395910: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910 [17:56:21] but a brief one on for now, will need to discuss and do more tests tomorrow [17:58:14] sure