[08:45:05] we have an issue with tools-proxy-9 apparently, many tools are unresponsive [08:51:43] I'm declaring an incident [08:53:07] incident doc: https://docs.google.com/document/d/1CLY_iZyXDTyJEl4fKYeU1aRSNsheO9-TZcjyW9wFyEk/edit?tab=t.0 [08:53:28] dcaro if you're around you might be quicker than me at identifying the problem [08:56:09] it looks like things are working now [09:22:12] the incident is resolved. follow-up investigation at T399261 [09:22:14] T399261: Widespread instances down in project deployment-prep - https://phabricator.wikimedia.org/T399261 [09:22:28] wrong task :D [09:22:35] correct task: T399281 [09:22:35] T399281: 2025-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281 [10:08:32] there are some alerts in cloudcontrol1006: designate_floating_ip_ptr_records_updater.service is in failed status [10:08:45] the logs shows "designateclient.exceptions.Unknown: timeout" [10:14:00] I think there is some ongoing Ceph instability, the dashboard at https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health shows 2 OSDs out, 8 OSDs down [10:22:11] and now "CephSlowOps: Ceph cluster in eqiad has 847 slow ops" [10:23:12] toolforge is down again [10:24:40] and back up [10:27:44] and down again [10:27:44] RESOLVED: CephSlowOp [10:33:06] things look better again but Ceph is still in HEALTH_WARN [10:33:23] I pasted the output of ceph health detail to T399281 [10:33:23] T399281: 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281 [10:38:09] there was another spike of Cloud VPS vms reporting down, but things are now back to normal [10:39:08] I'm not understanding what is the root cause of the Ceph instability [10:39:35] * dhinus lunch, back later [10:51:26] topranks: can you check if you can see anything network-wise that could explain some Ceph instability since this morning? [10:51:28] T399281 [10:51:29] T399281: 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281 [10:51:57] I'm especially puzzled by the SSH alerts on multiple cloudcephosd hosts [10:52:10] * topranks looking [10:52:14] * dhinus lunch, for real this time :) [10:55:05] checking a basic graph I have based on the cloud host ping checks from previous I see a few issues for cloudceph nodes only [10:55:08] https://grafana.wikimedia.org/goto/cSmIOesNg [10:55:22] which may be expected, but no general thing is jumping out [10:55:28] I'll check a few more things [10:55:32] it is still ongoing I take it? [10:57:07] kind of [10:57:12] not right now [10:57:22] but it's been going up and down a few times [10:58:21] I pasted some graphs in the task [10:58:50] a.ndrew was also doing some upgrades on those hosts yesterday so it might be related (or not) [11:14:11] yeah ok [11:14:22] nothing jumping out in the graphs or logs that would suggest a network problem [11:14:42] only thing in the graphs that looks different to normal is all the cloudrabbit* nodes are less spikey since about 8am this morning [11:14:45] https://grafana.wikimedia.org/goto/-qYOFesNg [11:15:04] could be totally unrelated still all of them seeing decent level of traffic [11:17:53] I'm still half asleep and haven't read the backscroll, but my emails suggest that ceph pacific + bookworm + ceph traffic is a bad combination. [11:18:34] So probably the fix for this is for me to downgrade those hosts back to bullseye. I can start doing that shortly, unless someone has a different diagnosis [11:19:02] dhinus: ^ [11:20:19] I don't think we can just depool the affected nodes pre-emptively without causing a whole lot of rebalancing which might make things worse [11:23:14] The bookworm hosts are 1006-1008,1035-1037 <- does that correspond to the failures? [11:43:06] andrewbogott: 1035-1037 seem to match [11:43:36] yeah. I thought it was specifically the md1 rebuild but 1036 seems to have flapped many times during the night... [11:43:46] and I only see log messages about md1 once. [11:44:17] I'm already in process of rolling 1037 back to bullseye; do you want to help me dig in the logs on 1035/1036 and see what you can see? [11:45:33] Oh, one other data point: I upgraded ssd fw on 1035-1037 but did /not/ upgrade firmware on 1006-1008 (although I did upgrade those three to bookworm as well). [11:45:48] but I also upgraded ssd fw on 1038-1041 [11:46:09] So pretty sure that still points to bookworm unless you've seen issues on 1038-1041 [11:49:50] dhinus: on cloudcephosd1036, look at time stamp '2025-07-11T10:27:42.396143+00:00' and below. That's one of the storms beginning. [11:50:07] It /looks/ like a network issue since there are kafka complaints at the same time as ceph complaints... [11:52:26] I'm quite confused, they look like network issues to me, but the network could also be a consequence and not the root cause [11:52:34] yeah [11:52:57] good catch about md1, did not see that [11:53:09] If it is a 'network' issue it would still have to be caused by system misbehavior, not literally the network [11:53:25] but of course if a different OSD is having trouble that might look like timeouts on any other OSD [11:57:06] thanks topranks btw for looking into this! I think there's nothing pointing to the network being the root cause so far [11:57:12] * andrewbogott agrees [11:58:26] ok np guys [11:58:56] I think downgrading 1035-1037 is probably a reasonable thing to try. wdyt? [11:59:12] I posted an update on the task with some details of what I could see [12:00:38] I definitely think downgrading is the next step. Then potentially we can go to the next version of ceph on bullseye before doing OS upgrades so we avoid this exact combination. [12:00:46] But it would be nice to actually know what's wrong :/ [12:02:40] The sw raid rebuild corresponds to some but not all of the lockups... [12:02:47] it's hard for me to blame the raid thing on ceph [12:03:39] but why does 1036 keep flapping then? It should already be done with the raid rebuild and it doesn't seem to be repeating that [12:14:35] btw the reimage is going badly because the partman phase fails. Typically I'd get around that by reformatting the OS drives so it has to rebuild the raid but I'm not yet sure how to do that with the controller on this generation of hosts... [12:31:32] bah, is there /any/ partition or raid management tool on the debian installer CLI? Do I really have to write a custom partman script? [12:36:33] no idea :( I'm on my phone atm, I can have a look in 30 mins, but I have limited knowledge of the partitioning and raid setup [12:39:07] I see that toolsdb is also alerting now, don't worry about it I'll fix it later [12:39:23] ok, thanks [12:39:26] are you off today, officially? [13:04:57] andrewbogott: no I'm working, and I'm back at my laptop now :) [13:05:47] ok! I'm enlisting Ben to help with the partman issue above [13:05:56] +1 [13:06:04] I'll look at toolsdb [13:06:08] And should probably step away for a few and eat something so my brain doesn't seize up [13:11:11] please do! things seem relatively stable right now [13:11:20] I've just restarted toolsdb replication and I think it's working [13:14:50] toolsdb replication is back in sync [14:20:17] dhinus: can you log in to /any/ cloud-vps hosts right now? [14:20:23] andrewbogott: let me try [14:20:28] there's a report in Slack about "I was trying to create an object storage container and am getting upstream request timeout errors." [14:20:52] hmm I can't ssh to the tools-db host I used one hour ago [14:20:58] so I guess it's a "no" [14:21:43] Maybe the bastion is busted? I'll restart it [14:21:49] https://openstack-browser.toolforge.org/ works, so this isn't a total network outage [14:22:24] I'm reopening the incident [14:22:35] no idea if this is related [14:24:44] I rebooted bastion-r and [14:24:46] [/sbin/fsck.ext4 (1) -- /dev/sda1] fsck.ext4 -a -C0 /dev/sda1 [14:24:46] /dev/sda1: recovering journal [14:24:57] so something bad has happened filesystemwise. [14:25:10] things seem able to recover, but... this is not consistent with how I would expect ceph to fail. [14:25:22] Unless it just caused a hard crash on vms that messed up journaling [14:25:30] there wasn't ever a ceph message implying data loss was there? [14:27:36] dhinus: my thought is... get bastionr working so that cumin works so that I can do some cloud-wide tests to see what VMs are locked up by surprise [14:29:13] didn't see anything implying data loss so far [14:29:23] I'm updating the incident doc timeline at https://docs.google.com/document/d/1CLY_iZyXDTyJEl4fKYeU1aRSNsheO9-TZcjyW9wFyEk/edit?tab=t.0 [14:29:37] things seem to get worse after 14:12 UTC [14:30:19] 1007 is having a thing now. [14:30:31] So we have two nodes down at once, which is probably what causes actual user-facing effects. [14:42:34] dhinus: cumin works on VMs again, can you do some searching and rebooting of frozen toolforge nodes? [14:43:20] you mean the usual "D state" frozen? [14:43:29] or other types of frozen? [14:43:57] probably they'll just appear as down in alertmanager [14:44:08] or you can see which ones time out for cumin. They're, like, all the way down. [14:44:18] at least, there are lots on deployment-prep in that state [14:44:54] maybe they'll recover on their own but that's not obvious to me if they are [14:47:17] ok [14:47:53] thx [14:52:55] are you seeing many, many unreachable hosts? [14:53:24] 30% in project tools [14:53:43] there's also this nice graph https://grafana.wmcloud.org/d/bfcWngjVk/taavi-cloud-vps-issue-detection-tests [14:54:57] I wish I understood better what's happening with those down nodes. Ceph should've just been a little slow during the flaps, right? [14:55:03] It should make a VM lock up [14:55:11] *shouldn't* [14:55:32] this morning they always self-recovered, but now they're not [14:55:39] probably because we had 2 ceph nodes down together [14:55:46] yeah [14:56:03] once we get 1036 back healthy I'm pretty sure this particular case won't happen again [14:56:46] dhinus: a useful thing would be a way to enumerate VM IDs from the list of unresponsive hosts. [14:56:50] cumin is not very usable or maybe I'm not using the right flags [14:57:07] I can't think of a /good/ way to do that, but it's possible with a lot of searches [14:58:44] are you having any luck getting those stuck hosts unstuck? [14:58:59] dhinus: I'm doing things like "sudo cumin O{project:deployment-prep} 'true' --timeout 10 --force " [14:59:13] and figuring out that the timeout VMs are stuck [15:01:30] my cumin command is completely stuck so I don't even have a list of hosts [15:01:52] I'm also doing IC and trying to update the timeline so I have limited bandwidth :) [15:02:51] I'm running on cloudcumin1001 and it's working reliably [15:03:03] what command are you running? [15:03:19] sudo cumin O{project:deployment-prep} 'true' --timeout 10 --force [15:03:20] just now [15:03:25] seems to be working [15:03:41] cloudcumin itself work, but my attempt of "cumin 'O{project:tools}' id" was never returning [15:03:50] of course I was missing "--timeout 10" :) [15:04:41] yeah, otherwise the stuck VMs make you wait forever [15:05:23] then what should we do with the ones that fail? [15:06:30] I'm not sure. The first few I tried came back after a hard reboot... [15:44:09] dhinus: do you have a google doc or just the phab task? [15:44:18] incident doc linked from the task [15:44:40] found it? [15:50:36] yep [15:51:32] I'm officially handing off IC to you then :) [15:52:43] if you have a handy list of bookworm-upgraded hosts, I would add it to the top of the doc [15:54:11] ok [15:54:28] If I'm both repairing and IC then I will lean on the repairing bit [15:54:43] makes sense [16:23:40] * dhinus offline [17:12:55] I am still reimaging things but I believe the incident to be over and the current Ceph state should prevent remaining issues from being visible to users. [17:14:37] perfect timing, stashbot [18:47:05] I am now going to try to leave ceph alone. Everything is back to Bullseye and ceph is doing a bit of rebalancing but seems otherwise happy. The volume that vanished from cloudcephosd1013 is concerning but seems unrelated to the rest of what's happening so it can be investigated later.