[08:00:35] morning! seemed like a busy weekend, reading [08:27:12] morning! most info about Friday's outage is in the incident doc https://docs.google.com/document/d/1CLY_iZyXDTyJEl4fKYeU1aRSNsheO9-TZcjyW9wFyEk/edit?tab=t.0#heading=h.95p2g5d67t9q [08:27:59] I'll create a draft incident report at https://wikitech.wikimedia.org/wiki/Incident_status [08:28:10] 👍 thanks for taking care of the incident! [08:30:26] I did not do much apart from looking at graphs and logs :D [08:31:07] I imagined it could be related to the upgrade, but I didn't immediately identify the problem was ONLY with the bookworm hosts [09:00:11] weird issue [09:50:45] I checked the Ceph and Linux kernel bug trackers in Debian (given both Ceph and Linux were the stock Bookworm kernels) and there are no similar reports, so it's probably down to some particular config/setting that only we're using [10:43:59] thanks! strange :/ [11:34:23] vps project request: https://phabricator.wikimedia.org/T399418 [11:34:23] project quota increase: https://phabricator.wikimedia.org/T398638 [11:34:48] These requests look valid but needs a +1. Anyone around to approve? [12:44:25] done [13:01:11] dhinus: it was entirely upgrade-related, as soon as I rolled those hosts back to bullseye everything got happy again. [13:01:39] moritzm: thank you for looking! My surely less-intensive googling didn't turn up much either. [13:02:01] did we change any other config? (ex. network) [13:03:07] andrewbogott: yep, what I meant is that it took me a few hours before realizing the failing hosts were the upgraded ones [13:03:09] not unless it was a hidden effect of the OS update. A few of the hosts renamed the nics with the new OS but not all of them [13:03:41] dhinus: oh yeah. possibly because you didn't know they had been upgraded, which I could have communicated better. [13:03:46] until then I only knew something was wrong, and that there was an upgrade, but I couldn't see the host correlation until you woke up :) [13:04:15] I was not sure how many hosts were upgraded yeah, but in retrospect I could have easily checked the versions [13:04:59] One other thing we can investigate: codfw1dev has been running that combination (pacific+bookworm) for a couple of weeks without any apparent issues. We should probably check and make sure that there /really/ aren't issues. If there are and they just aren't as severe because of low traffic that would be reassuring. [13:05:32] for cloudcephosd1013 it seems a hard drive failure [13:05:36] https://www.irccloud.com/pastebin/Jk5IyAfc/ [13:05:42] is there a task for it? [13:05:43] traffic in codfw is pretty low, but I would still expect to see some diff in the graphs (temperature, cpu, etc.) [13:05:53] dcaro: only if alertmonitor created one. [13:06:12] And we already have hardware racked to replace it, so there's no real followup needed there. [13:06:39] okok [13:07:05] I guess that shows that our 5-year replacement cadence is just right :p [13:07:22] dcaro: I think there's only T399366 which I'm not sure it's related [13:07:22] T399366: KernelErrors Server cloudcephosd1013 logged kernel errors - https://phabricator.wikimedia.org/T399366 [13:07:35] probably worth creating a task about the hard drive failure [13:07:48] kinda matches [13:07:53] (those are kernel errors) [13:07:55] it's interesting I think it started failing mid-incident :P [13:08:16] I did see initially 8 OSDs down in grafana when one host was not responding [13:08:19] hmmm.... that's interesting yep [13:08:20] then they become 9 :P [13:08:35] might have been the cause of stuck operations [13:08:43] *slow [13:08:55] dhinus: I think those kernel errors are me rebooting the host to see if that got the drive to reappear. [13:09:12] but yeah, timing is very suspicious! [13:09:40] There was probably some excited rebalancing during the incident, probably slightly increased load on that drive but nothing it shouldn't have handled [13:10:16] I am not finding kernel errors matching that datetime though... [13:11:27] https://usercontent.irccloud-cdn.com/file/4DU1otMW/image.png [13:11:29] scary xd [13:11:45] not that datetime, just the logs themselves [13:12:33] yes those match a dead hard drive...and it was Friday, but why phaultfinder created that task on Saturday? :) [13:13:07] hmmm.... it also seemed to have had network issues at some point [13:13:07] heartbeat_check: no reply from 10.64.151.7:6826 osd.254 since back 2025-07-11T14:02:30 [13:13:37] the graph in prometheus for kernel errors starts on Friday as expected [13:13:49] for some reason the task was only created later, maybe prometheus could not read the stats? [13:14:10] hmm but the graph shows the stat was recorded [13:16:13] Alertmanager figured we had enough to worry about already [13:16:20] LOL [13:17:51] the errors I was showing for the hard drive are from before the reboot [13:19:00] yes, I'm just confused why alertmanager didn't notify us, the emails and IRC messages are also on Saturday [13:20:46] We try to filter out certain kernel errors from alerting, maybe the filter is too broad [13:25:59] the prometheus value is already after the filtering, and the graph shows it starts on Jul 11 (Friday) [13:28:29] maybe some silence was in place? [13:30:27] oh! Yeah, there was a silence for hostname = cloudcephosd.* during the bullseye upgrades. [13:31:03] I'm not finding it in the "expired" silences, but that would explain the alert triggering late [13:31:09] if it expired on Saturday [13:31:35] ah yep found it! [13:33:28] mystery solved [15:54:40] I published an initial draft of the incident report: https://wikitech.wikimedia.org/wiki/Incidents/2025-07-11_WMCS_Ceph_issues_causing_Toolforge_and_Cloud_VPS_failures [16:05:22] thanks! [16:20:12] dhinus: do you know how to handle https://alerts.wikimedia.org/?q=team%3Dwmcs&q=alertname%3DProjectProxyMainProxyCertificateExpiry ?, I want to add a runbook, but I'm not sure what's the solution, the cert should be provisioned by acme-chief right? [16:26:38] dcaro: no, but I checked and there are 12 days before it actually expires [16:29:43] I added a note in the alert about the time it's left xd (it was not saying anything before) [16:30:25] thanks that looks much better :) [16:37:03] a user in Slack is reporting they're still having issues with object storage, since the outage on Friday [16:37:20] (#engineering-all Slack channel) [16:39:08] I see, they are using swift [16:48:32] T399481 [16:48:32] T399481: Swift container endpoints are unavailable - https://phabricator.wikimedia.org/T399481 [17:08:42] * dhinus offline [17:20:47] \me off [17:20:50] cya tomorrow!