[08:22:54] <taavi>	 clouddb1013@s3 is now depooled, I assume you want to have a look at the crash before restarting replication?
[08:35:08] <federico3>	 taavi: I'm taking a look at it 
[08:35:19] <taavi>	 thank you
[09:50:01] <elukey>	 hey folks, not sure if you saw https://phabricator.wikimedia.org/T420041
[09:50:22] <dhinus>	 ah interesting!
[09:50:25] <elukey>	 db1253 was depooled due to an unexpected hw freeze/error
[09:50:45] <dhinus>	 I just opened T420177
[09:50:45] <stashbot>	 T420177: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177
[09:52:30] <dhinus>	 no, 1253 is still on 10.11.13, so probably unrelated
[10:14:08] <federico3>	 also db1253 was not pinging and not responding to SSH, sounds pretty different
[10:14:42] <jynus>	 federico3: check if it is network, like the other one last week
[10:14:53] <jynus>	 it was apparently a loose cable
[10:19:34] <dhinus>	 federico3: yes sorry, I saw "crashed" and I thought "oh maybe that's the same thing" but of course it was a totally different one :D
[10:21:53] <federico3>	 jynus: do you know if we have alerts on switches for interfaces going down?
[10:23:17] <jynus>	 not on single ports, but it is easy to track just by connecting to the management interface and seeing if it is up or not
[10:24:22] <jynus>	 we have the ping, but that can mean network down or crash
[10:24:29] <federico3>	 do we have logs from the switch e.g. in logstash?
[10:25:00] <jynus>	 I am sure we have logs, I don't think we have access to them
[10:30:10] <jynus>	 there, it took me 1 command to know that host is up
[10:30:25] <jynus>	 The iDRAC controller cannot communicate with the power management firmware due to an problem with the interface to the power management engine or with the power management engine itself. The system may operate in a performance degraded state.
[10:30:32] <jynus>	 at 2026-03-13 17:53:52
[10:31:05] <jynus>	 The Intel Management Engine has encountered a Exception Event.
[10:31:48] <jynus>	 some cpu weirdness it seems
[10:32:30] <jynus>	 someone reseted it twice
[10:33:26] <jynus>	 seems to be ok now, but I would paste the logs to the ticket and ask dcops to have a look, upgrade bios, evaluate
[10:34:19] <federico3>	 on the network side and in past tasks for this host I'm not seeing anything suspicious
[10:34:27] <jynus>	 yeah, all good there
[10:44:22] <jynus>	 the boot was clean also, showed no errors
[11:23:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:27:40] <Emperor>	 just rebooted that host, it'll sort itself out.
[11:33:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:45:48] <jynus>	 I confirm there is some changers that are on the latest man but not on the wiki
[11:45:58] <jynus>	 fixing
[12:41:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on backup1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:44:19] <Amir1>	 Emperor: have you rebooted all frontends? I want to know whether to wait for them to happen or I can start the next round now
[12:55:55] <Emperor>	 Amir1: it's in-progress (downtiming is being a bit flaky)
[12:56:47] <Amir1>	 noted, let me know once you're done
[12:56:59] <Emperor>	 ack
[14:43:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:48:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:51:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on backup1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:18:40] <jinxer-wm>	 FIRING: DiskSpace: Disk space backup1010:9100:/srv/objectstorage 3.995% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=backup1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace