[01:57:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:57:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:11:12] <Amir1>	 The script is pooing back es2026 while it's alerting 
[07:28:04] <federico3>	 https://www.irccloud.com/pastebin/x4r8bDOy/
[07:29:23] <federico3>	 the script waits for icinga to go green before pooling in, but in the past we saw transient failures of  wmf_auto_restart_prometheus-mysqld-exporter.service 
[07:30:01] <Emperor>	 that unit is still failed on that host
[07:30:25] <Emperor>	 since 21:22:07 on 1st Sep
[07:31:04] <federico3>	 yes I'm aware, I'm trying to understand the timing:
[07:31:08] <federico3>	 https://www.irccloud.com/pastebin/riauVhyB/
[07:34:39] <Emperor>	 that paste has several different services in 
[07:35:38] <Emperor>	 I think the key log line is likely
[07:35:49] <Emperor>	 Sep 01 21:22:07 es2026 wmf-auto-restart[2602033]: INFO: 2025-09-01 21:22:07,538 : Service prometheus-mysqld-exporter not present or not running
[07:37:19] <Emperor>	 (indeed that exporter stopped at 13:56:52 on 1st Sep and didn't start again until 05:21:14 today)
[07:37:49] <federico3>	 yes i'm looking at when the spicerack check for icinga passed 
[08:51:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2049:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:04:41] <federico3>	 ?!
[10:00:13] <federico3>	 Amir1: the schema change on s3 is done, can I start s7?
[10:06:21] <Emperor>	 sigh, something is wrong with ms-be1083 and grub is trying to boot from the wrong md device(!)
[10:19:08] <Amir1>	 federico3: go for it
[10:21:50] <federico3>	 ok
[11:49:58] <Amir1>	 I think the switchover script is using the old one for cumin1003. It is quite slow (which is a thing for old version of wmfmariadbpy)
[11:50:57] <Amir1>	 I can't stop it now but sigh
[12:52:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2049:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:52:26] <Emperor>	 :-/
[13:06:44] <Amir1>	 federico3: please take a look. It's the host you're provisioning ^
[13:07:29] <Emperor>	 !log install libpython3.9-dbg python3.9-dbg on ms-fe2016 for debugging
[13:07:29] <stashbot>	 Emperor: Not expecting to hear !log here
[13:07:39] <Emperor>	 oops, ECHAN
[13:09:56] <federico3>	 Amir1: yes, I'm looking at the restart script and I added a bit of logic to temporarily disable the timer during the host cloning. I suspect there is a race between the stop-start of mariadb during clones and the timer-based run of wmf_auto_restart - albeit I need a bit more context on why that autorestart script exists
[13:12:37] <federico3>	 it appears the timer is set to run at different times across hosts, e.g.: OnCalendar=Mon,Tue,Wed,Thu,Fri *-*-* 17:42:00 on es2049 
[13:21:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2049:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:44:15] <federico3>	 Amir1: when you have a second https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184092/1 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184091/1
[14:52:15] <Amir1>	 I just came out of meeting. Give me a bit
[18:26:01] <federico3>	 Amir1: you've been starting the schema changes after the eqiad master flips: I you prefer I can run the schema changes after you do the flips
[18:28:05] <Amir1>	 Nah, these are fine, just we need to make sure where the current schema changes is missing. Updating stuff I realized we need to do switchover of s8 in eqiad
[18:28:16] <Amir1>	 being on top of the schema change blockers
[18:29:32] <Amir1>	 that way I can prioritize what to switchover next.
[18:35:46] <federico3>	 speaking of which, codfw s7 is done except the DC master, can I run the change on the DC master now?
[18:37:06] <Amir1>	 which schema change? https://phabricator.wikimedia.org/T401906
[18:37:12] <Amir1>	 I'd say do eqiad on s7 first
[18:37:22] <Amir1>	 we do replicas first (both dcs) then dc masters
[18:38:39] <federico3>	 you mean we want to follow strictly:   1) codfw replicas 2) eqiad replicas 3) codfw DC masters 4) eqiad masters   
[18:39:21] <Amir1>	 yeah, I usually combine 1 and 2 (at least for the third section onwards) but yeah
[18:39:59] <Amir1>	 my usual chain: s6 codfw, s5 codfw, s6 eqiad, s5 eqiad, s3 all, s2 all, s7 all, s4 all...
[18:40:15] <Amir1>	 (all = only replicas)
[18:40:49] <federico3>	 ah that's different
[18:42:38] <Amir1>	 I just run it without --dc argument but for you, let's do all codfw replicas first
[19:26:42] <federico3>	 Amir1: do you know if in the past new es* nodes were manually repartitioned after receiving them? The new es* hosts all have unused space in LVM but perhaps it was by design given that the RO sections are not going to grow?
[21:08:28] <Amir1>	 It should mirror exactly how the old host is set up. It will grow due to logs, etc. Definitely better safe than sorry 
[22:02:08] <federico3>	 Amir1: is this a bug in the provisioning or do we receive hosts with this configuration and repartition manually?