[01:57:10] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:57:10] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:11:12] The script is pooing back es2026 while it's alerting [07:28:04] https://www.irccloud.com/pastebin/x4r8bDOy/ [07:29:23] the script waits for icinga to go green before pooling in, but in the past we saw transient failures of wmf_auto_restart_prometheus-mysqld-exporter.service [07:30:01] that unit is still failed on that host [07:30:25] since 21:22:07 on 1st Sep [07:31:04] yes I'm aware, I'm trying to understand the timing: [07:31:08] https://www.irccloud.com/pastebin/riauVhyB/ [07:34:39] that paste has several different services in [07:35:38] I think the key log line is likely [07:35:49] Sep 01 21:22:07 es2026 wmf-auto-restart[2602033]: INFO: 2025-09-01 21:22:07,538 : Service prometheus-mysqld-exporter not present or not running [07:37:19] (indeed that exporter stopped at 13:56:52 on 1st Sep and didn't start again until 05:21:14 today) [07:37:49] yes i'm looking at when the spicerack check for icinga passed [08:51:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2049:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:41] ?! [10:00:13] Amir1: the schema change on s3 is done, can I start s7? [10:06:21] sigh, something is wrong with ms-be1083 and grub is trying to boot from the wrong md device(!) [10:19:08] federico3: go for it [10:21:50] ok [11:49:58] I think the switchover script is using the old one for cumin1003. It is quite slow (which is a thing for old version of wmfmariadbpy) [11:50:57] I can't stop it now but sigh [12:52:10] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2049:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:26] :-/ [13:06:44] federico3: please take a look. It's the host you're provisioning ^ [13:07:29] !log install libpython3.9-dbg python3.9-dbg on ms-fe2016 for debugging [13:07:29] Emperor: Not expecting to hear !log here [13:07:39] oops, ECHAN [13:09:56] Amir1: yes, I'm looking at the restart script and I added a bit of logic to temporarily disable the timer during the host cloning. I suspect there is a race between the stop-start of mariadb during clones and the timer-based run of wmf_auto_restart - albeit I need a bit more context on why that autorestart script exists [13:12:37] it appears the timer is set to run at different times across hosts, e.g.: OnCalendar=Mon,Tue,Wed,Thu,Fri *-*-* 17:42:00 on es2049 [13:21:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2049:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:15] Amir1: when you have a second https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184092/1 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184091/1 [14:52:15] I just came out of meeting. Give me a bit [18:26:01] Amir1: you've been starting the schema changes after the eqiad master flips: I you prefer I can run the schema changes after you do the flips [18:28:05] Nah, these are fine, just we need to make sure where the current schema changes is missing. Updating stuff I realized we need to do switchover of s8 in eqiad [18:28:16] being on top of the schema change blockers [18:29:32] that way I can prioritize what to switchover next. [18:35:46] speaking of which, codfw s7 is done except the DC master, can I run the change on the DC master now? [18:37:06] which schema change? https://phabricator.wikimedia.org/T401906 [18:37:12] I'd say do eqiad on s7 first [18:37:22] we do replicas first (both dcs) then dc masters [18:38:39] you mean we want to follow strictly: 1) codfw replicas 2) eqiad replicas 3) codfw DC masters 4) eqiad masters [18:39:21] yeah, I usually combine 1 and 2 (at least for the third section onwards) but yeah [18:39:59] my usual chain: s6 codfw, s5 codfw, s6 eqiad, s5 eqiad, s3 all, s2 all, s7 all, s4 all... [18:40:15] (all = only replicas) [18:40:49] ah that's different [18:42:38] I just run it without --dc argument but for you, let's do all codfw replicas first [19:26:42] Amir1: do you know if in the past new es* nodes were manually repartitioned after receiving them? The new es* hosts all have unused space in LVM but perhaps it was by design given that the RO sections are not going to grow? [21:08:28] It should mirror exactly how the old host is set up. It will grow due to logs, etc. Definitely better safe than sorry [22:02:08] Amir1: is this a bug in the provisioning or do we receive hosts with this configuration and repartition manually?