[05:27:27] x1 has been switched [06:00:24] thanks, starting the switchover in s4 [06:20:40] https://gerrit.wikimedia.org/r/c/operations/dns/+/1286411 needed a rebase [06:22:00] ok [06:23:12] and now it's failing tests due to an unrelated ipaddr [06:31:26] other than that the switchover is done [07:53:58] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1217:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:33] marostegui: ^-- looks like an expired downtiem? [07:56:38] time [07:56:47] mmmm somewhat, I will fix it now [07:57:33] ta [07:58:58] RESOLVED: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1217:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:34] federico3: In Zarcillo, most of s1 shows as running an older kernel, but in the May reboots most have been marked as done - is that all correct? [12:00:49] checking [12:01:17] you mean in https://zarcillo.wikimedia.org/ui/sections#s1 ? [12:01:42] Yes [12:02:39] the flag is relative between hosts in the same section: most likely some other hosts have trixie and a newer MariaDB. The purpose is to use this info to select a preferred candidate (in case there are more than one candidate) [12:07:09] OK, although in the view that shows right now, none of the servers used to flag the others are ones that would be used (i.e. backups, cloudb* etc) [12:08:43] Is it possible to exclude such servers from the comparison, or is it preferred to still use them when cross-checking for that flag? [12:12:06] .. or is that the case and it is actually the sanitarium master? The uptime looked like reboots were recent across most, hence why questioning this [12:13:05] maybe it's db1169? [12:13:49] if you click on the hostname you can see the kernel version and MariaDB version [12:24:23] Yes, I think it was the sanitarium master [12:27:10] I'm referring to the third in the list https://usercontent.irccloud-cdn.com/file/6np9465A/image.png [12:31:32] Sorry, yes - transposed the last 2 digits [13:08:12] db2185 upgraded ok, can I upgrade db1215 now? It will give orchestrator and zarcillo a bit of downtime @marostegui @cezmunsta [13:13:57] federico3: what effect does that have on a running reimage? [13:15:22] at a an already running reimage using major-upgrade? nothing [13:15:46] What about when it tries to repool? [13:17:41] tbh I think that any planned Zarcillo downtime needs to be in a maintenance window [13:21:49] also no effect on repool, zarcillo is not involved in that [13:26:00] imeta = fetch_host_instance_from_zarcillo(self.args.instance) ? [13:35:32] federico3: the 2 that I have running are both now doing a slow repool. So, do you have a rough idea of the downtime? [13:36:31] the downtime for zarcillo? maybe some 30 mins, it's not going to impact the repool [13:38:39] Well, it would have thrown an exception in run that I would have needed to watch out for if that repool had not yet started, wouldn't it? I would wait for Manuel to confirm anyway [13:48:04] Is there any reason not to reverse replication rather than taking it down whilst primary? [13:53:02] the repool does not interact with zarcillo's api [13:54:08] reversing the replication would not help at this time, we don't have automated failover between pri/replica yet [14:01:35] are these showing the interaction? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/mysql/pool.py#74 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/mysql/pool.py#466 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/mysql/major-upgrade.py#275 [14:08:12] ah, that's for the pc sections, you are right! [14:08:17] what are you pooling? [14:10:11] anyhow we can schedule a maintenance window for tomorrow morning anyways [14:32:30] It was 2 nodes in s1, one from each DC .. they have both finished now anyway