[05:27:27] <marostegui>	 x1 has been switched
[06:00:24] <federico3>	 thanks, starting the switchover in s4
[06:20:40] <federico3>	 https://gerrit.wikimedia.org/r/c/operations/dns/+/1286411 needed a rebase
[06:22:00] <marostegui>	 ok
[06:23:12] <federico3>	 and now it's failing tests due to an unrelated ipaddr
[06:31:26] <federico3>	 other than that the switchover is done
[07:53:58] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1217:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:56:33] <Emperor>	 marostegui: ^-- looks like an expired downtiem?
[07:56:38] <Emperor>	 time
[07:56:47] <marostegui>	 mmmm somewhat, I will fix it now
[07:57:33] <Emperor>	 ta
[07:58:58] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1217:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:00:34] <cezmunsta>	 federico3: In Zarcillo, most of s1 shows as running an older kernel, but in the May reboots most have been marked as done - is that all correct?
[12:00:49] <federico3>	 checking
[12:01:17] <federico3>	 you mean in https://zarcillo.wikimedia.org/ui/sections#s1 ?
[12:01:42] <cezmunsta>	 Yes
[12:02:39] <federico3>	 the flag is relative between hosts in the same section: most likely some other hosts have trixie and a newer MariaDB. The purpose is to use this info to select a preferred candidate (in case there are more than one candidate) 
[12:07:09] <cezmunsta>	 OK, although in the view that shows right now, none of the servers used to flag the others are ones that would be used (i.e. backups, cloudb* etc) 
[12:08:43] <cezmunsta>	 Is it possible to exclude such servers from the comparison, or is it preferred to still use them when cross-checking for that flag?
[12:12:06] <cezmunsta>	 .. or is that the case and it is actually the sanitarium master? The uptime looked like reboots were recent across most, hence why questioning this
[12:13:05] <federico3>	 maybe it's db1169?
[12:13:49] <federico3>	 if you click on the hostname you can see the kernel version and MariaDB version
[12:24:23] <cezmunsta>	 Yes, I think it was the sanitarium master
[12:27:10] <federico3>	 I'm referring to the third in the list https://usercontent.irccloud-cdn.com/file/6np9465A/image.png
[12:31:32] <cezmunsta>	 Sorry, yes - transposed the last 2 digits 
[13:08:12] <federico3>	 db2185 upgraded ok, can I upgrade db1215 now? It will give orchestrator and zarcillo a bit of downtime @marostegui @cezmunsta 
[13:13:57] <cezmunsta>	 federico3: what effect does that have on a running  reimage?
[13:15:22] <federico3>	 at a an already running reimage using major-upgrade? nothing 
[13:15:46] <cezmunsta>	 What about when it tries to repool?
[13:17:41] <cezmunsta>	 tbh I think that any planned Zarcillo downtime needs to be in a maintenance window
[13:21:49] <federico3>	 also no effect on repool, zarcillo is not involved in that
[13:26:00] <cezmunsta>	 imeta = fetch_host_instance_from_zarcillo(self.args.instance) ?
[13:35:32] <cezmunsta>	 federico3: the 2 that I have running are both now doing a slow repool. So, do you have a rough idea of the downtime?
[13:36:31] <federico3>	 the downtime for zarcillo? maybe some 30 mins, it's not going to impact the repool
[13:38:39] <cezmunsta>	 Well, it would have thrown an exception in run that I would have needed to watch out for if that repool had not yet started, wouldn't it? I would wait for Manuel to confirm anyway
[13:48:04] <cezmunsta>	 Is there any reason not to reverse replication rather than taking it down whilst primary?
[13:53:02] <federico3>	 the repool does not interact with zarcillo's api
[13:54:08] <federico3>	 reversing the replication would not help at this time, we don't have automated failover between pri/replica yet
[14:01:35] <cezmunsta>	 are these showing the interaction? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/mysql/pool.py#74 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/mysql/pool.py#466 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/mysql/major-upgrade.py#275 
[14:08:12] <federico3>	 ah, that's for the pc sections, you are right!
[14:08:17] <federico3>	 what are you pooling?
[14:10:11] <federico3>	 anyhow we can schedule a maintenance window for tomorrow morning anyways
[14:32:30] <cezmunsta>	 It was 2 nodes in s1, one from each DC .. they have both finished now anyway