[09:39:16] <marostegui>	 Amir1: Is there anything I need to do for https://phabricator.wikimedia.org/T387032 or it is all back repooled and checked?
[10:00:44] <marostegui>	 Amir1: I am going to move back x1 to ROW until https://phabricator.wikimedia.org/T385645 is unstalled
[11:01:18] <Amir1>	 marostegui: regarding the pc issue, I need to figure out why they went down, I will dig a bit more.
[11:02:11] <Amir1>	 marostegui: oh I forgot to unstall it last week :( the code definitely has reached production
[11:13:28] <marostegui>	 Amir1: Ah so I can go ahead?
[11:14:29] <Amir1>	 yup
[11:14:36] <marostegui>	 excellent
[11:14:47] <marostegui>	 would you mind updating the task so it is recorded there?
[11:16:00] <Amir1>	 I did unstall it
[11:28:23] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on es7 on es1035 is CRITICAL: 380.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1035&var-port=9104
[11:29:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on es1035:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:34:23] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on es7 on es1035 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1035&var-port=9104
[13:28:30] <federico3>	 do we still need to support Buster aka Python 3.7 in conftool / dbctl?
[13:36:00] <volans>	 in conftool yes until the last buster hosts are gone
[13:36:14] <volans>	 hopefully just few more months
[13:38:32] <volans>	 filter https://debmonitor.wikimedia.org/packages/python3-conftool by deb10 if you want to see which ones they are
[13:43:12] <elukey>	 > hopefully just few more months  ---> last famous words :D
[13:56:25] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s7 on db1181 is CRITICAL: 195.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104
[13:57:35] <marostegui>	 elukey: did homer say those words?
[13:59:25] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s7 on db1181 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104
[14:00:40] <elukey>	 marostegui: you wouldn't ask if you used a bit of cumin in your last recipe
[14:01:02] <marostegui>	 elukey: I was in the kitchen using a cookbook, sorry
[14:47:39] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on es6 on es1036 is CRITICAL: 171.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1036&var-port=9104
[14:48:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on es1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:49:06] <Emperor>	 odd that ferm.service failure is corresponding with lag again
[14:49:33] <marostegui>	 Emperor: I think it is because the host was just rebooted and puppet didn't run
[14:50:38] <volans>	 correct
[14:50:39] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on es6 on es1036 is OK: (C)10 ge (W)5 ge 4.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1036&var-port=9104
[14:50:45] <volans>	 but why?
[14:51:02] <volans>	 last puppet run 13 minutes ago, uptime 6m
[14:51:19] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db2219 is CRITICAL: 95.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2219&var-port=9104
[14:53:07] <volans>	 federico3: FYI for the cookbooks repo a C+2 on gerrit is enough to merge, then Gerrit will run CI and perform a gate-and-submit of the patch. See https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Deployment for more detail ;)
[14:53:19] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db2219 is OK: (C)10 ge (W)5 ge 1.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2219&var-port=9104
[14:53:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: ferm.service on es1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:54:15] <Emperor>	 I thought the reboot cookbook downtimed the host?
[16:11:35] <marostegui>	 es7 is having issues at the same moment I was doing a switchover
[16:11:40] <marostegui>	 Writes are disabled there anyway
[16:11:44] <marostegui>	 And it is the new master the one having issues
[16:13:20] <Amir1>	 :(
[16:13:26] <Amir1>	 let me know if you need a hand
[16:13:29] <jynus>	 is it the "things are getting locked" that has happened on past switchovers?
[16:13:44] <marostegui>	 not sure what you mean jaime
[16:13:50] <marostegui>	 I am going to revert the change
[16:13:55] <marostegui>	 The host unresponsive
[16:14:12] <jynus>	 on past switchovers, after 10.6, I think processes got weird, probably related to semisync
[16:14:36] <jynus>	 not always, just at some times
[16:17:46] <marostegui>	 I am reverting the switch
[16:17:50] <marostegui>	 it is all safe as everything is RO
[16:17:57] <jynus>	 yeah
[16:18:07] <jynus>	 let me also know if I can help too in any way
[16:18:12] <marostegui>	 thanks guys
[16:20:20] <marostegui>	 done
[16:21:27] <marostegui>	 I am going to open back es7
[16:23:54] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s8 on db2195 is CRITICAL: 44.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104
[16:25:56] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s8 on db2195 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104
[16:57:20] <elukey>	 Emperor: o/ is it ok if I reboot ms-be2075 to check one thing?
[16:57:26] <elukey>	 seems stuck and not in any os
[17:00:01] <Emperor>	 elukey: you may need to liase with JennH cf T382707 
[17:00:02] <stashbot>	 T382707: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707
[17:00:31] <Emperor>	 FTAOD, I have no objection, but I don't want to disrupt Jenn's work
[17:01:07] <elukey>	 yep yep
[17:02:30] <elukey>	 I am investigating why the reimage didn't work
[22:42:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on backup1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed