[09:39:16] Amir1: Is there anything I need to do for https://phabricator.wikimedia.org/T387032 or it is all back repooled and checked? [10:00:44] Amir1: I am going to move back x1 to ROW until https://phabricator.wikimedia.org/T385645 is unstalled [11:01:18] marostegui: regarding the pc issue, I need to figure out why they went down, I will dig a bit more. [11:02:11] marostegui: oh I forgot to unstall it last week :( the code definitely has reached production [11:13:28] Amir1: Ah so I can go ahead? [11:14:29] yup [11:14:36] excellent [11:14:47] would you mind updating the task so it is recorded there? [11:16:00] I did unstall it [11:28:23] PROBLEM - MariaDB sustained replica lag on es7 on es1035 is CRITICAL: 380.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1035&var-port=9104 [11:29:25] FIRING: SystemdUnitFailed: ferm.service on es1035:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:23] RECOVERY - MariaDB sustained replica lag on es7 on es1035 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1035&var-port=9104 [13:28:30] do we still need to support Buster aka Python 3.7 in conftool / dbctl? [13:36:00] in conftool yes until the last buster hosts are gone [13:36:14] hopefully just few more months [13:38:32] filter https://debmonitor.wikimedia.org/packages/python3-conftool by deb10 if you want to see which ones they are [13:43:12] > hopefully just few more months ---> last famous words :D [13:56:25] PROBLEM - MariaDB sustained replica lag on s7 on db1181 is CRITICAL: 195.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [13:57:35] elukey: did homer say those words? [13:59:25] RECOVERY - MariaDB sustained replica lag on s7 on db1181 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [14:00:40] marostegui: you wouldn't ask if you used a bit of cumin in your last recipe [14:01:02] elukey: I was in the kitchen using a cookbook, sorry [14:47:39] PROBLEM - MariaDB sustained replica lag on es6 on es1036 is CRITICAL: 171.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1036&var-port=9104 [14:48:25] FIRING: SystemdUnitFailed: ferm.service on es1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:06] odd that ferm.service failure is corresponding with lag again [14:49:33] Emperor: I think it is because the host was just rebooted and puppet didn't run [14:50:38] correct [14:50:39] RECOVERY - MariaDB sustained replica lag on es6 on es1036 is OK: (C)10 ge (W)5 ge 4.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1036&var-port=9104 [14:50:45] but why? [14:51:02] last puppet run 13 minutes ago, uptime 6m [14:51:19] PROBLEM - MariaDB sustained replica lag on s4 on db2219 is CRITICAL: 95.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2219&var-port=9104 [14:53:07] federico3: FYI for the cookbooks repo a C+2 on gerrit is enough to merge, then Gerrit will run CI and perform a gate-and-submit of the patch. See https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Deployment for more detail ;) [14:53:19] RECOVERY - MariaDB sustained replica lag on s4 on db2219 is OK: (C)10 ge (W)5 ge 1.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2219&var-port=9104 [14:53:25] RESOLVED: SystemdUnitFailed: ferm.service on es1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:15] I thought the reboot cookbook downtimed the host? [16:11:35] es7 is having issues at the same moment I was doing a switchover [16:11:40] Writes are disabled there anyway [16:11:44] And it is the new master the one having issues [16:13:20] :( [16:13:26] let me know if you need a hand [16:13:29] is it the "things are getting locked" that has happened on past switchovers? [16:13:44] not sure what you mean jaime [16:13:50] I am going to revert the change [16:13:55] The host unresponsive [16:14:12] on past switchovers, after 10.6, I think processes got weird, probably related to semisync [16:14:36] not always, just at some times [16:17:46] I am reverting the switch [16:17:50] it is all safe as everything is RO [16:17:57] yeah [16:18:07] let me also know if I can help too in any way [16:18:12] thanks guys [16:20:20] done [16:21:27] I am going to open back es7 [16:23:54] PROBLEM - MariaDB sustained replica lag on s8 on db2195 is CRITICAL: 44.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104 [16:25:56] RECOVERY - MariaDB sustained replica lag on s8 on db2195 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104 [16:57:20] Emperor: o/ is it ok if I reboot ms-be2075 to check one thing? [16:57:26] seems stuck and not in any os [17:00:01] elukey: you may need to liase with JennH cf T382707 [17:00:02] T382707: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707 [17:00:31] FTAOD, I have no objection, but I don't want to disrupt Jenn's work [17:01:07] yep yep [17:02:30] I am investigating why the reimage didn't work [22:42:25] FIRING: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on backup1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed