[05:12:23] PROBLEM - MariaDB sustained replica lag on s8 on db2195 is CRITICAL: 14.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104 [05:13:23] RECOVERY - MariaDB sustained replica lag on s8 on db2195 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104 [05:35:08] Amir1: I fixed it before you got to it :) [06:37:38] dhinus: Could you upgrade clouddb104 and clouddb1018 soonish? thanks! [07:05:25] FIRING: SystemdUnitFailed: ferm.service on es1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:10:25] RESOLVED: SystemdUnitFailed: ferm.service on es1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:02] marostegui: on it [10:02:39] dhinus: <3 [10:13:00] marostegui: I did some digging in the reboot-single cookbook, and I think it will always cause alerts to go off in -operations, because it removes all icinga silences on the host [10:13:14] I will try using "sudo reboot" instead [10:13:48] it would be good to have a db-specific reboot cookbook, cc federico3 [10:14:12] the procedure I follow at the moment should work for all db hosts, not just clouddbs https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host [10:14:34] what do DBAs do when they reboot a prod db? [10:15:13] +1 on a cookbook [10:15:31] We use the upgrade cookbook which does a reboot [10:15:57] ah I didn't see that one! full cookbook name? [10:25:43] clouddb1014 is upgraded and repooled [10:26:31] * dhinus is looking at sre.mysql.upgrade [10:28:08] I think that cookbook does everything I'm doing manually, but it does not support multi-instance hosts [11:19:25] dhinus: because it doesn't use spicerack's mysql module that supports multi-instance [11:20:07] :P [11:20:35] {{Patches welcome}} [11:21:16] fun typo btw: https://gerrit.wikimedia.org/g/operations/cookbooks/+/3ff5175ee95c5c05b0450bcdc5bfbe80317c8c28/cookbooks/sre/mysql/upgrade.py#74 [11:21:51] :D [11:29:10] That's my fault. For half a year after I joined I seriously thought it's spicecrack. Until someone commented laughing emoji on my slide 😁 [11:31:37] LOL [11:38:40] April 1st CR anyone? rename & reimplement in perl... [11:53:44] marostegui: clouddb1014 and clouddb1018 are upgraded and rebooted, I will do the 3 remaining ones later [13:01:06] PROBLEM - MariaDB sustained replica lag on s1 on db1154 is CRITICAL: 116.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13311 [13:02:06] RECOVERY - MariaDB sustained replica lag on s1 on db1154 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13311 [14:01:52] dhinus: thank you! [15:02:03] marostegui: all clouddbs are now running 10.6.20! [15:02:17] dhinus: thanks! [16:41:22] Amir1:https://phabricator.wikimedia.org/T385645#10525207 this is on your radar right? [16:41:50] marostegui: yeah, I can two things, [16:41:58] backport the change or let it ride with the train [16:42:15] I'm letting it ride with the train (should be deployed by end of next week) [16:42:27] once that's there, it should be safe to run the schema change again [16:42:34] if you want me to do it sooner, that's fine too [16:42:36] Amir1: That is totally fine, I won't deploy it during the summit anyway [16:42:48] Amir1: No no, no need to rush this [16:42:48] yeah [16:42:59] we have so many things on the plate already [16:43:04] Yeah