[00:28:05] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 93.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:35:05] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:44:05] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 49.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:49:05] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:54:05] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 139 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:01:05] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 1.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:20:05] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 42.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:23:05] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:29:05] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 30 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:38:05] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:59:07] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 15.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [02:00:07] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [02:50:07] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 12.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [02:55:07] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [03:37:07] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 27.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [03:44:07] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [06:07:28] T367781 will start on s1 primaries [06:07:29] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:31:57] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:15:11] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 42 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [07:18:11] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [08:40:21] rclone lost a deletion race with an admin (again) [08:41:57] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:21] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 13.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [09:35:19] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [09:49:11] I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1081103 but will disable puppet on x1, es6 and es7 primaries beforehand, ok? [09:49:18] ^ arnaudb [10:15:17] sorry I missed your message jynus [10:15:32] ack, ok [10:20:05] deploying to x1 eqiad [10:22:24] It takes less than 1 second to stop and restart the service [10:22:34] so I think it is safe to do it by puppet [10:22:47] but I will monitor x1 first to see there is nothing wrong with it [10:23:29] thanks, I'm around, I'll be cooking for a few minutes [10:23:52] no worries, just giving a heads up in case there is an alert so you are aware of this [10:23:58] I am just being extra careful [10:24:56] I think there won't be an issue as this change broke things without us noticing it [10:25:15] so the fix should be equally easy [10:25:31] but as I've been wrong in the past about the right fix, I am being careful [11:45:25] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 69.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [11:51:27] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 1.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [12:23:36] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:44:41] unless someone has seen any errors or alerts on x1, I will reenable puppet on es primaries [13:01:57] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2240:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:12] RESOLVED: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2240:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:57] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2240:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:05] after testing with dry-run, unless there is any concern I would proceed with depooling and then repooling with multiple steps db1185. (I've reduced the sleep from 900 to 60 for the testing) [15:33:25] arnaudb, Amir1 let me know if I should pick a different one instead [15:37:06] sounds good to me! [15:37:10] idem! [15:37:59] Amir1: if you're still around: https://phabricator.wikimedia.org/T377718#10247019 [15:38:12] thanks! [15:38:46] let me take a look [15:39:51] (just for the hotfix part, after I'll re-up a "proper" package) [16:01:49] depool + slow repool completed, all seems to work fine. Task updates can be seen in T377738, let me know if you want to change the wording, more/less verbose, etc... [16:02:11] T377738: Create a dbctl depool/pool cookbook - https://phabricator.wikimedia.org/T377738 [16:03:14] last path to test is to check it behaves as expected if there is another pending change in dbctl [16:04:00] in the meanwhile I'm fine-tuning some small details [16:34:37] volans: a feature request (if not done already): if you're pooling with four steps or more, please make it only add T#### on only first and the last steps. Otherwise, running something for 200 dbs will lead to 800-1000 comments on the ticket :D [16:34:58] Auto schema does this after it made a mess [16:36:13] Amir1: what do you mean by "add T####" ? updating phab? [16:36:30] in SAL message which automatically triggers a phab comment [16:37:01] I assume dbctl commit message triggers that SAL [16:38:02] no SAL message generated by dbctl in this case, but the cookbook is explicitly updating phab, see https://phabricator.wikimedia.org/T377738 [16:38:12] right now is at every step, but I can make it less verbose if you prefer [16:38:25] and update phab only on first/last [16:39:25] I guess then we could instead of that add the T### to the cookbook's start/end message and not log anything from the cookbook itself [16:43:53] yeah, whatever you prefer [17:02:08] Amir1: do you want the phab update after the first % repooling and after the 100% repooling or are you ok with just updating phab before starting/after ending? [17:02:27] either way is fine with me [17:02:31] whatever easier for you [17:03:06] ack, thx [18:47:37] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 30 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [18:50:37] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:05:17] dumps time [19:05:37] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 16.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:07:37] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:13:37] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 39.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:16:37] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104