[00:06:15] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 18.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:09:15] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 1.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:31:17] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 20.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:43:17] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 11.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:50:17] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:58:17] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 27.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:03:17] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:37:53] FIRING: PuppetFailure: Puppet has failed on es1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:37:57] FIRING: PuppetFailure: Puppet has failed on ms-be1082:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:42:49] FIRING: [4x] PuppetFailure: Puppet has failed on db1184:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:42:49] FIRING: [3x] PuppetFailure: Puppet has failed on ms-be1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:42:53] FIRING: [2x] PuppetFailure: Puppet has failed on db1206:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:43:49] FIRING: PuppetFailure: Puppet has failed on dbprov1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:47:49] FIRING: [8x] PuppetFailure: Puppet has failed on db1170:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:47:53] FIRING: [4x] PuppetFailure: Puppet has failed on ms-be1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:47:57] FIRING: PuppetFailure: Puppet has failed on aqs1017:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:48:01] FIRING: [7x] PuppetFailure: Puppet has failed on db1205:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:48:49] FIRING: [2x] PuppetFailure: Puppet has failed on dbprov1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:49:49] FIRING: PuppetFailure: Puppet has failed on dbprov1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:52:05] FIRING: PuppetFailure: Puppet has failed on thanos-fe1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:52:17] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 14.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:52:49] FIRING: [8x] PuppetFailure: Puppet has failed on ms-be1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:52:53] FIRING: [14x] PuppetFailure: Puppet has failed on db1160:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:52:57] FIRING: [3x] PuppetFailure: Puppet has failed on aqs1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:53:01] FIRING: [8x] PuppetFailure: Puppet has failed on db1205:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:53:49] FIRING: [4x] PuppetFailure: Puppet has failed on backup1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:54:17] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [01:57:49] FIRING: [13x] PuppetFailure: Puppet has failed on ms-be1051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:57:49] FIRING: [17x] PuppetFailure: Puppet has failed on db1156:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:57:49] FIRING: [11x] PuppetFailure: Puppet has failed on db1205:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:58:49] FIRING: [5x] PuppetFailure: Puppet has failed on backup1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:02:49] FIRING: [14x] PuppetFailure: Puppet has failed on ms-be1051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:02:53] FIRING: [5x] PuppetFailure: Puppet has failed on aqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:02:57] FIRING: [13x] PuppetFailure: Puppet has failed on db1205:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:03:01] FIRING: [20x] PuppetFailure: Puppet has failed on db1156:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:03:49] FIRING: [6x] PuppetFailure: Puppet has failed on backup1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:07:49] FIRING: [14x] PuppetFailure: Puppet has failed on ms-be1051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:07:53] FIRING: [5x] PuppetFailure: Puppet has failed on aqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:07:57] FIRING: [13x] PuppetFailure: Puppet has failed on db1205:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:08:01] FIRING: [21x] PuppetFailure: Puppet has failed on db1153:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:08:53] FIRING: [6x] PuppetFailure: Puppet has failed on backup1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:09:57] RESOLVED: PuppetFailure: Puppet has failed on dbprov1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:12:49] FIRING: [13x] PuppetFailure: Puppet has failed on ms-be1051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:12:53] FIRING: [12x] PuppetFailure: Puppet has failed on db1205:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:12:57] FIRING: [21x] PuppetFailure: Puppet has failed on db1153:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:13:49] FIRING: [6x] PuppetFailure: Puppet has failed on backup1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:16:17] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 18 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [02:17:01] RESOLVED: PuppetFailure: Puppet has failed on thanos-fe1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:17:49] FIRING: [5x] PuppetFailure: Puppet has failed on aqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:17:49] FIRING: [11x] PuppetFailure: Puppet has failed on db1205:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:17:49] FIRING: [19x] PuppetFailure: Puppet has failed on db1153:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:18:53] RESOLVED: [6x] PuppetFailure: Puppet has failed on backup1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:20:17] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [02:22:48] RESOLVED: [14x] PuppetFailure: Puppet has failed on db1153:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:22:49] RESOLVED: [10x] PuppetFailure: Puppet has failed on ms-be1051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:22:49] RESOLVED: [4x] PuppetFailure: Puppet has failed on aqs1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:22:53] RESOLVED: [11x] PuppetFailure: Puppet has failed on db1205:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:18:18] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 90.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [07:23:20] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [08:37:24] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 60.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [08:44:24] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [09:32:22] I appreciate it's Friday, but is anyone up for approving https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076158 please? Otherwise puppet is going to be unhappy on the apus nodes until I get back from vacation... [09:46:52] +1d [10:07:32] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 28.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [10:16:34] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [10:28:39] oh the dumps again [10:38:14] :sadpanda: [12:14:00] https://phabricator.wikimedia.org/T375652#10182735 I guess I will have to add "checking table schemas/autoincs" to the pre-switchover checks [12:41:08] es1022 raid seems to have gone back to healthy status, permission to repool it slowly? [12:44:56] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 25.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [12:47:56] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [12:51:24] I am seeing no objections, so I will start pooling es1022 slowly (I checked disks were in an optimal state already) T375257 [12:51:24] T375257: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257 [12:54:03] oh, of course, eqiad es load is very low now, so very little traffic [13:31:55] Thank you jynus [13:32:44] as much as possible I would like not to leave things depooled for long [14:39:13] Amir1: one thing- not urgent, but we found some weirdness when we did the pre-switchover checks on pc1017 / pc2017 (prometheus monitoring complains, lack of ops/events database) [14:39:46] I fixed it on zarcillo, but didn't have the time to check the other issues [14:41:22] e.g.: https://grafana.wikimedia.org/goto/O8LWsrRNR?orgId=1 [15:07:46] Thanks [15:07:52] I'm chipping at stuff [19:56:38] Hello! I was directed here to discuss the issues I've had with the check_bacula.py script from operations/puppet. I'm preparing patches to send to gerrit soon, but the second one is less trivial and I'm wondering if someone with knowledge of that script could share their opinion for the following: [19:59:30] basically I'm working on integrating the check_bacula.py script in Tor's infrastructure so that we also get info about our backups. in Bacula.get_job_executions() there's filtering happening on lines that don't start with 'JobId:', however in Tor's infrastructure (the bacula director is running debian bookworm with bacula 9.6.7) the output from bconsole has 'jobid:' instead [20:00:22] I'm wondering if the output difference depends on the version of bacula, e.g. maybe if wmf's director uses an older/more recent version of bacula [20:20:43] oh ... never mind the above. I wanted to check in with someone about the approach for changing the code to avoid adding a huge processing cost to that line but I actually found a way to make that change. I'll send the two patches to gerrit in a few minutes [20:25:34] LeLutin: Hi, bacula is mostly expertise of jynus but he is out for the week. You can ping him on Monday if you want to know more in general [20:28:50] Amir1: ok! no rush. I've never used gerrit before though... should I send a topic branch with the patches I'm proposing? [20:29:28] it's nice but not mandetory [20:29:29] https://www.mediawiki.org/wiki/Gerrit/Tutorial/tl;dr [20:29:43] this should save you some time (saved mine 11 years ago :D) [20:30:47] thanks I'll check that page [20:43:17] changes submitted to gerrit. that wiki page did indeed save me time, thanks :) [20:43:57] \o/