[00:00:23] PROBLEM - MariaDB sustained replica lag on db1146 is CRITICAL: 48.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1146&var-port=13314 [00:02:49] RECOVERY - MariaDB sustained replica lag on db1143 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104 [00:05:27] RECOVERY - MariaDB sustained replica lag on db1146 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1146&var-port=13314 [00:13:23] PROBLEM - MariaDB sustained replica lag on db1149 is CRITICAL: 41 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104 [00:22:03] RECOVERY - MariaDB sustained replica lag on db1149 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104 [00:23:57] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [00:32:19] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [00:35:41] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [00:56:41] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [01:01:57] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 5.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [01:19:39] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [01:48:07] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 14 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [01:53:15] PROBLEM - MariaDB sustained replica lag on db1142 is CRITICAL: 6.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1142&var-port=9104 [01:54:15] RECOVERY - MariaDB sustained replica lag on db1142 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1142&var-port=9104 [01:55:29] PROBLEM - MariaDB sustained replica lag on db1148 is CRITICAL: 10 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1148&var-port=9104 [02:03:25] RECOVERY - MariaDB sustained replica lag on db1148 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1148&var-port=9104 [02:05:33] PROBLEM - MariaDB sustained replica lag on db1121 is CRITICAL: 18.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [02:06:15] PROBLEM - MariaDB sustained replica lag on db1141 is CRITICAL: 20.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1141&var-port=9104 [02:06:49] RECOVERY - MariaDB sustained replica lag on db1121 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [02:09:15] PROBLEM - MariaDB sustained replica lag on db1143 is CRITICAL: 19.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104 [02:13:27] RECOVERY - MariaDB sustained replica lag on db1141 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1141&var-port=9104 [02:13:27] PROBLEM - MariaDB sustained replica lag on db1147 is CRITICAL: 19.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [02:13:37] RECOVERY - MariaDB sustained replica lag on db1143 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104 [02:13:45] PROBLEM - MariaDB sustained replica lag on db1149 is CRITICAL: 19 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104 [02:14:59] RECOVERY - MariaDB sustained replica lag on db1147 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [02:15:19] RECOVERY - MariaDB sustained replica lag on db1149 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104 [02:15:47] PROBLEM - MariaDB sustained replica lag on db1142 is CRITICAL: 18.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1142&var-port=9104 [02:19:03] RECOVERY - MariaDB sustained replica lag on db1142 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1142&var-port=9104 [02:27:17] PROBLEM - MariaDB sustained replica lag on db1121 is CRITICAL: 7.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [02:28:43] PROBLEM - MariaDB sustained replica lag on db1138 is CRITICAL: 7.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1138&var-port=9104 [02:29:05] PROBLEM - MariaDB sustained replica lag on db1142 is CRITICAL: 7.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1142&var-port=9104 [02:30:25] RECOVERY - MariaDB sustained replica lag on db1138 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1138&var-port=9104 [02:32:23] RECOVERY - MariaDB sustained replica lag on db1142 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1142&var-port=9104 [02:33:55] RECOVERY - MariaDB sustained replica lag on db1121 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [02:38:47] PROBLEM - MariaDB sustained replica lag on db1138 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1138&var-port=9104 [02:40:27] RECOVERY - MariaDB sustained replica lag on db1138 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1138&var-port=9104 [02:45:03] PROBLEM - MariaDB sustained replica lag on db1147 is CRITICAL: 4.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [02:48:23] RECOVERY - MariaDB sustained replica lag on db1147 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [02:52:11] PROBLEM - MariaDB sustained replica lag on db1138 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1138&var-port=9104 [02:53:49] RECOVERY - MariaDB sustained replica lag on db1138 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1138&var-port=9104 [02:57:29] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [03:26:01] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [03:47:55] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [06:00:52] 10DBA, 10Data-Persistence, 10User-Kormat: orchestrator: Select backend database solution - https://phabricator.wikimedia.org/T266003 (10Marostegui) db2093 now hosts the orchestrator database. @Kormat the only pending thing is to decide what to do with monitoring and `read_only` right? As right now `read_only... [06:06:43] 10DBA: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui) @Huji 20MB for ruwiki means around 1GB per year at current growth (assuming it keeps growing the same rate). That is perfectly acceptable. However, we do need to do this same exercise for the b... [06:23:17] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 9 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [06:26:51] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [07:31:01] 10DBA, 10Commons, 10Operations, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) Another spike yesterday on DELETEs {F32415982} Checking binlogs from 22:17 to 22:22... [08:08:14] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) [08:08:39] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) Per my chat with Chris, updating the rack location from A2 to A1 and from C2 to C3 [08:11:21] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 8.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [08:19:51] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [08:26:33] 10DBA, 10Commons, 10Operations, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) Another spike from 08:05 to 08:06 and this is what the binlog shows (number of state... [08:36:45] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [08:38:25] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [09:03:29] 10DBA, 10Data-Persistence, 10Patch-For-Review, 10User-Kormat: orchestrator: Select backend database solution - https://phabricator.wikimedia.org/T266003 (10Kormat) @Marostegui: correct. I've sent a CR to change the `read_only` status for now. [09:04:32] Morning [09:04:56] agreed [09:15:19] I have the (hopefully) final bout with the dentist today so I'll be offline between 11:30 and 12:30. [09:15:45] 🤭 [09:16:19] 🦷 [09:17:01] It's all fun and games until somebody loses a tooth [09:17:37] dentist + dc switchover on the same day, what can possibly go wrong [09:21:29] What was that IRC quote site again? [09:21:53] https://bash.toolforge.org/random ? [09:22:04] Yes! [09:23:05] Added :) [09:25:32] 10DBA, 10Patch-For-Review: Populating orchestrator metadata on a per-server basis - https://phabricator.wikimedia.org/T266485 (10Kormat) We might be able to re-use the heartbeat table for this. ` select shard from heartbeat.heartbeat order by ts desc limit 1 ` [09:32:43] sobanski: bonus effect of our decision to iterate on the phab workflows: i no longer need to have the doc open [09:38:29] \o/ [09:38:49] Side note, I need to update the doc now as it no longer reflects reality [10:01:17] 10DBA, 10Patch-For-Review: Populating orchestrator metadata on a per-server basis - https://phabricator.wikimedia.org/T266485 (10Marostegui) Forgetting the existing hosts: ` root@dborch1001:~# orchestrator-client -c forget-cluster -alias pc1007 root@dborch1001:~# ` And cleaning up the aliases related table... [11:24:37] 10DBA, 10Commons, 10Operations, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10matthiasmullie) At first sight, these DB operations seem to make sense: bots are in the process... [11:25:48] 10DBA, 10Data-Persistence, 10Operations, 10Release-Engineering-Team-TODO, and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) I just came across this: https://github.com/openark/orchestrator/blob/master/docs/ci-env.md#run-orchestrator-with-envir... [11:27:20] 10DBA, 10Commons, 10Operations, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10jcrespo) Independently of the source of the issue, could these regenerations be throttled/rate l... [12:10:41] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 72.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [12:13:52] i've just removed myself from the sre-data-persistence group and made lukasz the owner [12:15:09] PROBLEM - MariaDB sustained replica lag on db1147 is CRITICAL: 13 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [12:15:49] PROBLEM - MariaDB sustained replica lag on db1121 is CRITICAL: 15 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [12:17:37] RECOVERY - MariaDB sustained replica lag on db1121 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [12:20:29] RECOVERY - MariaDB sustained replica lag on db1147 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [12:33:49] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [12:56:53] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [13:04:01] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [13:39:29] so in the end what was the outcome of the read only thingy, will it be done automatically? [13:40:58] which read only thingy? [13:41:14] downtime of checks [13:41:38] masters will be downtimed by me and puppet will be run as part of the dc switch [13:41:48] ok [13:43:15] marostegui: the downtime can be pretty short, Ijust few minutes [13:43:27] yep [13:43:30] it will be very short [14:13:14] marostegui: what was s3? realy issue or alert issue? [14:13:28] volans: a host was a bit overloaded with the initial spike [14:13:30] seems ok now [14:14:12] the problem is I don't know if it is cause or consequence [14:14:32] when php idles, it keeps connections open, which can cause bad latency [14:17:33] on slow queries I don't see anything significant, but who knows how mediawiki responds after switching the gtid coords [14:20:41] test-s4 should change its read only policy, but not sure exactly how [14:39:40] could has been this low enough to cause the issues? https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=13&orgId=1&var-server=db1123&var-port=9104&from=1603805944495&to=1603809544495 we should compare with other masters [14:40:49] nah, s7 had the same low hit ration and wasn't as impacted [14:40:52] the issues I saw on s3 were related to a slave that peaked on connections [14:41:03] do you have it located? [14:41:13] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1603805944495&to=1603809544495&var-server=db1075&var-port=9104 [14:41:17] thanks [14:41:31] indeed [14:41:39] at 5K connections, the query killer kicks in [14:42:18] you can see it in action (kill) here: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=10&orgId=1&from=1603805944495&to=1603809544495&var-server=db1075&var-port=9104 [14:43:29] strangely, buffer pool wasn't the issue there [15:29:36] I am going to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/636686 to correct leftover from temporary otrs setup [15:29:46] only alerting now that eqiad is active [15:33:54] puppet doesn't need to run on icinga, at least not for read only changes for what I can see on db1077 [15:35:19] kormat: do you want me to do the inverse change to: https://gerrit.wikimedia.org/r/c/operations/puppet/+/636686 on db2093? [15:40:48] oh, it is already there, but not merged yet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/636609/ [15:42:08] gerrit isn't loading for me [15:42:18] works for me [15:42:58] looks like routing issues [15:44:02] our net or your isp? [15:44:16] not sure yet. [15:44:31] cloudflare? [16:08:40] 10DBA, 10MediaWiki-Categories, 10Patch-Needs-Improvement: Increase size of categorylinks.cl_collation column - https://phabricator.wikimedia.org/T158724 (10kaldari) >I wonder why you are phrasing this as a response to what I wrote? Personally I agree that increasing the length of the column is a bad idea. I... [16:15:31] 10DBA, 10AbuseFilter, 10MediaWiki-Change-tagging, 10Patch-Needs-Improvement: Cannot add a previously used change tag to an abuse filter - https://phabricator.wikimedia.org/T173917 (10Daimona) a:05Daimona→03None [16:33:01] Hi all. It would seem that today at 14:09:54 UTC, replication lost its mind on ToolsDB. I've struggled to recover it (since I was aiming to fail it over today), but so far that hasn't been doable...probably partly because writes are still carrying on at the master. I was wondering if any of you had time to offer some advice (like just start over from a full dump) or whatever :) I can make a task if nobody is available. [16:37:27] bstorm: hey. i'm around, but not sure if i'm all that much use [16:38:11] Well a second set of eyes wouldn't hurt no matter what. :) [16:38:25] bstorm: which kind of mind-loss is replication suffering from? [16:38:51] ToolsDB is the cloud read-write database (mariadb) that we share to everyone in Toolforge, so it gets a lot of activity from a lot of volunteer programmers. [16:39:00] Let me find the first failure... [16:39:16] Last_Error: Could not execute Update_rows_v1 event on table s54518__mw.online; Can't find record in 'online', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.221984, end_log_pos 27719835 [16:39:28] This was quite surprising. [16:40:14] I tried simply filtering the table, but that didn't help. There was always some row that couldn't be fixed. Usually when replication is messed up, it's because the master crashed and is hopefully read-only. [16:40:20] That didn't happen [16:40:55] oh crud. ok, data inconsistency [16:40:55] So I'm a bit unsure how to proceed. Filtering a lot of tables didn't get me very far. I also tried changing the log and position since the old log it was pointed at is rolled over now. [16:40:59] Yeah [16:41:47] I dislike this service. Too many users (who are not all professionals) hitting one database server. However, it's what we have given cloud users to write to. [16:41:51] which machine is this, and where is it replicating _from_? [16:42:41] This is a VM on a dedicated hypervisor named clouddb1002.clouddb-services.eqiad1.wikimedia.cloud [16:42:47] it replicates from clouddb1001.clouddb-services.eqiad1.wikimedia.cloud [16:42:56] hey bstorm I am on my phone, I was about to drive [16:42:59] I'm not sure you are in the cloud project for that yet! I can add you. [16:43:08] Hello marostegui [16:43:16] bstorm: But it does look like that that slave is totally broken? [16:43:17] This silly thing broke again, but at least it's "just" the replica [16:43:21] ok. so this sort of failure indicates there has been some drift between the master and the replica; [16:43:26] To me so far, yeah [16:43:33] the 'simplest' solution would be to reclone the replica [16:43:35] This happened just after 1400 UTC today [16:43:41] how large is the dataset? [16:44:10] It's TBs and has hundreds of users to coordinate with (or just tell, we are going read-only because that's how it has to be lol) [16:44:21] bstorm: So my advise would be to reclone it, but if you need replication running for something else, you can either skip more tables or just use the idempotent flag for that replica, which will ignore all errors, but you need to rebuild that replica, that is the only possible solution [16:44:22] * kormat winces [16:44:22] Lemme check... [16:44:54] bstorm: is it always the same table failing? [16:44:55] Thank you marostegui, that's what I suspected, but I wanted an expert opinion. :) [16:44:58] No [16:45:07] At least always the same database? [16:45:09] It's different tables as I filter things [16:45:11] nope [16:45:15] different dbs [16:45:18] That looks very bad then [16:45:20] ee [16:45:41] bstorm: My worry is...any idea how that corruption could have happened? [16:46:21] No. We did see toolforge tools throwing 504 errors this morning, which was odd. And now this. I was about to do our first scheduled failover for maintenance. [16:46:38] How was that replica built in the first place? [16:47:05] kormat: you can see the various tool users and how much space is in their DBs here https://tool-db-usage.toolforge.org/ [16:47:25] Aw, linkwatcher....long time no see [16:47:31] marostegui: I may need to look back at the ticket from the physical server crashing to see it [16:47:46] the physical server crashed? [16:47:47] I may have stood it up and just started replicating? lol [16:48:01] bstorm: I am asking cause, even if you rebuild the replica, it would be nice to try to identify what could have cause the data drift (a crash would explain it of course) [16:48:03] that could explain things [16:48:05] kormat: That was over a year ago when I built these [16:48:14] They are both VMs now [16:48:14] ah [16:48:29] any unclean reboots of the replica VM recently? [16:48:32] Yeah, I'll make a ticket and record findings. [16:48:52] kormat: nah, we treat this pair like treasured pets. [16:48:56] That replica is definitely not writable right? [16:48:57] right :) [16:49:20] marostegui: checking to be sure! [16:50:02] uhhh [16:50:05] https://www.irccloud.com/pastebin/zUeBMUrr/ [16:50:21] what the hell? [16:50:45] bstorm: So someone could be writing directly to the replica? That would explain data drifts [16:50:52] That would [16:51:23] the dns should not have changed. I have a patch up for the failover, but I hadn't merged it unless someone else did [16:51:48] Well, that gives me a lot to look into. [16:51:52] Thank you! [16:52:37] I'll try to figure out the actual state of things. [16:52:40] bstorm: The most common issues could be: 1) crashes on the replica 2) Data already drifted when it was cloned 3) Someone wrote directly into that replica [16:53:09] The replica is on a hypervisor that was reimaged, but I could have sworn the config told it to come up read-only. Clearly, I should not swear to such things :) [16:53:24] bstorm: replication just broke all of a sudden? [16:53:36] Yes at 1400UTC [16:54:02] Now I'm worried that my DNS patch may have been merged by someone. [16:54:13] I'll check that [16:54:14] I would check the above 3 cases I wrote above and see if you find something related [16:54:25] Check past crashes if you can too [16:54:33] Nope [16:54:38] The DNS didn't change [16:54:52] I have no idea why anyone would write to this, but apparently they would be able to [16:54:59] Any of those tables are myisam by any chance? [16:55:09] I do hope not... [16:55:16] But I'll check [16:55:20] I remember we had myisam on the old tools hosts [16:55:31] We quite possibly still do. [16:56:07] I'll make a ticket and not keep you all past dinner time. I think I have some things to pursue now. Also, I'm setting this to readonly [16:56:27] Well, after I make sure that DNS hasn't magically changed without my patch [16:56:33] haha [16:56:36] Good luck! [16:56:40] Thanks! [22:12:33] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10ifried) @Marostegui Thank you for your help so far in the release process! Now that it has been a few weeks, we would like to proceed with enabling the feature o...