[00:20:28] 10DBA, 10Wikimedia-Rdbms, 10Tracking-Neverending: Database replication lag issues (tracking) - https://phabricator.wikimedia.org/T3268 (10aaron) [06:57:06] m5 replication on codfw has been stopped for 8 hours, does anyone know why? [07:00:18] jynus: ~8h matches https://gerrit.wikimedia.org/r/c/operations/puppet/+/674724 [07:00:42] yeah, but I just see now that it crashed [07:04:31] 10DBA, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10jcrespo) [07:14:09] jynus: oh crap. I had to delete and recreate some tables that had the wrong charset, is it possible for that to crash mysql? [07:18:12] it crashed after "CREATE UNIQUE INDEX ix_mailinglist_list_id ON mailinglist (list_id)" [07:18:33] at 2021-03-24 22:26:53 [07:18:59] 10DBA, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10Legoktm) oh crap, it probably is my fault. I had to delete and recreate some tables with the wrong charset (T277286#6944044) - I wasn't aware that would or could even crash mysql. And I should have noticed the m5 al... [07:19:32] 10DBA, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10jcrespo) it crashed after "CREATE UNIQUE INDEX ix_mailinglist_list_id ON mailinglist (list_id)" at 2021-03-24 22:26:53 [07:19:58] really the whole thing was a disaster on my part. I expected that I would be manually creating the database tables, and as soon as I opened the firewall, the services immediately connected and applied the schema to create the tables [07:21:41] why would it crash after the index creation? [07:21:54] we still don't know what caused it [07:22:30] it could be just that activity triggered other issue- mysqls shouldn't crash because unprivileged user actions [07:22:52] ack, would you like me to help investigate or anything? [07:29:02] o/ [07:29:12] I'm around too, let me know if I can do anything [07:44:42] Error 'Duplicate key name 'ix_mailinglist_list_id'' on query. Default database: 'testmailman3'. Query: 'CREATE UNIQUE INDEX ix_mailinglist_list_id ON mailinglist (list_id)' [07:48:48] there is one thing you could do, Amir1- please if you could run your script checking data structure diffs between tables on these 2 dbs? [07:49:09] I am not sure they are the same after crash [07:50:07] I can try [07:51:56] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10jcrespo) I restarted the host to check for hw errors. After upgrade and restart, I ran into: ` Error 'Duplicate key name 'ix_mailinglist_list_id'' on query. Default database: 'testmailman3'. Query: 'CREATE... [08:27:19] kormat, I think dborch was already restarted for reboot, right? https://phabricator.wikimedia.org/T273278 (pending to be reported) [09:04:25] jynus: yes, done back on feb 25th. i'll add it to the task. [10:00:28] 10Data-Persistence-Backup, 10SRE: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10jcrespo) a:05jcrespo→03None [10:01:56] 10Data-Persistence-Backup, 10SRE: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10jcrespo) 05Open→03Resolved a:03jcrespo This is technically resolved, for both regular backups and database ones. However, I may create a new task at some point to remove the offsite copy... [10:05:31] 10Data-Persistence-Backup: Care needed with mariabackup versions - https://phabricator.wikimedia.org/T253959 (10jcrespo) p:05Medium→03Low Due to my comments on https://gerrit.wikimedia.org/r/599343 I think this is something we could improve from wmfbackups, but it is not a high priority for now (should not b... [13:36:33] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10LSobanski) p:05Triage→03Medium [14:03:52] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10Kormat) This looks like https://jira.mariadb.org/browse/MDEV-23019, which was fixed in 10.4.14. The server was running 10.4.13 when the crash occurred. The server is now running 10.4.18. [15:19:50] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Another idea that may not be feasible: Would it be possible to move the event produce... [16:06:48] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [16:11:06] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [16:41:54] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [16:46:43] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [17:01:53] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [17:12:47] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104