[05:23:23] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [05:30:47] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [06:53:33] 10DBA: QPS rate of change alarming - https://phabricator.wikimedia.org/T281833 (10Ladsgroup) Isn't it fixed now? [07:18:02] 10DBA, 10Sustainability (Incident Followup): Introduce alarming to monitor mediawiki databases QPS rate of change - https://phabricator.wikimedia.org/T281833 (10jcrespo) [10:19:35] 10DBA, 10DiscussionTools, 10OWC2020, 10Editing-team (FY2020-21 Kanban Board), 10Patch-For-Review: DBA review: conversation subscriptions - https://phabricator.wikimedia.org/T263817 (10LSobanski) @ppelberg The deployment steps is self-service, as outlined in https://wikitech.wikimedia.org/wiki/Creating_ne... [10:28:22] 10DBA: Update DB read_only alert to represent correct state - https://phabricator.wikimedia.org/T277174 (10LSobanski) [10:28:26] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10LSobanski) [10:28:57] 10DBA, 10Icinga, 10observability, 10Sustainability (Incident Followup): Make primary DB masters page on HOST DOWN alert - https://phabricator.wikimedia.org/T233684 (10LSobanski) [10:29:01] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10LSobanski) [10:29:21] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10LSobanski) [10:29:23] 10DBA, 10Sustainability (Incident Followup): Introduce alarming to monitor mediawiki databases QPS rate of change - https://phabricator.wikimedia.org/T281833 (10LSobanski) [10:29:54] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10LSobanski) [10:31:35] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [10:32:01] 10DBA, 10DiscussionTools, 10Editing-team (FY2020-21 Kanban Board), 10Performance-Team (Radar): Post-deployment: evaluate impact on site performance - https://phabricator.wikimedia.org/T280606 (10LSobanski) [10:38:59] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [10:47:31] 10DBA, 10Icinga, 10observability, 10Sustainability (Incident Followup): Make primary DB masters page on HOST DOWN alert - https://phabricator.wikimedia.org/T233684 (10jcrespo) There is some interaction between this and T252679 (although they are technically separate tickets). T252679 would solve this by no... [10:48:47] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [10:53:45] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [10:55:43] 10DBA: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and replicas - https://phabricator.wikimedia.org/T282857 (10LSobanski) [10:58:47] 10DBA: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and replicas - https://phabricator.wikimedia.org/T282857 (10LSobanski) p:05Triage→03Medium Blocked until T104459 is completed. [11:00:52] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Followup), 10WorkType-NewFunctionality: Detect object, schema and data drifts between mediawiki HEAD, production masters and replicas - https://phabricator.wikimedia.org/T104459 (10LSobanski) [11:01:34] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10akosiaris) Drive by comments by yours truly: * Do we have estimations (or even better hard data... [11:02:24] 10DBA: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and replicas - https://phabricator.wikimedia.org/T282857 (10LSobanski) [11:03:06] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Followup), 10WorkType-NewFunctionality: Detect object, schema and data drifts between mediawiki HEAD, production masters and replicas - https://phabricator.wikimedia.org/T104459 (10LSobanski) I updated the task to limit... [11:33:46] 10DBA, 10Data-Services, 10Projects-Cleanup: Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 (10LSobanski) p:05Medium→03Low [11:34:25] 10DBA, 10Sustainability (Incident Followup): Introduce alarming to monitor mediawiki databases QPS rate of change - https://phabricator.wikimedia.org/T281833 (10jcrespo) I've made a test dashboard mostly for learning purposes, but it show it could have generated an alarm, maybe if we smooth out QPS over longer... [11:34:31] I've made a think, sobanski: https://grafana.wikimedia.org/d/GpL5R8CGz/mysql-query-rate?viewPanel=9&orgId=1&from=1613215837147&to=1620991837147&var-site=eqiad&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-role=All 0:-) [11:34:35] *thing [11:58:47] 10DBA, 10Sustainability (Incident Followup): Introduce alarming to monitor mediawiki databases QPS rate of change - https://phabricator.wikimedia.org/T281833 (10jcrespo) Adding @Krinkle as he was involved in the ES issue, and has helped us a lot in the past with MySQL graphs. This is very much WIP, but wanted... [11:58:55] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [12:08:17] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [13:40:43] jynus: hi, can I ask a favor? Can you check binlog of a master and say what percentage of write queries is on "module_deps" table? [13:41:13] in beta is around 40% and I could reproduce the bug in production [14:04:54] let me see [14:05:34] https://phabricator.wikimedia.org/T247028 [14:07:21] I think the problem is worse than that [14:07:35] there is something underlying going really wrong [14:08:03] these write queries should happen like once every hour... [14:12:17] globally I don't see anything unusual: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&from=1620396674097&orgId=1&to=1621001474097&var-server=db1163&var-port=9104 [14:12:49] row writes per server are 200-500/s [14:14:54] I think this has been going on for a really long time and no one noticed [14:15:32] well, I noticed and reported, and was told it was "normal" until the feature was migrated away [14:15:48] once it is not causing db issues, not something we really case [14:16:27] *care [14:16:39] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [14:20:34] and it really doesn't seem to be a problem, at least on enwiki [14:21:00] of the latest 2037624 binlog event, only 21 were related to module_deps [14:21:33] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [14:21:36] I can check other hosts [14:22:15] but maybe it is true that 40% of writes on beta are those, but I guess there is normally not a lot of user edits there? [14:22:32] what other sections should I check, s7? [14:22:35] s3? [14:29:30] less than 0.01% I am getting in all I am checking [14:30:16] which is really way less than I exected due to T247028 [14:30:16] T247028: Database 'INSERT' query rate doubled (module_deps regression?) - https://phabricator.wikimedia.org/T247028 [14:33:49] this is the percentage in the last 12 hours [14:34:25] maybe it happens often at deploy time? or beta is in a more "debug" mode? But I really don't see any problem on production [15:08:43] oh okay [15:08:45] Thanks [15:09:35] it's reading a lot on master though, I can see that. Will fix that separately [15:09:59] it's interesting it's the case in beta [15:13:18] still it shouldn't write this much in beta [15:13:33] that's good it's not writing this much in production [16:13:09] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [16:22:53] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [16:25:09] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > Do we have estimations (or even better hard data) as to the number of missed events?... [16:45:10] 10DBA, 10DiscussionTools, 10OWC2020, 10Editing-team (FY2020-21 Kanban Board), 10Patch-For-Review: DBA review: conversation subscriptions - https://phabricator.wikimedia.org/T263817 (10ppelberg) 05Open→03Resolved >>! In T263817#7087636, @LSobanski wrote: > @ppelberg The deployment steps are self-servi... [16:45:24] 10DBA, 10DiscussionTools, 10OWC2020, 10Editing-team (FY2020-21 Kanban Board), 10Patch-For-Review: DBA review: conversation subscriptions - https://phabricator.wikimedia.org/T263817 (10ppelberg) [16:50:51] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) This is not very urgent, but I am generating backups from eqiad to codfw at 173Mbps, which takes... [17:14:59] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [17:19:57] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [18:48:13] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [18:52:11] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [19:22:27] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [19:28:55] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [20:42:50] 10DBA, 10Sustainability (Incident Followup): Introduce alerting to monitor mediawiki databases QPS rate of change - https://phabricator.wikimedia.org/T281833 (10Krinkle) [20:43:23] 10DBA, 10Performance-Team, 10Sustainability (Incident Followup): Introduce alerting to monitor mediawiki databases QPS rate of change - https://phabricator.wikimedia.org/T281833 (10Krinkle) [21:58:11] 10DBA, 10Internet-Archive, 10MediaWiki-extensions-Translate: Translate syntax version update and translation-aware transclusion lost - https://phabricator.wikimedia.org/T282905 (10Tacsipacsi)