[01:01:14] PROBLEM - 5-minute average replication lag is over 2s on db1095 is CRITICAL: 383 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1095&var-port=13313&var-dc=eqiad+prometheus/ops [03:28:02] RECOVERY - 5-minute average replication lag is over 2s on db1095 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1095&var-port=13313&var-dc=eqiad+prometheus/ops [04:34:59] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) s7 eqiad progress [] labsdb... [05:24:25] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1119.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [05:57:47] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1119.eqiad.wmnet'] ` and were **ALL** successful. [06:21:26] 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor swap/memory usage on databases - https://phabricator.wikimedia.org/T172490 (10jcrespo) p:05Medium→03High dbstore1004 was low on memory- so low that puppet runs were failing (although mysql wasn't due to its low OOM killer precedence).... [06:22:56] I can confirm there is an m2 dump dating 2020-08-04--00-00-01 [06:23:06] I will create a snapshot just in case [06:23:08] sweet [06:25:10] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Marostegui) [06:37:12] I have stopped replication on db1117, if you need to start it back you can do it, but I was hoping that made it finish before 8 [06:37:31] I was going to start moving slaves, but I can wait for it [06:37:34] no problem [07:18:48] still running the snapshot jynus? [07:19:01] yeah [07:20:31] ok [07:23:53] not sure it will finish on time [07:24:03] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Marostegui) [07:24:06] you can run the move - it will ignore db1117 [07:24:15] and move it later [07:24:32] ok [07:24:59] it has copied 300 GB out of 500 only [07:41:44] no errors, right? [07:42:14] nope [07:43:10] if replication is stopped, the scripts basically cannot detect the replica [07:59:32] 1m to launch [07:59:53] yeaaah [08:00:36] 🥘 [08:07:05] 10DBA, 10OTRS, 10Recommendation-API, 10Research, 10Patch-For-Review: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 (10Marostegui) Failover done successfully [08:10:24] http://mystery.knightlab.com/ thought this audience might appreciate it [08:17:26] 🕵️‍♀️ [08:26:24] jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/450046 - this says "Upgrade check_mariadb.py to the latest WMFMariaDB version". do you know where that latest version was? it's not in wmfmariadbpy [08:28:25] you are asking me about a 2 year old CR :-/ [08:29:19] jynus: there's a delta between the version of WMFMariaDB embedded in check_mariadb.py in the puppet repo, and the version in wmfmariadbpy [08:29:38] and i'm trying to figure out how to resolve this [08:29:42] I trust you, not saying you are wrong :-) [08:30:51] could it be I confused with thinking https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/449185 was merged? [08:31:28] ahhh! [08:31:31] yes, that makes sense [08:31:50] honestly, as I said, check_mariadb.py was merged in a rush [08:31:52] ok perfect. that gives me enough to work with [08:32:01] because we found a widespread issue with read_only [08:32:12] so it was merged despite being half-done [08:32:25] * kormat nods [08:32:29] 10DBA, 10OTRS, 10Recommendation-API, 10Research, 10Patch-For-Review: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 (10Marostegui) 05Open→03Resolved Resolving this - db1132 will be moved to m3 which will be tracked at T253217 [08:32:32] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) [08:32:39] so maybe (again, it was long time ago) I was improving wmfmariadbpy at my own pace [08:32:59] then I saw the isssue and merged the development version instead of the last patch [08:33:38] that patch made the librarly slightly cleaner but I couldn't merge it because it change the apy slightly [08:33:43] *api [08:34:21] my gerrit workboard is full of nice ideas that were never merged [08:35:52] 10DBA: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) [08:36:02] 10DBA: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) p:05Triage→03Medium [08:48:35] 10DBA, 10Patch-For-Review: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2134.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202008040848_m... [09:27:26] 10DBA: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2134.codfw.wmnet'] ` and were **ALL** successful. [09:30:32] 10DBA: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) [09:44:18] we've extended the retention of some librenms data that's stored in the db in https://gerrit.wikimedia.org/r/c/operations/puppet/+/618235 nothing too shocking, JFYI [09:46:22] that's ok, the its db is tiny [09:47:53] neat! [10:05:31] marostegui: i see your plan to prevent any maintenance from happening is to fill up all the time before the dc switchover with meetings. sneaky. [10:05:44] you are welcome! [10:35:41] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Marostegui) [10:37:45] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Marostegui) 05Open→03Resolved This is all done ` # /home/marostegui/section s7 | while read host port; do mysql.py -h$host:$port frwiktionary -e "show create table revision\G" | grep PRIMARY... [10:37:50] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [10:55:17] 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) @mmodell would you be available on **Thursday 13th at 05:00 AM UTC** for failing over phabricator master? (I can do it earlier if it is easier for you) Ideally, we should set phabricator into... [10:58:16] 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10mmodell) @marostegui sure I can do that. [10:59:00] 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) Thank you - I will block that maintenance window on the deployments page and send you a google calendar invite. [10:59:08] 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10mmodell) @Marostegui: in case you ever need to do it without me, it's documented here: https://wikitech.wikimedia.org/wiki/Phabricator#read-only_mode_/_restarting_mariadb [11:00:10] 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) Oh, great. Thank you. I can do it myself for learning purposes, but if you are around to support it, just in case, it would be appreciated! [11:46:56] 10DBA: Code and production differs on s3 on pagelinks table - https://phabricator.wikimedia.org/T259238 (10Marostegui) [11:54:47] 10DBA: Code and production differs on s3 on pagelinks table - https://phabricator.wikimedia.org/T259238 (10Marostegui) eqiad progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1004 [] db1124 [] db1123 [] db1112 [x] db1095 [] db1078 [] db1075 [12:14:33] 10DBA: Code and production differs on s3 on pagelinks table - https://phabricator.wikimedia.org/T259238 (10Marostegui) [12:14:41] 10DBA: Code and production differs on s3 on pagelinks table - https://phabricator.wikimedia.org/T259238 (10Marostegui) 05Open→03Resolved All done [12:14:56] Amir1: whenver you have time, can you run the drifts script? I want to make sure we have nothing else left apart from MCR [12:16:51] Sure! [12:35:29] 10DBA, 10Operations: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) p:05High→03Medium [14:16:00] issues on es1020 ? [14:16:16] what kind? [14:16:34] 5-minute average replication lag is over 2s CRITICAL 2020-08-04 14:14:15 0d 0h 3m 40s 2/5 4.6 ge 2 [14:17:09] that's the new alert, no? [14:17:20] there is a weird spike there [14:17:24] I don't see anything wrong with it now [14:17:38] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1020&var-port=9104&from=1596549245122&to=1596550615588 [14:17:59] weird [14:18:38] are dumps running? [14:18:42] not now [14:19:08] other replicas seem ok (except for increase of traffic after the issue due to it being depooled) [14:19:20] that suggest not a traffic issue, but server issue [14:19:20] yeah, the rest including the master seem ok [14:19:37] they are running [14:19:41] | 185009772 | wikiadmin | 10.64.16.16:58758 | wikidatawiki | Query | 0 | Statistics | SELECT /* ExternalStoreDB::fetchBlob dumpsgen@snapsh... */ blob_text FROM `blobs_cluster26` WHE | 0.000 | [14:19:41] with server, could be hw, network, etc. [14:19:48] maybe that cause a small glitch [14:19:49] oh, they are? [14:20:02] I thought they had finished long time ago [14:20:13] oh, I see [14:20:18] you meant the xmldumps [14:20:22] yeah [14:20:23] I was thinking about the backups [14:20:29] backups were finished [14:20:52] but those would distribute evently among replicas [14:21:03] wouldn't fit, at least not by itself [14:21:15] they are hitting es1020 now, so maybe they caused some overload [14:21:17] don't know [14:21:33] I am going to try to debug to discard host/net issues [14:21:37] I can take care of that [14:21:46] thanks [14:22:21] the point is es very rarely create lag- unlike metadata [14:22:25] they are just light inserts [14:22:31] I know [14:24:45] (I am talking to myself) lag increased suddently- which suggest suddent complete stall or network loss [14:25:37] 14:12 -> 14:13, then recovered very quickly [14:26:12] metrics took 4 seconds to scrape [14:26:38] InnoDB Data pending reads: 6 [14:27:17] there is a spike of writes after that, but I think that is just replication catching up [14:28:02] there is however an increase on kill commands before the issue (stall leading to the query killer?) [14:28:51] although the kills don't seem that rare before now [14:31:24] nothing immediately relevant on sw logs [14:37:29] and nothing on hw [14:38:19] I am going to check the binlog as a last effort [14:48:00] see nothing [14:51:22] SELECT /* ExternalStoreDB::fetchBlob */ blob_text FROM `blobs_cluster26` WHERE blob_id = ? LIMIT 1 [14:51:46] the only thing I see interesting is catching 4 queries on show processlist taking 3 seconds to run at 14:11:45 [14:51:53] so there was a stall [14:52:30] problem is while we have the metrics, they don't have the granularity needed to catch it as it took less than a minute [15:14:13] kormat: I have not tested it, but the patch seems ok, but not sure I understood what will you want to do about packaging? Your latest patch didn't answer my question- unless the answer is "IDK yet" 0:-) [15:14:55] I am guessing you will propose something on a separate patch, I think? [15:18:02] there needs to be a new package that's useful for non-cumin hosts. check_health.py will go into that new package [15:18:09] i'll create that in a new CR [15:18:33] ok, I understood now [15:18:42] don't worry, take your time [15:18:57] I just didn't get the last comment [15:19:27] grand