[02:44:17] PROBLEM - 5-minute average replication lag is over 2s on db1089 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1089&var-port=9104&var-dc=eqiad+prometheus/ops [02:46:13] RECOVERY - 5-minute average replication lag is over 2s on db1089 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1089&var-port=9104&var-dc=eqiad+prometheus/ops [04:55:59] 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) @mmodell Thursday is a non deployment day, so let's move this to Tuesday 18th if that's ok? So that would be Tuesday 18th at 05:00 AM UTC? [07:04:01] PROBLEM - 5-minute average replication lag is over 2s on db2098 is CRITICAL: 4984 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2098&var-port=13313&var-dc=codfw+prometheus/ops [07:11:45] RECOVERY - 5-minute average replication lag is over 2s on db2098 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2098&var-port=13313&var-dc=codfw+prometheus/ops [07:20:54] 10DBA, 10MediaWiki-extensions-OAuthRateLimiter, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway), and 3 others: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 (10Naike) @Pchelolo - Do you know what this task is blocked on your workboard... [08:08:24] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1019.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202008100808_maro... [08:45:02] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1019.eqiad.wmnet'] ` and were **ALL** successful. [09:27:13] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10Marostegui) [09:28:10] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10Marostegui) 05Open→03Resolved All dbproxies upgraded to Buster and HAproxy 1.8.19-1 [09:28:12] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) [09:47:21] 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) p:05Triage→03Medium a:03Kormat [10:54:54] 10DBA: Compare a few tables per section before the switchover - https://phabricator.wikimedia.org/T260042 (10Marostegui) [11:25:33] marostegui: hey, let me know when you're around [11:25:47] Amir1: o/ [11:26:55] marostegui: niiice, so first, I ran it again, it broke a lot (it seems that it sometimes even simple queries time out) but nothing showed up beside mcr and that one thing in master of s8 [11:27:17] really? that's for the important drifts only, right? [11:27:24] yup [11:28:06] in one of the runs, some drifts showed up in s3 but it got overriden by the next run. I have the idea of running it for s3 [11:28:10] all wikis [11:28:25] but that's going to take some time to finish :D [11:28:58] hahaha [11:28:59] yeah [11:29:03] so those are really good news [11:29:36] indeed [11:29:52] I don't know why the system times out quite often though :( [11:30:15] I retry it but in general it makes it really slow [11:31:13] like this command [11:31:18] Running: timeout 6 sql testcommonswiki -- -h db1121 -e "SHOW INDEX FROM change_tag;" [11:32:08] I note only show index gets the timeout (and it's all random) [11:33:07] marostegui: I also finished the code that checks for db drifts with abstract schema, More than 25% of the tables are already abstracted \o/ [11:33:37] oh nice! [11:33:42] how long did it take? [11:33:44] for that 25%? [11:35:35] quite fast, it's more accurate too because I have everything in json and don't need to parse sql [11:37:27] I think it would get hit by these weird timeouts and will be slow once we move more tables to it because that part of the code is shared between the two :D [11:37:58] marostegui: one last thing, it would be great if you answer this question: https://phabricator.wikimedia.org/T42626#3801703 [11:38:19] 2017 XDDD [11:38:22] binary means more schema changes and work for you but less space used in the db [11:38:23] Sure, I will take a look [11:38:26] yup :D [11:38:39] I recall answering something about it in maybe other task [11:38:42] I will double check [11:38:49] Thanks <3 [11:39:07] I thought we were using varbinary though [11:39:09] for a long time [11:45:48] We use varbinary in 17 places and binary in 13 places [11:46:30] varbinary makes sense for title and other types of custom strings but this is guaranteed to be 14 all the time [12:23:33] marostegui: started it on all of s3 so we can find all of thee drifts it's really fast so far [12:24:24] thanks for all this work [12:37:10] I wish I could help more :( You have so much to do [13:58:52] 10DBA, 10MediaWiki-extensions-OAuthRateLimiter, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway), and 3 others: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 (10Pchelolo) >>! In T258711#6371839, @Naike wrote: > @Pchelolo - Do you know... [14:08:50] dbstore1003 WARN Memory 92% used. Largest process: mysqld (2855) = 43.1% [14:38:05] jynus: ?? you left a comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/619291 saying i should lower-case the alert description, i did so, and.. you uploaded a patch to my CR re-capitalizing them? [14:38:27] yes, sorry, I overrode your patch by accident [14:38:40] revert or I will [14:40:31] one second, I am only behind by 12900 commits [14:40:50] eventual casing™ [14:41:27] kormat: should be back you your patch 2 [14:44:04] fixed now, including committer [14:44:08] kormat: there is something wrong with the patch [14:44:26] another patch I sent was voted +1 [14:44:37] jynus: ci is broken [14:45:00] see discussion in #-sre [14:45:54] ah, sorry, I didn't see it [14:47:59] [16:40:31] one second, I am only behind by 12900 commits hahahahah [14:48:10] <3 [14:48:54] not kidding "Your branch is behind 'origin/production' by 12905 commits, and can be fast-forwarded." [14:51:32] XDDDDDDDDDDDDD [14:51:34] 10DBA, 10MediaWiki-extensions-OAuthRateLimiter, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway), and 3 others: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 (10Marostegui) >>! In T258711#6372976, @Pchelolo wrote: >>>! In T258711#63718... [15:00:16] 10DBA, 10MediaWiki-extensions-OAuthRateLimiter, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway), and 3 others: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 (10Reedy) And it's also fine to create the table well in advance of the exten... [15:15:43] marostegui: so far only checked 40 wikis in s3 and already found several drifts that are limited to small number of wikis only [15:15:47] https://www.irccloud.com/pastebin/gxhThlB9/ [15:16:03] I will make a ticket once it's all done [15:23:13] wow that is a lot better than expected, thanks a lot Amir1 [15:23:21] Yeah, if you can place a ticket, that'll be appreciated it [15:24:25] Sure! [15:25:15] marostegui: what terrible thing are you doing to s8 in codfw? [15:25:48] kormat: MCR :) [15:26:01] check !log from earlier today in operations [15:27:13] ah hah :) [15:28:20] i think i'll wait until that's done before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/619291, as otherwise it's far too painful to recreate the downtimes [15:29:26] marostegui: fancy. Can you tell what's the revision table size before and after. I'm dying to know :D [15:29:42] yeah, I will paste it on the ticket [15:30:09] it is 245G now [15:30:18] so we'll see how much it shrinks to [15:31:36] 245G is wrong, we have a new unit, it's 1/4th of a wb_terms [15:32:45] (I wanted to say rest in peace for wb_terms but I actually hope it burns in hell) [15:33:35] hahahaha [16:17:28] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10wiki_willy) a:03Cmjohnson [16:19:04] 10DBA, 10articlequality-modeling, 10Scoring-platform-team (Current), 10artificial-intelligence: [Discuss] Hosting the monthly article quality dataset on labsDB - https://phabricator.wikimedia.org/T146718 (10mforns) [16:56:22] 10DBA, 10observability, 10Patch-For-Review, 10Sustainability (Incident Followup): Monitor swap/memory usage on databases - https://phabricator.wikimedia.org/T172490 (10jcrespo) 05Open→03Resolved a:03jcrespo We settled for now on memory usage. We send a warning when we get to 90% usage and a critical... [16:56:24] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10jcrespo) [18:11:31] 10DBA, 10Patch-For-Review: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Cmjohnson) [18:28:58] 10DBA, 10Patch-For-Review: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Cmjohnson) [21:17:27] 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10mmodell) @marostegui: That works for me.