[00:45:03] 10DBA, 10MediaWiki-API, 10Patch-For-Review, 10Wikimedia-production-error: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077 (10Krinkle) [00:45:19] 10DBA, 10MediaWiki-API, 10Patch-For-Review, 10Wikimedia-production-error: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077 (10Krinkle) Might be a duplicate of {T101502}. [05:46:55] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [05:49:05] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) codfw s3 mediawikiwiki progress: [] dbstore2002 [] db2094 [] db2074 [] db2057 [] db2050 [] db2043 [] db2036 [05:55:34] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [05:56:33] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) 05Open>03Resolved This is all done - one less drift from code and production! [05:56:41] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [06:29:28] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) These are the last 24h: https://logstash.wikimedia.org/goto/cd0af28f39b7ad679b9d1e1130636fdf Errors are almost... [06:36:16] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) I have altered db2070 (enwiki) and I will leave it like that for a few days before... [06:38:51] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) s1 progress: [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore2002 [] dbstore1... [06:39:54] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) [06:59:06] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10jcrespo) That looks really bad performance. Not only that scans the revision table from top to bottom (>200GB of data)... [07:28:16] what do you think? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/461528 [07:30:12] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Bawolff) >>! In T202596#4600586, @jcrespo wrote: > That looks really bad performance. Not only that scans the revision... [07:38:32] let me see [07:42:41] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10jcrespo) @Bawolff That new query you propose makes no sense to me- it just selects the first 100 revisions every singl... [07:49:20] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) [07:52:08] I start compressing the s2 tables on dbstore2002 [07:52:13] I'll be around [07:52:24] banyek: cool, remember to !log it [07:52:27] when I leave, I'll stop the process [07:52:37] banyek: there is no need to stop it I would say [07:53:33] kk [07:54:28] I said for me to be around when he runs it [07:54:37] to monitor the space left [07:54:43] *him [07:54:52] Ah, true, there is space issues [07:54:55] I forgot that [07:55:12] I forgot it was not just a simple alter table [07:55:13] if we run out of space, a well behaved innodb should stop writing [07:55:28] a bad one will corrupt all our backups instances [07:55:32] well, 50% of them [07:55:40] yeah, I forgot there are space issues on that host [07:56:44] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) I have deployed the schama change on db2066 (s5) - and I will leave it like that for some hours to make sure nothing breaks befor... [07:57:41] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) s5 progress: [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore2001 [] dbstore1002 [x] db2094 [] db2089 [] db2084 [] db2075 [x... [07:57:54] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) [08:01:53] I set the downtime of MariaDB s2 replication checks (IO/LAG/SQL) until Monday [08:02:02] cool [08:02:13] great [08:02:21] so I asked to monitor it, if space goes down and we are more confortable [08:02:32] it may need less attention [08:02:46] but I am not 100% what will be the trend [08:03:09] maybe we have to introduce a sleep 10 between execitions for easier kill? [08:04:01] BTW as a reminder, I won't be around tomorrow [08:04:23] enjoy your day off! :) [08:46:57] Tomorow I'll have a shorter day than usual 11am - 5pm in office hours. [08:51:31] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10Banyek) The tables are being compressed [08:56:40] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10jcrespo) Looking good so far: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=18&fullscreen&orgId=1&var-server=dbstore2002&var-datasource=codfw%20prometheus%2Fops&from=now-3h&to=now-1m [08:57:18] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10jcrespo) a:03Marostegui [08:58:02] 10DBA, 10Data-Services, 10Patch-For-Review: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10jcrespo) a:03Banyek [08:58:40] cool ^ [08:58:59] I was reviewing the "In progress" column [08:59:09] in general everthing in progress should be owned [08:59:30] you can move it to next or backlog if you will not work on that [09:01:30] I am checking some random tables on dbstore2002 and I think some of them might not be compressed, is: commonswiki.templatelinks is 301G there, I am sure that is not compressed [09:01:43] banyek: ^ might be worth checking tables on s4 as we can probably gain some space there too [09:02:19] don't confuse him too much, we file a task and check that at another time :-) [09:02:36] ;) [09:02:47] e.g. "check other compressible tables on dbstores" [09:03:13] Will create a task for that yep [09:04:14] banyek: dbstores are some of our slowest dbs, but the compression will be needed on the new hosts too [09:04:25] so everthing we do in advance will help in the future [09:05:47] I did a check of all tasks in progress so it could represent reality better [09:06:08] thanks [09:31:52] 10DBA, 10Patch-For-Review: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Banyek) A quick recap: a, The original issue (the different API hosts have different query plans) is solved. b, A new issue emerged: The API hosts has really bad performance with the query `A... [09:38:17] 10DBA, 10Patch-For-Review: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Marostegui) Thanks for the summary! I agree that the best chance way is to go for option `c` probably. a) Would show lots of uncertainties, it might fix this query, but what could happen wi... [09:40:33] 10DBA, 10Operations, 10ops-codfw: Several issues with mgmt interfaces on es200X hosts - https://phabricator.wikimedia.org/T204928 (10jcrespo) [09:40:46] 10DBA, 10Operations, 10ops-codfw: Several issues with mgmt interfaces on es200X hosts - https://phabricator.wikimedia.org/T204928 (10jcrespo) p:05Triage>03Lowest [09:46:02] 10DBA: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Marostegui) [09:46:30] 10DBA: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) a:03Banyek [09:46:54] 10DBA: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Marostegui) p:05Triage>03Normal [09:49:46] 10DBA, 10Operations, 10ops-codfw: Several issues with mgmt interfaces on es200X hosts - https://phabricator.wikimedia.org/T204928 (10jcrespo) [11:02:26] it is always the same- it is clear what we want to do, we just don't have the time :-) [11:02:59] (to do all) [11:07:23] I stop the compression on db20002 b/c I leave for about an hour soon. I'll restart it when I'm back [11:08:00] (so far we got ~54 Gb free disk space ) [11:08:05] db2002 [11:08:37] Cool, I am sure we will get even more once https://phabricator.wikimedia.org/T204930  is also done [11:08:40] leave it on [11:08:47] I can keep an eye on it [11:08:56] banyek: ^ [11:09:16] and it seems to go better than I expected [11:09:22] I already claimed it ;) [11:09:26] (no sudden drops ) [11:09:32] yeah, you keep it claimed [11:09:44] but I think it is now less dangerous to leave it unattended [11:09:56] than when it was at 7% [11:11:06] it shows a steady climbing https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=18&fullscreen&orgId=1&var-server=dbstore2002&var-datasource=codfw%20prometheus%2Fops&from=now-6h&to=now-1m [11:11:17] and not a saw tooth as I expected [11:11:29] but up to you [11:17:58] 10DBA, 10Operations, 10ops-codfw: Several issues with mgmt interfaces on es200X hosts - https://phabricator.wikimedia.org/T204928 (10jcrespo) [11:20:40] 10DBA, 10Operations, 10ops-codfw: Issues with mgmt interface on es2001 host - https://phabricator.wikimedia.org/T204928 (10jcrespo) [12:12:00] I resumed it [12:14:50] 10DBA, 10Patch-For-Review: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Banyek) I created a bug in the MediaWiki-Database board regarding this: T204926 [12:34:38] ^that will need way more context and then show actual examples from the logs, then add it to #wikimedia-production-errors [13:33:44] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) Confirmed this is backwards compatible. I fully altered s5 on codfw and nothing broke on eqiad (which is the situation we will ha... [13:50:43] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Banyek) [13:50:46] 10DBA, 10Patch-For-Review: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Banyek) 05Resolved>03Open I re-open this ticket until the query plans are fixed [13:50:58] 10DBA, 10Patch-For-Review, 10User-Banyek: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Banyek) [13:51:58] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) [13:52:21] 10DBA, 10Patch-For-Review, 10User-Banyek: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Marostegui) Are all the API hosts having the same query plan then? Or do we have to tweak https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/461360/ ? [13:53:09] 10DBA, 10Patch-For-Review, 10User-Banyek: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Banyek) all of them has the same (wrong) query plan [13:54:00] 10DBA, 10User-Banyek: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Banyek) [13:54:23] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10Banyek) [13:59:19] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) 300 errors in the last 24h, I think we are good? [14:25:10] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) [14:26:44] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) As agreed during our meeting today, we will combine this failovers with the next hardware purchase for eqiad (I will link the task here once I create it) [15:00:20] I leave for today (I'll check back later) but I keep running the compression. I am pretty sure there won't be disk space issues with s2 [15:31:29] 10DBA: Create Icinga alerts on backup generation failure - https://phabricator.wikimedia.org/T203969 (10jcrespo) I have a first quick and dirty script to check for backup freshness: ``` root@db1115:~$ sudo -u nagios python3 check_mariadb_backups.py -s 's9' -d 'eqiad' -f 100 usage: check_mariadb_backups.py [-h]... [16:13:38] 10DBA, 10MediaWiki-API, 10Patch-For-Review, 10Wikimedia-production-error: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077 (10Anomie) Perhaps. If so, this task seems the more useful of the two. The discussion on T101502 seems focused... [21:06:24] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10Banyek) A quick not before I forgot: before the compression begun, we had 470G free space on that host. [23:34:38] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) p:05High>03Normal [23:53:37] 10DBA, 10JADE, 10Operations, 10Patch-For-Review, and 2 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) Thanks for all the attention given to this, and apologies for thinking that the namespace condition would behave the same in...