[00:58:49] 10DBA, 10MediaWiki-API, 10MediaWiki-Database, 05MW-1.29-release-notes, and 3 others: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176#3136015 (10Krinkle) [06:16:15] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3136180 (10Marostegui) 05Open>03Resolved a:03Marostegui We thankfully saved the data before reimaging/rebooting it, it is more about the ser... [07:04:01] 10DBA, 13Patch-For-Review: Remove partitions from enwiktionary.templatelinks in s2 - https://phabricator.wikimedia.org/T154097#3136212 (10Marostegui) 05Open>03Resolved a:03Marostegui db2017 (codfw) is now done: ``` root@neodymium:~# mysql --skip-ssl -hdb2017.codfw.wmnet enwiktionary -e "show create table... [07:11:02] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07PostgreSQL, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3136229 (10Marostegui) db1089 from s1 is done and all the UNIQUE keys have been converted to PK (apart fro... [07:32:40] 10DBA: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088#3136254 (10Marostegui) [08:16:25] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07PostgreSQL, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3136340 (10Marostegui) s5 sum up: dewiki Tables IN USE without PK but with UNIQUE: ``` categorylinks UN... [08:17:26] 10DBA: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088#3136342 (10Marostegui) This task is not strictly blocking: T17441 but it would be nice to get it done at some point [09:01:12] marostegui, I am sorry [09:01:42] I am not sure pt-table-checksum is useful on ES with ROW binary logging [09:01:42] for what? [09:02:12] oh! [09:02:17] and not even to warmup the tables on the slaves [09:02:27] yeah :( [09:02:29] did you check the binary logging [09:02:38] No, I didn't [09:02:39] to see if the values are directly written [09:02:52] or the actual queries are run? [09:03:12] I realized this night while dreaming [09:03:27] and that is why I needed a script in the first place [09:03:47] I only checked the wmf_checksum table, indeed [09:04:07] but you are right, it might be uselss 100% [09:04:07] what time did you run it? [09:04:16] I can check quickly [09:04:57] it is currently running [09:04:57] es1011 [09:04:57] I didn't even think about it [09:06:11] what is the name of the table? [09:06:28] testwiki.__wmf_checksum [09:06:45] testwiki.__wmf_checksums [09:13:05] yeah, confirmed not useful [09:14:03] it is not doing anything then? [09:14:18] https://phabricator.wikimedia.org/P5147 [09:14:34] it just copies the master values [09:15:27] and only heats the checksum table [09:15:31] right [09:15:34] except on the master [09:15:47] :( [09:15:52] I didn't even think about its binlog [09:16:07] I changed it to ROW [09:16:16] becase it works better, specially here [09:16:22] i just assumed it was like production for some reason [09:16:35] were it is all updates by PK [09:16:43] and short transactions [09:16:52] yeah, it makes sense there [09:17:42] 10DBA: Run pt-table-checksum on es2 - https://phabricator.wikimedia.org/T161510#3136446 (10Marostegui) 05Open>03Invalid Jaime had a revelation and realised that it is running ROW based replication and the check isn't doing anything but copying the master values, so there will never be any differences. [09:18:42] more like "Jaime screwed it up" [09:19:01] haha no way [09:19:06] I should've checked too! [09:19:08] we can still do a script, which was the original intention [09:19:29] and needed anyway for core [09:19:39] yeah, and if it is has PK it is easier [09:22:22] please delete the table so there is not garbage around [09:22:56] done [09:25:15] jynus: awesome: re mysqld_exporter, I think we can also selectively enable additional stats via hieradata for specific servers while we're testing [09:26:00] I need to change the yaml with the servers first [09:26:10] querying puppetdb [09:29:25] just making sure, they're not dependent changes though? [09:31:25] no [09:31:33] but one is higher priority [09:32:09] I do not want to maintain 20 different db lists [09:33:27] yeah that's fair enough, lmk if you need help with it or have questions, I did some other integrations puppetdb -> prometheus already [09:35:00] godog, if you have a CR link that would make my life easier [09:35:10] of previous work related to that [09:39:14] jynus: yup, a simple example is https://gerrit.wikimedia.org/r/#/c/341535/ for pdus, and its fixup at https://gerrit.wikimedia.org/r/#/c/344953/ [09:39:43] jynus: also in ops.pp there's a few defines e.g. prometheus::class_config you can use as an example [09:42:36] thanks, that is exactly what I needed [09:53:22] np! happy it helps [09:56:15] 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3136514 (10jcrespo) 14% now maybe enough until decommission? [10:01:49] 10DBA, 06Labs, 10Tool-Labs: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 - https://phabricator.wikimedia.org/T133322#3136515 (10jcrespo) @chasemp @madhuvishy @bd808 The user @marcmiquel didn't answer after 2 weeks notice, maybe we should consider dropping his 100GB database co... [10:06:43] 10DBA, 06Labs, 10Tool-Labs: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 - https://phabricator.wikimedia.org/T133322#3136519 (10marcmiquel) You can delete the databases. I am sorry for not replying, I did not see the e-mail. Thanks. [10:19:55] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07PostgreSQL, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3136541 (10Marostegui) The used tables from wikidata in db1092 that have an UNIQUE have been converted to... [10:22:42] 10DBA, 06Labs, 10Tool-Labs: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 - https://phabricator.wikimedia.org/T133322#3136543 (10jcrespo) 05Open>03Resolved Done. [10:22:44] 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3136545 (10jcrespo) [11:20:53] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3136663 (10jcrespo) a:03jcrespo [11:21:30] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3117278 (10jcrespo) p:05Triage>03High [12:51:03] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07PostgreSQL, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3136752 (10Marostegui) The status of s4, commons: Tables in use with UNIQUE key but not PK: ``` category... [12:58:15] ^should we tell what we are doing or should we wait until someone comments? [12:58:46] so far we are only touching unique keys into pk [12:58:57] yeah, but still [12:58:58] I would guess people are fine with that? [12:59:08] how can we spread the word more? [12:59:09] if there is any FORCE(index) that wouldn't work [12:59:24] because now it should be FORCE(PRIMARY) [12:59:28] Maybe we can also comment on the rampant ticket? [12:59:32] nah [12:59:36] so the subscribers can take a look at that other ticket? [12:59:40] keep doing the good work [12:59:44] I was going to comment there [12:59:55] about the SOS ticket [13:00:04] "This is blocked by this issue" [13:00:32] ah, good yes [13:02:27] 10DBA, 07Epic: Meta DBA ticket for the DC switchover - https://phabricator.wikimedia.org/T155099#3136761 (10jcrespo) [13:02:30] 10DBA, 10MediaWiki-API: Database query error (internal_api_error_DBQueryError) while getting list=allrevisions - https://phabricator.wikimedia.org/T123557#3136762 (10jcrespo) [13:02:33] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3136759 (10jcrespo) 05Open>03stalled @Catrope Mediawiki "official" indexes do not seem to work on enwiki. Please advise how to proceed because this task is blocked... [13:03:50] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3136767 (10jcrespo) [13:08:07] quick question- which PK did you chose for change_tags? [13:08:29] I left change_tag and tag_summary out of the equation [13:08:37] ok [13:08:57] if you have a suggestion, I am all hears [13:09:10] but it says: oldimage, querycache,querycachetwo and user_newtalk is only without? [13:09:16] Basically all the int columns are DEFAULT NULL, so I didn't like that much [13:09:23] ah [13:09:24] true [13:09:48] I am ammeding the comment, thanks for pointing that out :) [13:10:11] do you want me to have a look at it, then? [13:10:30] sure, if you have time for it [13:11:00] My reasoning was to advance as uch as possible and leave all the hard tables for the end, as they are probably the same across all the shards [13:11:14] but the ones with or without PK are not the same in all the shards and/or schemas [13:11:17] so it is a mess [13:11:38] there are tickets already at T146571 T146568 T146585 T146586 T146591 T146570 [13:11:38] T146586: Add a primary key to querycachetwo - https://phabricator.wikimedia.org/T146586 [13:11:38] T146585: Add a primary key to user_newtalk - https://phabricator.wikimedia.org/T146585 [13:11:38] T146568: Give oldimage a primary key - https://phabricator.wikimedia.org/T146568 [13:11:38] T146570: Give user_properties a primary key - https://phabricator.wikimedia.org/T146570 [13:11:38] i would say: change_tags, tags_summary and oldimage are the hard ones really [13:11:38] T146571: Give querycache a primary key - https://phabricator.wikimedia.org/T146571 [13:11:39] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [13:12:04] auto_increment :-) [13:12:57] ah nice tickets [13:13:03] lol [13:13:19] but some tables, like l10n_cache aren't in use [13:13:24] (at least on the shards I saw) [13:13:24] it is [13:13:30] just once a month [13:13:31] you sure? [13:13:36] -rw-rw---- 1 mysql mysql 48M May 6 2014 l10n_cache.ibd [13:13:38] that is commons [13:13:40] ah [13:13:47] then it is on a file [13:14:01] I can alter it, there is problem on that it is tiny [13:14:17] we can merge paladox's patch https://gerrit.wikimedia.org/r/#/c/318553 [13:15:13] yeah, that one seems fine [13:16:01] I can try to alter that table and see if we get duplicates [13:16:05] also: https://gerrit.wikimedia.org/r/#/c/318547/5/maintenance/tables.sql [13:16:48] yeah, that one has the same issue as change_tags, hard to add right now :) [13:16:52] remember when I broke it :) [13:17:04] that is why we are doing it [13:17:14] now, read-only during the failover [13:17:32] yeah [13:17:55] i think we'd be down to oldimage, change_tags and tags_summary [13:18:02] failover -> PKs -> ROW -> galera -> no SPOF [13:18:10] it is easy [13:18:22] * marostegui looking at 2018 [13:21:14] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07PostgreSQL, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3136790 (10Marostegui) >>! In T17441#3136541, @Marostegui wrote: > The used tables from wikidata in db1092... [13:43:35] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136845 (10Marostegui) [13:44:37] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136638 (10Marostegui) @Cmjohnson feel free to replace the disk when you can. Thanks! [14:37:49] jynus: if you have a minute, do you mind a quick logical review of https://gerrit.wikimedia.org/r/#/c/345045/4/switchdc/stages/t09_tendril.py ? [14:38:21] just FYI the get_db_remote is just doing a query to puppetdb to get the master hosts selected [14:38:46] saw it before [14:38:57] looks good, but I would need to test it to +1 [14:39:19] it can be tested now, so I would literally do it [14:39:29] was waiting for your ok to do it [14:44:06] * RoanKattouw finishes reading up on all the discussion on the various revision index tasks [14:45:09] jynus: So you say T132416 is blocked on T159319, could you tell me in more detail what it's blocked on and what you want me to do / tell you / think about? [14:45:10] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [14:46:00] I see there's a patch against MW for IGNORE INDEX(PRIMARY) which I'd be happy to merge if you can tell me that there isn't a DB in prod any more where PRIMARY is actually equal to the desired index (because IIRC there used to be some) [14:46:33] jynus: if the only way to test is in prod I think is easier to merge, I can test it right away in both ways and we're good [14:46:43] volans, ok to me [14:46:45] and fix anything if we're not :D [14:47:23] RoanKattouw please comment on the ticket [14:47:28] OK will do [14:47:35] however, that patch will not fix the issue fully [14:47:37] thanks! [14:48:11] and it is actually a deadlock- we will not continue doing the alter if the queries are not ok [14:50:07] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3137035 (10Catrope) >>! In T132416#3136759, @jcrespo wrote: > @Catrope Mediawiki "official" indexes do not seem to work on enwiki. Please advise how to proceed because... [14:52:19] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3137056 (10jcrespo) > Is it blocked on anything else other than merging the patch attached to that task? The whole task- those issues only appeared after the alters to... [15:04:56] RoanKattouw, despite what may happen- most of T132416 is done [15:04:56] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [15:05:07] Yeah, I saw [15:05:11] but we stopped when enwiki started to missbehave [15:05:23] Great work, thanks for picking that up, it looks like it was tons of work [15:05:29] I think it has ouliers like pages with many revisions that also are very frequent [15:05:37] I merged your patch BTW [15:05:46] unlike commons and wikidtawiki [15:05:56] did it work for the use case? [15:06:10] there was a link that was broken consistently [15:06:57] (don't write it here unless fixed, that is why the task is still private) [15:07:17] It's impossible to test locally, so we'll find out when that patch hits group0 [15:07:24] ah, see [15:07:27] Which should be in 1-2 hours [15:07:28] merged but not deployed [15:07:29] sorry [15:07:30] Yeah [15:07:40] not accostumed to mediawiki workflow [15:07:47] that is why I needed product involved [15:07:48] No worries; I could cherry-pick it, but releng is going to deploy it for me along with the rest of last week's changes in an hour [15:07:57] no, absolutely no problem [15:08:00] So I might as well wait for that :) [15:08:07] I was worried because it was DOS-worthy [15:08:10] and no attention [15:08:27] Also thanks for your comment at the bottom of that task, that explains to me what was left to do (which I couldn't easily find reading through the comment history) [15:08:29] I'll get on fixing that one [15:08:49] thank you a lot [15:08:53] that means a lot to us [15:08:56] <3 [15:09:26] we find problems that affect reliability and do not expect immediate attention [15:09:36] but at least happy to have someone working on it [15:18:09] marostegui, https://phabricator.wikimedia.org/P5150 [15:18:24] oh nice! [15:18:41] you have officially forked pt-table-checksum then :) [15:19:01] reimplemented a poors version, based on WMFMariaDB [15:19:10] *WMFMariaDB.py [15:19:44] Yesterday I realised that the text table has an unique index, so I have converted it to a pk on the testing hosts [15:19:57] yep [15:20:15] which is actually a nice autoinc column so a pretty easy win there [15:20:25] still bad on the master [15:20:33] yeah :( [15:20:36] and it may be even worse if same index name is assumed [15:21:00] what do you mean? [15:22:45] query works by forcing usage of key- if it has a different name queries may break [15:22:50] and realication with it [15:23:30] aaah [15:23:46] hopefully not :| [15:23:47] it may be doing some checking- but who knows [15:23:56] look at the issues on the special slaves [15:24:09] yeah [15:24:13] i hate those! [15:24:33] we just added some extra queries to them! [15:25:02] you think we will ever get rid of them? [15:25:07] sure [15:25:09] and have all the slaves with the same structure? [15:25:14] I got the promise that MCR refactoring [15:25:15] 10DBA, 10Analytics-EventLogging, 06Analytics-Kanban, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3137202 (10Nuria) mysql:research@analytics-store.eqiad.wmnet [log]> show tables like 'Image... [15:25:19] will fix that [15:25:22] 10DBA, 10Analytics-EventLogging, 06Analytics-Kanban, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3137203 (10Nuria) 05Open>03Resolved [15:25:25] 10DBA, 07Epic, 07Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3137204 (10Nuria) [15:25:56] oh really? niiiiice [15:26:04] marostegui, the promise [15:26:06] not the fix [15:26:08] haha [15:26:20] let me dream! [15:26:55] marostegui, do you trust me running that on s2 slaves? https://phabricator.wikimedia.org/P5150#27408 [15:27:23] I would start with codfw XD [15:27:26] he he [15:27:54] technically no write is done here [15:28:08] so it is actually safer than pt-table-checksum [15:28:37] ah, and you are doing only 1000 rows [15:28:41] yep [15:28:51] that was the whole point of the automatization [15:29:04] I would still go for codfw, not because I do not trust you, I tend to be that careful :p [15:29:06] and I do not even query all rows, I just generate a hash, like pt-t-c [15:30:32] that is nice [15:30:43] pt-table-checksum-wfm [15:31:28] nah [15:31:33] wmf-table-checksum [15:32:15] is there any table that has failed the check recently? [15:32:28] failed as in...? [15:32:33] differences? [15:32:34] different data [15:32:35] yes [15:32:39] i thknk so, let me check [15:32:56] I fixed most of s2, if not all [15:33:51] https://phabricator.wikimedia.org/T160509#3127532 and https://phabricator.wikimedia.org/T160509#3110389 [15:34:18] I will run it on db2046 [15:34:26] it is of course stupid there [15:34:36] but I need to text it really works :-) [15:34:39] *test [15:34:42] but at least you can test it [15:34:44] yeah, that [15:35:04] This is more for es, where we cannot use the right tool [15:36:00] it is definitely too early to touch that [15:37:01] I wonder why an UNIQUE key isn't enough for pt-table-checksum to consider it safe to check [15:37:14] I do not think it is the unique [15:37:34] either it was not identified correctly [15:37:43] or it had the same issue than revision [15:37:49] that you worked to minimize it [15:38:05] I just knew it was problematic and generated that list [15:39:06] ah maybe maybe [15:39:09] could be [15:44:09] 27 QPS is not _that much load_: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2046&from=now-1h&to=now [15:47:09] haha not at all, I think it can handle more :p [15:48:44] doesn't work very well for sparse values :-( [15:56:24] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3137367 (10Cmjohnson) Disk has been ordered through Dell Create Service Request: Service Tag 3JG3K02 Confirmed: Request 946035459 was successfully submitted. [15:57:50] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3137369 (10Marostegui) Thanks! [16:03:22] 10DBA, 10Wikidata, 03Wikidata-Sprint: Wikibase\Repo\Store\Sql\SqlEntitiesWithoutTermFinder::getEntitiesWithoutTerm can take 19 hours to execute and it is run by the web requests user - https://phabricator.wikimedia.org/T160887#3137398 (10jcrespo) This is ongoing right now, for example: ``` db1082 307726822 w... [16:22:17] 10DBA, 10Wikidata, 03Wikidata-Sprint: Wikibase\Repo\Store\Sql\SqlEntitiesWithoutTermFinder::getEntitiesWithoutTerm can take 19 hours to execute and it is run by the web requests user - https://phabricator.wikimedia.org/T160887#3137487 (10daniel) Looks like we wrote the code, but failed to deploy the necessar... [16:31:45] $ python3 compare.py dbstore1002 db2046.codfw.wmnet ruwiki geo_tags gt_id --step=10000 -> No differences found. mmm [16:51:12] 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3137674 (10jcrespo) 05Open>03Resolved a:03jcrespo For now. [16:55:08] 10DBA, 10Wikidata, 03Wikidata-Sprint: [Task] Remove "all" option for Special:EntitiesWithout*" - https://phabricator.wikimedia.org/T161631#3137680 (10Lydia_Pintscher) [17:33:44] 10DBA, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint: [Task] Remove "all" option for Special:EntitiesWithout*" - https://phabricator.wikimedia.org/T161631#3137845 (10daniel) [17:38:26] 10DBA: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3101348 (10jcrespo) I think it finally works: ``` $ python3 compare.py db2046.codfw.wmnet dbstore1002 ruwiki geo_tags gt_id --step=10000 Rows are different WHERE gt_id BETWEEN 81260001 AND 81270000 Rows are different WHERE gt_id BETWEE... [18:08:58] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3137981 (10Marostegui) >>! In T160509#3137853, @jcrespo wrote: > I think it finally works: > ``` > $ python3 compare.py db2046.codfw.wmnet dbstore1002 ruwiki geo_tags gt_id --step=10000 > Rows are different WHERE... [23:24:48] 10DBA, 10MediaWiki-Database, 06Performance-Team, 07Availability: wfWaitForSlaves in JobRunner can massively slow down run rate if just a single slave is lagged - https://phabricator.wikimedia.org/T95799#3138856 (10aaron) a:05aaron>03None