[05:07:26] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4595562, @Smalyshev wrote: > Looking at logstash: https://logstash.wikimedia.org/goto/39a6fe9ed... [06:13:46] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [06:13:49] 10DBA, 10Patch-For-Review: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 (10Marostegui) 05Open>03Resolved [06:43:13] I am around if you want to merge things [06:49:32] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10jcrespo) My suggestion for root cause is that mysql works well as long as gtid is not enabled on its master (it was enabled around some days of the crash). Replicating from a non-gtid enabled master worked we... [07:08:54] jynus: Yeah! Let's merge stuff [07:09:23] can you let me 3 minutes to finish a reboot? [07:09:27] of course! [07:09:33] i will disable puppet meanwhilke [07:09:38] thanks [07:17:26] I am temporarilly enabling pupet on db1109, I need it for the reboot [07:20:01] cool [07:21:39] check this when you have a chance: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/461288/ [07:25:58] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Marostegui) There is also the fact that I executed an ALTER table on the DC master for this host (db1067) with replication. It was done early that day: ``` 17th Sept 05:28 marostegui: Deploy schema change on... [07:27:33] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10jcrespo) > There is also the fact that I executed an ALTER table It would fit in my theory because it would be (maybe) the first event with GTID that the host received in a long time. [07:28:58] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Marostegui) >>! In T204594#4597123, @jcrespo wrote: >> There is also the fact that I executed an ALTER table > It would fit in my theory because it would be (maybe) the first event with GTID that the host rec... [07:33:04] good morning, I'm in the process of setting up a replacement Cumin server (cumin2001.codfw.wmnet which will eventually replace sarin). The new servers are based on stretch and we won't immediately remove the old servers as we'll switch back to eqiad with the proven switchdc install on jessie [07:33:18] As the Cumin servers are also used as DB root clients, this needs a grant merged (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460323/) and some testing on your end that all works fine with stretch (there should be no issues), but no hurry, this is just a FYI and can wait a few more weeks [07:34:31] moritzm: We'd need to merge and deploy the grant everywhere, indeed [07:34:38] moritzm: see my comment [07:38:15] jynus: the firewall rules were merged via the network constants which happened in 1a60d042045e9e20ebc28f01c47385761565d2b2 and the root profile was added 32e1d5b0024ae1482e0201b67cf86dc8d19046c4 (as otherwise Puppet failed to run as when applying the cluster management role), so I think these should all be in place already [07:38:20] I'll comment on the task [07:38:32] moritzm: ok, I didn't see that [07:39:00] yeah, Riccardo reviewed those, as you were all knee-deep in master reimages I didn't add you as reviewers [07:40:22] one thing I forgot to mention, there's also an eqiad replacement forthcoming (next week, Riccardo will use the OS install as a training for the new hires), so best to wait to bundle both grants [07:41:11] that is good information- we should wait as long as the old hosts are available [07:41:28] and until both new ones aren't [07:42:02] ack [09:03:04] marostegui: with this we have a backup dashboard so we can check bottlnecks easier: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=MariaDB+read+only+s1 [09:03:17] e.g. if queries or connections get slow [09:03:33] ah nice [09:03:44] * marostegui bookmarking that [09:04:06] when I have time, read only will become "mariadb health" [09:04:45] also those units need work- days instead of seconds, ms instead of seconds [09:06:16] not sure if you didn't disable puppet on es* hosts on purpose or not, but check works as expected there [09:06:48] I checked on icinga files yep [09:07:02] e.g. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=MariaDB+read+only+es2 [09:11:57] so do we enable puppet everywhere? [09:12:12] Yeah, I am doing some final checks [09:12:15] Give me some minutes [09:12:41] ok [09:16:56] Enabling puppet [09:22:50] 10DBA, 10Patch-For-Review: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 (10Marostegui) [09:23:29] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423 (10Marostegui) [09:23:33] 10DBA, 10Patch-For-Review: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 (10Marostegui) 05Open>03Resolved a:03Marostegui This is done [09:48:04] This is nice: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=MariaDB+read+only+s2 [09:48:26] I can see that the connection latency of db2063 is the higest [09:48:54] so I can tweaks connection weights a bit [10:17:30] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) I will try to get this deployed on eqiad before we switch back to it. [10:25:07] so we have a few hosts reimaged to 10.1 that had old grants [10:25:11] *had [10:25:22] probably the first reimaged [10:25:49] I think it was only 3 of them [10:27:07] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [10:27:48] 10DBA, 10Patch-For-Review: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Banyek) 05Open>03Resolved The root cause of the different query plans was the different schemas on the hosts. ``` SHOW CREATE TABLE enwiki.revision: [...] KEY `rev_timestamp` (`rev_times... [10:27:50] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Banyek) [10:28:45] 10DBA, 10Patch-For-Review: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Marostegui) Good catch. Can you open a task to get that checked and fixed across all the hosts where this difference exists? [10:31:21] 10DBA, 10Patch-For-Review: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Marostegui) It is not entirely clear to me what the differences are? Which hosts did you compare? [12:07:14] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10Addshore) So I just diffed the tables and indexes on production wikidatawiki vs my local fresh instal... [12:07:56] marostegui: ^^ looks like there are a couple of other difference between the indexes on wb_terms in the code and in production.... [12:10:52] let's sync later on prioritizing pending dc switch tasks, I will be for now working on ep_ dumps [12:37:31] addshore: not surprising :( [12:43:11] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10Marostegui) A quick check about unused indexes on that table reports that only `wb_terms_entity_id` a... [12:49:07] marostegui: so, re your last comment [12:49:12] so tmp1 is now being listed as used? [12:49:57] addshore: yep, also remember that deleting it caused a massive outage :) [12:50:10] yup ;) [12:50:28] im wondering why term_language is missing from those 2 indexes [12:50:59] but I guess the easy thing to do will be to sync the code to not have term_language in those indexes, and then at least those 2 line up with what the index is doing, even if they have different names [12:51:20] My guess is that they were reconverted to wb_terms_text and wb_terms_search_key and code was never updated [12:51:39] Reconverted in production I mean [12:51:43] is renaming indexes trivial in production? or not? [12:51:47] Nope [12:51:50] okay ;) [12:51:51] We need to drop and create [12:51:56] ewww [12:52:00] I know :-( [12:52:10] right, I'll probably just file 3 sub tasks to bring the code in line with prod then [12:52:14] and then we can close the task :) [12:52:23] Yeah, that makes more sense [12:52:33] I might leave the names different [12:52:36] And at somepoint I guess we need to rename tmp1 to something more meaningful [12:52:42] but, they are essentially the same [12:52:54] marostegui: well, the actual plan is to just kill the table [12:53:13] If I were you I would name them exactly to what we have in production, just to avoid confusion in 6 months time when we would be like: why are these names different? :-) [12:53:15] of course there isn't a timeline for that yet, but im actively slowly working on setting the ball in motion [12:53:18] addshore: Ah true!! Forgot that :) [12:53:46] but the goal is to just stop using it, while not touching it, as touching it might cause more issues like the tmp1 incident [12:53:56] so, just slowly migrate away from it [12:54:15] yeah, I will cry once we can kill that table XD [12:54:20] :) [12:56:13] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [12:56:45] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [12:57:16] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10Addshore) [12:58:15] 10DBA, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: Monitor read_only on all databases, make it page on masters - https://phabricator.wikimedia.org/T172489 (10Marostegui) @jcrespo is this done after today's merge or still missing things? [13:08:10] 10DBA, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: Monitor read_only on all databases, make it page on masters - https://phabricator.wikimedia.org/T172489 (10jcrespo) 05Open>03stalled The initial scope is not fulfilled: > Monitor read_only on all databases, make it page on masters It has... [13:08:21] 10DBA, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: Monitor read_only on all databases, make it page on masters - https://phabricator.wikimedia.org/T172489 (10jcrespo) p:05Triage>03Normal [13:21:34] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [13:36:30] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [13:43:17] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10Banyek) a:03Banyek [13:43:30] jynus: I have a question about "section" and the order it shows the servers [13:43:54] yes? [13:43:55] no rush or urgent at all [13:44:05] we can order it in any way we want [13:44:13] Yeah, but I was wondering about one thing [13:44:17] it is a question of giving it the right sql [13:44:47] ./section s1 shows the master in the correct order, last line for eqiad master and the last codfw host is codfw master, but for x1, it is not the case [13:48:33] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) codfw x1 enwiki progress: [] dbstore2002 [] db2069 [] db2034 [] db2033 [13:49:11] that is because it just orders by string [13:49:20] we can create a masters table [13:49:28] and order alphabetically [13:49:32] but masters last [13:49:41] but then we have to remember to update it [13:51:19] yeah, it is not a big deal, I was just wondering :) [13:52:06] we can do it (you can do it!) [13:52:27] maybe even it is there already [13:53:26] there is a masters table [14:10:31] marostegui: https://phabricator.wikimedia.org/P7568 [14:30:43] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10Banyek) This oneliner will comress the tables from the direction of the smallest towards the largest: ``` mysql -BN -S /run/mysqld/mysqld.s2.sock -e "SELECT table_schema, table_name FROM information_Schema.tables W... [14:32:14] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10Marostegui) >>! In T204593#4597992, @Banyek wrote: > This oneliner will comress the tables from the direction of the smallest towards the largest: > ``` > mysql -BN -S /run/mysqld/mysqld.s2.sock -e "SELECT table_sc... [14:39:15] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10Banyek) and stop replication beforehand! ``` mysql -BN -S /run/mysqld/mysqld.s2.sock -e "STOP SLAVE"; mysql -BN -S /run/mysqld/mysqld.s2.sock -e "SELECT table_schema, table_name FROM information_Schema.tables WHER... [14:45:32] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10aezell) Thanks! [15:07:23] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10jcrespo) a:05jcrespo>03None So public dumps are ready: ``` root@labsdb1010:/srv/tmp/ep_dumps$ rm *; mysql -BN --skip-ssl -e "select table_s... [15:15:27] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Krinkle) a:03Smalyshev Considering this done from our team, but keeping open for now because the WDQS issues appear unre... [15:50:54] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Wikidata-Campsite: Create wb_terms_entity_id wb_terms index for Wikibase on install and upgrade - https://phabricator.wikimedia.org/T204836 (10Addshore) p:05Triage>03Normal [15:50:58] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Wikidata-Campsite: Make Wikibase wb_terms term_text index the same as wb_terms_text in WMF production - https://phabricator.wikimedia.org/T204837 (10Addshore) p:05Triage>03Normal [15:51:01] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Wikidata-Campsite: Make Wikibase wb_terms term_search_key index the same as wb_term_search_key in WMF production - https://phabricator.wikimedia.org/T204838 (10Addshore) p:05Triage>03Normal [15:52:23] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10Addshore) [15:52:57] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10Addshore) All of the sub tasks have been created. Once all complete we should compare the production... [15:53:24] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Wikidata-Campsite: Make Wikibase wb_terms term_search_key index the same as wb_term_search_key in WMF production - https://phabricator.wikimedia.org/T204838 (10jcrespo) ^I am assuming this doesn't have yet actionables for us (yet) just on the... [15:54:20] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Wikidata-Campsite: Make Wikibase wb_terms term_search_key index the same as wb_term_search_key in WMF production - https://phabricator.wikimedia.org/T204838 (10Addshore) >>! In T204838#4598351, @jcrespo wrote: > ^I am assuming this doesn't hav... [16:00:02] addshore: I was just justifying/asking my blocked external column movement [16:00:31] jynus: always better to check more than check less :) [16:00:33] sometimes it is unclear then added to a ticket what people except from us, just wanted to clarify that we will oversee [16:01:00] but not start doing anything there, just on the radar [16:01:11] s/then/when/ [16:01:23] s/except/expect/ [16:01:29] I have to change my keyboard [16:01:36] :D [16:01:40] or my brain, one of the 2 [17:26:39] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @jcrespo When you have a minute, I'd like to hear your opinion on calculated field joins, e.g. {P7570} I don'... [17:49:48] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) I just sync'ed with @ayounsi about the network maintenance. It is still blocked on the cables. row A: If cables arrive on time, they are expecting to do maintenance on row A (no servers) later this week or early next...