[05:51:28] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275#4115645 (10Marostegui) s8: db2079 [06:18:41] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275#4115657 (10Marostegui) [06:37:24] good morning :) [06:37:29] hey! [06:38:03] so, the alter table failed with this error: ERROR 1799 (HY000): Creating index 'PRIMARY' required more than 'innodb_online_alter_log_max_size' bytes of modification log. Please try again. [06:38:07] aaaaah [06:38:09] :( [06:38:11] I can fix that for you [06:38:15] but you need to run the alter again [06:38:30] sure [06:38:45] db2083, right? [06:38:49] yup [06:39:03] should be fixed now [06:39:30] Cool [06:40:25] I'm actually writing to the table atm (replacing term_search_key with '') [06:40:30] yeah, i saw that [06:40:31] it will finish soon [06:40:42] I saw the alter running for 24h at least, no? [06:41:08] no, it just finished and I ran another batch smaller [06:41:17] aaah cool cool [06:41:24] so it failed after how many hours? [06:41:43] the old one was replacing 150M (took 10 hours), this one will take around 5 hours-ish [06:41:55] No, I mean the one dropping the column [06:41:57] the alter didn't give a time to me [06:42:08] it :/ [06:42:21] ah because of the error indeed [06:44:15] yeah it was at leat 24 hours [06:44:19] *least [06:44:45] https://gerrit.wikimedia.org/r/#/c/424300/ [06:45:49] Can we deploy this? https://giphy.com/gifs/please-simba-lion-king-PK5CQPd6rCF3y [06:46:19] sure [06:46:21] give me a sec [06:50:59] merged and puppet ran on terbium [06:51:02] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4115680 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2092.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reim... [06:51:08] marostegui: can you make the /var/log/wikidata/deleteAutoPatrolLogs.log file in terbium? writeable/readable by the $::mediawiki::users::web ? [06:51:12] in terbium [06:51:25] I'd go with 775 :D [06:51:25] yep [06:51:57] done [06:52:44] Thank you! [06:52:48] I keep monitoring this [06:52:57] cheers! [06:54:19] Right now, the logging table in commons wiki growth ten times slower and wikidatawiki 100 times. All other wikis will get the change by the next train \o/ I'm already deleting rows everywhere but it takes some time [06:55:40] nice!!!!! [06:55:51] let us know when we can optimize logging, specially on big wikis [06:56:53] yeah sure [07:28:02] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4115738 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2092.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reim... [07:35:16] I wonder if dbproxy1011.yaml and dbproxy1010.yaml should have the same format? [07:35:26] ? [07:35:57] https://github.com/wikimedia/puppet/blob/production/hieradata/hosts/dbproxy1010.yaml https://github.com/wikimedia/puppet/blob/production/hieradata/hosts/dbproxy1011.yaml [07:36:39] master can only contain 2 hosts [07:36:49] and weight doesn't make sense [07:37:10] yaml hashes cannot be ordered [07:37:21] but feel free to spend time on that, I will not [07:37:32] Yeah, I was wondering if we can make them more alike [07:37:46] send a proposal [07:37:53] as right now depooling labsdb1009 is different thatn depooling labsdb1011 [07:37:56] with code, of course! [07:38:08] not a big deal, but just wondering :) [07:38:21] I say that if you spend the time, I am ok [07:38:27] :) [07:38:38] (and it is correct) [07:39:29] I would prefer if sanitarium puppet was fixed first [07:39:32] I was answering an email from brooke with some questions and realised that it could be confusing, that's all :) [07:39:34] which is part of the goal [07:39:55] sanitarium_multiinstance [07:46:07] 10DBA, 10Collaboration-Team-Triage, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#4115753 (10jcrespo) That last suggestion looks like a blocker to me, at least to check it before doin... [07:48:19] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4115766 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2092.codfw.wmnet'] ``` and were **ALL** successful. [07:55:08] the deleteAutoPatrolLogs works fine for wikidata, btw. you can delete /var/log/wikidata/rebuildTermSqlIndex.log* to save some space on terbium [07:55:40] it is 0 bytes :) [07:55:55] Ah, it has multple [07:55:56] I see [07:56:09] I will nuke them! [07:56:43] \o/ [09:28:08] The ALTER TABLE has began [09:28:14] cool [12:22:18] Running the script from terbium on commonswiki gives me "SLOW TIMER" notice (21 seconds on each batch), should I stop it? [12:25:01] yes, make the batch smaller [12:25:39] they are supposed to take no more than 1 second [12:27:21] I think mediawiki also will kill writes longe than 3-5 seconds [12:30:09] The writes are fast, finding them is hard [12:31:11] if they are on the same transaction && server, there is not much of a difference- if they are on separate transactions/servers, then no problem [12:32:01] Made the batch half, let's see what happens. The script uses a replica to get log_id s to delete and use another transaction to delete [12:32:08] AND the selects are using the "slow" queries relplicas [12:32:36] which are not slower, they are for slow queries like those [12:36:21] noted [12:48:41] Nothing worked, will do commons later [13:21:01] 10DBA, 10Operations, 10ops-eqiad: Rack and setup 8 new eqiad DBs - https://phabricator.wikimedia.org/T191792#4116638 (10Marostegui) p:05Triage>03Normal [13:54:31] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4116752 (10Marostegui) [13:54:34] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4116747 (10Marostegui) 05Open>03Resolved [13:55:10] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3689635 (10Marostegui) All the hosts are now pooled. [13:55:32] marostegui hi does that mean gerrit2001 can connect to the db now ? :) [13:55:44] ? [13:56:08] marostegui that task was a subtask of https://phabricator.wikimedia.org/T176532#4116752 [13:56:33] paladox: we still don't have proxies [13:56:38] oh ok [13:57:01] marostegui is there a task for the proxies? [13:57:38] No, we first need to order them and all that [13:58:38] oh ok [13:58:48] We are working out eqiad proxies first, as those have more priority :) [14:03:39] paladox: I gave you a date in the past about when we plan to handle that [14:03:46] don't you remember? [14:04:19] yep, i thought that, that task was part of it. But wasen't. [14:04:33] asking more time will not make it happen faster :-) [14:04:37] *times [14:04:55] technically it is [14:05:09] but that is only for the underlying servers, we still need the proxies [14:06:12] ok [14:27:01] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Epic: Make wb_terms table fancy - https://phabricator.wikimedia.org/T188992#4116879 (10Lucas_Werkmeister_WMDE) [14:54:39] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117113 (10Papaul) p:05Triage>03Normal [14:58:30] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117116 (10Papaul) @Marostegui which disk you want to replace first? 1 or 4 ? [14:58:58] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117117 (10Marostegui) Oh, I didn't know there were two of them broken. Let me check [14:59:07] I prefer the failed one first [14:59:12] (4) [14:59:19] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117118 (10Marostegui) @Papaul disk #4 [14:59:30] Yeah, I thought the #1 failed already [14:59:41] But it is still as predicrtive faulure [15:05:23] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117150 (10Papaul) a:05Papaul>03Marostegui complete [15:08:31] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117168 (10Marostegui) Thanks ``` physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, Rebuilding) ``` Once this is completed we can replace #1 [16:28:20] marostegui: https://gerrit.wikimedia.org/r/#/c/425087/ [16:29:20] oh nice!! [16:29:20] (not ready yet, but just FYI) [16:29:25] So cool! [16:29:28] I will take a look :) [16:42:07] jynus: what day/time works for you to talk over the m5 things from T189542? [16:42:07] T189542: Update updatequerypages::cronjob and refreshlinks::cronjob now that silver no longer has a database - https://phabricator.wikimedia.org/T189542 [16:43:01] any time not too late for us [16:43:06] (CEST) [16:43:13] I think that andrew and I can talk it all out with you. [16:43:23] * bd808 looks at calendar [16:43:31] when do you start your day normally? [16:43:49] I try to start ~15:00 UTC [16:44:37] 30 minutes tomorrow at 15:30 UTC? [16:45:12] that collides with the WMCS team meeting. My mornings are a mess this week apparently. [16:45:22] ok, another day, another hour? [16:46:01] 15:30 on Thursday would work for me and it looks like Andrew too [16:46:09] ok, sending invite [16:47:45] see you [16:48:01] thanks jynus [17:58:03] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117864 (10Marostegui) a:05Marostegui>03Papaul @Papaul please change disk #1 as disk #4 got rebuilt finely ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay... [18:49:53] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4117968 (10Krinkle) [18:52:37] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4117973 (10Krinkle) [19:03:24] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118008 (10Marostegui) [19:04:54] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4114784 (10Marostegui) I see the disk already being rebuilt Thanks @Papaul ``` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ``` [19:13:40] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118043 (10Papaul) @Marostegui no problem [20:16:42] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118365 (10Marostegui) @Papaul this disk failed, was it an used one? Can you try another one? ``` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Failed) ``` Thanks [20:17:21] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118381 (10Marostegui) [20:21:20] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118408 (10Papaul) Ok will do once back at the DC in the AM [20:27:41] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118426 (10Marostegui) Thanks!