[05:51:28] <wikibugs>	 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275#4115645 (10Marostegui) s8: db2079
[06:18:41] <wikibugs>	 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275#4115657 (10Marostegui)
[06:37:24] <Amir1>	 good morning :)
[06:37:29] <marostegui>	 hey!
[06:38:03] <Amir1>	 so, the alter table failed with this error: ERROR 1799 (HY000): Creating index 'PRIMARY' required more than 'innodb_online_alter_log_max_size' bytes of modification log. Please try again.
[06:38:07] <marostegui>	 aaaaah
[06:38:09] <marostegui>	 :(
[06:38:11] <marostegui>	 I can fix that for you
[06:38:15] <marostegui>	 but you need to run the alter again
[06:38:30] <Amir1>	 sure
[06:38:45] <marostegui>	 db2083, right?
[06:38:49] <Amir1>	 yup
[06:39:03] <marostegui>	 should be fixed now
[06:39:30] <Amir1>	 Cool
[06:40:25] <Amir1>	 I'm actually writing to the table atm (replacing term_search_key with '')
[06:40:30] <marostegui>	 yeah, i saw that
[06:40:31] <Amir1>	 it will finish soon
[06:40:42] <marostegui>	 I saw the alter running for 24h at least, no?
[06:41:08] <Amir1>	 no, it just finished and I ran another batch smaller
[06:41:17] <marostegui>	 aaah cool cool
[06:41:24] <marostegui>	 so it failed after how many hours?
[06:41:43] <Amir1>	 the old one was replacing 150M (took 10 hours), this one will take around 5 hours-ish
[06:41:55] <marostegui>	 No, I mean the one dropping the column
[06:41:57] <Amir1>	 the alter didn't give a time to me
[06:42:08] <Amir1>	 it :/
[06:42:21] <marostegui>	 ah because of the error indeed
[06:44:15] <Amir1>	 yeah it was at leat 24 hours
[06:44:19] <Amir1>	 *least
[06:44:45] <Amir1>	 https://gerrit.wikimedia.org/r/#/c/424300/
[06:45:49] <Amir1>	 Can we deploy this? https://giphy.com/gifs/please-simba-lion-king-PK5CQPd6rCF3y
[06:46:19] <marostegui>	 sure
[06:46:21] <marostegui>	 give me a sec
[06:50:59] <marostegui>	 merged and puppet ran on terbium
[06:51:02] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4115680 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2092.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reim...
[06:51:08] <Amir1>	 marostegui: can you make the /var/log/wikidata/deleteAutoPatrolLogs.log file in terbium? writeable/readable by the $::mediawiki::users::web ?
[06:51:12] <Amir1>	 in terbium
[06:51:25] <Amir1>	 I'd go with 775 :D
[06:51:25] <marostegui>	 yep
[06:51:57] <marostegui>	 done
[06:52:44] <Amir1>	 Thank you!
[06:52:48] <Amir1>	 I keep monitoring this
[06:52:57] <marostegui>	 cheers!
[06:54:19] <Amir1>	 Right now, the logging table in commons wiki growth ten times slower and wikidatawiki 100 times. All other wikis will get the change by the next train \o/ I'm already deleting rows everywhere but it takes some time
[06:55:40] <marostegui>	 nice!!!!!
[06:55:51] <marostegui>	 let us know when we can optimize logging, specially on big wikis
[06:56:53] <Amir1>	 yeah sure
[07:28:02] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4115738 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2092.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reim...
[07:35:16] <marostegui>	 I wonder if dbproxy1011.yaml and dbproxy1010.yaml should have the same format?
[07:35:26] <jynus>	 ?
[07:35:57] <marostegui>	 https://github.com/wikimedia/puppet/blob/production/hieradata/hosts/dbproxy1010.yaml https://github.com/wikimedia/puppet/blob/production/hieradata/hosts/dbproxy1011.yaml
[07:36:39] <jynus>	 master can only contain 2 hosts
[07:36:49] <jynus>	 and weight doesn't make sense
[07:37:10] <jynus>	 yaml hashes cannot be ordered
[07:37:21] <jynus>	 but feel free to spend time on that, I will not
[07:37:32] <marostegui>	 Yeah, I was wondering if we can make them more alike 
[07:37:46] <jynus>	 send a proposal
[07:37:53] <marostegui>	 as right now depooling labsdb1009 is different thatn depooling labsdb1011
[07:37:56] <jynus>	 with code, of course!
[07:38:08] <marostegui>	 not a big deal, but just wondering :)
[07:38:21] <jynus>	 I say that if you spend the time, I am ok
[07:38:27] <marostegui>	 :)
[07:38:38] <jynus>	 (and it is correct)
[07:39:29] <jynus>	 I would prefer if sanitarium puppet was fixed first
[07:39:32] <marostegui>	 I was answering an email from brooke with some questions and realised that it could be confusing, that's all :)
[07:39:34] <jynus>	 which is part of the goal
[07:39:55] <jynus>	 sanitarium_multiinstance
[07:46:07] <wikibugs>	 10DBA, 10Collaboration-Team-Triage, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#4115753 (10jcrespo) That last suggestion looks like a blocker to me, at least to check it before doin...
[07:48:19] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4115766 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2092.codfw.wmnet'] ```  and were **ALL** successful.
[07:55:08] <Amir1>	 the deleteAutoPatrolLogs works fine for wikidata, btw. you can delete /var/log/wikidata/rebuildTermSqlIndex.log* to save some space on terbium
[07:55:40] <marostegui>	 it is 0 bytes :)
[07:55:55] <marostegui>	 Ah, it has multple
[07:55:56] <marostegui>	 I see
[07:56:09] <marostegui>	 I will nuke them!
[07:56:43] <Amir1>	 \o/
[09:28:08] <Amir1>	 The ALTER TABLE has began
[09:28:14] <marostegui>	 cool
[12:22:18] <Amir1>	 Running the script from terbium on commonswiki gives me "SLOW TIMER" notice (21 seconds on each batch), should I stop it?
[12:25:01] <jynus>	 yes, make the batch smaller
[12:25:39] <jynus>	 they are supposed to take no more than 1 second
[12:27:21] <jynus>	 I think mediawiki also will kill writes longe than 3-5 seconds
[12:30:09] <Amir1>	 The writes are fast, finding them is hard
[12:31:11] <jynus>	 if they are on the same transaction && server, there is not much of a difference- if they are on separate transactions/servers, then no problem
[12:32:01] <Amir1>	 Made the batch half, let's see what happens. The script uses a replica to get log_id s to delete and use another transaction to delete 
[12:32:08] <jynus>	 AND the selects are using the "slow" queries relplicas
[12:32:36] <jynus>	 which are not slower, they are for slow queries like those
[12:36:21] <Amir1>	 noted
[12:48:41] <Amir1>	 Nothing worked, will do commons later
[13:21:01] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Rack and setup 8 new eqiad DBs - https://phabricator.wikimedia.org/T191792#4116638 (10Marostegui) p:05Triage>03Normal
[13:54:31] <wikibugs>	 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4116752 (10Marostegui)
[13:54:34] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4116747 (10Marostegui) 05Open>03Resolved
[13:55:10] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3689635 (10Marostegui) All the hosts are now pooled.
[13:55:32] <paladox>	 marostegui hi does that mean gerrit2001 can connect to the db now ? :)
[13:55:44] <marostegui>	 ?
[13:56:08] <paladox>	 marostegui that task was a subtask of https://phabricator.wikimedia.org/T176532#4116752
[13:56:33] <marostegui>	 paladox: we still don't have proxies
[13:56:38] <paladox>	 oh ok
[13:57:01] <paladox>	 marostegui is there a task for the proxies?
[13:57:38] <marostegui>	 No, we first need to order them and all that
[13:58:38] <paladox>	 oh ok
[13:58:48] <marostegui>	 We are working out eqiad proxies first, as those have more priority :)
[14:03:39] <jynus>	 paladox: I gave you a date in the past about when we plan to handle that
[14:03:46] <jynus>	 don't you remember?
[14:04:19] <paladox>	 yep, i thought that, that task was part of it. But wasen't.
[14:04:33] <jynus>	 asking more time will not make it happen faster :-)
[14:04:37] <jynus>	 *times
[14:04:55] <jynus>	 technically it is
[14:05:09] <jynus>	 but that is only for the underlying servers, we still need the proxies
[14:06:12] <paladox>	 ok
[14:27:01] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Epic: Make wb_terms table fancy - https://phabricator.wikimedia.org/T188992#4116879 (10Lucas_Werkmeister_WMDE)
[14:54:39] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117113 (10Papaul) p:05Triage>03Normal
[14:58:30] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117116 (10Papaul) @Marostegui  which disk you want to replace first? 1 or 4  ?
[14:58:58] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117117 (10Marostegui) Oh, I didn't know there were two of them broken. Let me check
[14:59:07] <jynus>	 I prefer the failed one first
[14:59:12] <jynus>	 (4)
[14:59:19] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117118 (10Marostegui) @Papaul disk #4
[14:59:30] <marostegui>	 Yeah, I thought the #1 failed already
[14:59:41] <marostegui>	 But it is still as predicrtive faulure
[15:05:23] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117150 (10Papaul) a:05Papaul>03Marostegui complete
[15:08:31] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117168 (10Marostegui) Thanks ```       physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, Rebuilding) ```  Once this is completed we can replace #1
[16:28:20] <jynus>	 marostegui: https://gerrit.wikimedia.org/r/#/c/425087/
[16:29:20] <marostegui>	 oh nice!!
[16:29:20] <jynus>	 (not ready yet, but just FYI)
[16:29:25] <marostegui>	 So cool!
[16:29:28] <marostegui>	 I will take a look :)
[16:42:07] <bd808>	 jynus: what day/time works for you to talk over the m5 things from T189542?
[16:42:07] <stashbot>	 T189542: Update updatequerypages::cronjob and refreshlinks::cronjob now that silver no longer has a database - https://phabricator.wikimedia.org/T189542
[16:43:01] <jynus>	 any time not too late for us
[16:43:06] <jynus>	 (CEST)
[16:43:13] <bd808>	 I think that andrew and I can talk it all out with you.
[16:43:23] * bd808 looks at calendar
[16:43:31] <jynus>	 when do you start your day normally?
[16:43:49] <bd808>	 I try to start ~15:00 UTC
[16:44:37] <jynus>	 30 minutes tomorrow at 15:30 UTC?
[16:45:12] <bd808>	 that collides with the WMCS team meeting. My mornings are a mess this week apparently.
[16:45:22] <jynus>	 ok, another day, another hour?
[16:46:01] <bd808>	 15:30 on Thursday would work for me and it looks like Andrew too
[16:46:09] <jynus>	 ok, sending invite
[16:47:45] <jynus>	 see you
[16:48:01] <bd808>	 thanks jynus 
[17:58:03] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4117864 (10Marostegui) a:05Marostegui>03Papaul @Papaul please change disk #1 as disk #4 got rebuilt finely  ```       logicaldrive 1 (3.3 TB, RAID 1+0, OK)        physicaldrive 1I:1:1 (port 1I:box 1:bay...
[18:49:53] <wikibugs>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4117968 (10Krinkle)
[18:52:37] <wikibugs>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4117973 (10Krinkle)
[19:03:24] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118008 (10Marostegui)
[19:04:54] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4114784 (10Marostegui) I see the disk already being rebuilt Thanks @Papaul  ```       physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ```
[19:13:40] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118043 (10Papaul) @Marostegui no problem
[20:16:42] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118365 (10Marostegui) @Papaul this disk failed, was it an used one? Can you try another one? ```       physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Failed) ```  Thanks
[20:17:21] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118381 (10Marostegui)
[20:21:20] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118408 (10Papaul) Ok will do once back at the DC in the AM
[20:27:41] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2069 - https://phabricator.wikimedia.org/T191720#4118426 (10Marostegui) Thanks!