[06:22:54] 10DBA, 10Operations, 10ops-eqiad: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874338 (10Marostegui) [06:33:10] 10DBA, 10Operations, 10ops-eqiad: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874356 (10Marostegui) ``` Time: Fri Nov 24 23:39:07 2017 Event Description: Battery started charging Time: Fri Nov 24 23:46:42 2017 Event Description: Battery charge complete Time: Sun Nov 26 08:04:47 20... [06:41:15] 10DBA, 10Operations, 10ops-eqiad: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874359 (10Marostegui) After the manual relearn: ``` ˜/icinga-wm 7:37> RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` Don't know for how long it will last [06:43:00] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3874361 (10Marostegui) This host failed again and recovered itself: ``` 03:16 < icinga-wm> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, cu... [06:45:06] 10DBA, 10Operations, 10ops-eqiad: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874362 (10Marostegui) p:05Triage>03Normal [07:03:49] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3874373 (10Marostegui) a:05Cmjohnson>03Marostegui [07:07:34] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3874375 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1113.eqiad.wmnet', 'db111... [07:11:04] 10DBA, 10Operations, 10ops-eqiad: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874377 (10Marostegui) a:03Cmjohnson ``` PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough ``` We should replace the BBU [07:24:16] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3874384 (10Marostegui) a:05Marostegui>03None [07:43:42] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3874390 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1113.eqiad.wmnet', 'db1114.eqiad.wmnet'] ``` and were **ALL** successful. [07:55:38] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3874396 (10Marostegui) I put the wrong task ID, it was meant to be T182896 Sorry! [07:56:26] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3837888 (10Marostegui) [08:20:41] 10DBA: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161#3874417 (10Marostegui) p:05Triage>03Normal [08:21:53] 10DBA: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161#3874432 (10jcrespo) Maybe solve T182896 if we leave this open? [08:23:22] 10DBA, 10Patch-For-Review: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161#3874448 (10Marostegui) >>! In T184161#3874432, @jcrespo wrote: > Maybe solve T182896 if we leave this open? hehe yeah, I was about to :) [08:24:29] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3874449 (10Marostegui) 05Open>03Resolved [08:48:50] 10DBA, 10Data-Services: Discrepancies with logging table on different wikis - https://phabricator.wikimedia.org/T71127#3874511 (10jcrespo) As far as I can see, mediawiki doesn't define `log_title_time` nor `log_title_type_time` ( see https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/... [08:57:34] 10DBA, 10Data-Services: Discrepancies with logging table on different wikis - https://phabricator.wikimedia.org/T71127#3874526 (10jcrespo) It is important to separate production from wikireplicas- wikireplicas can have additional indexes, production should not- or we should patch mediawiki first. [13:51:32] see my comment on https://gerrit.wikimedia.org/r/401436 I just remembered that [13:52:25] but it is a hard dependency? [13:52:27] I did https://gerrit.wikimedia.org/r/#/c/391198/ in the past, but I will abandon it [13:52:32] it is [13:52:36] :( [13:52:51] should we CC Amir1? maybe he can help [13:52:57] CC why? [13:53:08] maybe he can help [13:53:15] as he was going to help during the failover [13:53:19] we just need to add the extra things of 391198 into your patch [13:53:30] deploy at the same time [13:54:03] ok [13:54:28] except the db-common [13:54:37] just the dblist and the noc stuff [14:06:50] i have sent a new patch [14:07:02] trying to understand why jenkins doesn't like it [14:11:48] see my comment [14:12:06] yeah, that is what I was actually checking [14:12:24] $this->assertEquals( $linkDestination, '../../../dblists/' . $fname ); fails [14:13:00] they may have changed files to links since I uploaded my patch [14:14:01] Actually no, see how I linked it, too: https://gerrit.wikimedia.org/r/#/c/391198/5/docroot/noc/conf/s8.dblist [14:16:36] yeah, that was it, pretty ugly :_( [14:22:20] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3875191 (10chasemp) What were the old query killer thresholds? We could start with those and add 50% to see how it goes. [14:25:33] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3875209 (10jcrespo) > What were the old query killer thresholds It depended on the load, at times it was as low as 1 hour- when on high load-, at other times it was 4 hours. The whole po... [14:29:59] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3875216 (10chasemp) yep understood, the way I'm interpreting is that there is no objective right answer here. At first blush I would be inclined to take queues from that. Web query kill... [14:34:38] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3875255 (10jcrespo) > Web query killer at 1h and analytics at 4h and we see how it goes. I've implemented that, unpuppetized waiting for your feedback. When happy, I will puppetize and s... [14:39:46] is the event the one called: wmf_labs_slow_duplicates ? [14:50:11] the event? [14:50:17] I don't understand [14:50:51] oh, there is indeed events, but I do not know how/if they worked [14:51:01] I want to move to pt-kill anyway [14:51:15] like pt-heartbeat, processes that are more observable [14:51:26] yeah, I was referring to the events, if you were configuring them for this [14:51:31] yeah, pt-kill for the win [14:51:43] I have not touched them [14:51:55] and I do not know if they are installed at all [14:52:03] I know they are on the software repo [14:52:18] yeah, I thought you were using them to set the query limits :) [14:58:01] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3875292 (10Marostegui) s4 master finished [14:58:10] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3875294 (10Marostegui) [15:27:25] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3875352 (10Marostegui) 05Open>03Resolved >>! In T180788#3854799, @Marostegui wrote: > This has been all set. > Servers replicate between each other (db1111 being the master). > They con... [15:55:34] I could have accidentally deleted db2029 from tendril [15:56:03] as long as it only db2029 :) [15:56:33] I was missing a whole branch from tendril/dbtree [15:57:16] upgrading multi-instance hosts is trikier than it looks [15:57:23] specially for kernel updates [15:59:43] yeah, it is a massive pain XD [15:59:49] because we are not used to [16:00:17] a normal mysql update is actually not that painful [16:00:27] you can upgrade just one of the 2 instances [18:30:29] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3876045 (10madhuvishy) @jcrespo Sounds great! Let's puppetize and tweak later if needed. Thank you :) [18:58:59] 10DBA, 10Collaboration-Team-Triage, 10Notifications, 10Schema-change: Review new Echo table for user group expiration - https://phabricator.wikimedia.org/T168107#3876192 (10MaxSem) 05stalled>03Invalid Not pursuing this anymore. [19:04:49] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3876215 (10chasemp) >>! In T183983#3876045, @madhuvishy wrote: > @jcrespo Sounds great! Let's puppetize and tweak later if needed. Thank you :) +1 [19:26:10] 10DBA, 10Operations, 10ops-codfw: db2054: Disk with predictive failure - https://phabricator.wikimedia.org/T183887#3876328 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [19:34:18] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T184210#3876339 (10Peachey88) [19:42:27] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T184210#3876353 (10Marostegui) 05Open>03declined Thanks - this is because we replaced a disk which was on predicted failure: T183887