[05:45:04] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317260 (10Marostegui) This went back to faulty again: ``` BatteryType: BBU Battery State: Unknown Battery backup charge time : 0 hours ``` Raid went back to WriteThrough: ``` Default Cache... [05:54:03] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317261 (10Marostegui) And it is back: ``` 05:51 < icinga-wm> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy Default Cache Policy: WriteBack, Read... [05:55:58] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3317265 (10Marostegui) [07:00:09] 10DBA, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317309 (10elukey) >>! In T166141#3315357, @jcrespo wrote: > Not really, we have almost decided the goals for Q1, and they are all quite urgent and for hardware that ha... [07:10:47] 10DBA, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317327 (10Marostegui) >>! In T166141#3317309, @elukey wrote: >>>! In T166141#3315357, @jcrespo wrote: >> Not really, we have almost decided the goals for Q1, and they... [07:12:46] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317328 (10Marostegui) And again: ``` ˜/icinga-wm 9:11> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough ``` [08:02:46] I stopped slave on db2035 so it doesn't get the alters replicated, so you can reimage it now if you like [08:03:22] did you request a learning cycle on db1016? [08:03:41] I will do db2035 now, then [08:03:43] yes [08:03:48] cool [08:04:02] Worked early in the morning, we will see if it works again [08:14:25] 10DBA, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317379 (10elukey) Sure I am concerned too, this is why I asked if it was possible to order the hardware as soon as possible to be ready to work on it by the end of Q1 :) [09:02:06] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317454 (10Marostegui) db1075 the master is done - the whole shard is completed. [09:02:23] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3317455 (10Marostegui) ^ Wrong ticket [09:02:56] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3317458 (10Marostegui) db1075 the master is done - the whole shard is completed. [09:03:10] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3317459 (10Marostegui) [09:03:19] 10Blocked-on-schema-change, 10DBA, 10MW-1.28-release (WMF-deploy-2016-08-30_(1.28.0-wmf.17)), 10MW-1.28-release-notes, 10Patch-For-Review: Clean up revision UNIQUE indexes - https://phabricator.wikimedia.org/T142725#3317462 (10Marostegui) [09:03:21] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3290974 (10Marostegui) 05Open>03Resolved [09:06:24] 10DBA, 10Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3317475 (10Marostegui) [09:06:25] 10Blocked-on-schema-change, 10DBA, 10MW-1.28-release (WMF-deploy-2016-08-30_(1.28.0-wmf.17)), 10MW-1.28-release-notes, 10Patch-For-Review: Clean up revision UNIQUE indexes - https://phabricator.wikimedia.org/T142725#3317472 (10Marostegui) 05Open>03Resolved a:03Marostegui All the shards are now done. [09:16:34] 10Blocked-on-schema-change, 10DBA, 10MW-1.28-release (WMF-deploy-2016-08-30_(1.28.0-wmf.17)), 10MW-1.28-release-notes, 10Patch-For-Review: Clean up revision UNIQUE indexes - https://phabricator.wikimedia.org/T142725#3317495 (10Marostegui) Thousands of alters have been done and checked but we could have m... [10:20:11] 10DBA, 10Operations: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#1761444 (10Marostegui) I have installed `wmf-mariadb101_10.1.23-1_amd64.deb` on a fresh stretch to play around with it - will get back to you if I see issues! [10:20:32] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#2825398 (10Ladsgroup) Just to clarify, Is [[https://gerrit.wikimedia.org/r/#/c/336542 | gerrit:336542]] is the only op... [10:27:21] 10DBA, 10Analytics, 10Analytics-EventLogging: db1047 has been restarted - needs another restart - https://phabricator.wikimedia.org/T166452#3317854 (10Marostegui) 05Open>03Resolved The scope of this ticket is done - pending is the ALTER table to unify revision so we can run pt-table-checksum for enwiki o... [10:40:35] 10DBA, 10Operations: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3317984 (10jcrespo) I have to package 10.1.24 and fix some things- coming soon. [11:14:15] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3318112 (10Marostegui) >>! In T166853#3311393, @jcrespo wrote: > This one is also showing the following alarm- > > > ``` > Sensor Type(s) Temperature Status: Critical [Power Unit 2 18-VR P2 = Critical, Po... [12:18:25] 10DBA, 10Operations, 10ops-codfw: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3318345 (10Marostegui) 05Open>03Resolved Going to close this for now as we had no more crashes lately. [12:24:22] 10DBA, 10Datasets-General-or-Unknown, 10Labs, 10Labs-Infrastructure: Rebuild old timestamp format tables - https://phabricator.wikimedia.org/T151607#3318347 (10Marostegui) 05stalled>03Resolved Closing this as we are not moving tablespaces anymore. We would need to make sure that this isn't an issue wh... [12:24:25] 10DBA, 10Datasets-General-or-Unknown, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#3318349 (10Marostegui) [12:45:12] 10DBA, 10Operations, 10ops-eqiad: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318380 (10Cmjohnson) @Marostegui The battery is here...let me know when you want to replace [12:53:14] 10DBA, 10Operations, 10ops-eqiad: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318391 (10Marostegui) @Cmjohnson I will depool the server now and ping you once it is down. [13:21:52] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3318469 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now - thanks Chris! ``` Cache Backup Power Source: Batteries Battery/Capacitor... [13:46:02] 10DBA, 10Operations, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3318555 (10jcrespo) 05Open>03stalled [16:20:24] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3319296 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [16:41:24] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3319375 (10jcrespo) `Rebuilding`, will resolve once it is done. [18:14:34] 10DBA, 10Labs, 10Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3319781 (10jcrespo) I have reloaded dbproxy1011 configuration, heads up in case the wrong db is being pointed at (downtime expired). [18:17:18] 10DBA, 10Labs, 10Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3319866 (10jcrespo) Maybe it didn't expire but got lost, but same thing applies. [18:40:33] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T166853#3320042 (10jcrespo) 05Open>03Resolved [19:51:55] so downtimes got lost [19:53:09] on the bright side, codfw mediawiki reimages finished