[05:33:20] DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2356666 (RobH) [05:36:48] DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2356683 (RobH) I've rebooted the host in an attempt to return it back online. This should be flagged into notes for the host history (we don't really have a good way to do that now.) For now I'm setting it to high priority and assigned to @jcre... [05:39:30] DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2356685 (RobH) P3211 has the ilom log [05:44:02] DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2356686 (RobH) mysql isn't online, but im not sure if its as simple as just manually starting it, or if it has to be manually checked/synced. Since db2034 crashed and wasn't cleanly shut down, I don't want to assume I should just restart the db/m... [06:26:14] DBA: db2034 crash - https://phabricator.wikimedia.org/T137084#2356666 (jcrespo) p:Triage>High [06:30:49] DBA, Operations, ops-codfw: db2034 crash - https://phabricator.wikimedia.org/T137084#2356708 (jcrespo) It seems there was a RAID controller failure: > A controller failure event occurred prior to this power-up We had similar issues on T130702. We may need a general upgrade of all machines with simi... [06:39:13] DBA, Operations, ops-codfw: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2356722 (jcrespo) a:jcrespo>Papaul This host crashed today: T137084 due to a RAID controller failure. Are we still sure this was safe? Papaul, could you please follow up with support? [06:41:21] DBA, Operations, ops-codfw: db2034 crash - https://phabricator.wikimedia.org/T137084#2356726 (jcrespo) This host being down was creating log noise due to health checks (no users affected): https://logstash.wikimedia.org/#dashboard/temp/AVUkao15_LTxu7wl9U3S [08:20:44] DBA, Performance-Team, Availability, Epic, Patch-For-Review: MASTER_POS_WAIT() alternative that works cross-DC - https://phabricator.wikimedia.org/T135027#2356817 (jcrespo) [08:20:46] DBA: Change dbstore1001 delayed slave to be a direct slave of the eqiad masters - https://phabricator.wikimedia.org/T133386#2356818 (jcrespo) [08:20:48] DBA, Operations, Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#2356819 (jcrespo) [08:20:52] DBA, MediaWiki-Database, Operations, Performance: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385#2356814 (jcrespo) Open>Resolved a:jcrespo GTID rolled in on all production coredb servers. Resolving now, although it will still be applied to... [08:20:54] DBA, Availability: Look into Maria 10 parallel-replication - https://phabricator.wikimedia.org/T85266#2356820 (jcrespo) [08:31:58] DBA, Labs: Wrong page title in labs database replica enwiki page table - https://phabricator.wikimedia.org/T136618#2341449 (jcrespo) After seeing many cases like this, I can conclude that replication to labs breaks whenever there is a page move, an archival or an undeletion. I have not yet clear why, but... [08:33:54] DBA, Labs: Wrong page title in labs database replica enwiki page table - https://phabricator.wikimedia.org/T136618#2356836 (jcrespo) Of course, I can fix individual reports, although we should setup a more convenient way than a ticket per row with problems. [09:15:21] DBA, Operations, ops-codfw: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2356875 (jcrespo) Tuesday, whenever you start working and are available (my afternoon)? [09:22:29] DBA, Analytics: dbstore1002 crashed - https://phabricator.wikimedia.org/T136333#2356887 (jcrespo) Open>Resolved a:jcrespo [09:24:16] DBA: db1034 was killed 22-05-16 at 14:17:06 - https://phabricator.wikimedia.org/T135944#2356891 (jcrespo) Open>Resolved a:jcrespo [09:39:31] DBA: Identical EventLogging queries give different results on db1047 and dbstore1002 - https://phabricator.wikimedia.org/T131236#2356947 (jcrespo) Open>Resolved ``` MariaDB db1047 log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%'; +--------+ | events... [12:08:17] Blocked-on-schema-change, DBA, Notifications: Temporary index for Echo backfillReadBundles.php? - https://phabricator.wikimedia.org/T137100#2357217 (Catrope) [13:09:33] Blocked-on-schema-change, DBA, Notifications: Temporary index for Echo backfillReadBundles.php? - https://phabricator.wikimedia.org/T137100#2357379 (jcrespo) > Is adding a temporary index for this kind of thing recommended / a thing we do? Is something we can do indeed, but please combine it with th... [13:11:07] DBA, Notifications, Schema-change: Temporary index for Echo backfillReadBundles.php? - https://phabricator.wikimedia.org/T137100#2357382 (jcrespo) (this is not yet a proper DBA request -it is in planning phase-, please create a proposal of several actions to do and re-add the tag when ready) [15:14:46] DBA, Notifications, Schema-change: Temporary index for Echo backfillReadBundles.php? - https://phabricator.wikimedia.org/T137100#2357613 (Catrope) >>! In T137100#2357379, @jcrespo wrote: >> Is adding a temporary index for this kind of thing recommended / a thing we do? > > So, please try to minimize... [15:17:16] DBA, Notifications, Schema-change: Temporary index for Echo backfillReadBundles.php? - https://phabricator.wikimedia.org/T137100#2357623 (jcrespo) > I'll line up all the things we talked about in the previous task, but unfortunately we can't fix the over-indexing until after this migration is complet...