[02:19:16] 10DBA, 10Operations: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10wiki_willy) [02:20:10] 10DBA, 10Operations, 10ops-eqdfw: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10wiki_willy) a:03Cmjohnson [05:00:23] 10DBA, 10Operations, 10ops-eqdfw: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui) [05:00:26] 10DBA, 10Operations, 10Patch-For-Review: Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) [05:06:31] 10DBA, 10Operations, 10ops-eqiad: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui) p:05Triage→03High [05:10:42] 10DBA, 10Operations: Batch db1074-db1079 hosts having BBU issues - https://phabricator.wikimedia.org/T233569 (10Marostegui) [05:11:12] 10DBA, 10Operations: Batch db1074-db1079 hosts having BBU issues - https://phabricator.wikimedia.org/T233569 (10Marostegui) [05:11:15] 10DBA, 10Operations, 10ops-eqiad: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui) [05:13:05] 10DBA, 10Operations, 10Patch-For-Review: Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) db1075 (the current master) crashed yesterday with BBU issues {T233534}. db1078 is also part of the same batch of hosts that have h... [05:15:16] 10DBA, 10Operations, 10ops-eqiad: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) [05:16:12] 10DBA, 10Operations: Batch db1074-db1079 hosts having BBU issues - https://phabricator.wikimedia.org/T233569 (10Marostegui) p:05Triage→03Normal [05:21:49] 10DBA, 10Operations: Batch db1074-db1079 hosts having BBU issues - https://phabricator.wikimedia.org/T233569 (10Marostegui) [05:22:14] 10DBA, 10Operations, 10Patch-For-Review: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Marostegui) [05:23:06] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) [06:20:03] 10DBA, 10Wikimedia-Incident: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10Marostegui) [06:22:58] 10DBA, 10Operations, 10Wikimedia-Incident: Batch db1074-db1079 hosts having BBU issues - https://phabricator.wikimedia.org/T233569 (10Marostegui) [06:29:21] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) I am starting to write the Incident Report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190923-s3_primary_db_master_crashed_-_s3_wikis... [06:37:09] 10DBA, 10Operations, 10Patch-For-Review: Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) db1123 (current recentchanges, logpager etc) s3 slave is in D8, so thus not affected by the PDU maintenance, so maybe we should fai... [06:43:52] 10DBA, 10Operations, 10Patch-For-Review: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) [07:40:20] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Batch db1074-db1079 hosts having BBU issues - https://phabricator.wikimedia.org/T233569 (10Marostegui) [07:42:52] 10DBA, 10Operations, 10ops-codfw: db2127 memory issues - https://phabricator.wikimedia.org/T233184 (10Marostegui) 05Open→03Resolved a:03Papaul HW logs look clean, closing this! Thanks @Papaul for catching this! [08:11:14] 10DBA, 10Core Platform Team, 10Epic, 10Tracking-Neverending: Tracking task for mariadb optimizer misbehaviours - https://phabricator.wikimedia.org/T233579 (10Marostegui) [08:12:18] 10DBA, 10Core Platform Team, 10Epic, 10Tracking-Neverending: Tracking task for mariadb optimizer misbehaviours - https://phabricator.wikimedia.org/T233579 (10Marostegui) p:05Triage→03Normal Please feel free to edit this task to keep adding past issues (I have added the most recent cases I remember) [08:13:03] 10DBA, 10Core Platform Team, 10Epic, 10Tracking-Neverending: Tracking task for mariadb optimizer misbehaviours - https://phabricator.wikimedia.org/T233579 (10Marostegui) [08:13:07] 10DBA, 10Core Platform Team Legacy (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10Performance: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 (10Marostegui) [08:13:10] 10DBA, 10MediaWiki-API, 10Performance, 10User-Marostegui: list=logevents slow for users with last log action long time ago - https://phabricator.wikimedia.org/T71222 (10Marostegui) [09:00:00] 10DBA, 10Core Platform Team, 10Epic, 10Tracking-Neverending: Tracking task for mariadb optimizer misbehaviours - https://phabricator.wikimedia.org/T233579 (10Aklapper) //[Please do not create more tracking tasks.](https://www.mediawiki.org/wiki/Phabricator/Project_management/Tracking_tasks) Create [project... [09:10:52] 10DBA, 10Core Platform Team, 10Epic, 10Tracking-Neverending: Tracking task for mariadb optimizer misbehaviours - https://phabricator.wikimedia.org/T233579 (10Marostegui) >>! In T233579#5515112, @Aklapper wrote: > //[Please do not create more tracking tasks.](https://www.mediawiki.org/wiki/Phabricator/Proje... [10:14:01] if you have a min this afternoon, I'd ask some info about https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538045/ :) [13:46:32] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [13:47:17] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) [13:47:19] 10Blocked-on-schema-change, 10DBA, 10Core Platform Team: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 (10Marostegui) [13:47:27] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) p:05Triage→03Normal [13:53:00] 10DBA, 10Operations, 10ops-eqiad: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Any ETA on when the request will be sent to Dell? Thanks! [14:03:54] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Anomie) Confirmed the new schema versus tables.sql. > `log_timestamp` varbinary(14) NOT NULL DEFAULT '19700101000000', Apparently we have P8433-style issues with other tables too. tables.sql specifies... [14:07:35] 10DBA: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 (10Marostegui) >>! In T233625#5516202, @Anomie wrote: > Confirmed the new schema versus tables.sql. > >> `log_timestamp` varbinary(14) NOT NULL DEFAULT '19700101000000', > > Apparently we have P8433-styl... [14:46:48] 10DBA: Drop frwiki.archive_save table - https://phabricator.wikimedia.org/T233187 (10Marostegui) a:03Marostegui Taken a temporary backup of this just in case: ` root@cumin1001:/home/marostegui/T233187# ls -lh archive_save.sql -rw-r--r-- 1 root root 945M Sep 23 14:46 archive_save.sql ` Will drop this table tom... [18:43:52] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Krenair) >>! In T233534#5514692, @Marostegui wrote: > I am starting to write the Incident Report: https://wikitech.wikimedia.org/wiki/Incident_documentation/201909... [18:56:12] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) >>! In T233534#5517195, @Krenair wrote: >>>! In T233534#5514692, @Marostegui wrote: >> I am starting to write the Incident Report: https://wikitech.wik... [19:14:49] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Krenair) I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun". Was th...