[05:30:41] 10DBA, 10Data-Services: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4096387 (10Marostegui) Just for the record - replication is broken on a few threads: ``` root@labsdb1009:~# mysql --skip-ssl -e "show all slaves status\G" | egrep "Connection|Seconds|Last" Connection_name: db10... [06:03:53] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#4096419 (10Marostegui) [06:07:51] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4096434 (10Marostegui) [06:07:53] 10DBA: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811#4096432 (10Marostegui) 05Open>03Resolved Thanks for letting us know. I have removed `localisation`table and checked across all the servers in all the shards! [07:07:32] 10DBA, 10DC-Ops, 10Operations, 10media-storage, 10ops-codfw: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4094875 (10Marostegui) >>! In T191129#4094901, @Volans wrote: > I've agreed with @RobH on IRC that this is not UBN for now for the #dba part. > > Although assessing the situation... [07:53:17] 10DBA, 10Data-Services: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4095437 (10jcrespo) [08:08:16] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4096579 (10Marostegui) [08:08:38] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4096598 (10jcrespo) a:03jcrespo [08:08:55] 10DBA, 10DC-Ops, 10Operations, 10media-storage, 10ops-codfw: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4096600 (10Marostegui) I have created: T191193 to track the masters movement [08:21:21] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4096614 (10Peachey88) [08:33:17] I am going to format /srv on labsdb1009, can you check if there is something there that you would like to keep? [08:33:24] sure [08:33:48] checking now [08:34:03] there are some things at /srv/tmp [08:34:12] yeah, I am checking those [08:34:52] I think they can go [08:35:26] ok, formating /srv at 1009, make sure you cd into another partition [08:35:41] done :) [08:41:33] 10DBA, 10Data-Services: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4096624 (10jcrespo) [08:57:02] 10DBA, 10Data-Services: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4096635 (10jcrespo) @ayounsi : As part of this emergency recovery, we are copying 7T at ~240MBytes/s from labsdb1011 to labsdb1009, it will take around 8 hours. [11:39:29] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4096772 (10Marostegui) >>! In T187089#3964266, @Anomie wrote: > Something else to consider would be to add... [11:52:09] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4096777 (10Marostegui) [11:52:25] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4096778 (10Marostegui) [11:52:39] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4096779 (10Marostegui) [11:54:21] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3984786 (10Marostegui) s3 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1002 [] db1... [11:54:31] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4096781 (10Marostegui) s3 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db... [11:54:35] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4096782 (10Marostegui) s3 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] dbstore1002 [] db1095 [] db1072 []... [11:54:57] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4096785 (10Marostegui) [11:55:13] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4096786 (10Marostegui) [11:55:40] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4096787 (10Marostegui) [12:15:35] 10DBA, 10Operations, 10Goal: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4096834 (10Marostegui) @jcrespo and myself have done an initial discussion about HW and to which extend (pros and cons) we can achieve redundancy for... [12:37:04] 10DBA, 10Data-Services: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4095437 (10chasemp) Sincerest thanks to you all :) [12:38:07] 10DBA, 10Data-Services: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4096849 (10jcrespo) @chasemp No user impact -yet- due to the proxy working as intended and failing over to labsdb1010. [12:53:26] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4096859 (10Krinkle) [12:55:07] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3252115 (10Krinkle) I've attempted to make the table a bit more readable by shortening the labels a bit, and by rephrasing "//Removable: Yes, No, Removed//... [13:03:10] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4096887 (10jcrespo) [13:48:31] 10DBA, 10Data-Services: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4097016 (10jcrespo) @JAllemandou Please stop sqoop, we are close to a full service outage. [14:06:02] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4097113 (10jcrespo) Ok, then lowering priority. [14:06:04] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4097114 (10jcrespo) p:05High>03Normal a:05aaron>03None [14:06:07] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4097118 (10jcrespo) [14:06:25] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4097119 (10jcrespo) [14:07:19] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4094572 (10jcrespo) We can even close it as invalid if you think that is appropiate. [14:12:30] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4097132 (10Anomie) The deprecated uses still need cleaning up, so it could be kept open... [14:17:18] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4097148 (10Anomie) Looking through the different stack traces associated with this mess... [14:23:56] 10DBA: Drop contest* tables from mediawikiwiki - https://phabricator.wikimedia.org/T186867#4097164 (10Marostegui) 05Open>03Resolved a:03Marostegui I have dropped the tables. There is a temporary backup on s3 master: `db1075:/srv/tmp/contest_tables` [14:23:59] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4097167 (10Marostegui) [14:28:07] 10DBA: Drop flaggedrevs tables from mediawikiwiki - https://phabricator.wikimedia.org/T186865#4097175 (10Marostegui) I have placed a temporary backup on s3 master: `db1075:/srv/tmp/flaggedrevs` [14:49:19] 10DBA, 10DC-Ops, 10Operations, 10media-storage, 10ops-codfw: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4097210 (10Papaul) Removed power for 2 minutes and plugged back. Leaving this task open for now to monitoring the switch. [14:50:35] 10DBA, 10DC-Ops, 10Operations, 10media-storage, 10ops-codfw: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4097212 (10Papaul) p:05High>03Low [14:51:04] 10DBA, 10DC-Ops, 10Operations, 10media-storage, 10ops-codfw: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4097213 (10Marostegui) The servers are reporting the recoveries already :-) Thanks! [14:52:03] 10DBA, 10DC-Ops, 10Operations, 10media-storage, 10ops-codfw: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4097214 (10RobH) Bad switch state is the easiest recovery, so that is nice. [14:52:30] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097215 (10Papaul) a:05Papaul>03Marostegui @Marostegui confirm [14:54:28] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097218 (10Marostegui) Thanks @Papaul - we can schedule one movement per day if that works for you! In order to minimize downtime I would need the future IP of each server before we shut it down so I... [14:55:20] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097222 (10Marostegui) a:05Marostegui>03Papaul [14:57:59] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097226 (10Papaul) @Marostegui let me know which one you wan to start with. [14:59:32] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097227 (10Marostegui) [15:00:28] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4096579 (10Marostegui) Let's go for db2035 if that works for you! [15:19:10] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4097273 (10Cmjohnson) [15:20:52] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4097279 (10Anomie) >>! In T187089#4096772, @Marostegui wrote: >>>! In T187089#3964266, @Anomie wrote: >> S... [15:25:21] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097289 (10Papaul) new IP address 10.192.16.73 [15:26:02] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097293 (10Marostegui) Thanks! I will post here as soon as the server is off [15:26:29] 10DBA, 10Operations, 10ops-codfw: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097294 (10Papaul) new switch port information asw-b1-codfw ge-1/0/15 [15:35:59] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097394 (10Marostegui) @Papaul db2035 is now off! [15:42:07] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097430 (10Papaul) old switch information asw-c6-codfw ge-6/0/2 [15:42:47] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097434 (10Marostegui) mediawiki config files changed network/interfaces changed dns merged and deployed [15:47:04] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097458 (10Papaul) @robh if switch configuration is not done yet can you please change it from new switch port information asw-b1-codfw ge-1/0/15 to new switch port informat... [15:55:56] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097505 (10Papaul) db2035 was on asw-c6-codfw ge-6/0/2 and now will be on asw-b1-codfw ge-1/0/4 [16:49:05] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097692 (10Papaul) moved db2035 in racktables from C6 to B1 [16:50:28] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097693 (10Marostegui) [16:50:34] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097694 (10Papaul) [16:51:11] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097696 (10Marostegui) db2035's mysql is back and slaves are reconnecting. I would suggest next server to be db2039. [17:24:23] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097801 (10Papaul) switch port information when ready to move db2039. This i just a note for when we are ready to do the move. db2039 was on asw-c6-codfw ge-6/0/6 and now will... [18:48:18] 10DBA, 10Data-Services, 10Patch-For-Review: labsdb1009 crashed - https://phabricator.wikimedia.org/T191149#4098185 (10jcrespo) 05Open>03Resolved This should be fixed now, labsdb1009 was loaded with a copy of labsdb1011. I also took the time to upgrade kernels and mariadb versions. There will be lag for s... [20:20:35] 10DBA, 10Reading List Service, 10MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Update duplicate handling in reading lists API - https://phabricator.wikimedia.org/T184680#4098486 (10Tgr) [20:21:47] 10DBA, 10Reading List Service, 10MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Update duplicate handling in reading lists API - https://phabricator.wikimedia.org/T184680#3892212 (10Tgr) 05Open>03Resolved Let's call t... [22:46:06] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4098911... [22:46:29] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4089198...