[04:24:24] 10DBA, 10Operations, 10ops-codfw: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10Marostegui) Great! So what do you have in mind? [04:42:40] 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-OATHAuth: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643 (10Marostegui) a:03Marostegui [04:44:55] 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-OATHAuth: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643 (10Marostegui) I have deployed this change on db1073 for labswiki and labtestwiki just to have it done there in advance to check if something breaks in the next few days. [04:45:15] 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-OATHAuth: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643 (10Marostegui) p:05Triage→03Normal [05:09:17] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ayounsi) @Cmjohnson I'm pretty sure they will have to move :( Creating a new vlan and all the supporting config (DHCP, routing, IP allocation, etc) is a non trivial task for a vlan... [05:12:46] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) So, to be clear from my side: Out of those 4 hosts, which 2 are on C row and 2 are on D row, we need 2 of them (I don't mind which ones) to be in the same VLAN as dbpro... [05:16:56] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ayounsi) Ok good so that's 1018/1019). In which vlan the other 2 (dbproxy1020/1021) need to go then? [05:18:13] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) dbproxy1020/1021 can go the same vlans as dbproxy1001-1008 as those will be replacing some of those [05:22:20] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ayounsi) dbproxy1001-1008 are in the private vlans across row A-B-C, none in D. Is row D private fine for dbproxy1020/1021 or should they be in private-A/B/C ? [05:23:53] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) >>! In T225704#5264272, @ayounsi wrote: > dbproxy1001-1008 are in the private vlans across row A-B-C, none in D. Is row D private fine for dbproxy1020/1021 or should the... [05:27:55] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ayounsi) No differences other than the physical row they're in. They will be able to reach the same resources. @Cmjohnson so dbproxy1020/1021 will go in private1-d-eqiad. [05:28:36] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) >>! In T225704#5264276, @ayounsi wrote: > No differences other than the physical row they're in. They will be able to reach the same resources. Great! Thanks :-) [05:40:47] 10DBA: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 (10Marostegui) [05:41:05] 10DBA: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 (10Marostegui) test-cluster users have been notified that on Thursday the replica will go offline to be changed by db1077. [05:41:20] 10DBA: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 (10Marostegui) p:05Triage→03Normal [05:56:19] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10Marostegui) Assigning this to myself to indicate I am using dbstore1001 for a few days as storing the content of db1112 (test cluster data) temporarily - once I have fi... [05:56:28] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10Marostegui) a:05RobH→03Marostegui [06:17:03] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) Please do not use dbstores, use dbprov instead. [06:18:15] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10Marostegui) it is temporary and it won't last more than 2 days, but ok [06:20:20] 10DBA, 10DC-Ops, 10decommission, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10Marostegui) a:05Marostegui→03RobH [06:20:57] just make sure you use a non 4444-4452 port (although snapshots are not running at the moment) [06:21:23] I am using transfer.py [06:21:33] which has a --port option [06:21:52] This is exactly the reason why I wanted to use dbstore1001 to avoid messing up with the backups [06:21:59] it doesn't matter [06:22:05] they are not running today [06:23:30] yes, but I didn't want to send stuff there to avoid messing up with iops etc, also you mentioned you wanted to keep them clean and all that [06:23:31] but ok [06:23:39] I am sending it there now with --port 5555 [06:23:52] I prefer that than to deal with dbstores :-) [06:25:20] I will be going now, will bring my laptop with me [06:29:27] good luck! [07:06:25] 10DBA: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10Marostegui) [07:06:58] 10DBA: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10Marostegui) a:03Marostegui [07:07:23] 10DBA: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10Marostegui) [07:07:38] 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 (10Marostegui) [07:07:40] 10DBA: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10Marostegui) [07:08:31] before merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/517461/ fleet-wide I'm disabling it manually on a few test hosts of the major roles, it's very unlikely to cause any issues, but better safe than sorry. any DBA recommendations for a db/dbproxy host to pick? [07:09:00] I assume you need them to have some traffic, no? [07:10:40] either should be fine I think, can also be a replica in codfw I think [07:10:53] yeah, I was going to propose that [07:11:40] let's pick db2054, db2053 and db2111? [07:12:05] thanks, setting it there [07:12:11] thanks [08:27:00] 10DBA: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 (10Marostegui) [09:49:04] dumps finished with no issues with the new codebase at 9UTC [09:49:14] congrats! :) [09:49:30] well, it was always like this since the beginning [09:49:58] but as it is the same codebase, clearly the issue is with mariaback and/or remote executiuon [09:50:06] *mariabackup [09:56:29] 10DBA, 10Operations: db2084 temporary correctable hardware errors - https://phabricator.wikimedia.org/T225884 (10jijiki) p:05Triage→03Normal @Marostegui are we good to mark this as resolved? [09:57:06] 10DBA, 10Operations: db2084 temporary correctable hardware errors - https://phabricator.wikimedia.org/T225884 (10Marostegui) Not yet, I haven't seen more errors but I want to wait until icinga alert clears up, let's give it another 24h [11:54:02] 10DBA, 10Operations: db2084 temporary correctable hardware errors - https://phabricator.wikimedia.org/T225884 (10Marostegui) a:03Marostegui [17:54:06] 10DBA, 10Operations, 10Patch-For-Review: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10jcrespo) Testing went as expected: ` root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002 Starting preflight checks... [ERROR]: Initial... [18:14:43] 10DBA, 10Operations, 10Patch-For-Review: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10jcrespo) Reviewed all patches, only commented on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/517363 (but +1ed too). Running a last compare.py just to be sup... [18:22:01] 10DBA, 10Operations, 10Patch-For-Review: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) Thanks for all the checks! I will depool db1081 early in the morning, good idea :)