[06:25:35] 10DBA, 10decommission-hardware: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 (10Marostegui) Host depooled - will leave it depooled till next week before decommissioning it. [06:27:56] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 (10Marostegui) [06:28:35] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned - https://phabricator.wikimedia.org/T276150 (10Marostegui) Excellent - will alter a host in eqiad on Monday [06:29:26] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) @Ladsgroup this looks good too? [07:15:46] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) Started transfer from db2116 to db2145 [10:12:51] ACKNOWLEDGEMENT - MariaDB sustained replica lag on db1143 is CRITICAL: 21 ge 2 Kormat Investigating https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104 [10:16:20] 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) Deployed to s4 without issues. Deploying to s5 now. [11:15:09] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db2145 is now replicating [11:30:49] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Kormat) [11:30:54] 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) 05Open→03Resolved Deployment complete. [11:31:08] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Kormat) [11:31:17] 10DBA: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) 05Resolved→03Open Ah, that was premature. This still needs to be fixed for the other profiles. [11:50:07] 10Data-Persistence-Backup, 10SRE, 10Goal: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) [11:56:13] 10Blocked-on-schema-change, 10DBA: Drop default of oldimage.oi_timestamp - https://phabricator.wikimedia.org/T272511 (10Marostegui) >>! In T272511#6877940, @Marostegui wrote: > The last pending host is db1123 (s3 master) which I will do tomorrow morning as it will take 15h to complete. This is now running [11:59:19] 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) p:05Triage→03High [12:03:40] 10Data-Persistence-Backup, 10SRE, 10Goal: Create a first release of the media backups automation tools - https://phabricator.wikimedia.org/T276445 (10jcrespo) [12:03:50] 10Data-Persistence-Backup, 10SRE, 10Goal: Create a first release of the media backups automation tools - https://phabricator.wikimedia.org/T276445 (10jcrespo) p:05Triage→03High [12:16:15] would "media backups orchestrator" be confusing? Alternatives? https://gerrit.wikimedia.org/r/c/operations/puppet/+/668380 [12:16:46] jynus: maybe a bit yes, what about "management nodes?" [12:17:13] would that be confusing with cumin, which I think is called something similar? [12:17:51] "media backup manager hosts" [12:20:49] I may change manager with "generation" (vs "storage") [12:21:35] I indeed things like orchestate and manage are very overused words [12:21:40] *think [12:21:45] *think that things [12:24:02] "backup central nodes"? [12:24:52] "backup workers" vs "backup storage" [12:24:54] don't worry, I already got a few candidates, will try to send a new proposal [12:25:08] with some of the things you proposed [12:25:25] I like workers [12:36:37] 10DBA: Failover m1 master: db1080 -> db1159 - https://phabricator.wikimedia.org/T276448 (10Marostegui) [12:36:48] 10DBA: Failover m1 master: db1080 -> db1159 - https://phabricator.wikimedia.org/T276448 (10Marostegui) p:05Triage→03Medium [13:06:41] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db2146 is now replicating [13:14:39] PROBLEM - MariaDB sustained replica lag on db1141 is CRITICAL: 7 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1141&var-port=9104 [13:15:55] RECOVERY - MariaDB sustained replica lag on db1141 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1141&var-port=9104 [14:26:59] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Ladsgroup) oh I made a mistake. This needs to become binary(14) too. I added it to the wrong ticket (https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-7r43urruzwsv4fj/) image table... [14:27:52] Late to the party but +1 for workers [14:28:30] Less namespace conflicts and they also sound very serverless, which I hear is what the cool kids use these days [14:34:46] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) @Ladsgroup let's include on this ticket then? That makes this task also need a master swap for getting this change applied to the primary masters as changing the datatype isn't something t... [14:52:51] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Ladsgroup) This can be grouped with another schema change instead. Maybe with unsigned for rc_id? It doesn't need to happen at the same time as this change. [14:53:23] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) I am fine with that too, yes. Can you edit the other task then? [14:56:16] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned - https://phabricator.wikimedia.org/T276150 (10Ladsgroup) [14:56:45] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Ladsgroup) Done [14:56:49] 10DBA, 10Patch-For-Review: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) [14:56:51] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Kormat) [14:57:09] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Kormat) a:05Kormat→03None The puppet changes are now in place. [14:58:25] marostegui: ^ FYI (particularly that i've finished my work for this task) [14:58:58] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) Thanks - I will add that change to the altered host and update that other task (T276150) [14:59:18] kormat: thanks - I will review what's pending (if anything) and close it if all our work is done [14:59:23] 👍 [15:00:11] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp VARBINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [15:00:43] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [15:05:34] PROBLEM - MariaDB sustained replica lag on db2089 is CRITICAL: 76.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2089&var-port=13316 [15:06:49] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) Applied also the change about rc_timestamp being binary to already altered host (db2089:3316): ` root@db2089.codfw.wmnet[frwiki]> ALTER TABLE /*_*/recentcha... [15:10:58] RECOVERY - MariaDB sustained replica lag on db2089 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2089&var-port=13316 [16:07:41] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) Interestingly, db2116 was used to clone db2145 and db2146. db2146 came up corrupted Then db2116 got corrupted after getting restarted.... db2145 has no traces of corruption [18:14:56] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) Alright, patches are ready, here are the steps I will run to rename and reimage labsdb1012 to clouddb1021. #### Phase 1: Reimage, rename, set to inset... [18:24:08] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) > VLAN Type: Analytics (could somebody confirm this for me?) The current labsdb1012 seems to be in the `cloud-support1-a-eqiad` VLAN, I'd leave it l... [18:32:29] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) @razzi qq - are you going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/663865 before starting? Also in the plan I'd mention to down... [18:56:45] Hi data persistence, I'm working on a puppet patch for a new wikireplicas database server clouddb1021, and I'm hoping to confirm that `echo partman/custom/db.cfg` as was added for clouddb1013-clouddb1020 works for clouddb1021 as well, as seen in https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529/2/modules/install_server/files/autoinstall/netboot.cfg. Can somebody confirm? [18:59:25] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) @marostegui I see in https://phabricator.wikimedia.org/T260441 that you handled the other hosts, should we just use the `db.cfg` partman config in htt... [19:03:23] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) Yes, that one should be fine. It will nuke everything but won't touch the raid level or anything else. Even if you don't use a recipe, that should... [19:32:39] Got an answer for the above in phabricator; looks like db.cfg will do! [19:47:18] PROBLEM - MariaDB sustained replica lag on db1121 is CRITICAL: 7.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [19:49:16] RECOVERY - MariaDB sustained replica lag on db1121 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [21:17:41] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @elukey thanks for your comments; I edited the plan comment.