[03:16:05] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [03:18:23] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [04:20:52] 10DBA, 10Phabricator: Upgrade mysql on db1132 (phabricator db master) - https://phabricator.wikimedia.org/T279625 (10Marostegui) test [04:22:27] marostegui: I'm ready to do the lists-next upgrade whenever you're free :) [04:22:48] 10DBA, 10Phabricator: Upgrade mysql on db1132 (phabricator db master) - https://phabricator.wikimedia.org/T279625 (10Marostegui) 05Open→03Resolved a:03Marostegui This was done RO starts: 04:19:51 RO stops: 04:20:18 [04:22:50] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) [04:22:52] legoktm: we can do it now :) [04:23:05] legoktm: sorry for not answering yesterday, that was pretty late in the EU evening :) [04:23:21] ok, give me a minute [04:23:44] and no worries :) I figured as much like a minute after I sent the ping [04:25:10] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) [04:33:15] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 8 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 (10Marostegui) [04:34:56] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 8 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 (10Marostegui) p:05Triage→03Medium [04:38:26] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [04:46:35] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Pooled db1124 with minimal weight for the first time in s7 [04:47:27] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 (10Legoktm) 05Open→03Resolved a:03Legoktm Upgraded, thanks to @Marostegui for supervising! [04:58:04] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10Marostegui) [04:58:59] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10Marostegui) This is ready for #dc-ops [05:00:31] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:18:31] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) I am automatically pooling db1124 into s7. [05:18:42] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:27:35] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) [05:32:41] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) s7 eqiad [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1181 [] db1174 [] db1170 [] db1158 [] db1155 [] db1136 []... [05:40:20] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) [05:41:03] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) [05:47:55] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) [05:52:42] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) s3 eqiad [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1004 [] db1179 [] db1175 [] db1171 [] db1166 [] db1157 [] db1154 [... [06:03:32] 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:13:27] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) Anything else pending or can this be closed? [06:14:15] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) 05Open→03Resolved [06:48:44] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:50:42] 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:53:08] 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [06:54:42] 10DBA, 10Patch-For-Review: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [07:03:14] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10elukey) @razzi work done to unblock data persistence, also ran `mysql_upgrade` after a chat with Manuel. All good from the Analytics side @Marostegui ! [07:05:04] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) Thanks @elukey! [07:05:11] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) [07:17:01] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Checking tables on db1167 [07:20:53] 10Data-Persistence-Backup, 10Goal: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) [07:24:58] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10jcrespo) >>! In T280492#7034626, @Marostegui wrote: > Excellent, thanks. It will take around a day I'd guess. It finished at ~3am: all yours. Please note I l... [07:25:30] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) Thank you, I will take over it! [08:11:52] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) I have started to clean up s1, but will most likely finish it once the switchover is done tomorrow. There's lots to clean up there [08:46:15] 10Data-Persistence-Backup: xtrabackup --prepare hits open_files_limit on buster - https://phabricator.wikimedia.org/T281094 (10jcrespo) New package version has been locally installed on dbprov2003. Run looks fine so far: ` [08:43:32]: DEBUG - ['xtrabackup', '--prepare', '--target-dir', '/srv/backups/snapshots/... [09:42:42] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) Regarding s1: I have finished cleaning up eqiad. Tomorrow only deleting the current master's ID would be the only thing left for it. Going to start codfw clean up now. [09:53:01] I'm going to clean ~12M rows from watchlist table of commons. Don't be alarmed for the high write there [09:53:10] thanks [09:54:28] I've reported the backup failure of people1003 at T280989, there is not much else we can do on our side [09:54:28] T280989: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 [12:05:55] jynus: can you review this for me? (I just saw k0rmat) is out today: https://gerrit.wikimedia.org/r/c/operations/puppet/+/682881 https://gerrit.wikimedia.org/r/c/operations/dns/+/682882 so I would like to get a review before the switch tomorrow morning :) [12:06:10] one sec [12:06:29] no rush [12:06:30] I am finishing some details about s4, I guess you don't mind me doing it later :-) [12:06:36] absolutely [12:06:52] add me a reviewer please so I get a reminder at mail [12:06:59] yep wilco [12:21:20] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) s1 codfw is now clean [12:51:00] 10DBA, 10SRE-tools: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) [12:51:26] ^I've filed this based on my struggles to get a one liner working at: P15586 [12:52:16] I thought we had a task for that [12:52:28] but anyways, +1 to have that [12:52:40] 10DBA, 10SRE, 10Sustainability (Incident Followup): Collect metricts for Exec_Master_Log_Pos - https://phabricator.wikimedia.org/T281251 (10jbond) [12:53:13] ^ I am not sure I get what that is about [12:53:17] I think I asked stevie about productionizing section and she mentioned some blocker [12:53:35] ah, that is just a duplicate of mine [12:53:41] I will merge it [12:53:47] as mine is more extended [12:53:49] +1 [12:54:22] 10DBA, 10SRE-tools: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) [12:54:26] 10DBA, 10SRE, 10Sustainability (Incident Followup): Collect metricts for Exec_Master_Log_Pos - https://phabricator.wikimedia.org/T281251 (10jcrespo) [12:54:54] 10DBA, 10SRE-tools: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) Jbond: we already collect those metrics, what we don't have is a way to show them easily. [12:55:26] 10DBA, 10SRE-tools: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jbond) >>! In T281249#7037932, @jcrespo wrote: > Jbond: we already collect those metrics, what we don't have is a way to show them easily.... [12:56:35] 10DBA, 10SRE-tools, 10Sustainability (Incident Followup): Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jbond) [12:58:17] 10DBA, 10SRE-tools, 10Sustainability (Incident Followup): Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) It could have a format similar to the existing db-replication-tree, but it cannot use repl... [12:58:42] to be fair, I report my own struggles, then you DBAs decide on importance/etc. [13:00:51] also clouddbs now are show on replica tree as they no longer use multi-source :-) https://phabricator.wikimedia.org/F34429728 [13:00:57] *shown [13:01:18] I will take a break, and when come backup give you a review and keep fixing the backups :-) [13:01:34] *come back [13:01:40] *will [13:01:45] thanks jynus [13:02:27] I will probably go soon too, as I started at 6am today and tomorrow will do the same for the switchover preparation [14:58:42] marostegui: let me know it's okay for me to restart my clean up script [14:58:56] I'll run it with smaller pace. 1K row per 5sec? [16:02:33] I think s3 buster backups are fixed, waiting for a last full backup to confirm it [16:21:38] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 9 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 (10Tgr) #growthexperiments also uses x1 (not for enwiki but for a number of others). As far as I'm aware it won't be nontrivially impacted by a short readonly... [16:54:27] 10DBA, 10MediaWiki-extensions-Renameuser: Fix use of DB schema so RenameUser is trivial - https://phabricator.wikimedia.org/T33863 (10Izno) [17:45:47] 10Data-Persistence-Backup, 10Goal: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) [17:46:05] 10Data-Persistence-Backup, 10Patch-For-Review: xtrabackup --prepare hits open_files_limit on buster - https://phabricator.wikimedia.org/T281094 (10jcrespo) 05Open→03Resolved This is now fixed, I have uploaded v0.5 packages fixing the issue, but I will only have updated for now buster dbprov hosts, as the o... [17:50:05] marostegui: if you're available, we'd like to do the real mailman3 install/db setup ~6 UTC tomorrow [17:55:41] 10DBA, 10Phabricator, 10serviceops, 10User-brennen: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution" - https://phabricator.wikimedia.org/T279013 (10mmodell) [17:56:19] 10DBA, 10Schema-change, 10Tracking-Neverending: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Izno) [17:56:24] 10DBA, 10Phabricator, 10serviceops, 10User-brennen: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution" - https://phabricator.wikimedia.org/T279013 (10mmodell) I've marked this as blocked by {T171498} because that sounds like the righ... [17:56:49] 10DBA, 10Phabricator, 10serviceops, 10User-brennen: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution" - https://phabricator.wikimedia.org/T279013 (10mmodell) p:05Triage→03Low [17:57:54] 10DBA, 10Schema-change, 10Tracking-Neverending: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Izno) [17:59:18] 10DBA, 10Schema-change, 10Tracking-Neverending: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Izno) [18:58:55] I'm importing wikimedia-l to lists-next as the last test before going live tomorrow [20:42:17] 10DBA, 10DiscussionTools, 10OWC2020, 10Editing-team (FY2020-21 Kanban Board), 10Patch-For-Review: DBA review: conversation subscriptions - https://phabricator.wikimedia.org/T263817 (10matmarex) Thank you for taking a look! Replies below: >>! In T263817#7033384, @Marostegui wrote: > Let's make the VARCH... [20:51:14] 10DBA, 10DiscussionTools, 10OWC2020, 10Editing-team (FY2020-21 Kanban Board), 10Patch-For-Review: DBA review: conversation subscriptions - https://phabricator.wikimedia.org/T263817 (10matmarex) a:03matmarex [21:00:34] 10DBA, 10DiscussionTools, 10OWC2020, 10Editing-team (FY2020-21 Kanban Board), 10Patch-For-Review: DBA review: conversation subscriptions - https://phabricator.wikimedia.org/T263817 (10DLynch) > The .sql files are generated from the .json file, using the generateSchemaSql.php maintenance script, and as fa...