[00:49:40] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Sergey.Trofimovsky.SF) The plan overall is to utilize Gitlab's built in backup that (to some point) takes care of the consistency of backups, including... [00:55:04] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Sergey.Trofimovsky.SF) Using Gerrit backups as a baseline makes sense. What components are currently included in the hourly Gerrit backups? What is the... [01:25:58] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [01:26:32] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 9.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [01:27:40] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 14 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [01:37:29] marostegui: aha, I found https://docs.google.com/spreadsheets/d/1QMwENrGC6IKBV5F8boyjr-3V-3HujI9Vo3U52zP5F88/edit#gid=0 - in case that helps [02:02:34] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [02:04:14] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [02:09:54] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [02:15:18] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [04:43:08] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) [04:58:31] I have truncated db1077 parsercache tables as it was filling up [04:58:45] db1077 is a testing host, so nothing to worry about [05:08:44] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Marostegui) [05:08:56] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Marostegui) [05:13:21] 10DBA, 10GrowthExperiments-MentorDashboard, 10GrowthExperiments-Mentorship, 10Growth-Team (Current Sprint), and 2 others: Create growthexperiments_mentor_mentee database table on extension1 for wikis in growthexperiments.dblist - https://phabricator.wikimedia.org/T278573 (10Marostegui) Confirmed - table is... [05:14:50] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by root@cumin1001 for hosts: `db1076.eqiad.wmnet` - db1076.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - F... [05:15:10] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Marostegui) Host ready for DC-Ops! [05:16:08] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:30:37] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) Pooled db1177 with minimal weight on s8 - will slowly pool it if all goes well. [06:38:10] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) Moving this to 10:30 AM UTC as there's a power maintenance scheduled in my building which is supposed to end at 10:00 AM UTC, but just in case... [08:45:15] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jcrespo) > What components are currently included in the hourly Gerrit backups? What is the retention policy for build artifacts, build data (logs etc)... [09:12:05] I was checking m2 for unrelated reasons and I ran into this issue: https://phabricator.wikimedia.org/P15315 [09:15:37] 10DBA, 10Privacy Engineering, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10LSobanski) @Reedy are we ok to drop `aft_feedback` now? [09:26:41] PROBLEM - MariaDB sustained replica lag on db2133 is CRITICAL: 7.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104 [09:29:25] RECOVERY - MariaDB sustained replica lag on db2133 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104 [09:30:10] jynus: uh? what's that? [09:30:53] a view created with a user that no longe exists, it looks like [09:31:03] that's weird [09:31:14] I don't recall having views on any other place apart from wmcs [09:31:18] not sure if something you know about or the service owners did, or what [09:31:42] but maybe there was some ip moves on grants of something [09:31:58] I was just reviewing backups and run into it, no more [09:31:58] Do you mind creating a task so I can investigate? [09:32:03] of course! [09:32:07] thanks [09:32:24] as in, I reported so you do decided how to react :-) [09:32:36] *decide [09:32:39] yes, let's create a task so I can investigate some other time [09:33:02] I happen also to cleaning up the misc db descriptions [09:33:06] *be [09:33:17] I think you will like it [09:33:53] misc descriptions? in wikitech? [09:34:02] yes [09:34:12] that's very needed, I did some cleanup a few months ago I think [09:34:17] they were outdated, I will show you in a bit [09:34:17] thanks for doing it [09:34:31] I use it a lot because of backups generation [09:34:47] when checking if everything is backed up [09:36:59] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) Automatically pooling db1177 into s8 [09:37:16] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) [09:39:52] marostegui, I updated everything that was outdated, and reorganized it a bit: https://wikitech.wikimedia.org/wiki/MariaDB/misc#Sections_description [09:40:10] it is now not very pretty, but I will convert it to a nice table at a later time [09:40:34] Thank you for keeping it up to date! [09:40:44] I removed the owners section [09:40:48] as it was outdated and duplicated [09:41:00] and put the owners/people offering to help on the same line [09:41:41] on a next change I think I could do a table with fields db name | description | owner | actions needed after failover [09:42:02] yeah, that would be great and maybe also: actions needed before the failover [09:42:10] lol [09:42:16] sure [09:42:18] :) [09:42:37] for now I was happy that it matched my backups [09:44:14] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [09:45:35] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) As pre step, everything moved under the new host. {F34387980} [09:45:47] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [09:49:35] 10Data-Persistence-Backup, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jcrespo) I just read there is an option for "Skipping tar creation", maybe that could be used to generate a consistent export of files that are increme... [09:59:24] legoktm: thanks for the doc, so we'll be following that same approach for 2021's switchover? [10:04:50] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [10:06:29] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [10:06:37] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) All the pre-steps are done [10:09:08] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [10:12:32] marostegui: yes [10:13:03] legoktm: good, for us that means only MW indeed, so pretty much the same we did last year. Thanks! [10:34:26] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [10:36:15] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [10:39:27] kormat: https://wikitech.wikimedia.org/w/index.php?title=MariaDB&type=revision&diff=1907980&oldid=1907906 [10:41:49] I think it applied also some master puppet-driven checks [10:43:48] marostegui: 👍 [10:45:55] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) Everything looks good, we are running some final checks to ensure backup infra is working fine after the swap. The RO time was around 10 seconds. [10:47:17] 10DBA, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui) [10:49:58] ah, silly me, cumin doesn't need to reload the db, as it is only written from dbprov* hosts [10:50:10] 10DBA, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui) db1080 is no longer m1 master, it was swapped T276448. Let's give db1159 a week before decommissioning this host. [10:50:20] 10DBA, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Marostegui) [10:50:22] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [10:50:49] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) [10:51:23] I am rerruning manually "snapshot of s7 in codfw", in a couple of hours we will be able to check if it writes its metadata successfully [10:51:33] good, thanks [10:51:34] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [11:04:57] 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [11:05:31] 10DBA, 10Phabricator: Upgrade mysql on db1132 (phabricator db master) - https://phabricator.wikimedia.org/T279625 (10Marostegui) @mmodell let me know if the above is enough and I will take care of this myself. Thanks! [11:21:59] I am going to take a long lunch break, ping me on phone if something breaks horribly [11:22:18] enjoy [11:24:06] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) a:03Marostegui [11:25:18] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) [11:40:46] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) @Ladsgroup I have altered db1096:3316 and will leave it running for a few days, to make sure we have no code forcing the old inde... [11:42:56] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) s6 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1005 [] db2141 [] db2129 [] db2124 [] db2117 [] db2114 [] db2097... [11:43:13] 10Blocked-on-schema-change, 10DBA: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 (10Marostegui) [11:43:50] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10MoritzMuehlenhoff) [11:53:41] moritzm: I am going to push https://gerrit.wikimedia.org/r/679318 [12:04:22] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) [12:08:23] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) [13:23:44] 10Data-Persistence-Backup: Internal APT repository backup - https://phabricator.wikimedia.org/T276220 (10MoritzMuehlenhoff) >>! In T276220#6994952, @jbond wrote: > As far as i can tell all the necessary data is in `/srv/wikimedia` which is already being backed up via Indeed, that contains everything we built lo... [13:25:10] 10DBA: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10jcrespo) Backup metadata looking good: ` root@db1159.eqiad.wmnet[dbbackups]> select * FROM backups order by id desc limit 1\G *************************** 1. row ***************************... [13:34:38] 10DBA, 10OTRS: OTRS database is "too large" - https://phabricator.wikimedia.org/T138915 (10LSobanski) The future of this task is heavily dependent on the outcome of https://phabricator.wikimedia.org/T275294 and its follow up tasks. [13:46:45] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) [13:46:50] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [13:46:52] 10DBA: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) [13:46:56] 10DBA: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448 (10Marostegui) 05Open→03Resolved Thanks! Closing this! [14:10:00] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [14:14:32] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [14:20:44] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [14:23:00] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [15:28:25] 10DBA, 10cloud-services-team (Kanban): Upgrade mysql on db1128 (m5 db master) - https://phabricator.wikimedia.org/T279657 (10aborrero) That's ok for us! Ping me on IRC and I will: * downtime labstore1004 * stop puppet * shutdown maintain-dbusers [15:30:47] 10DBA, 10cloud-services-team (Kanban): Upgrade mysql on db1128 (m5 db master) - https://phabricator.wikimedia.org/T279657 (10Marostegui) Thanks @arturo - what about Monday 19th at 09:00 AM UTC? [15:40:37] 10Data-Persistence-Backup: Internal APT repository backup - https://phabricator.wikimedia.org/T276220 (10jcrespo) Feel free -if you find the time- to make some recovery tests, even on paper- it should be easy and doesn't hurt recovering on e.g. /var/tmp/new-dir and check you would be able to recover everything f... [20:30:40] PROBLEM - MariaDB sustained replica lag on db2133 is CRITICAL: 2.379e+04 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104 [20:46:16] RECOVERY - MariaDB sustained replica lag on db2133 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104