[00:31:01] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 151.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [00:33:37] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 75.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [00:38:43] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [00:41:15] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [03:11:24] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [03:17:22] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) @jcrespo no IP change just switch port change [03:19:31] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [04:44:51] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jijiki) [04:49:46] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [04:52:36] 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) p:05Triage→03Medium a:03Papaul @papaul this host is under support, can we get a new disk from DELL? This is s2 codfw master [05:23:58] 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui) @Krinkle this is ready to go whenever you are done with the manual script run: https://gerrit.wikimedia.org/r/... [05:44:35] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [05:44:51] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) s7 sanitarium master db1079 has been replaced by db1158 [05:45:42] 10DBA, 10decommission-hardware: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui) [05:46:07] 10DBA, 10decommission-hardware: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui) Let's wait a few days to make sure its replacement (db1158) is working fine. [05:46:34] 10DBA, 10decommission-hardware: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui) [05:46:36] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [05:46:41] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:46:56] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:57:25] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) [05:57:58] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) [05:58:09] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) [05:58:21] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) [06:00:55] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) s5 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1161 [] db1154 [] db1150 [] db1145 [] db1144 [] db1130 [] db1113 [] db111... [06:00:57] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) s5 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1161 [] db1154 [] db1150 [] db1145 [] db1144 [] db1130 [] db1113... [06:01:00] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) s5 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1161 [] db1154 [] db1150 [] db1145 [] db1144 [] db113... [06:29:01] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on s8 - https://phabricator.wikimedia.org/T281830 (10Marostegui) 05Open→03Resolved a:03Marostegui This is all clean. Of course, once we switch the master we'll need to remove the old server_id for db1104 (171970645) before adding s8 to orchestrator [06:29:03] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) [06:29:15] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) [07:24:10] I am going to stop dbprov2002 at some point today. If you want to provision s1, s2 or s7 in the next hours, you better do it now [07:24:49] (we will be able to switch it on in an emergency, but warning for regular maintenance) [07:26:09] thanks [07:26:40] I will ping here again before shutdown, in case there is ongoing activity, around CET midday [07:27:00] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on s3 - https://phabricator.wikimedia.org/T281827 (10Marostegui) 05Open→03Resolved a:03Marostegui This is all clean. Of course, once we switch the master we'll need to remove the old server_id for db1123 (171978787) before adding s3 to orchestrator [07:27:02] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) [07:27:16] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) [07:27:26] 10DBA, 10Orchestrator, 10SRE: Base replication lag detection on heartbeat - https://phabricator.wikimedia.org/T268316 (10Marostegui) [07:27:28] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) 05Open→03Resolved a:03Marostegui All done [07:30:03] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10jcrespo) I am going to remove the db2098 s3 10.1 instance, now that db2139 has been working fine for a while. A last backup of the old instance will be availa... [08:05:40] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10jcrespo) db2098 s3 should be gone now, and will be soon gone from grafana/prometheus. [08:31:03] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) `mysqlcheck --all-databases` completed successfully on db2129. [08:48:19] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) s8 sanitarium master db1087 has been replaced by db1167 [08:48:30] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [08:49:23] 10DBA, 10decommission-hardware: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 (10Marostegui) [08:49:43] 10DBA, 10decommission-hardware: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 (10Marostegui) [08:49:45] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [08:49:48] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:50:01] 10DBA, 10decommission-hardware: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 (10Marostegui) Wait a few days to make sure its replacement (db1167) works fine. [08:50:24] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:51:31] 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Marostegui) [08:52:07] 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Marostegui) a:03Kormat Wait a few days to make sure its replacement (db1165) works fine. [08:52:41] 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Marostegui) [08:52:46] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [08:52:51] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:53:05] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:54:11] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) 05Open→03Resolved All hosts that are scheduled for decommissioning are now ready (but waiting a few days to make sure their repl... [08:56:07] jynus: just for my own tasks organization, you are not currently working on setting up the two media backups hosts right? https://phabricator.wikimedia.org/T275633 [08:56:37] I am but I have problems getting puppet to work [08:56:48] ah ok, then I will leave it on in-progress [08:56:56] No rush, it was just for the dashboard organization :) [08:57:19] you can move the DBA dashboard [08:57:37] and leave the persistence one pending [08:57:47] the backup one, I mean [08:58:03] or in this case, add it [08:58:06] sure [08:58:24] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) [08:59:05] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10jcrespo) a:05Marostegui→03jcrespo [09:40:06] 10Data-Persistence-Backup, 10Analytics-Clusters: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10elukey) @jcrespo quick question - if we want to move forward with this, do we need hardware planned for next fiscal? I know that the use case is ver... [09:49:21] 10Data-Persistence-Backup, 10Analytics-Clusters: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10jcrespo) > do we need hardware planned for next fiscal Absolutely yes. I thought that clear, and something you were handling on your own or with my... [10:12:20] I am going to soon stop dbprov2002 [10:12:30] as announced [10:21:34] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) @Papaul could you turn dbprov2002 back on when you finish all needed maintenance? That's all it will need to be back into service. Thank you. [11:18:55] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1173.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20210506111... [11:43:16] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1173.eqiad.wmnet'] ` and were **ALL** successful. [11:54:26] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) [11:54:47] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) db1173 (candidate master in eqiad) reimaged to buster, `mysqlcheck --all-databases` running now. [13:21:00] 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat) [13:21:30] \o/ [13:24:48] 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat) [13:40:50] 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat) [13:57:18] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:12:48] 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat) [14:14:55] 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat) [15:29:49] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:31:15] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:55:10] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [16:45:39] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [16:58:55] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10RKemper) [17:27:41] 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui) Merged the script. Tonight the script will purge everything older than 21 days. [17:28:05] 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui) [17:31:18] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [18:10:19] 10DBA, 10Platform Engineering, 10User-brennen, 10Wikimedia-production-error: Possible uptick in "DBTransactionSizeError: Transaction spent [n] second(s) in writes, exceeding the limit of 3" - https://phabricator.wikimedia.org/T282173 (10brennen) [19:20:18] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10BBlack) [19:38:22] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [19:45:46] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [22:11:27] 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) @BBlack i had meetings from 12:30 pm to 4PM so I didn't have the chance to work on the cp nodes. You can re-pool those since i will not be able to get back on those until th...