[00:31:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 151.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[00:33:37] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 75.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[00:38:43] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104
[00:41:15] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104
[03:11:24] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[03:17:22] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) @jcrespo  no IP change just switch port change
[03:19:31] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[04:44:51] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jijiki)
[04:49:46] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[04:52:36] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) p:05Triage→03Medium a:03Papaul @papaul this host is under support, can we get a new disk from DELL? This is s2 codfw master
[05:23:58] <wikibugs>	 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui) @Krinkle this is ready to go whenever you are done with the manual script run: https://gerrit.wikimedia.org/r/...
[05:44:35] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[05:44:51] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) s7 sanitarium master db1079 has been replaced by db1158
[05:45:42] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui)
[05:46:07] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui) Let's wait a few days to make sure its replacement (db1158) is working fine.
[05:46:34] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1079.eqiad.wmnet - https://phabricator.wikimedia.org/T282079 (10Marostegui)
[05:46:36] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[05:46:41] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[05:46:56] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[05:57:25] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui)
[05:57:58] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui)
[05:58:09] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui)
[05:58:21] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui)
[06:00:55] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) s5 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1161 [] db1154 [] db1150 [] db1145 [] db1144 [] db1130 [] db1113 [] db111...
[06:00:57] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) s5 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1161 [] db1154 [] db1150 [] db1145 [] db1144 [] db1130 [] db1113...
[06:01:00] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) s5 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1161 [] db1154 [] db1150 [] db1145 [] db1144 [] db113...
[06:29:01] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on s8 - https://phabricator.wikimedia.org/T281830 (10Marostegui) 05Open→03Resolved a:03Marostegui This is all clean. Of course, once we switch the master we'll need to remove the old server_id for db1104 (171970645) before adding s8 to orchestrator
[06:29:03] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[06:29:15] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[07:24:10] <jynus>	 I am going to stop dbprov2002 at some point today. If you want to provision s1, s2 or s7 in the next hours, you better do it now
[07:24:49] <jynus>	 (we will be able to switch it on in an emergency, but warning for regular maintenance)
[07:26:09] <marostegui>	 thanks
[07:26:40] <jynus>	 I will ping here again before shutdown, in case there is ongoing activity, around CET midday
[07:27:00] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on s3 - https://phabricator.wikimedia.org/T281827 (10Marostegui) 05Open→03Resolved a:03Marostegui This is all clean. Of course, once we switch the master we'll need to remove the old server_id for db1123 (171978787) before adding s3 to orchestrator
[07:27:02] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[07:27:16] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui)
[07:27:26] <wikibugs>	 10DBA, 10Orchestrator, 10SRE: Base replication lag detection on heartbeat - https://phabricator.wikimedia.org/T268316 (10Marostegui)
[07:27:28] <wikibugs>	 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) 05Open→03Resolved a:03Marostegui All done
[07:30:03] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10jcrespo) I am going to remove the db2098 s3 10.1 instance, now that db2139 has been working fine for a while. A last backup of the old instance will be availa...
[08:05:40] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10jcrespo) db2098 s3 should be gone now, and will be soon gone from grafana/prometheus.
[08:31:03] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) `mysqlcheck --all-databases` completed successfully on db2129.
[08:48:19] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) s8 sanitarium master db1087 has been replaced by db1167
[08:48:30] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[08:49:23] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 (10Marostegui)
[08:49:43] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 (10Marostegui)
[08:49:45] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[08:49:48] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[08:50:01] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 (10Marostegui) Wait a few days to make sure its replacement (db1167) works fine.
[08:50:24] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[08:51:31] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Marostegui)
[08:52:07] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Marostegui) a:03Kormat Wait a few days to make sure its replacement (db1165) works fine.
[08:52:41] <wikibugs>	 10DBA, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Marostegui)
[08:52:46] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui)
[08:52:51] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[08:53:05] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[08:54:11] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) 05Open→03Resolved All hosts that are scheduled for decommissioning are now ready (but waiting a few days to make sure their repl...
[08:56:07] <marostegui>	 jynus: just for my own tasks organization, you are not currently working on setting up the two media backups hosts right? https://phabricator.wikimedia.org/T275633
[08:56:37] <jynus>	 I am but I have problems getting puppet to work
[08:56:48] <marostegui>	 ah ok, then I will leave it on in-progress
[08:56:56] <marostegui>	 No rush, it was just for the dashboard organization :)
[08:57:19] <jynus>	 you can move the DBA dashboard
[08:57:37] <jynus>	 and leave the persistence one pending
[08:57:47] <jynus>	 the backup one, I mean
[08:58:03] <jynus>	 or in this case, add it
[08:58:06] <marostegui>	 sure
[08:58:24] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui)
[08:59:05] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10jcrespo) a:05Marostegui→03jcrespo
[09:40:06] <wikibugs>	 10Data-Persistence-Backup, 10Analytics-Clusters: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10elukey) @jcrespo quick question - if we want to move forward with this, do we need hardware planned for next fiscal? I know that the use case is ver...
[09:49:21] <wikibugs>	 10Data-Persistence-Backup, 10Analytics-Clusters: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10jcrespo) > do we need hardware planned for next fiscal  Absolutely yes. I thought that clear, and something you were handling on your own or with my...
[10:12:20] <jynus>	 I am going to soon stop dbprov2002
[10:12:30] <jynus>	 as announced
[10:21:34] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) @Papaul could you turn dbprov2002 back on when you finish all needed maintenance? That's all it will need to be back into service. Thank you.
[11:18:55] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1173.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20210506111...
[11:43:16] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1173.eqiad.wmnet'] `  and were **ALL** successful.
[11:54:26] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat)
[11:54:47] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) db1173 (candidate master in eqiad) reimaged to buster, `mysqlcheck --all-databases` running now.
[13:21:00] <wikibugs>	 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat)
[13:21:30] <marostegui>	 \o/
[13:24:48] <wikibugs>	 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat)
[13:40:50] <wikibugs>	 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat)
[13:57:18] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[14:12:48] <wikibugs>	 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat)
[14:14:55] <wikibugs>	 10DBA: Switchover s6 from db1131 to db1173 - https://phabricator.wikimedia.org/T282124 (10Kormat)
[15:29:49] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[15:31:15] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[15:55:10] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[16:45:39] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[16:58:55] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10RKemper)
[17:27:41] <wikibugs>	 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui) Merged the script. Tonight the script will purge everything older than 21 days.
[17:28:05] <wikibugs>	 10DBA, 10DiscussionTools, 10Editing-team, 10Performance-Team, and 2 others: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 (10Marostegui)
[17:31:18] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[18:10:19] <wikibugs>	 10DBA, 10Platform Engineering, 10User-brennen, 10Wikimedia-production-error: Possible uptick in "DBTransactionSizeError: Transaction spent [n] second(s) in writes, exceeding the limit of 3" - https://phabricator.wikimedia.org/T282173 (10brennen)
[19:20:18] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10BBlack)
[19:38:22] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104
[19:45:46] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104
[22:11:27] <wikibugs>	 10DBA, 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) @BBlack i had meetings from 12:30 pm  to 4PM so I didn't have the chance to work on the cp nodes. You can re-pool those since i will not be able to get back on those until th...