[06:54:25] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1173.eqiad.wmnet'] ` The log ca... [06:54:55] 10Blocked-on-schema-change: Schema change for renaming two indexes of site_identifiers - https://phabricator.wikimedia.org/T273361 (10Marostegui) [06:57:45] 10Blocked-on-schema-change: Schema change for renaming two indexes of site_identifiers - https://phabricator.wikimedia.org/T273361 (10Marostegui) [07:00:03] 10Blocked-on-schema-change: Schema change for renaming two indexes of site_identifiers - https://phabricator.wikimedia.org/T273361 (10Marostegui) [07:15:32] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1173.eqiad.wmnet'] ` and were **ALL** successful. [07:17:01] 10DBA, 10decommission-hardware: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 (10Marostegui) [07:17:28] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:17:30] 10DBA, 10decommission-hardware: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 (10Marostegui) [07:17:43] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:11:44] 10Blocked-on-schema-change: Schema change for renaming two indexes of site_identifiers - https://phabricator.wikimedia.org/T273361 (10Marostegui) [08:14:18] 10Blocked-on-schema-change: Schema change for renaming two indexes of site_identifiers - https://phabricator.wikimedia.org/T273361 (10Marostegui) [08:19:18] 10DBA, 10OTRS, 10Recommendation-API, 10Research, 10Performance-Team (Radar): Restart m2 database master (db1107) - https://phabricator.wikimedia.org/T272964 (10Marostegui) pre-restart steps are done [09:07:41] 10DBA, 10OTRS, 10Recommendation-API, 10Research, 10Performance-Team (Radar): Restart m2 database master (db1107) - https://phabricator.wikimedia.org/T272964 (10Marostegui) This has been delayed a little bit as there's an unrelated incident going on. [09:40:53] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1173 is now replicating. [09:46:53] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:03:30] 10DBA, 10Orchestrator: Add m* and es4/es5 sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) [10:03:32] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [10:04:04] 10DBA, 10OTRS, 10Recommendation-API, 10Research, 10Performance-Team (Radar): Restart m2 database master (db1107) - https://phabricator.wikimedia.org/T272964 (10Marostegui) 05Open→03Resolved This was done. RO start: 09:58:05 RO stop: 09:58:48 All the services recovered without human intervention. [10:04:06] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [10:08:04] 10DBA, 10Orchestrator: Add m* and es4/es5 sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) [10:08:08] 10DBA, 10Orchestrator: Add m* and es4/es5 sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) 05Open→03Resolved m2 added to orchestrator. This is all done [10:08:10] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [10:08:22] marostegui: 👏 [10:08:29] \o/ [10:09:32] a logical recovery of a full section takes 12h: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db1171&var-datasource=thanos&var-cluster=mysql&from=1612262579249&to=1612312604455 [10:16:04] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) s1 clouddb replicas cleaned: ` mysqlroot@cumin1001:/home/marostegui# for i in clouddb1017:3311 clouddb1013:3311; do echo $i; mysql.py -h$i heartbeat_p -e "select * from heartbeat;" > don... [10:16:16] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) [10:23:24] "Could not read data from enwiki.blobs_cluster26: Server shutdown in progress" [10:23:36] now that is probably a reboot, right? [10:24:05] that's old, right? [10:24:14] 2021-02-02 06:23:54 [10:24:24] yes, from yesterday [10:24:30] I did a restart yes [10:24:36] ok, then all good [10:24:44] last time we found no explanation for the errors [10:24:52] this time is a normal thing [10:26:29] backups of es at low concurrency now take 10-11 hours [10:26:53] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) s3 clouddb replicas cleaned: ` root@cumin1001:/home/marostegui# for i in clouddb1017:3313 clouddb1013:3313; do echo $i; mysql.py -h$i heartbeat_p -e "select * from heartbeat;"; done clou... [10:27:07] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) [10:40:30] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) [10:43:38] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) [10:46:43] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) s2 clouddb replicas cleaned: ` root@cumin1001:/home/marostegui# for i in clouddb1014:3312 clouddb1018:3312; do echo $i; mysql.py -h$i heartbeat_p -e "select * from heartbeat;"; done clo... [10:46:58] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) [11:02:09] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) s7 clouddb replicas cleaned: ` root@cumin1001:/home/marostegui# for i in clouddb1014:3317 clouddb1018:3317; do echo $i; mysql.py -h$i heartbeat_p -e "select * from heartbeat;"; done clou... [11:02:24] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) [11:11:54] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) s6 clouddb replicas cleaned: ` root@cumin1001:/home/marostegui# for i in clouddb1015:3316 clouddb1019:3316; do echo $i; mysql.py -h$i heartbeat_p -e "select * from heartbeat;"; done clou... [11:12:11] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) [11:24:56] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) s5 clouddb replicas cleaned: `` root@cumin1001:/home/marostegui# for i in clouddb1016:3315 clouddb1020:3315; do echo $i; mysql.py -h$i heartbeat_p -e "select * from heartbeat;"; done clo... [11:25:10] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) [11:35:42] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) [11:35:51] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) [11:35:57] 10DBA, 10Data-Services: Clean up heartbeat table on clouddb hosts - https://phabricator.wikimedia.org/T273593 (10Marostegui) 05Open→03Resolved a:03Marostegui s8 clouddb replicas cleaned: ` root@cumin1001:/home/marostegui# for i in clouddb1016:3318 clouddb1020:3318; do echo $i; mysql.py -h$i heartbeat_p -... [13:28:34] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) x1: [x] dbstore1005 [] db1137 [] db1120 [] db1103 [x] db1102 [13:28:59] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [13:45:15] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 8 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Marostegui) [13:45:52] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 8 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Marostegui) p:05Triage→03Medium [13:47:48] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 9 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Marostegui) [13:54:00] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Trizek-WMF) Wow, this will happen soon! If I read it correctly, it will disturb ContentTranslation, Flow, Echo (all Echo notifications at en.wp and all X-wiki no... [13:59:23] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Restart x1 database master - https://phabricator.wikimedia.org/T273758 (10Marostegui) >>! In T273758#6800123, @Trizek-WMF wrote: > Wow, this will happen soon! It will happen in 14 days. > > If I read it correctly, it will disturb Con... [21:06:32] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [21:07:32] 10Data-Persistence-Backup, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson) [21:07:45] 10Data-Persistence-Backup, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson) Both have been removed from rack and netbox updated [21:07:51] 10Data-Persistence-Backup, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10Cmjohnson) 05Open→03Resolved