[00:33:21] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [00:38:09] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [04:39:08] musikanimal bd808 the reason they are behind is cause we are migrating their sanitarium master to 10.4, and once we reimage the host we need to check their tables (which is an operation that blocks the table) [04:39:21] The check has finished so I am going to check if there is something there and if not, I will resume replication [04:39:53] sweet, thanks! [04:40:22] it is all good, so I have restarted replication [04:40:36] is there a way for us to see what's broken in the pipeline? [04:41:16] Tendril's activity monitor would have shown the check table on its master, but if you aren't familiar with the topology it might have been hard to catch [04:41:32] This is all part of https://phabricator.wikimedia.org/T280492 [04:41:37] and s1 was the last host [04:41:54] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) db1106 came back clean, replication restarted [04:41:55] gotcha. Good to know. Thanks again! [04:42:04] musikanimal: not sure if you have access to this: https://orchestrator.wikimedia.org/web/cluster/alias/s1 [04:42:23] but there you can see db1106 being the master for db1154 (sanitarium) which is the master for the wikireplicas [04:58:08] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) s4 eqiad [] labsdb1011 [] labsdb1010 [x] dbstore1004 [] db1183 [] db1160 [] db1155 [] db1150 [] db1149 [] db1148 [] db1147 [] db1146 [] db1145 [] d... [04:58:10] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) s4 eqiad [] labsdb1011 [] labsdb1010 [x] dbstore1004 [] db1183 [] db1160 [] db1155 [] db1150 [] db1149 [] db1148 [] db1147 [] d... [04:58:13] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) s4 eqiad [] labsdb1011 [] labsdb1010 [x] dbstore1004 [] db1183 [] db1160 [] db1155 [] db1150 [] db1149 [] db1148 [] db1147 [] db1146 [] db1... [05:03:29] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) [05:03:35] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) p:05Triage→03High [05:07:40] legoktm: I am ready for T282621 whenever you are [05:07:41] T282621: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 [05:08:12] marostegui: in an hour right? [05:08:19] yep [05:09:25] I will be ready then :) [05:09:33] * legoktm was just about to go afk and thought I got the UTC conversion wrong :p [05:09:38] haha [05:24:55] 10DBA, 10Data-Services, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10Marostegui) 05Stalled→03Open [05:31:11] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:50:22] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) [05:50:26] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) [05:50:34] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) [05:51:18] marostegui: do you have a new size for image table on commons? [05:52:03] Amir1: yeah, pretty much the same, 311GB compressed [05:52:08] it took around 20h to get it done [05:52:19] :((((((((((((((( [05:53:03] yeah, it is going to take like 2 weeks to alter s4 XD [05:54:09] metadata of top 100 files on its own is 3GB. I have a patch to fix it *cough* [05:56:47] :-/ [05:57:47] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [06:04:26] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Marostegui) Alters deployed [06:08:10] 10DBA, 10SRE, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) 05Open→03Resolved Yay, thank you! In conclusion the schema changes themselves took a few seconds and we had about 3 m... [06:40:39] 10DBA, 10Data-Services, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10Marostegui) [06:44:01] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) [06:44:20] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [06:44:22] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [06:44:24] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 (10Marostegui) 05Open→03Resolved All sanitarium masters are now running Buster+10.4 [06:44:27] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:46:23] 10DBA, 10Data-Services, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `labsdb1010.eqiad.wmnet`... [06:47:15] 10DBA, 10Data-Services, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10Marostegui) a:05Marostegui→03wiki_willy This is ready for #dc-ops [06:54:53] 10DBA: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 (10Marostegui) [06:57:10] 10DBA: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 (10Marostegui) 05Open→03Stalled p:05Triage→03Medium a:03Kormat Assigning this to @Kormat as she's done s6 already. This should be stalled for now and only to be done once we are happy with s6's performance/s... [06:57:14] kormat: ^ congratulations! [06:58:07] 10DBA: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 (10Marostegui) [06:58:34] 10DBA: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 (10jcrespo) [06:59:22] 10DBA: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 (10Marostegui) [07:00:34] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) [07:00:42] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) [07:00:50] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) [07:07:02] 10DBA, 10Patch-For-Review: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 (10jcrespo) [07:19:58] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi @Ottomata you can use db1125 to replace this host. Most likely it needs to be renamed to dbstore1006: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging Let me know w... [07:29:18] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) [07:29:43] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) [07:30:09] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) [08:44:41] marostegui: as i said yesterday, i'm overwhelmed by your generosity ;) [08:59:26] jynus: is it ok if i assign https://phabricator.wikimedia.org/T280751 to you, as the remaining step is "Cleanup (remove) old backup sources from both DCs"? [08:59:32] +1 [08:59:49] as mentioned, the plan was to remove it next monday [09:00:02] cool [09:00:02] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) a:05Kormat→03jcrespo [09:02:22] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10jcrespo) [09:02:36] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10LSobanski) Timing wise, this should happen before DC switchover happens (likely the week of June 21st) as we'll have our hands full during that time. This makes things tricky as the three weeks before that date... [09:38:52] hi, we got a page from icinga: [09:38:53] PROBLEM - DNS on labsdb1010.mgmt is CRITICAL: Domain labsdb1010.mgmt.eqiad.wmnet was not found by the server [09:39:07] that host is being decommissioned, no? [09:39:09] arturo: see -operations [09:39:15] it was decommissioned earlier today [09:39:19] so not sure why that has showed p [09:39:23] showed up [09:39:42] thanks marostegui [09:41:35] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) [09:41:37] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) [09:41:39] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) [09:50:58] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) s3 eqiad [] labsdb1011 [x] dbstore1004 [] db1179 [] db1175 [] db1171 [] db1166 [] db1157 [] db1154 [] db1123 [] db1112 [] db1102 [] clouddb... [09:51:00] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) s3 eqiad [] labsdb1011 [x] dbstore1004 [] db1179 [] db1175 [] db1171 [] db1166 [] db1157 [] db1154 [] db1123 [] db1112 [] db110... [09:51:03] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) s3 eqiad [] labsdb1011 [x] dbstore1004 [] db1179 [] db1175 [] db1171 [] db1166 [] db1157 [] db1154 [] db1123 [] db1112 [] db1102 [] clouddb1021 []... [11:25:58] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) [11:26:12] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) [11:26:25] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) [13:45:18] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Ottomata) What's the urgency of 85% full? Could it wait until Q2 maybe? [13:50:56] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) >>! In T283125#7098281, @Ottomata wrote: > What's the urgency of 85% full? Could it wait until Q2 maybe? If it increases 3% more, that means we'll not be able to alter the `image` table anymore. At... [14:22:18] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Performance-Team: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Marostegui) @Krinkle did the manual purge on pc1010 finish? If so, are we ok to go ahead and optimize the tables there so we can proc... [14:44:39] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Ottomata) Ok, so db1125 is available for reimaging now? I'll bring this up in our standup today, and see if we can get to work on it next week or after. [14:46:38] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Yes, it can be done anytime. [15:20:36] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10Performance-Team: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Krinkle) @kormat Yes, the purge has completed on all 255 tables. Go ahead! [16:15:58] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Create backups of user tables from decommissioned database servers - https://phabricator.wikimedia.org/T183758 (10Urbanecm) Hello, just a friendly reminder, I noticed this is still in the scratch volume. Do we still need three years old backup? Or can... [16:55:58] 10DBA, 10Analytics: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10razzi) a:03razzi Thanks for calling this out @Marostegui and offering db1125. I'll get started on the reimage of db1125. [16:58:30] 10DBA, 10Analytics, 10Analytics-Kanban: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10razzi) [20:08:58] 10DBA, 10Analytics, 10Analytics-Kanban: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `db1125.eqiad.wmnet` - db1125.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found physi...