[00:05:08] PROBLEM - MariaDB sustained replica lag on es1022 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1022&var-port=9104 [00:06:24] RECOVERY - MariaDB sustained replica lag on es1022 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1022&var-port=9104 [06:01:03] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Marostegui) I still think it needs to be tested frist on `labtestwiki`: T269348#6737173 [06:05:36] 10DBA, 10SRE, 10ops-eqiad: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Marostegui) I have started to repool this host back. [06:50:55] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [06:52:50] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [07:13:39] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [07:21:03] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [07:53:59] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [07:56:19] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) s1 (enwiki) needs to be done host by host [07:57:52] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [08:00:35] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [08:01:40] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) s1 eqiad [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [x] db1169 [] db1154 [] db1140 [] db1139 [] db1135 [] db1134 [] db1133 [] db... [08:05:01] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [08:06:18] 10Blocked-on-schema-change: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) Schema change started on s3, it will take around 15h [08:29:24] hey dbas [08:29:34] Phabricator cant reach the mysql database for some reason [08:29:35] filing a task [08:29:39] uh? [08:29:50] checking [08:30:05] there is some network flapping maybe that is related [08:30:15] let's check with arzhel [08:30:16] and of course I can't file a task bah [08:30:19] yeah [08:30:38] m3-master.eqiad.wmnet failed with error #2002: Cannot assign requested address. [08:30:48] the db looks fine [08:31:13] hashar: let's go to -sre please as arzhel is there [08:31:50] okkkk [09:36:42] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1170:3312 and db1170:3317 is now replicating [09:43:56] 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) I will open a new task for this issue and add you there. While this is not a blocker for backup generation, it would be for an emergency, and we should... [10:20:13] 10Data-Persistence-Backup, 10SRE: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) [10:32:06] marostegui, jynus: doublechecking: ok to reboot cumin2001 now from DB/backups perspective? [10:32:13] yes [10:32:15] +1 from me [10:32:17] at least for me [10:32:37] today is the best day in fact [10:33:55] thanks, starting in a bit, then [10:51:44] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:53:09] 10DBA, 10decommission-hardware: decommission db1075.eqiad.wmnet - https://phabricator.wikimedia.org/T274235 (10Marostegui) [10:53:49] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [11:19:05] moritzm, did you finish reboot? [11:19:34] (asking because of keyholder alert) [11:19:55] the regular keyholder is armed, but waiting for Arzhel to enter the keyholder passphrase for Homer [11:20:02] it's restricted to netops in pwstore [11:20:37] apart from Homer cumin2001 is good to go [11:24:20] 10DBA, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Marostegui) [11:33:01] oh, i didn't know [11:33:08] 10Blocked-on-schema-change, 10DBA: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [11:36:12] 10Blocked-on-schema-change: Schema change for renaming name_title_timestamp on archive table - https://phabricator.wikimedia.org/T273359 (10Marostegui) p:05Triage→03Medium Nothing comes up on codesearch regarding forcing this index. [11:55:59] @ interview for the next hour or so [12:37:50] 10DBA: Move more wikis from s3 to s5 - https://phabricator.wikimedia.org/T226950 (10Marostegui) Checking in for status of s5 at the moment as it is the default section for new wikis: it has 25 wikis. [13:06:34] elukey: you aware that db1108:3352 has replication stopped? [13:07:07] yes yes it was me earlier on, upgrade in progress so I didn't want schema changes to be propagated, but I thought it was downtimed [13:07:20] ah cool! [13:07:25] I saw it on tendril :) [13:30:55] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [13:52:30] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10CDanis) >>! In T269324#6794983, @Marostegui wrote: >>>! In T269324#6794656, @Krinkle wrote: >>> nightmare […] It all starts with having to depool them via a MW commit, […] >> >> I... [13:59:56] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Marostegui) I will let MW experts to answer to that. However, as I mentioned throughout the task, I don't want to go into the same data/operational mode we have with parsercache,... [14:02:34] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Kormat) >>! In T269324#6794021, @Marostegui wrote: > - If @Kormat could give some estimations on how much puppet work would be required to make the "inactive" dc master (codfw in t... [14:05:18] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Marostegui) Thanks @Kormat @Krinkle @aaron - let's go for the x1 approach but with local masters being writable then? [15:34:45] there was a sizeable increase of s1 traffic around 13:30: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s1&var-role=All&from=1612863245304&to=1612884845304 [15:36:02] nothing bad is going on, but mentioning so you can keep an eye on those servers, specially db1118 [15:36:29] It doesn't look too out of the ordinary https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s1&var-role=All&from=now-7d&to=now [17:51:41] 10Data-Persistence-Backup, 10SRE: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) Something that may or may not be related, but we will want to correct is that backup2002 is resolved on dns... [18:44:01] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) **backup2002 -> backup1002** (please note this was while large backups were running in the backg...