[06:26:25] federico3: the progress on https://phabricator.wikimedia.org/T399728 is real? because the work on https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance says something different, and I need to know if I can run my schema change on s1 or s8 [06:27:05] By the way, that change is also safe to run on masters without any switchovers (so once all the replicas are done you can use --dc-masters $DATACENTER) [06:29:23] marostegui: I'm updating the progress on the task async after the run. The data on https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance is generated by the auto schema script [06:30:51] BTW s3 on codfw is done, can i start s4? [06:41:20] also I'm aware of --dc-masters but I thought of starting with the replicas as they are the biggest/slowest changes to rollout and less risky [08:14:17] federico3: Yes, that is fine, also --dc-masters will ONLY do dc masters, so always start with the replicas [08:14:37] What I normally do is: I do the replicas of a section and then --dc maters so the full section is fully finished, but that's how I approach it [08:15:44] ok I can add it to the script right now [08:15:56] To which script? [08:16:32] the helper here https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/42 [08:16:51] But this is going to be a full wrapper of the schema change [08:16:51] also you want the task to reflect the current/ongoing operation? [08:17:48] You mean add it to def write_summary right? [08:17:56] Not to change the logic of the schema change to perform things etc right? [08:18:01] Like a purely reporting thing, am I right? [08:19:56] I mean 1) add the --dc-masters step here https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/42/diffs#7329d389feef6faed22f45ef93afd8d94da66ec0_0_45 after the schema change on the replicas is completed [08:19:56] 2) tweak write_summary https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/42/diffs#7329d389feef6faed22f45ef93afd8d94da66ec0_0_118 to also flag "(ongoing)" to report it into phabricator [08:20:55] (auto_schema and the schema-change scripts remain unchanged) [08:22:15] I feel we are almost creating a total wrapper of the auto schema [08:22:38] yes, that's the idea [08:22:42] Who's idea? [08:22:45] whose [08:22:50] mine :D [08:23:08] Right, but that needs some discussion because we are changing the logic of how we run schema changes [08:23:12] so we can run it over section one at the time and get the reports out [08:24:08] Sure, but there are some stuff that we need to double check (eg: dc-masters could be critical enough that we need to discuss whether we want it to be fully automated, even with an optional --dc-master argument) [08:26:21] maybe I should describe better what it does: I was instructed to apply schema changes one section at a time and one dc at a time after asking for confirmation at each section-dc, and update the task as I go: the script is implementing the same manual work that I've been doing. It is still asking the user for manual confirmation before doing each step [08:29:41] to clarify: the script is not taking decisions by itself on when to run - if you want we can schedule a meeting and look at it together: I think often even a 20 minute meeting looking at the same codebase together could be more efficient than discussing on the PR and IRC [08:41:15] sure, we can talk about it next week, no problem [09:13:58] marostegui: can I start the schema change in s4? [09:14:04] yep [09:14:50] ok thanks [22:21:48] FIRING: [8x] MysqlReplicationLagPtHeartbeat: MySQL instance db2186:9104 has too large replication lag (11m 24s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [22:25:52] PROBLEM - MariaDB sustained replica lag on x1 on db2215 is CRITICAL: 628 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2215&var-port=9104 [22:26:48] RESOLVED: [8x] MysqlReplicationLagPtHeartbeat: MySQL instance db2186:9104 has too large replication lag (15m 24s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [22:26:54] PROBLEM - MariaDB sustained replica lag on x1 on db2191 is CRITICAL: 476.5 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2191&var-port=9104 [22:27:26] PROBLEM - MariaDB sustained replica lag on x1 on db2196 is CRITICAL: 51 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2196&var-port=9104 [22:27:54] RECOVERY - MariaDB sustained replica lag on x1 on db2191 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2191&var-port=9104 [22:29:52] RECOVERY - MariaDB sustained replica lag on x1 on db2215 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2215&var-port=9104 [22:30:28] RECOVERY - MariaDB sustained replica lag on x1 on db2196 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2196&var-port=9104