[11:18:58] arnaudb: Any objections to doing T374087 right now? [11:18:59] T374087: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T374087 [11:24:57] https://phabricator.wikimedia.org/T374087#10120947 \o/ [12:12:17] \o/ [12:12:19] sorry I was AFK [14:28:56] there is a problem with replication on db1154 (s5 only) [14:29:11] "Could not execute Write_rows_v1 event on table srwiki.recentchanges; Index for table 'recentchanges' is corrupt; try to repair it" [14:34:24] db1240 mswiktionary pagelinks 9/5/2024 [14:34:34] I've also had that one on s3 [14:34:44] dhinus: I'll check in a few moments [14:35:20] arnaudb: thanks, no rush [14:35:22] those are noted in the tracking gsheet ↑ [14:35:28] https://docs.google.com/spreadsheets/d/1uZFy9BqMUug14h899cU3-m4pnkrARl-ElV3fGbE7AXQ/edit?gid=0#gid=0 [14:37:51] dhinus: table is rebuilding [14:37:57] replication resumed [14:38:05] cheers :) [14:38:36] PROBLEM - MariaDB sustained replica lag on s5 on db1154 is CRITICAL: 2.978e+04 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315 [14:46:46] let me check [14:47:28] fixed now already, awesome [14:48:48] FIRING: MysqlReplicationLag: MySQL instance db1154:13315 has too large replication lag (13m 6s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1154&var-port=13315 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [14:53:36] RECOVERY - MariaDB sustained replica lag on s5 on db1154 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13315 [14:53:48] RESOLVED: MysqlReplicationLag: MySQL instance db1154:13315 has too large replication lag (13m 6s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1154&var-port=13315 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [15:13:35] I've had to disable semi_sync repl on db2225 for it to catch on its replag after clone, first time it happens right after a clone haha [15:19:59] annnnd same thing on db2125, obv [15:30:13] that's weird [15:30:24] why semi sync [15:33:23] btw db2187:3312 had its replication broken for days now (codfw sanitarium) [15:33:38] checking [15:33:45] already fixing it [15:33:52] ack! sry I missed it! [15:37:05] and db2197 is the same too [15:37:08] 1d [15:37:42] checking [15:37:50] backup source [15:37:59] (I think, let me double check) [15:38:03] arnaudb: already fixed [15:38:09] ah ack [15:38:10] thanks [15:38:11] recentchanges corruption on nlwiki [15:38:14] urgh [15:38:30] yeah backup source [16:35:25] FIRING: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@mgr.moss-be2003.vtjrnj.service on moss-be2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:22] That's a false positive from the host being in maintenance mode for network maintenance re T373096 [16:36:23] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [17:10:25] RESOLVED: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@mgr.moss-be2003.vtjrnj.service on moss-be2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed