[04:30:48] FIRING: MysqlReplicationLag: MySQL instance db1153:9104@x2 has too large replication lag (11m 12s). Its replication source is db1152.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1153&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [04:32:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1153:9104 has too large replication lag (13m 13s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1153&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [07:49:04] First hang of 10.11 that we are testing, I will report this to mariadb [08:12:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:48] RESOLVED: MysqlReplicationLag: MySQL instance db1153:9104@x2 has too large replication lag (3h 11m 38s). Its replication source is db1152.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1153&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [08:22:48] RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1153:9104 has too large replication lag (2h 42m 36s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1153&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [08:25:36] https://jira.mariadb.org/browse/MDEV-36016 [09:14:09] rclone was another "lost a race with admin deletion". [09:17:25] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:55] PROBLEM - MariaDB sustained replica lag on es6 on es2037 is CRITICAL: 392.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2037&var-port=9104 [09:45:55] RECOVERY - MariaDB sustained replica lag on es6 on es2037 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2037&var-port=9104 [10:14:23] replication to db2187:3317 is stopped (codfw sanitarium), I don't see any errors. [10:15:47] it is me [10:15:49] rebuilding tables [10:16:11] cool. Thaks! [10:29:25] FIRING: SystemdUnitFailed: cassandra-b.service on restbase2035:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:25] RESOLVED: SystemdUnitFailed: cassandra-b.service on restbase2035:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:25] FIRING: SystemdUnitFailed: ferm.service on es2027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:05] computer bought the ferm? :) [11:16:25] RESOLVED: SystemdUnitFailed: ferm.service on es2027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:11] >[{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1034: Index for table 'logging' is corrupt; try to repair it Function: ManualLogEntry::insert Query: INSERT INTO `logging` (log_type,log_action,log_timestamp,log_actor,log_namespace,log_title [12:44:29] newiki, seemingly, at least [12:44:40] Amir1: marostegui ^^ [12:44:49] Can dump it in a task [12:47:28] what host? [12:47:45] Reedy: if you give me the host I will fix it right now [12:47:49] (but a task will be good too) [12:47:51] in person? [12:47:53] the flight will take a while [12:48:51] Reedy: We can meet at the DC [12:48:53] marostegui: I guess s3 master [12:48:58] uuuuh [12:48:59] checking [12:49:05] as it's an insert... and the wiki is on s3 [12:49:15] "why are we storing things in Amazon s3?" [12:49:32] ok db2209 [12:49:32] got it [13:00:01] server crashed nice [13:00:50] It is back, but I will schedule an emergency switchover [13:01:18] need help? [13:01:26] no, it is fine, thanks a lot jynus [13:05:37] yeah this master is pretty dead [13:10:35] :( [13:18:57] Should we add a note to the status page? not necesarily of an outage, but notifying an uscheduled maintenace? [13:19:05] *unscheduled [13:19:26] so people are aware it has been fixed now [13:24:59] jynus: sure [13:25:46] I can do it, some read only time was produced (?) [13:27:38] yeah, there was read only time [13:27:52] at least 3-4 minutes [13:27:59] do you have some aproximate timestamps, I don't need the exact, just including the reboots [13:28:06] let me check [13:28:09] for sometime between X and X is enough [13:29:43] 13:07:38 to 13:14 UTC [13:34:18] corrections? https://www.wikimediastatus.net/incidents/sclkxsj73cmz [13:34:37] +1 [13:34:50] I will remove the seconds on the first, looks unnecessary [13:34:53] thanks for the help! [13:35:15] that way if someone reports we can tell: you should have looked at the status page :-D [14:26:10] <_joe_> I will not be around for the DP meeting today, sorry but I've too much stuff to do and too many meetings this week to manage to do it all before atlanta [14:27:15] 🥳 [14:31:53] <_joe_> ass :P