[00:24:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101181 [00:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101181 (owner: 10TrainBranchBot) [00:55:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101181 (owner: 10TrainBranchBot) [01:08:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101184 [01:08:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101184 (owner: 10TrainBranchBot) [01:10:12] (03PS2) 10LD: T381722:Add abusefilter-access-protected-vars to frwiki EFM and remove it for sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 [01:26:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101184 (owner: 10TrainBranchBot) [02:40:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:00] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:15:42] FIRING: [5x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:16:00] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:25:42] FIRING: [4x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:24:29] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:49] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (owner: 10LD) [04:39:13] (03CR) 10Pppery: T381722:Add abusefilter-access-protected-vars to frwiki EFM and remove it for sysops (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (owner: 10LD) [04:39:45] (03CR) 10Pppery: "Welcome to Gerrit. You will need to schedule this patch for deployment in a backport window for it to get reviewed. See an explanation of " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (owner: 10LD) [04:40:23] (03CR) 10Pppery: T381722:Add abusefilter-access-protected-vars to frwiki EFM and remove it for sysops (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (owner: 10LD) [04:40:25] (03CR) 10CI reject: [V:04-1] T381722:Add abusefilter-access-protected-vars to frwiki EFM and remove it for sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (owner: 10LD) [06:46:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:51:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [07:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:43] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241208T0800) [08:24:29] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [08:51:12] (03CR) 10Stang: "Withdrawn, seeking for someone else for the testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [11:03:30] PROBLEM - MariaDB Replica SQL: s2 on db2197 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: nlwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:43] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:24:29] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:09] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:58:42] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Connect - Orange, AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:09] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:54] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:05:58] (03PS3) 10LD: T381722:Add abusefilter-access-protected-vars to frwiki EFM and remove it for sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 [13:07:45] (03PS4) 10LD: Bug:T381722 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 [13:10:54] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:13:08] (03PS5) 10LD: frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 [13:17:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (owner: 10LD) [13:18:42] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:46:31] PROBLEM - MariaDB Replica SQL: s2 #page on db2207 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: nlwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:47:36] :sadpanda: [13:47:42] I'm taking a look [13:48:29] fixed [13:48:35] RECOVERY - MariaDB Replica SQL: s2 #page on db2207 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:48:43] it's a replica, so I was going to say "safe to depool & open a ticket", it looks like the index-corruption issue again, so could presumably optimize table... [13:48:49] but you're quicker than me :) [13:48:52] thanks [13:49:31] yeah, rc table of nlwiki [13:49:37] I fixed it live [13:49:54] I resolved the VO incident [13:50:01] I think the whole mariadb is now down? [13:50:17] https://usercontent.irccloud-cdn.com/file/YiyltktZ/grafik.png [13:50:23] also what's going on with db2197 [13:50:31] orchestrator thinks db2207 "invalid" [13:51:09] it's corrupt again [13:51:12] give me a second [13:51:33] PROBLEM - MariaDB Replica SQL: s2 #page on db2207 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: nlwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:51:56] Amir1: 2207 paged again, same error. [13:51:58] I'll ACK [13:52:19] on it [13:52:31] if you want me to do anything other than wrangling VO, do shout :) [13:52:44] it should be fixed now [13:52:50] I did the force [13:53:04] 2207 now just a bit lagged [13:53:04] but if it doesn't fix it or it pages again, we should just depool [13:53:17] yeah, catching up [13:53:33] RECOVERY - MariaDB Replica SQL: s2 #page on db2207 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:53:33] 2207 looks good now [13:53:40] fixed, let's see if it pages again [13:54:10] do you want to eyeball db2197? [13:54:31] [I mean, it's Sunday afternoon, but it orchestrator thinks its not replicating] [13:54:48] I'm doing it too [13:54:57] it's backup source, that's why it didn't page [13:55:06] but it's the same db, same table [13:55:23] Ah, OK. If it's a not a quick fix, it can wait 'til Monday... [13:55:30] RECOVERY - MariaDB Replica SQL: s2 on db2197 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:55:32] nah, already fixed [13:55:37] ^ [13:55:39] 👍 [13:56:20] Does arnaud.b's google doc need updating to reflect these? [13:56:53] yeah, that can wait until monday :P [13:56:56] :) [13:57:14] OK, back to my Sunday. That's the 3rd time this weekend now /o\ [13:59:55] it would be nice to have weekend oncall [14:03:28] ^ +1 [14:30:02] <-- my "saying nothing" face [14:40:43] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:43] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:29] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:10] (03PS6) 10Pppery: frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [17:39:10] (03CR) 10Dreamy Jazz: [C:04-1] "Are we sure this has WMF Legal approval?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [17:59:52] (03CR) 10Zabe: "This patch only touches abusefilter-access-protected-vars, so I do not really see where WMF legal approval should be necessary?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [18:01:11] (03CR) 10Dreamy Jazz: [C:04-1] "It is necessary because it would grant access to temporary account IP addresses through the protected variable `user_unnamed_ip`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [18:02:16] (03CR) 10Dreamy Jazz: [C:04-1] "To clarify: At the moment there is no temporary accounts on frwiki. However, if they are enabled then having the `abusefilter-access-prote" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [18:04:43] (03CR) 10Dreamy Jazz: [C:04-1] "Furthermore, the policy doesn't currently give access to local abuse filter maintainer groups. Instead the access is through the "Patrolle" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [18:06:31] (03CR) 10Dreamy Jazz: [C:04-1] "I do see why this patch (or something like it) is needed, but I think it should be passed by WMF Legal to ensure that situations like T380" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [18:09:50] (03CR) 10Zabe: "Ack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [18:14:26] (03CR) 10Pppery: frwiki: Add abusefilter-access-protected-vars to EFM, remove it from sysops. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101182 (https://phabricator.wikimedia.org/T381722) (owner: 10LD) [19:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:43] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:24:33] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [19:25:33] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [20:24:29] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:43] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable