[00:01:28] RESOLVED: ErrorBudgetBurn: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:07:32] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [00:12:32] RESOLVED: [9x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [00:53:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270112 (owner: 10TrainBranchBot) [01:09:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270159 [01:09:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270159 (owner: 10TrainBranchBot) [01:19:38] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270159 (owner: 10TrainBranchBot) [02:01:10] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:29] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 19s) [02:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:07] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [02:44:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T410589)', diff saved to https://phabricator.wikimedia.org/P90442 and previous config saved to /var/cache/conftool/dbconfig/20260412-024415-ladsgroup.json [02:44:18] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:15:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:16:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:34:01] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:53:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:58:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T410589)', diff saved to https://phabricator.wikimedia.org/P90443 and previous config saved to /var/cache/conftool/dbconfig/20260412-063600-ladsgroup.json [06:36:03] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:46:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P90444 and previous config saved to /var/cache/conftool/dbconfig/20260412-064608-ladsgroup.json [06:46:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:51:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:56:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P90445 and previous config saved to /var/cache/conftool/dbconfig/20260412-065616-ladsgroup.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260412T0700) [07:06:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T410589)', diff saved to https://phabricator.wikimedia.org/P90446 and previous config saved to /var/cache/conftool/dbconfig/20260412-070624-ladsgroup.json [07:06:28] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [07:06:41] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [07:06:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T410589)', diff saved to https://phabricator.wikimedia.org/P90447 and previous config saved to /var/cache/conftool/dbconfig/20260412-070649-ladsgroup.json [08:53:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T410589)', diff saved to https://phabricator.wikimedia.org/P90448 and previous config saved to /var/cache/conftool/dbconfig/20260412-112100-ladsgroup.json [11:21:04] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [11:31:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P90449 and previous config saved to /var/cache/conftool/dbconfig/20260412-113108-ladsgroup.json [11:41:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P90450 and previous config saved to /var/cache/conftool/dbconfig/20260412-114116-ladsgroup.json [11:51:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T410589)', diff saved to https://phabricator.wikimedia.org/P90451 and previous config saved to /var/cache/conftool/dbconfig/20260412-115124-ladsgroup.json [11:51:28] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [11:51:41] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [11:51:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T410589)', diff saved to https://phabricator.wikimedia.org/P90452 and previous config saved to /var/cache/conftool/dbconfig/20260412-115148-ladsgroup.json [12:04:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:05:43] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11811701 (10A_smart_kitten) [12:09:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:53:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:23:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:26:56] hi, going to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/BlogPage/+/1222806 I'm getting upstream connect error or disconnect/reset before headers. reset reason: connection timeout [14:27:44] paladox: my speculation is that it may be related to T423027 [14:27:44] T423027: DiskSpace - https://phabricator.wikimedia.org/T423027 [14:27:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:51] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:27:53] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:28:06] thanks [14:28:23] np :) [14:29:16] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:27] A_smart_kitten: I think gerrit being broken is at least UBN! [14:30:02] RhinosF1: oki, I can file a separate ticket if you think that'd be good (I just noticed the automatic phaultfinder task on phab earlier) [14:30:17] A_smart_kitten: just raise the automatic task to a UBN [14:30:27] RhinosF1: okay will do [14:30:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:30:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:30:45] claime, volans: as on-callers ^ [14:31:26] done [14:32:23] Hi [14:32:25] Taking a look [14:32:57] !log cgoubert@dns2004 START - running authdns-update [14:33:37] claime: I think the reason authdns is alerting is gerrit is down per A_smart_kitten's task [14:33:42] yeah [14:33:49] can't run it because gerrit is down [14:34:10] (linking to T423027 just in case it got lost in the botnoise above) [14:34:10] T423027: DiskSpace - https://phabricator.wikimedia.org/T423027 [14:34:16] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:34:16] it's possibly due to it being out of disk space [14:38:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from gerrit.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:39:11] that only just paged? [14:39:16] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:27] that should have gone off like 15 minutes ago [14:39:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:42] !ack [14:39:42] 7827 (ACKED) [2x] ATSBackendErrorsHigh cache_text sre (gerrit.discovery.wmnet) [14:39:55] actually it should probably paged 4 hours ago when it ran out of disk [14:44:16] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:41] I've extended the root vg so gerrit comes back, but it's probably going to fill up again [14:45:24] I filed https://phabricator.wikimedia.org/T423035 as a follow-up too [14:45:45] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:45:45] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:43] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:43] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:43] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:49] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:49] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:49] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:51] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:53] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:47:53] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:48:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from gerrit.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:48:56] claime: it seems like it started climbing about midnight UTC [14:49:02] Yes. [14:49:09] https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-collaboration-services?orgId=1&from=now-24h&to=now&timezone=browser&var-DS_PROMETHEUS=000000026&var-job=node&var-nodename=gerrit2003&var-node=gerrit2003:9100&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&refresh=1m&viewPanel=panel-152 [14:49:16] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:24] paladox: any ideas why from your knowledge of gerrit? [14:49:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:57] it's not gerrit. It's probably the new layer that was added. [14:50:03] I think? [14:50:16] paladox: what new layer? [14:50:44] Well we go through the cache proxies to envoy now I think? [14:51:00] It's not the edge cache that made gerrit's disk fill up [14:51:04] paladox: gerrit2003 is out of disk space [14:51:07] well was [14:51:18] it started rapidly rising in disk space about midnight UTC [14:51:37] oh ok [15:00:16] RhinosF1: You shouldn't rename phaultfinder tasks, it won't get updated if it start firing again [15:01:09] claime: ack, do you want me to change back? [15:01:19] Nah it's fine, don't worry about it [15:01:25] Just for future ref :) [15:02:21] Ok [15:13:09] claime: fyi, my napkin maths says the extra 20G will literally only take it to 9am Monday UTC so someone might want to make sure they are warned to pick it up literally first thing [15:14:14] That’s 11 my time so I’ll have pinged them by that point thanks for checking [15:14:26] I’ll take a look end of [15:14:39] Shift to see how it moved and if it needs another bump [15:15:13] Cool [15:25:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:26:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:53:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:31:55] PROBLEM - Host wikikube-worker1163 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 4100.38 ms [17:32:57] RECOVERY - Host wikikube-worker1163 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [19:03:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:08:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [20:24:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T410589)', diff saved to https://phabricator.wikimedia.org/P90456 and previous config saved to /var/cache/conftool/dbconfig/20260412-202435-ladsgroup.json [20:24:39] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [20:34:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P90457 and previous config saved to /var/cache/conftool/dbconfig/20260412-203443-ladsgroup.json [20:44:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P90458 and previous config saved to /var/cache/conftool/dbconfig/20260412-204451-ladsgroup.json [20:50:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T410589)', diff saved to https://phabricator.wikimedia.org/P90459 and previous config saved to /var/cache/conftool/dbconfig/20260412-205020-ladsgroup.json [20:50:24] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [20:53:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T410589)', diff saved to https://phabricator.wikimedia.org/P90460 and previous config saved to /var/cache/conftool/dbconfig/20260412-205500-ladsgroup.json [20:55:17] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [20:55:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T410589)', diff saved to https://phabricator.wikimedia.org/P90461 and previous config saved to /var/cache/conftool/dbconfig/20260412-205525-ladsgroup.json [20:55:29] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:00:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P90462 and previous config saved to /var/cache/conftool/dbconfig/20260412-210028-ladsgroup.json [21:10:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P90463 and previous config saved to /var/cache/conftool/dbconfig/20260412-211036-ladsgroup.json [21:20:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T410589)', diff saved to https://phabricator.wikimedia.org/P90464 and previous config saved to /var/cache/conftool/dbconfig/20260412-212043-ladsgroup.json [21:20:48] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:21:00] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [21:51:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:01:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency