[00:01:58] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:03:53] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:08:27] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:10:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11810180 (10Jclark-ctr) 05Open→03Resolved These where finished accidentally put Racking ticket on cookbook reimage was posted on Procurement tic... [00:11:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11810183 (10Jclark-ctr) [00:53:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:58:12] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T419635)', diff saved to https://phabricator.wikimedia.org/P90392 and previous config saved to /var/cache/conftool/dbconfig/20260411-005811-fceratto.json [00:58:16] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [01:00:05] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269755 (owner: 10TrainBranchBot) [01:01:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/1/5 (Transport: cr2-codfw:et-0/1/4 (Lumen, 449169461) {#changeme_lumen_patch}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:07:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:09:01] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20260411-010859-fceratto.json [01:09:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270112 [01:09:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270112 (owner: 10TrainBranchBot) [01:19:34] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1270112 (owner: 10TrainBranchBot) [01:19:49] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:19:53] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260411-011948-fceratto.json [01:20:41] !incidents [01:20:41] 7825 (UNACKED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [01:20:54] !ack [01:20:54] 7825 (ACKED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [01:20:57] \o [01:22:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:24:49] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:26:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:30:45] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T419635)', diff saved to https://phabricator.wikimedia.org/P90393 and previous config saved to /var/cache/conftool/dbconfig/20260411-013040-fceratto.json [01:30:49] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [01:31:04] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2236.codfw.wmnet with reason: Maintenance [01:31:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:31:57] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2236 (T419635)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260411-013151-fceratto.json [01:35:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:44:14] Phab DB down [01:45:46] musikanimal: sometimes you just need to refresh, it's probably a aggressive scraper again [01:47:54] thanks, yeah it eventually worked. Intermittent on my end, but I also saw the alerts above so wanted to give confirmation from the user end [01:49:31] back to artisnal handcrafted html pages for everyone >.> [01:49:38] lol [01:49:39] I know proof-of-work challenges like Anubis was deemed not appropriate for the wikis, but I wonder if that would be acceptable for Phab? [01:50:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:50:41] I've done 3 Anubis deployments so far and all worked flawlessly, and issues with scrapers are a thing of the past [01:57:17] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T410589)', diff saved to https://phabricator.wikimedia.org/P90394 and previous config saved to /var/cache/conftool/dbconfig/20260411-030611-ladsgroup.json [03:06:16] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:12:17] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T419635)', diff saved to https://phabricator.wikimedia.org/P90395 and previous config saved to /var/cache/conftool/dbconfig/20260411-031216-fceratto.json [03:12:20] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [03:16:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P90396 and previous config saved to /var/cache/conftool/dbconfig/20260411-031620-ladsgroup.json [03:18:48] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:23:06] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P90397 and previous config saved to /var/cache/conftool/dbconfig/20260411-032304-fceratto.json [03:26:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P90398 and previous config saved to /var/cache/conftool/dbconfig/20260411-032628-ladsgroup.json [03:33:54] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P90399 and previous config saved to /var/cache/conftool/dbconfig/20260411-033352-fceratto.json [03:36:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T410589)', diff saved to https://phabricator.wikimedia.org/P90400 and previous config saved to /var/cache/conftool/dbconfig/20260411-033636-ladsgroup.json [03:36:40] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:36:53] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [03:37:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T410589)', diff saved to https://phabricator.wikimedia.org/P90401 and previous config saved to /var/cache/conftool/dbconfig/20260411-033701-ladsgroup.json [03:44:42] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T419635)', diff saved to https://phabricator.wikimedia.org/P90402 and previous config saved to /var/cache/conftool/dbconfig/20260411-034441-fceratto.json [03:44:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [03:45:01] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2237.codfw.wmnet with reason: Maintenance [03:45:50] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2237 (T419635)', diff saved to https://phabricator.wikimedia.org/P90403 and previous config saved to /var/cache/conftool/dbconfig/20260411-034549-fceratto.json [04:47:26] (03CR) 101F616EMO: "Please note that Wikinews will be closed on the same day. I wonder would that affect how caches are handled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [04:50:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:51:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:51:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:52:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:53:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:00:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:01:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:01:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:29:03] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T419635)', diff saved to https://phabricator.wikimedia.org/P90404 and previous config saved to /var/cache/conftool/dbconfig/20260411-052901-fceratto.json [05:29:06] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [05:39:51] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P90405 and previous config saved to /var/cache/conftool/dbconfig/20260411-053950-fceratto.json [05:47:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:39] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P90406 and previous config saved to /var/cache/conftool/dbconfig/20260411-055038-fceratto.json [06:01:27] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T419635)', diff saved to https://phabricator.wikimedia.org/P90407 and previous config saved to /var/cache/conftool/dbconfig/20260411-060126-fceratto.json [06:01:31] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [06:01:46] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance [07:10:39] (03CR) 10Anzx: "If that the case you can schedule, but closed doesn't mean wikis gets closed , it will only be locked for editing so logos still may appea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [07:18:48] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:35:40] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2240.codfw.wmnet with reason: Maintenance [07:36:28] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2240 (T419635)', diff saved to https://phabricator.wikimedia.org/P90408 and previous config saved to /var/cache/conftool/dbconfig/20260411-073627-fceratto.json [07:36:31] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:45:32] FIRING: SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [08:50:32] FIRING: [3x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [08:53:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:32] FIRING: [5x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:01:31] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:32] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:10:32] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:15:32] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:16:57] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 117.54 ms [09:18:49] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T419635)', diff saved to https://phabricator.wikimedia.org/P90409 and previous config saved to /var/cache/conftool/dbconfig/20260411-091847-fceratto.json [09:18:52] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:25:32] FIRING: [10x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:29:37] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P90410 and previous config saved to /var/cache/conftool/dbconfig/20260411-092936-fceratto.json [09:30:32] FIRING: [11x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:35:32] FIRING: [11x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:38:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:39:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:40:26] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P90411 and previous config saved to /var/cache/conftool/dbconfig/20260411-094024-fceratto.json [09:40:32] FIRING: [11x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:45:32] FIRING: [11x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:48] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:15] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T419635)', diff saved to https://phabricator.wikimedia.org/P90412 and previous config saved to /var/cache/conftool/dbconfig/20260411-095113-fceratto.json [09:51:18] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:51:34] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2245.codfw.wmnet with reason: Maintenance [09:52:22] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2245 (T419635)', diff saved to https://phabricator.wikimedia.org/P90413 and previous config saved to /var/cache/conftool/dbconfig/20260411-095220-fceratto.json [09:55:32] FIRING: [11x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [10:00:32] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [10:05:32] RESOLVED: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:40:39] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T419635)', diff saved to https://phabricator.wikimedia.org/P90414 and previous config saved to /var/cache/conftool/dbconfig/20260411-114037-fceratto.json [11:40:42] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:51:27] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P90415 and previous config saved to /var/cache/conftool/dbconfig/20260411-115126-fceratto.json [12:02:15] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P90416 and previous config saved to /var/cache/conftool/dbconfig/20260411-120214-fceratto.json [12:12:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T410589)', diff saved to https://phabricator.wikimedia.org/P90417 and previous config saved to /var/cache/conftool/dbconfig/20260411-121218-ladsgroup.json [12:12:22] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [12:13:04] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T419635)', diff saved to https://phabricator.wikimedia.org/P90418 and previous config saved to /var/cache/conftool/dbconfig/20260411-121302-fceratto.json [12:13:07] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:13:23] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2246.codfw.wmnet with reason: Maintenance [12:14:11] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2246 (T419635)', diff saved to https://phabricator.wikimedia.org/P90419 and previous config saved to /var/cache/conftool/dbconfig/20260411-121410-fceratto.json [12:22:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P90420 and previous config saved to /var/cache/conftool/dbconfig/20260411-122226-ladsgroup.json [12:32:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P90421 and previous config saved to /var/cache/conftool/dbconfig/20260411-123235-ladsgroup.json [12:40:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:41:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:42:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:42:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T410589)', diff saved to https://phabricator.wikimedia.org/P90422 and previous config saved to /var/cache/conftool/dbconfig/20260411-124244-ladsgroup.json [12:42:48] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [12:43:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:45:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:46:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:46:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:49:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:49:15] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:52:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:52:15] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:53:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:35] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11810791 (10Krd) Please unbreak now. [14:06:30] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T419635)', diff saved to https://phabricator.wikimedia.org/P90423 and previous config saved to /var/cache/conftool/dbconfig/20260411-140628-fceratto.json [14:06:33] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:17:18] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P90424 and previous config saved to /var/cache/conftool/dbconfig/20260411-141717-fceratto.json [14:28:07] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P90425 and previous config saved to /var/cache/conftool/dbconfig/20260411-142805-fceratto.json [14:38:55] !log fceratto@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T419635)', diff saved to https://phabricator.wikimedia.org/P90426 and previous config saved to /var/cache/conftool/dbconfig/20260411-143854-fceratto.json [14:38:58] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:39:15] !log fceratto@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2247.codfw.wmnet with reason: Maintenance [14:40:03] !log fceratto@cumin2002 dbctl commit (dc=all): 'Depooling db2247 (T419635)', diff saved to https://phabricator.wikimedia.org/P90427 and previous config saved to /var/cache/conftool/dbconfig/20260411-144002-fceratto.json [16:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable