[00:03:31] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:05:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:08:31] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:08:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:09:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:09:31] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:12:31] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:12:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:13:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:16:31] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:17:17] RESOLVED: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:18:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:20:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:20:31] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:39:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1246360 [00:39:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1246360 (owner: 10TrainBranchBot) [00:51:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1246360 (owner: 10TrainBranchBot) [01:03:16] (03CR) 10Zabe: "We can do this now" [puppet] - 10https://gerrit.wikimedia.org/r/1239483 (https://phabricator.wikimedia.org/T417492) (owner: 10Zabe) [01:08:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1246368 [01:08:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1246368 (owner: 10TrainBranchBot) [01:12:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [01:15:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:20:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [01:25:40] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1246368 (owner: 10TrainBranchBot) [01:30:40] RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:13] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:57:17] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:44] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:02:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:08:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:45] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 00s) [02:14:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:33:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:14:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:50:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:43] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:55:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:15:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:18:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:14:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:19:55] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:40] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:13] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:02:32] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:13:23] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:14:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1222.eqiad.wmnet with reason: Maintenance [06:14:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2207.codfw.wmnet with reason: Maintenance [06:15:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:18:23] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:20:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance [06:20:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T418465)', diff saved to https://phabricator.wikimedia.org/P89271 and previous config saved to /var/cache/conftool/dbconfig/20260301-062047-marostegui.json [06:20:51] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [06:21:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1159.eqiad.wmnet with reason: Maintenance [06:21:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T418465)', diff saved to https://phabricator.wikimedia.org/P89272 and previous config saved to /var/cache/conftool/dbconfig/20260301-062108-marostegui.json [06:25:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T418465)', diff saved to https://phabricator.wikimedia.org/P89273 and previous config saved to /var/cache/conftool/dbconfig/20260301-062515-marostegui.json [06:26:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T418465)', diff saved to https://phabricator.wikimedia.org/P89274 and previous config saved to /var/cache/conftool/dbconfig/20260301-062636-marostegui.json [06:26:40] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [06:26:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11660957 (10Marostegui) >>! In T401966#11659582, @VRiley-WMF wrote: > @Marostegui It seems when I try to run it on dbproxy1028, it trie... [06:40:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P89275 and previous config saved to /var/cache/conftool/dbconfig/20260301-064023-marostegui.json [06:41:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P89276 and previous config saved to /var/cache/conftool/dbconfig/20260301-064145-marostegui.json [06:55:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P89277 and previous config saved to /var/cache/conftool/dbconfig/20260301-065531-marostegui.json [06:56:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P89278 and previous config saved to /var/cache/conftool/dbconfig/20260301-065653-marostegui.json [07:10:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T418465)', diff saved to https://phabricator.wikimedia.org/P89279 and previous config saved to /var/cache/conftool/dbconfig/20260301-071040-marostegui.json [07:10:43] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:10:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:11:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:11:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T418465)', diff saved to https://phabricator.wikimedia.org/P89280 and previous config saved to /var/cache/conftool/dbconfig/20260301-071113-marostegui.json [07:12:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T418465)', diff saved to https://phabricator.wikimedia.org/P89281 and previous config saved to /var/cache/conftool/dbconfig/20260301-071201-marostegui.json [07:12:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [07:12:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T418465)', diff saved to https://phabricator.wikimedia.org/P89282 and previous config saved to /var/cache/conftool/dbconfig/20260301-071226-marostegui.json [07:15:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T418465)', diff saved to https://phabricator.wikimedia.org/P89283 and previous config saved to /var/cache/conftool/dbconfig/20260301-071521-marostegui.json [07:18:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T418465)', diff saved to https://phabricator.wikimedia.org/P89284 and previous config saved to /var/cache/conftool/dbconfig/20260301-071816-marostegui.json [07:18:20] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:18:23] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:20:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:23] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:25:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:30:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P89285 and previous config saved to /var/cache/conftool/dbconfig/20260301-073028-marostegui.json [07:33:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P89286 and previous config saved to /var/cache/conftool/dbconfig/20260301-073324-marostegui.json [07:45:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P89287 and previous config saved to /var/cache/conftool/dbconfig/20260301-074536-marostegui.json [07:48:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P89288 and previous config saved to /var/cache/conftool/dbconfig/20260301-074833-marostegui.json [07:54:43] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:59:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260301T0800) [08:00:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T418465)', diff saved to https://phabricator.wikimedia.org/P89289 and previous config saved to /var/cache/conftool/dbconfig/20260301-080044-marostegui.json [08:00:48] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:01:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1185.eqiad.wmnet with reason: Maintenance [08:01:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T418465)', diff saved to https://phabricator.wikimedia.org/P89290 and previous config saved to /var/cache/conftool/dbconfig/20260301-080110-marostegui.json [08:03:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T418465)', diff saved to https://phabricator.wikimedia.org/P89291 and previous config saved to /var/cache/conftool/dbconfig/20260301-080341-marostegui.json [08:03:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance [08:04:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T418465)', diff saved to https://phabricator.wikimedia.org/P89292 and previous config saved to /var/cache/conftool/dbconfig/20260301-080404-marostegui.json [08:08:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T418465)', diff saved to https://phabricator.wikimedia.org/P89293 and previous config saved to /var/cache/conftool/dbconfig/20260301-080838-marostegui.json [08:08:42] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:18:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:19:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P89294 and previous config saved to /var/cache/conftool/dbconfig/20260301-081912-marostegui.json [08:23:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P89295 and previous config saved to /var/cache/conftool/dbconfig/20260301-082346-marostegui.json [08:34:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P89296 and previous config saved to /var/cache/conftool/dbconfig/20260301-083420-marostegui.json [08:38:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P89297 and previous config saved to /var/cache/conftool/dbconfig/20260301-083855-marostegui.json [08:49:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T418465)', diff saved to https://phabricator.wikimedia.org/P89298 and previous config saved to /var/cache/conftool/dbconfig/20260301-084928-marostegui.json [08:49:31] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:49:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1200.eqiad.wmnet with reason: Maintenance [08:49:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T418465)', diff saved to https://phabricator.wikimedia.org/P89299 and previous config saved to /var/cache/conftool/dbconfig/20260301-084952-marostegui.json [08:52:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T418465)', diff saved to https://phabricator.wikimedia.org/P89300 and previous config saved to /var/cache/conftool/dbconfig/20260301-085246-marostegui.json [08:54:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T418465)', diff saved to https://phabricator.wikimedia.org/P89301 and previous config saved to /var/cache/conftool/dbconfig/20260301-085403-marostegui.json [08:54:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2192.codfw.wmnet with reason: Maintenance [08:54:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2192 (T418465)', diff saved to https://phabricator.wikimedia.org/P89302 and previous config saved to /var/cache/conftool/dbconfig/20260301-085427-marostegui.json [08:59:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T418465)', diff saved to https://phabricator.wikimedia.org/P89303 and previous config saved to /var/cache/conftool/dbconfig/20260301-085907-marostegui.json [08:59:11] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:07:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P89304 and previous config saved to /var/cache/conftool/dbconfig/20260301-090754-marostegui.json [09:10:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P89305 and previous config saved to /var/cache/conftool/dbconfig/20260301-091415-marostegui.json [09:15:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:23:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P89306 and previous config saved to /var/cache/conftool/dbconfig/20260301-092302-marostegui.json [09:29:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P89307 and previous config saved to /var/cache/conftool/dbconfig/20260301-092923-marostegui.json [09:38:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T418465)', diff saved to https://phabricator.wikimedia.org/P89308 and previous config saved to /var/cache/conftool/dbconfig/20260301-093810-marostegui.json [09:38:14] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:38:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1207.eqiad.wmnet with reason: Maintenance [09:38:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T418465)', diff saved to https://phabricator.wikimedia.org/P89309 and previous config saved to /var/cache/conftool/dbconfig/20260301-093835-marostegui.json [09:41:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T418465)', diff saved to https://phabricator.wikimedia.org/P89310 and previous config saved to /var/cache/conftool/dbconfig/20260301-094137-marostegui.json [09:44:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T418465)', diff saved to https://phabricator.wikimedia.org/P89311 and previous config saved to /var/cache/conftool/dbconfig/20260301-094432-marostegui.json [09:44:36] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:44:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2201.codfw.wmnet with reason: Maintenance [09:48:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2211.codfw.wmnet with reason: Maintenance [09:48:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T418465)', diff saved to https://phabricator.wikimedia.org/P89312 and previous config saved to /var/cache/conftool/dbconfig/20260301-094847-marostegui.json [09:49:13] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:54:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T418465)', diff saved to https://phabricator.wikimedia.org/P89313 and previous config saved to /var/cache/conftool/dbconfig/20260301-095434-marostegui.json [09:54:38] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:56:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P89314 and previous config saved to /var/cache/conftool/dbconfig/20260301-095645-marostegui.json [10:02:32] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P89315 and previous config saved to /var/cache/conftool/dbconfig/20260301-100942-marostegui.json [10:11:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P89316 and previous config saved to /var/cache/conftool/dbconfig/20260301-101154-marostegui.json [10:24:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P89317 and previous config saved to /var/cache/conftool/dbconfig/20260301-102450-marostegui.json [10:27:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T418465)', diff saved to https://phabricator.wikimedia.org/P89318 and previous config saved to /var/cache/conftool/dbconfig/20260301-102702-marostegui.json [10:27:06] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:27:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1210.eqiad.wmnet with reason: Maintenance [10:27:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1210 (T418465)', diff saved to https://phabricator.wikimedia.org/P89319 and previous config saved to /var/cache/conftool/dbconfig/20260301-102727-marostegui.json [10:30:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T418465)', diff saved to https://phabricator.wikimedia.org/P89320 and previous config saved to /var/cache/conftool/dbconfig/20260301-103134-marostegui.json [10:35:25] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [10:39:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T418465)', diff saved to https://phabricator.wikimedia.org/P89321 and previous config saved to /var/cache/conftool/dbconfig/20260301-103958-marostegui.json [10:40:02] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:40:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2223.codfw.wmnet with reason: Maintenance [10:40:25] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T418465)', diff saved to https://phabricator.wikimedia.org/P89322 and previous config saved to /var/cache/conftool/dbconfig/20260301-104024-marostegui.json [10:45:25] RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T418465)', diff saved to https://phabricator.wikimedia.org/P89323 and previous config saved to /var/cache/conftool/dbconfig/20260301-104606-marostegui.json [10:46:10] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:46:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P89324 and previous config saved to /var/cache/conftool/dbconfig/20260301-104642-marostegui.json [10:49:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [11:00:47] (03CR) 10Aklapper: "My only thought remaining is wondering whether replacing xhpast with phpast (instead of adding the latter) in .arclint makes any differenc" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) (owner: 10Pppery) [11:01:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P89325 and previous config saved to /var/cache/conftool/dbconfig/20260301-110114-marostegui.json [11:01:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P89326 and previous config saved to /var/cache/conftool/dbconfig/20260301-110151-marostegui.json [11:16:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P89327 and previous config saved to /var/cache/conftool/dbconfig/20260301-111622-marostegui.json [11:16:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T418465)', diff saved to https://phabricator.wikimedia.org/P89328 and previous config saved to /var/cache/conftool/dbconfig/20260301-111658-marostegui.json [11:17:02] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:17:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1216.eqiad.wmnet with reason: Maintenance [11:19:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1245.eqiad.wmnet with reason: Maintenance [11:21:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [11:31:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T418465)', diff saved to https://phabricator.wikimedia.org/P89329 and previous config saved to /var/cache/conftool/dbconfig/20260301-113131-marostegui.json [11:31:36] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:31:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2228.codfw.wmnet with reason: Maintenance [11:31:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T418465)', diff saved to https://phabricator.wikimedia.org/P89330 and previous config saved to /var/cache/conftool/dbconfig/20260301-113156-marostegui.json [11:36:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T418465)', diff saved to https://phabricator.wikimedia.org/P89331 and previous config saved to /var/cache/conftool/dbconfig/20260301-113636-marostegui.json [11:36:39] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:51:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P89332 and previous config saved to /var/cache/conftool/dbconfig/20260301-115144-marostegui.json [11:54:43] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:59:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:06:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P89333 and previous config saved to /var/cache/conftool/dbconfig/20260301-120652-marostegui.json [12:18:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:18:23] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:22:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T418465)', diff saved to https://phabricator.wikimedia.org/P89334 and previous config saved to /var/cache/conftool/dbconfig/20260301-122201-marostegui.json [12:22:04] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:23:23] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:40:51] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1246819 [13:49:13] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:23] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp1102.eqiad.wmnet, cp1104.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1100.eqiad.wmnet, cp1102.eqiad.wmnet, cp1104.eqiad.wmnet, cp1106.eqiad.wmnet, cp1108.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but poo [13:49:23] tlb6_443: Servers cp1100.eqiad.wmnet, cp1104.eqiad.wmnet, cp1108.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but pooled: gerritlb_443: Servers cp1100.eqiad.wmnet, cp1102.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:49:31] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp2041.codfw.wmnet, cp2039.codfw.wmnet, cp2033.codfw.wmnet, cp2027.codfw.wmnet, cp2031.codfw.wmnet are marked down but pooled: textlb_443: Servers cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2027.codfw.wmnet, cp2031.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled: textlb6_443: Servers cp2035.codfw.w [13:49:31] 2039.codfw.wmnet, cp2027.codfw.wmnet, cp2031.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled: gerritlb_443: Servers cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2031.codfw.wmnet, cp2027.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:49:31] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp2035.codfw.wmnet, cp2029.codfw.wmnet, cp2027.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled: textlb_443: Servers cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2027.codfw.wmnet, cp2031.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet, cp2037.codfw.wmnet are marked down but poo [13:49:31] tlb6_443: Servers cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled: gerritlb_443: Servers cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2027.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:49:57] FIRING: [12x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:23] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:50:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:50:31] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:50:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:39] !ack [13:50:40] no value provided for parameter incident and no default available [13:50:40] All incidents are already acked. [13:51:01] I did from the app [13:51:09] k [13:51:30] FIRING: [6x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 3 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [13:53:23] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:53:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:54:44] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:54:57] RESOLVED: [18x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:55:07] !ack [13:55:08] 7513 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [13:55:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:56:30] RESOLVED: [10x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 3 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [13:57:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp1108.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1100.eqiad.wmnet, cp1104.eqiad.wmnet, cp1106.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1100.eqiad.wmnet, cp1104.eqiad.w [13:57:17] 1108.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but pooled: gerritlb_443: Servers cp1100.eqiad.wmnet, cp1108.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:57:23] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp1100.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1100.eqiad.wmnet, cp1102.eqiad.wmnet, cp1108.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1100.eqiad.wmnet, cp1102.eqiad.wmnet, cp1104.eqiad.wmnet, cp1108.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.w [13:57:23] marked down but pooled: gerritlb_443: Servers cp1102.eqiad.wmnet, cp1106.eqiad.wmnet, cp1110.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:57:31] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled: textlb6_443: Servers cp2035.codfw.wmnet, cp2029.codfw.wmnet, cp2027.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:57:31] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2027.codfw.wmnet, cp2041.codfw.wmnet are marked down but pooled: textlb6_443: Servers cp2029.codfw.wmnet, cp2027.codfw.wmnet, cp2033.codfw.wmnet, cp2041.codfw.wmnet are marked down but pooled: gerritlb_443: Servers cp2039.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet are marked down [13:57:31] ed https://wikitech.wikimedia.org/wiki/PyBal [13:58:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:58:23] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:58:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:58:31] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:58:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:59:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 3 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [13:59:57] FIRING: [18x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:00:27] FIRING: [13x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:02:32] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:02:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:03:45] !ack [14:03:46] 7517 (ACKED) [2x] ProbeDown sre () [14:03:46] 7518 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [14:04:30] RESOLVED: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 2 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [14:04:57] RESOLVED: [13x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:05:28] RESOLVED: [15x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:06:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:07:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:10:36] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#11661302 (10EPIC) @Dzahn I have adde... [14:37:12] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [14:37:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:38:32] !ack [14:38:33] 7519 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule@main) [14:38:33] 7520 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [14:42:12] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [14:42:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:05:29] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2002 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [15:06:29] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2002 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [15:17:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:17:44] Deployment miscweb-bugzilla in miscweb at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=miscweb&var-deployment=miscweb-bugzilla - ... [15:17:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:32:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:32:44] Deployment miscweb-bugzilla in miscweb at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=miscweb&var-deployment=miscweb-bugzilla - ... [15:32:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:54:44] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:59:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:08:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:18:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:19:05] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#11661379 (10Urbanecm) >>! In T351202... [16:28:53] PROBLEM - ganeti-noded running on ganeti1028 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:29:53] RECOVERY - ganeti-noded running on ganeti1028 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:33:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2213.codfw.wmnet with reason: Maintenance [16:36:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1230.eqiad.wmnet with reason: Maintenance [16:39:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:39:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:39:32] (03CR) 10Pppery: "IIRC I originally did this because the parents were adding non-xhpast-parsable code. Which it no longer does, so adding xhpast would proba" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) (owner: 10Pppery) [16:39:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T418465)', diff saved to https://phabricator.wikimedia.org/P89335 and previous config saved to /var/cache/conftool/dbconfig/20260301-163938-marostegui.json [16:39:42] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:40:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2150.codfw.wmnet with reason: Maintenance [16:40:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T418465)', diff saved to https://phabricator.wikimedia.org/P89336 and previous config saved to /var/cache/conftool/dbconfig/20260301-164022-marostegui.json [16:41:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T418465)', diff saved to https://phabricator.wikimedia.org/P89337 and previous config saved to /var/cache/conftool/dbconfig/20260301-164153-marostegui.json [16:45:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T418465)', diff saved to https://phabricator.wikimedia.org/P89338 and previous config saved to /var/cache/conftool/dbconfig/20260301-164545-marostegui.json [16:45:51] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:57:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P89339 and previous config saved to /var/cache/conftool/dbconfig/20260301-165701-marostegui.json [17:00:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P89340 and previous config saved to /var/cache/conftool/dbconfig/20260301-170053-marostegui.json [17:12:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P89341 and previous config saved to /var/cache/conftool/dbconfig/20260301-171210-marostegui.json [17:16:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P89342 and previous config saved to /var/cache/conftool/dbconfig/20260301-171602-marostegui.json [17:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:27:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T418465)', diff saved to https://phabricator.wikimedia.org/P89343 and previous config saved to /var/cache/conftool/dbconfig/20260301-172717-marostegui.json [17:27:21] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:27:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:27:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T418465)', diff saved to https://phabricator.wikimedia.org/P89344 and previous config saved to /var/cache/conftool/dbconfig/20260301-172742-marostegui.json [17:31:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T418465)', diff saved to https://phabricator.wikimedia.org/P89345 and previous config saved to /var/cache/conftool/dbconfig/20260301-173110-marostegui.json [17:31:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2159.codfw.wmnet with reason: Maintenance [17:31:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T418465)', diff saved to https://phabricator.wikimedia.org/P89346 and previous config saved to /var/cache/conftool/dbconfig/20260301-173134-marostegui.json [17:32:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T418465)', diff saved to https://phabricator.wikimedia.org/P89347 and previous config saved to /var/cache/conftool/dbconfig/20260301-173253-marostegui.json [17:32:57] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:36:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T418465)', diff saved to https://phabricator.wikimedia.org/P89348 and previous config saved to /var/cache/conftool/dbconfig/20260301-173649-marostegui.json [17:48:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P89349 and previous config saved to /var/cache/conftool/dbconfig/20260301-174802-marostegui.json [17:49:13] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:51:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P89350 and previous config saved to /var/cache/conftool/dbconfig/20260301-175157-marostegui.json [18:02:32] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P89351 and previous config saved to /var/cache/conftool/dbconfig/20260301-180310-marostegui.json [18:07:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P89352 and previous config saved to /var/cache/conftool/dbconfig/20260301-180705-marostegui.json [18:18:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T418465)', diff saved to https://phabricator.wikimedia.org/P89353 and previous config saved to /var/cache/conftool/dbconfig/20260301-181818-marostegui.json [18:18:22] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:18:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:21:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:21:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T418465)', diff saved to https://phabricator.wikimedia.org/P89354 and previous config saved to /var/cache/conftool/dbconfig/20260301-182153-marostegui.json [18:22:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T418465)', diff saved to https://phabricator.wikimedia.org/P89355 and previous config saved to /var/cache/conftool/dbconfig/20260301-182213-marostegui.json [18:22:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2168.codfw.wmnet with reason: Maintenance [18:22:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89356 and previous config saved to /var/cache/conftool/dbconfig/20260301-182238-marostegui.json [18:24:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T418465)', diff saved to https://phabricator.wikimedia.org/P89357 and previous config saved to /var/cache/conftool/dbconfig/20260301-182409-marostegui.json [18:24:13] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:27:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89358 and previous config saved to /var/cache/conftool/dbconfig/20260301-182750-marostegui.json [18:39:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P89359 and previous config saved to /var/cache/conftool/dbconfig/20260301-183917-marostegui.json [18:43:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P89360 and previous config saved to /var/cache/conftool/dbconfig/20260301-184259-marostegui.json [18:54:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P89361 and previous config saved to /var/cache/conftool/dbconfig/20260301-185425-marostegui.json [18:58:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P89362 and previous config saved to /var/cache/conftool/dbconfig/20260301-185807-marostegui.json [19:09:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T418465)', diff saved to https://phabricator.wikimedia.org/P89363 and previous config saved to /var/cache/conftool/dbconfig/20260301-190934-marostegui.json [19:09:38] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:09:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance [19:09:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T418465)', diff saved to https://phabricator.wikimedia.org/P89364 and previous config saved to /var/cache/conftool/dbconfig/20260301-190958-marostegui.json [19:12:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T418465)', diff saved to https://phabricator.wikimedia.org/P89365 and previous config saved to /var/cache/conftool/dbconfig/20260301-191213-marostegui.json [19:13:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T418465)', diff saved to https://phabricator.wikimedia.org/P89366 and previous config saved to /var/cache/conftool/dbconfig/20260301-191315-marostegui.json [19:13:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2182.codfw.wmnet with reason: Maintenance [19:13:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T418465)', diff saved to https://phabricator.wikimedia.org/P89367 and previous config saved to /var/cache/conftool/dbconfig/20260301-191340-marostegui.json [19:16:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:18:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T418465)', diff saved to https://phabricator.wikimedia.org/P89368 and previous config saved to /var/cache/conftool/dbconfig/20260301-191858-marostegui.json [19:19:01] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:27:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P89369 and previous config saved to /var/cache/conftool/dbconfig/20260301-192721-marostegui.json [19:34:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P89370 and previous config saved to /var/cache/conftool/dbconfig/20260301-193406-marostegui.json [19:42:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P89371 and previous config saved to /var/cache/conftool/dbconfig/20260301-194230-marostegui.json [19:49:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P89372 and previous config saved to /var/cache/conftool/dbconfig/20260301-194914-marostegui.json [19:54:44] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:57:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T418465)', diff saved to https://phabricator.wikimedia.org/P89373 and previous config saved to /var/cache/conftool/dbconfig/20260301-195738-marostegui.json [19:57:42] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:57:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1194.eqiad.wmnet with reason: Maintenance [19:58:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T418465)', diff saved to https://phabricator.wikimedia.org/P89374 and previous config saved to /var/cache/conftool/dbconfig/20260301-195803-marostegui.json [19:59:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:00:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T418465)', diff saved to https://phabricator.wikimedia.org/P89375 and previous config saved to /var/cache/conftool/dbconfig/20260301-200016-marostegui.json [20:04:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T418465)', diff saved to https://phabricator.wikimedia.org/P89376 and previous config saved to /var/cache/conftool/dbconfig/20260301-200422-marostegui.json [20:04:26] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:04:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2198.codfw.wmnet with reason: Maintenance [20:08:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2200.codfw.wmnet with reason: Maintenance [20:12:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2208.codfw.wmnet with reason: Maintenance [20:12:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T418465)', diff saved to https://phabricator.wikimedia.org/P89377 and previous config saved to /var/cache/conftool/dbconfig/20260301-201212-marostegui.json [20:12:15] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:15:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P89378 and previous config saved to /var/cache/conftool/dbconfig/20260301-201525-marostegui.json [20:16:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:17:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T418465)', diff saved to https://phabricator.wikimedia.org/P89379 and previous config saved to /var/cache/conftool/dbconfig/20260301-201720-marostegui.json [20:17:24] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:18:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:30:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P89380 and previous config saved to /var/cache/conftool/dbconfig/20260301-203033-marostegui.json [20:32:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P89381 and previous config saved to /var/cache/conftool/dbconfig/20260301-203227-marostegui.json [20:45:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T418465)', diff saved to https://phabricator.wikimedia.org/P89382 and previous config saved to /var/cache/conftool/dbconfig/20260301-204541-marostegui.json [20:45:45] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:45:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1202.eqiad.wmnet with reason: Maintenance [20:46:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T418465)', diff saved to https://phabricator.wikimedia.org/P89383 and previous config saved to /var/cache/conftool/dbconfig/20260301-204606-marostegui.json [20:47:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P89384 and previous config saved to /var/cache/conftool/dbconfig/20260301-204736-marostegui.json [20:48:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T418465)', diff saved to https://phabricator.wikimedia.org/P89385 and previous config saved to /var/cache/conftool/dbconfig/20260301-204820-marostegui.json [20:50:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:45] (03PS1) 10Zabe: Stop setting MW_USE_CONFIG_SCHEMA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1246880 (https://phabricator.wikimedia.org/T304460) [20:55:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:37] (03CR) 10CI reject: [V:04-1] Stop setting MW_USE_CONFIG_SCHEMA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1246880 (https://phabricator.wikimedia.org/T304460) (owner: 10Zabe) [20:55:40] (03PS2) 10Zabe: Stop setting MW_USE_CONFIG_SCHEMA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1246880 (https://phabricator.wikimedia.org/T304460) [21:02:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T418465)', diff saved to https://phabricator.wikimedia.org/P89386 and previous config saved to /var/cache/conftool/dbconfig/20260301-210244-marostegui.json [21:02:48] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:03:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2218.codfw.wmnet with reason: Maintenance [21:03:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2218 (T418465)', diff saved to https://phabricator.wikimedia.org/P89387 and previous config saved to /var/cache/conftool/dbconfig/20260301-210309-marostegui.json [21:03:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P89388 and previous config saved to /var/cache/conftool/dbconfig/20260301-210329-marostegui.json [21:08:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T418465)', diff saved to https://phabricator.wikimedia.org/P89389 and previous config saved to /var/cache/conftool/dbconfig/20260301-210815-marostegui.json [21:08:19] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465