[00:05:09] RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:06:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:21:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:26:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:31:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:33:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:38:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241364 [00:38:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241364 (owner: 10TrainBranchBot) [00:43:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:45:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:54:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241364 (owner: 10TrainBranchBot) [01:00:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:01:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:06:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:07:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:08:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241368 [01:09:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241368 (owner: 10TrainBranchBot) [01:12:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:22:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:27:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:32:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:32:45] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241368 (owner: 10TrainBranchBot) [01:42:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:56:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:00:45] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:01:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:01:51] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 06s) [02:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:33:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T415786)', diff saved to https://phabricator.wikimedia.org/P88959 and previous config saved to /var/cache/conftool/dbconfig/20260222-023847-marostegui.json [02:38:52] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:53:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P88960 and previous config saved to /var/cache/conftool/dbconfig/20260222-025355-marostegui.json [03:09:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P88961 and previous config saved to /var/cache/conftool/dbconfig/20260222-030904-marostegui.json [03:24:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T415786)', diff saved to https://phabricator.wikimedia.org/P88962 and previous config saved to /var/cache/conftool/dbconfig/20260222-032412-marostegui.json [03:24:18] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:24:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2239.codfw.wmnet with reason: Maintenance [03:31:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:39:07] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:48:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [04:49:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T415786)', diff saved to https://phabricator.wikimedia.org/P88963 and previous config saved to /var/cache/conftool/dbconfig/20260222-044859-marostegui.json [04:49:04] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:19:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:20:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:46:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:46:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:49:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:49:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:53:21] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:55:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:55:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:08:21] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260222T0800) [08:40:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:42:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:48:17] FIRING: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:03:21] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:06:19] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:06:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:07:11] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:07:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:13:21] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:14:11] FIRING: Temperature: Temp issue on wdqs2023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [09:18:17] RESOLVED: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:11] RESOLVED: Temperature: Temp issue on wdqs2023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [09:54:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:54:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:55:05] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:56:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:08:21] FIRING: [2x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:29:42] 06SRE, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11638658 (10Kredionecsresmi) !! Himbauan Penting jika ingin membatalkan pinjaman KrediOne kami menyarankan hub cs di nomor watsapp 0853-9319-3291 [10:30:12] 06SRE, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11638659 (10Kredionecsresmi) !! Himbauan Penting jika ingin membatalkan pinjaman KrediOne kami menyarankan hub cs di nomor watsapp 0853-9319-3291 [10:31:51] (03PS1) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) [10:34:46] 06SRE, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11638669 (10Kredionecsresmi) 05Open→03Resolved a:03Kredionecsresmi !! Himbauan Penting jika ingin membatalkan pinjaman KrediOne kami menyarankan hub cs di... [10:35:24] 06SRE, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11638672 (10Kredionecsresmi) !! Himbauan Penting jika ingin membatalkan pinjaman KrediOne kami menyarankan hub cs di nomor watsapp 0853-9319-3291 [10:41:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [10:42:51] 06SRE, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11638675 (10Peachey88) 05Resolved→03Open a:05Kredionecsresmi→03None [11:01:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:04:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:18:21] FIRING: [2x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:20:59] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:23:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:25:59] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:26:05] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:29:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:35:57] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:49:59] 06SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217#11638700 (10GreenReaper) Hyperkitty version 1.3.9 (we appear to be on 1.3.8 now) [adds support for list owners to delete messages from the archive](https://git... [11:53:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:17] FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:57] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:11:45] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:13:41] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:13:45] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:15:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:20:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:23:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:15] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:27:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:28:13] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:28:17] FIRING: [21x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:33:17] FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:21] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:38:17] FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:17] FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:41:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2245.codfw.wmnet with reason: Maintenance [14:41:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2245 (T415786)', diff saved to https://phabricator.wikimedia.org/P88964 and previous config saved to /var/cache/conftool/dbconfig/20260222-144110-marostegui.json [14:41:15] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [14:45:43] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Sun 22 Mar 2026 02:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [14:48:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:49:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:36] again? ;( [14:58:42] 06SRE, 10MediaWiki-Uploading, 06ServiceOps new, 10ServiceOps-Mediawiki: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#11638786 (10Aklapper) 05Open→03Stalled @Grand-Duc: Could you please answer the last comment? Thanks in advance! [15:02:13] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:07:13] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:23:17] FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:52] (03PS1) 10DDesouza: Pre-deploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241712 (https://phabricator.wikimedia.org/T417829) [15:32:09] (03CR) 10CI reject: [V:04-1] Pre-deploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241712 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [15:32:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:33:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:40:52] (03CR) 10Daniel Kinzler: rest-gateway: fix x-wmf-ratelimit-policy in access log (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [15:42:18] (03PS1) 10DDesouza: Pre-deploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241713 (https://phabricator.wikimedia.org/T417834) [15:43:30] (03CR) 10CI reject: [V:04-1] Pre-deploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241713 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [15:43:43] (03PS2) 10DDesouza: Pre-deploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241712 (https://phabricator.wikimedia.org/T417829) [15:44:15] (03PS2) 10DDesouza: Pre-deploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241713 (https://phabricator.wikimedia.org/T417834) [16:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:58:17] RESOLVED: [4x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T415786)', diff saved to https://phabricator.wikimedia.org/P88965 and previous config saved to /var/cache/conftool/dbconfig/20260222-175016-marostegui.json [17:50:22] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [18:05:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P88966 and previous config saved to /var/cache/conftool/dbconfig/20260222-180524-marostegui.json [18:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:20:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P88967 and previous config saved to /var/cache/conftool/dbconfig/20260222-182032-marostegui.json [18:35:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T415786)', diff saved to https://phabricator.wikimedia.org/P88968 and previous config saved to /var/cache/conftool/dbconfig/20260222-183541-marostegui.json [18:35:47] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [18:35:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [18:36:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T415786)', diff saved to https://phabricator.wikimedia.org/P88969 and previous config saved to /var/cache/conftool/dbconfig/20260222-183605-marostegui.json [18:41:11] (03PS1) 10ArielGlenn: python tests: use type hints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239529 (owner: 10Daniel Kinzler) [19:15:25] 10SRE-tools, 10Phabricator: Phabricator cli for serviceops - https://phabricator.wikimedia.org/T377311#11638877 (10Aklapper) [19:40:44] 06SRE, 10MediaWiki-Uploading, 06ServiceOps new, 10ServiceOps-Mediawiki: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#11638881 (10Grand-Duc) Ah, sorry, I forgot to get back at this. I made one test: 2026-02-06, [[ https://commons.wik... [21:23:21] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:28:21] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:54:14] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241846 [21:56:47] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241851 [22:18:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:18:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:22:05] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:23:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:27:59] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:28:05] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:33:21] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:44:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:47:26] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1241870 [22:47:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:48:21] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:48:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1241870 (owner: 10Matthias Mullie) [22:49:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:49:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:55:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:55:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:49:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:49:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal