[00:03:54] (03PS1) 10BryanDavis: toolhub: Bump container version to 2026-02-20-232022-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241047 (https://phabricator.wikimedia.org/T372824) [00:22:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T415786)', diff saved to https://phabricator.wikimedia.org/P88940 and previous config saved to /var/cache/conftool/dbconfig/20260221-002255-marostegui.json [00:23:01] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:30:46] (03PS1) 10Hashar: gerrit: ProxyTimeout shorter than Jetty's idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/1241048 (https://phabricator.wikimedia.org/T246763) [00:30:46] (03CR) 10Hashar: "Hi Chris and Valentin" [puppet] - 10https://gerrit.wikimedia.org/r/1241048 (https://phabricator.wikimedia.org/T246763) (owner: 10Hashar) [00:38:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P88941 and previous config saved to /var/cache/conftool/dbconfig/20260221-003804-marostegui.json [00:39:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241049 [00:39:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241049 (owner: 10TrainBranchBot) [00:47:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:48:07] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:52:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241049 (owner: 10TrainBranchBot) [00:53:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P88942 and previous config saved to /var/cache/conftool/dbconfig/20260221-005312-marostegui.json [01:08:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T415786)', diff saved to https://phabricator.wikimedia.org/P88943 and previous config saved to /var/cache/conftool/dbconfig/20260221-010820-marostegui.json [01:08:25] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [01:08:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance [01:08:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T415786)', diff saved to https://phabricator.wikimedia.org/P88944 and previous config saved to /var/cache/conftool/dbconfig/20260221-010845-marostegui.json [01:09:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241050 [01:09:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241050 (owner: 10TrainBranchBot) [01:33:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:33:43] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241050 (owner: 10TrainBranchBot) [01:35:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:40:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:59:05] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [02:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:02:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mgmt ip for asw1-23-uslfo - pt1979@cumin2002" [02:02:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mgmt ip for asw1-23-uslfo - pt1979@cumin2002" [02:02:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:10:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:13:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:14:26] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 44s) [02:16:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:21:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:33:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:35:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T415786)', diff saved to https://phabricator.wikimedia.org/P88945 and previous config saved to /var/cache/conftool/dbconfig/20260221-023536-marostegui.json [02:35:41] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:36:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:50:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P88946 and previous config saved to /var/cache/conftool/dbconfig/20260221-025044-marostegui.json [02:57:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:58:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:59:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:02:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:05:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P88947 and previous config saved to /var/cache/conftool/dbconfig/20260221-030552-marostegui.json [03:13:05] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [03:18:52] pt1979@cumin2002 netbox (PID 185017) is awaiting input [03:21:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T415786)', diff saved to https://phabricator.wikimedia.org/P88948 and previous config saved to /var/cache/conftool/dbconfig/20260221-032101-marostegui.json [03:21:06] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:21:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [03:21:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T415786)', diff saved to https://phabricator.wikimedia.org/P88949 and previous config saved to /var/cache/conftool/dbconfig/20260221-032125-marostegui.json [03:39:43] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mgmt ip for asw1-22-uslfo - pt1979@cumin2002" [03:39:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mgmt ip for asw1-22-uslfo - pt1979@cumin2002" [03:39:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:44:05] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device asw1-22-ulsfo [03:45:46] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device asw1-22-ulsfo [03:51:02] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11637925 (10Papaul) Initial configuration done on both switches. what left on the switches : - user homer password - sre.network.tls cookbook I will work on this... [05:19:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:19:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:30:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:30:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:34:41] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:43:21] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:50:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11637958 (10Papaul) @robh please see below for the diagram requested. Note: On site need to provide the fiber length needed for each connection. I have... [06:08:11] FIRING: Temperature: Temp issue on wdqs1023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [06:13:11] RESOLVED: Temperature: Temp issue on wdqs1023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [06:59:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:03:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:07:11] FIRING: Temperature: Temp issue on wdqs2023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [07:12:11] RESOLVED: Temperature: Temp issue on wdqs2023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [08:40:32] (03PS1) 10Majavah: admin: Remove SSH keys for mmiller [puppet] - 10https://gerrit.wikimedia.org/r/1241058 (https://phabricator.wikimedia.org/T418036) [08:41:34] (03CR) 10Majavah: [C:03+2] admin: Remove SSH keys for mmiller [puppet] - 10https://gerrit.wikimedia.org/r/1241058 (https://phabricator.wikimedia.org/T418036) (owner: 10Majavah) [08:48:33] !log taavi@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging MMiller out of all services on: 2449 hosts [10:05:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:12:25] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:12:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:16:15] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:17:15] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55711 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:17:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:18:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:28:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11638095 (10MatthewVernon) @Jhancock.wm these nodes are swift frontends in the ms cluster, so should be ms-fe* not moss-fe* (moss* is a legacy name that should never apply to n... [12:44:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:45:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:32:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T415786)', diff saved to https://phabricator.wikimedia.org/P88950 and previous config saved to /var/cache/conftool/dbconfig/20260221-133219-marostegui.json [13:32:25] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:34:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:39:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:47:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P88951 and previous config saved to /var/cache/conftool/dbconfig/20260221-134728-marostegui.json [13:53:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:58:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:02:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P88952 and previous config saved to /var/cache/conftool/dbconfig/20260221-140236-marostegui.json [14:17:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T415786)', diff saved to https://phabricator.wikimedia.org/P88953 and previous config saved to /var/cache/conftool/dbconfig/20260221-141744-marostegui.json [14:17:49] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [14:18:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance [14:18:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T415786)', diff saved to https://phabricator.wikimedia.org/P88954 and previous config saved to /var/cache/conftool/dbconfig/20260221-141809-marostegui.json [14:19:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:40:17] FIRING: ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:17] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:09:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:09:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:10:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:11:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:12:05] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:15:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:18:46] looks like WDQS is unhappy. Taking a look now [15:20:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:23:21] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:24:41] FIRING: [2x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:28:43] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [15:33:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [15:36:15] (03PS1) 10Zabe: Update documenation to reference config-schema.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241167 [15:37:50] (03PS2) 10Zabe: Update documenation to reference config-schema.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241167 [15:38:21] FIRING: [2x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:39:59] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:40:05] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:41:19] !log restart wdqs CODFW in response to huge error rates https://w.wiki/Hw9u [15:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:21] RESOLVED: [2x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:43:59] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:44:05] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:45:59] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:46:05] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:46:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:50:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:50:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:51:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:56:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:01:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:06:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:36:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T415786)', diff saved to https://phabricator.wikimedia.org/P88955 and previous config saved to /var/cache/conftool/dbconfig/20260221-163612-marostegui.json [16:36:17] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [16:39:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:41:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:51:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P88956 and previous config saved to /var/cache/conftool/dbconfig/20260221-165120-marostegui.json [16:51:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:06:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P88957 and previous config saved to /var/cache/conftool/dbconfig/20260221-170628-marostegui.json [17:08:17] FIRING: [7x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:13:17] FIRING: [18x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:17:12] (03CR) 10Hashar: [C:04-1] "I am pretty sure that is because Gerrit did not get restarted after the configuration has been updated. After the switch over the Gerrit " [puppet] - 10https://gerrit.wikimedia.org/r/1240217 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [17:21:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T415786)', diff saved to https://phabricator.wikimedia.org/P88958 and previous config saved to /var/cache/conftool/dbconfig/20260221-172135-marostegui.json [17:21:41] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [17:21:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [17:56:25] (03CR) 10Hashar: [C:04-1] "If I understand properly this:" [puppet] - 10https://gerrit.wikimedia.org/r/1238315 (https://phabricator.wikimedia.org/T416912) (owner: 10Arnaudb) [18:07:41] (03PS3) 10Hashar: Run Gerrit spec tests on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240703 (owner: 10Muehlenhoff) [18:08:10] (03CR) 10Hashar: [C:03+1] Run Gerrit spec tests on Bullseye/Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240703 (owner: 10Muehlenhoff) [18:19:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:32:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:37:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:08:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:13:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:19:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:24:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:29:49] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418062 (10phaultfinder) 03NEW [19:36:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:41:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (195.200.68.147) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:46:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:51:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:56:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:01:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:30:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:35:00] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418062#11638532 (10phaultfinder) [20:47:39] FIRING: CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=asw1-b3-magru:9804&var-bgp_group=core&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:52:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:57:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:02:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:12:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:17:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:22:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:27:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:30:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:32:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:47:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:52:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:57:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:13:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:17:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [22:18:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:19:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:22:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [22:23:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:27:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:32:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:37:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (195.200.68.147) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:42:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:47:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:52:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:54:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:09:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:14:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:19:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:39:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (195.200.68.147) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:44:39] FIRING: [3x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:49:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and asw1-b3-magru (2a02:ec80:700:fe08::2) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-b3-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:50:09] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:55:09] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (2a02:ec80:700:fe08::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown