[00:26:41] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11716587 (10BCornwall) [02:17:23] RESOLVED: ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:25:23] FIRING: ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:55:00] <_joe_> what happened last night in codfw? [06:55:16] <_joe_> there seems to have been a big varnish error budget burn [07:01:09] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: ATS: align ATS and Gerrit Apache timeouts to reenable connection re-use - https://phabricator.wikimedia.org/T417998#11717078 (10ABran-WMF) it's merged! I'll resolve {T246763}. Thanks for pointing out that c... [07:21:40] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team: Gerrit: Debug connection re-use on Gerrit's httpd causing Gerrit interface to be very slow - https://phabricator.wikimedia.org/T420189#11717097 (10ABran-WMF) I've merged patch with the Jetty timeout alignment made by @hashar in {T2... [07:23:35] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team: Gerrit: Debug connection re-use on Gerrit's httpd causing Gerrit interface to be very slow - https://phabricator.wikimedia.org/T420189#11717100 (10ABran-WMF) [08:25:38] FIRING: ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:00:28] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Debug connection re-use on Gerrit's httpd causing Gerrit interface to be very slow - https://phabricator.wikimedia.org/T420189#11717566 (10ABran-WMF) After merging [[ https://gerrit.wikimedia.org/r/c/op... [12:24:06] 10netops, 06Infrastructure-Foundations: esams/magru: 185.71.138.0/24 (wikidough) prefix not advertized - https://phabricator.wikimedia.org/T420342 (10ayounsi) 03NEW p:05Triage→03High [12:24:18] 10netops, 06Infrastructure-Foundations: esams/magru: 185.71.138.0/24 (wikidough) prefix not advertized - https://phabricator.wikimedia.org/T420342#11718192 (10ayounsi) [12:25:38] FIRING: ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:38:02] 10netops, 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: esams/magru: 185.71.138.0/24 (wikidough) prefix not advertized - https://phabricator.wikimedia.org/T420342#11718237 (10ayounsi) [12:43:20] 10netops, 06Infrastructure-Foundations: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180#11718283 (10taavi) [12:47:41] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11718297 (10cmooney) 05Open→03Declined This won't be required now, we have res... [12:49:03] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159#11718300 (10cmooney) 05Open→03Declined This won't be needed now, we were... [13:05:30] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#11718393 (10Fabfur) [13:13:17] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419859#11718437 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:13:45] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419858#11718441 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:13:58] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419857#11718445 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:14:04] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419856#11718448 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:14:10] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419855#11718451 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:14:18] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419854#11718454 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:43:19] 10netops, 06Infrastructure-Foundations, 06SRE: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351 (10cmooney) 03NEW p:05Triage→03Medium [13:43:25] 10netops, 06Infrastructure-Foundations, 06SRE: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351#11718588 (10cmooney) [13:43:30] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11718589 (10cmooney) [13:49:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11718638 (10Papaul) [14:09:30] 06Traffic: Add new Auth DNS IPv6 addresses to ns_group firewall group - https://phabricator.wikimedia.org/T420361 (10taavi) 03NEW [14:15:23] RESOLVED: ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:49:06] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11718960 (10RobH) The distro swap did not fix this host, it will require a mainboard swap via a procurement task (linked in) [14:56:05] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719023 (10RobH) [14:56:12] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719026 (10RobH) [14:56:21] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719028 (10RobH) [14:57:21] 06Traffic, 06SRE: Startup failure for Bird on new durum hosts - https://phabricator.wikimedia.org/T419868#11719032 (10MoritzMuehlenhoff) >>! In T419868#11713955, @ssingh wrote: > That's interesting, thanks for debugging. What is weird is that a restart of anycast-healthchecker then should have fixed this in th... [15:14:09] 10netops, 06Infrastructure-Foundations, 06SRE: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351#11719150 (10cmooney) 05Open→03Resolved Ok this work is now complete. Only had to reset the tunnel on `lsw1-d4-eqiad` it w... [15:15:57] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11719160 (10cmooney) p:05Medium→03Low Ok all vxlan tunnels right now on row c/d leaf switches to ssw1-d1-eqiad and ssw1-d8-eqiad have a valid vxlan tunnel id. So u... [16:15:00] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp7005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:40] RESOLVED: [4x] SystemdUnitFailed: haproxy.service on cp7005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:01] ^looking [16:22:30] tmpfs certs stuff [16:24:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11719697 (10Fabfur) Procedure from the traffic perspective should be roughly - Depool ulsfo (around 0900UTC) and wait about 30' for all connections... [17:15:39] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720161 (10BCornwall) [17:27:30] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3066.esams.wmnet with OS trixie [17:27:33] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3066.esams.wmnet with OS trixie executed with errors: - cp3066 (**FAIL**) - **The reimage failed, see the cookbook... [17:29:04] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720255 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3066.esams.wmnet with OS trixie [17:30:02] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720263 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3067.esams.wmnet with OS trixie [18:30:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11720532 (10ssingh) >>! In T418971#11719697, @Fabfur wrote: > Procedure from the traffic perspective should be roughly > > - Depool ulsfo (around 0... [18:41:03] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720556 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3066.esams.wmnet with OS trixie completed: - cp3066 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [18:43:09] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3067.esams.wmnet with OS trixie completed: - cp3067 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [18:55:35] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3068.esams.wmnet with OS trixie [18:56:30] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720584 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3069.esams.wmnet with OS trixie [19:16:13] 06Traffic, 13Patch-For-Review: Add new Auth DNS IPv6 addresses to ns_group firewall group - https://phabricator.wikimedia.org/T420361#11720604 (10ssingh) Yes, thanks for reporting, we should probably add it to the definitions. I did a CR. [19:18:36] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720608 (10BCornwall) [19:50:35] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3068.esams.wmnet with OS trixie completed: - cp3068 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [19:54:28] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720697 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3069.esams.wmnet with OS trixie completed: - cp3069 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [20:09:14] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720750 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3070.esams.wmnet with OS trixie [20:09:31] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720751 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3071.esams.wmnet with OS trixie [21:05:52] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720986 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3071.esams.wmnet with OS trixie completed: - cp3071 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [21:09:54] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11720998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3070.esams.wmnet with OS trixie completed: - cp3070 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [21:14:43] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11721003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3072.esams.wmnet with OS trixie [21:15:31] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11721005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp3073.esams.wmnet with OS trixie [22:10:56] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11721140 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3072.esams.wmnet with OS trixie completed: - cp3072 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [22:13:28] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11721145 (10CDobbins) [22:15:21] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11721161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp3073.esams.wmnet with OS trixie completed: - cp3073 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [22:21:15] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11721182 (10CDobbins) [22:56:45] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11721236 (10BCornwall)