[07:10:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp3075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:15:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp3075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:26:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp3068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp3068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp3069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp3069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:12] 06Traffic, 06MW-Interfaces-Team, 06ServiceOps new, 07Epic, and 3 others: Epic: Enforce API rate limits (WE5.1.3c) - https://phabricator.wikimedia.org/T412585#11947486 (10Clement_Goubert) [10:50:43] blblack: Well, it turns out, the rest-gateway's no-cache header was misconfigured (added to request and not response), and since I've changed that, the calls are effectively cache misses, even when I stay on the same Origin. [10:51:09] Now I have no ideawhat mechanism is used for the revalidation but.. yeah [10:57:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: Decom eqord POP - https://phabricator.wikimedia.org/T427050 (10cmooney) 03NEW p:05Triage→03Medium [10:57:40] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: Decom eqord POP - https://phabricator.wikimedia.org/T427050#11947853 (10cmooney) [10:58:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: Decom eqord POP - https://phabricator.wikimedia.org/T427050#11947860 (10cmooney) [10:59:05] 06Traffic, 10ContentTranslation, 06LPL Hypothesis, 06Security-Team, and 6 others: CX dashboard can't load page collections on some wikis (blocked by CORS) - https://phabricator.wikimedia.org/T426323#11947857 (10Clement_Goubert) 05Open→03Resolved While investigating with @hnowlan he observed the `ca... [11:01:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp3079:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp3079:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:47] 06Traffic, 06SRE: wiki.openstreetmap.org Commons thumbs rate limit allowance - https://phabricator.wikimedia.org/T423570#11947887 (10jcrespo) 05Open→03Resolved I am not seeing any 429 from this source in the last 15 days, so tentatively resolving. Please reopen if you disagree. [11:39:31] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11947995 (10cmooney) [11:40:48] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11947997 (10cmooney) Everything is more-or-less done here. The eqsin link is still operational, though traffic is flowing via the switches due to OSPF cost. We can leave that one in place for now a... [11:48:31] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11948010 (10cmooney) [11:48:34] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11948012 (10cmooney) [11:48:40] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11948014 (10cmooney) [11:48:46] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11948016 (10cmooney) [11:50:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp3076:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3076&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:55:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp3076:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3076&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [12:10:50] 06Traffic, 10ContentTranslation, 06LPL Hypothesis, 06Security-Team, and 5 others: CX dashboard can't load page collections on some wikis (blocked by CORS) - https://phabricator.wikimedia.org/T426323#11948059 (10kostajh) >>! In T426323#11947857, @Clement_Goubert wrote: > While investigating with @hnowla... [13:21:02] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11948291 (10cmooney) p:05High→03Low [13:33:03] 06Traffic, 06Data-Engineering: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068 (10elukey) 03NEW [13:34:44] 06Traffic, 06Data-Engineering: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#11948372 (10elukey) @JAllemandou I imagine that adding the new JSON field to `webrequest_frontend_{text,upload}` require some upstream changes to the DE ingestion pipeline as well right?... [13:44:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp3078:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3078&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [13:44:41] damn [13:46:07] sigh [13:49:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp3078:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3078&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [15:26:24] 06Traffic, 13Patch-For-Review: Synchronize and rotate TCP Fastopen keys for various use-cases - https://phabricator.wikimedia.org/T355446#11948737 (10ssingh) 05Open→03In progress [15:30:42] 06Traffic, 06SRE: Startup failure for Bird on new durum hosts - https://phabricator.wikimedia.org/T419868#11948763 (10ssingh) 05Open→03Resolved a:03ssingh We have done quite a few reimages of durum since then (and reboots) and this issue was not observed. I am taking the liberty to close this as part... [15:31:29] 06Traffic: Update South America geo-maps - https://phabricator.wikimedia.org/T387774#11948767 (10ssingh) 05Open→03Resolved This has been completed in various iterations of the updates to geo-maps. [15:33:59] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154#11948776 (10ssingh) The only blocker in this task was the cp hosts for OpenSSL. We have already upgraded them to trixie in T401832, so this task can be resolved. A note has been made for the same in the... [15:34:19] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154#11948780 (10ssingh) [15:34:30] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154#11948782 (10ssingh) 05Stalled→03Resolved a:03ssingh [15:36:12] 06Traffic, 06SRE: PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - https://phabricator.wikimedia.org/T356951#11948788 (10ssingh) 05Open→03Resolved a:03ssingh There has been no follow-up to this in a while (and this is on k8s anyway now?) and this task has been open since 2024. I a... [15:39:15] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10SRE-tools: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166#11948797 (10ssingh) 05Open→03Resolved a:03ssingh LVS in core sites will be superseded by Liberica so we are unlikely to spend any time on this. I am taking... [15:45:12] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686#11948831 (10ssingh) 05Open→03Resolved All hosts were reimaged and have been back in production for a while. Resolving. [15:45:37] 06Traffic: HTTP 503 error trying to make any edits on Wikipedia - https://phabricator.wikimedia.org/T423991#11948835 (10ssingh) @Ergur: Hi. Can you confirm if this is a problem for you still or has resolved? [15:47:17] 06Traffic: Provide better error pages for HAProxy - https://phabricator.wikimedia.org/T352291#11948838 (10ssingh) I am curious, which error pages are we talking about? [15:48:20] 06Traffic, 06Commons, 06SRE: Backend fetch failed - https://phabricator.wikimedia.org/T383013#11948842 (10ssingh) 05Open→03Resolved a:03ssingh It seems like the issue was transient and therefore I am taking the liberty to close this as part of regular task cleanup. Please re-open if desired. [15:50:02] 06Traffic, 10conftool, 06SRE: confd causes soft lockup when you are tailing a file with -F and the state is updated - https://phabricator.wikimedia.org/T372646#11948872 (10ssingh) 05Open→03Resolved a:03ssingh No one else has observed this issue and it has been almost two years since this was report... [15:52:29] 06Traffic, 06Data-Platform-SRE: Validate pybal config in CI - https://phabricator.wikimedia.org/T394789#11948908 (10ssingh) 05Open→03Declined LVS in core sites will be superseded by Liberica so we are unlikely to spend any time on this. I am taking the liberty to close this as part of regular task cle... [15:53:35] 06Traffic: Investigate setting init_on_alloc=0 on cache hosts - https://phabricator.wikimedia.org/T401025#11948922 (10ssingh) We never got to this in Q3 or even Q4. Should we plan to do this in Q1 2026? [15:55:18] 06Traffic: images are not loading for some users (on the us west coast?) - https://phabricator.wikimedia.org/T425670#11948929 (10ssingh) 05Open→03Resolved a:03ssingh Boldly resolving for the reasons above: the issue was transient because we responded to it, there is no follow up and there is nothing on... [15:57:26] 06Traffic, 06SRE: Investigate port 80 page in text@esams for Ipv6 - https://phabricator.wikimedia.org/T423667#11948936 (10ssingh) 05Open→03Declined This hasn't happened again and it's hard investigating now what caused these two blips. Boldly resolving for this as part of regular task cleanup. If it ha... [15:58:40] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11948942 (10ssingh) @cmooney: We plan to move to Liberica in Q1 or Q2 of APP2026. Do you think we should still consider w... [16:00:53] 06Traffic, 06SRE: ATS automatically restarted due to receiving SIGUSR2 on cp5024 - https://phabricator.wikimedia.org/T344674#11948949 (10ssingh) 05Open→03Resolved a:03ssingh This hasn't happened in a while (last incident was 2023) and we have run `sre.cdn.roll-reboot` many times since then, so boldly... [16:04:37] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11948966 (10cmooney) >>! In T405630#11948942, @ssingh wrote: > @cmooney: We plan to move to Liberica in Q1 or Q2 of FY202... [16:05:41] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11948969 (10ssingh) Thanks for the update and the explanation, @cmooney! [16:23:22] 06Traffic, 06DC-Ops, 10decommission-hardware, 10ops-codfw, and 2 others: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828#11949051 (10Jhancock.wm) 05Open→03Resolved [18:10:47] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW - https://phabricator.wikimedia.org/T420706#11949332 (10cmooney) Nokia have told us they are going to fix this and the patch is scheduled for releast 26.7.1 which should be out late July/August. [18:47:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp3080:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3080&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [18:52:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp3080:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3080&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted