[05:28:02] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11957928 (10Marostegui) [05:29:21] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11957930 (10Marostegui) @jcrespo FYI db2250 @FCeratto-WMF can you take care of depooling pc2021 and coordinating db2158? cc @CWilliams-WMF [07:11:29] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#11958008 (10elukey) Trying to summarize the work to do by layers: haproxy: here https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/106 should be e... [07:37:14] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#11958059 (10JAllemandou) > I think the only rule we'd have to follow afterwards is being sure to adding new keys to the struct before we start emitting them into kaf... [07:40:38] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#11958061 (10JAllemandou) > haproxy: here https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/106 should be enough, like we do with X-Analytics - the... [07:55:05] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#11958158 (10elukey) >>! In T427068#11958061, @JAllemandou wrote: >> haproxy: here https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/106 should be... [08:17:28] 10netops, 06DBA, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11958191 (10jcrespo) Thanks for the heads up, @Marostegui db2250 needs no special handling or depooling -other than downtiming-, assuming maintenance happens during the day. [08:40:01] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357 (10ayounsi) 03NEW p:05Triage→03Medium [08:41:00] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11958324 (10ayounsi) [08:41:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958339 (10ayounsi) [08:41:18] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11958338 (10ayounsi) [08:43:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958357 (10ayounsi) [08:43:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958374 (10ayounsi) [08:44:59] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11958376 (10jcrespo) db2183 will require stopping mediabackups in advance, to prevent losing metadata. I will take care of that. For db2198, db2199, d... [08:59:30] FIRING: HAProxyRestarted: HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3066&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [08:59:30] FIRING: HAProxyRestarted: HAProxy server restarted on cp3074:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3074&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [08:59:44] ^^ me [09:00:42] fabfur: Good with me merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1293746 soon-ish? [09:02:08] know very little about this but lgtm [09:04:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3066&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:04:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp3074:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3074&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:07:57] 06Traffic, 13Patch-For-Review: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#11958503 (10Fabfur) [09:10:40] 06Traffic, 10Liberica, 06Machine-Learning-Team, 10Prod-Kubernetes, and 2 others: Migrate ML k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420438#11958512 (10elukey) @klausman first code changes out for staging, after applying them we'll be able to see if anything weird pops up. Th... [09:11:49] 06Traffic, 10Liberica, 06Machine-Learning-Team, 10Prod-Kubernetes, and 2 others: Migrate ML k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420438#11958517 (10klausman) >>! In T420438#11958512, @elukey wrote: > @klausman first code changes out for staging, after applying them we'll... [09:20:55] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, and 2 others: Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#11958528 (10Clement_Goubert) I've merged the fix and tested it on the cache servers I hit from the outside, `https://api.wikimedia.org/serv... [10:08:09] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11958744 (10jijiki) @ayounsi `mc2055` and `mc-gp2004` are on A4, and that is by accident. `mc-gp2004` is working as a backup in case `mc2055` or any... [10:16:58] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, and 2 others: Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#11958782 (10isarantopoulos) works like a charm! thanks a lot @Clement_Goubert 🎉 [10:24:02] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#11958828 (10Fabfur) So IIUC in this case we have 2 possibilities: 1. Use a new header/variable (x-provenance) that needs to be logged into haproxy standard log form... [10:27:26] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, and 2 others: Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#11958842 (10jijiki) p:05Triage→03Medium [10:41:38] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, and 2 others: Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#11958881 (10Clement_Goubert) Cool! Moving it to Radar on our side, feel free to ping me on task if you need us again on this task. [10:44:20] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11958890 (10ayounsi) [10:51:30] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11958926 (10ayounsi) 05Open→03Resolved a:03ayounsi I think we can close this task as the new transport circuits will eliminate that routing loop... [10:52:59] 06Traffic: Traffic: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421709#11958934 (10ayounsi) [10:53:03] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11958935 (10ayounsi) [10:53:04] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11958936 (10ayounsi) [10:57:54] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298#11958946 (10ayounsi) 05Open→03Declined We're not going to add more stuff to Icinga. [11:06:28] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11958978 (10ayounsi) [11:06:29] 10netops, 06Infrastructure-Foundations, 06SRE: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#11958977 (10ayounsi) [11:40:06] 10netops, 06Infrastructure-Foundations, 06SRE: GRE Interfaces statistics not being returned by Juniper MX via gnmi - https://phabricator.wikimedia.org/T403936#11959179 (10ayounsi) 05Open→03Resolved a:03cmooney It's now showing up thanks to {T424683} https://grafana.wikimedia.org/goto/dfnbnedrb28sg... [11:40:55] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Map video and other large files to 'low-priority' network Qos queue - https://phabricator.wikimedia.org/T410133#11959189 (10cmooney) 05Open→03Resolved a:03cmooney We actaully added a mechanism to do this late last year when we had some une... [11:51:19] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Map internet-bound upload traffic to low-priority QoS queue - https://phabricator.wikimedia.org/T415649#11959238 (10cmooney) 05Open→03Declined I'm going to close this one. I hadn't fully thought out the way we serve things currently. `uplo... [11:54:08] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#11959248 (10ayounsi) [12:31:40] 06Traffic, 10MediaViewer, 10Thumbor, 07Browser-Support-Firefox, and 3 others: 429 too many requests when trying to view .webp image in MediaViewer in Firefox - https://phabricator.wikimedia.org/T418346#11959368 (10Robertsky) [12:35:25] 06Traffic, 10MediaViewer, 10Thumbor, 07Browser-Support-Firefox, and 3 others: 429 too many requests when trying to view .webp image in MediaViewer in Firefox - https://phabricator.wikimedia.org/T418346#11959384 (10Robertsky) This issue happens across all browsers as long as the original image size of the w... [13:14:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393 (10Papaul) 03NEW [13:29:25] 06Traffic: Reboot lvs1019 for memory self-healing - https://phabricator.wikimedia.org/T426109#11959581 (10ssingh) >>! In T426109#11957496, @BCornwall wrote: > @ssingh The Dell docs mention updating the BIOS: > >> update BIOS to the latest revision that includes many memory Self-healing capabilities and ongoing... [14:56:07] 10netops, 06Traffic, 06Discovery-Search, 06Infrastructure-Foundations, and 3 others: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357#11959954 (10ssingh) Depool for cp2044 looks good; please ping Traffic if you want us to take care of it. [15:53:38] 06Traffic, 06SRE, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11960227 (10BCornwall) 05Open→03In progress [16:01:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960266 (10cmooney) That looks good to me @papaul good stuff. If we use vlan IDs 512/522 I guess the plan would be to change the vlan i... [16:02:35] 06Traffic, 06SRE, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11960280 (10BCornwall) [16:03:30] 06Traffic, 06SRE, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11960281 (10BCornwall) [16:09:25] FIRING: SystemdUnitFailed: ipip-multiqueue-optimizer.service on lvs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960461 (10Papaul) @cmooney yes we will change the VLAN-id and rename the VLAN for rack 0603 during the switch migration. so it will be... [16:36:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960468 (10Papaul) [16:52:51] Traffic folks, FYI, a toolforge user just told me that "claude recommended toolforge as a workaround" to an api throttle. We already have a policy against using toolforge as a proxy but some users are likely to slip through the gates so please toolforge folks know if toolforge traffic spikes. [16:57:48] andrewbogott: noted, please mention the _sec channel too [16:58:13] ok [18:12:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp5024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:37] hmm [18:12:40] yeah [18:12:57] why was that just the resolved/ [18:13:10] https://sal.toolforge.org/log/a0ieap4BffdvpiTr_EfA [18:13:16] ohhh [18:13:22] so reboot and then stale alert [18:13:26] it's not stale :) [18:13:32] there's a race condition on the puppet run right after reboot [18:13:50] yes for the TLS keys, but I meant stale as in why it didn't fire (because it was downtimed at that time) [18:14:25] I was worried if this was related to the other issue [18:14:27] but thankfully not [18:14:56] ahh ok [18:15:24] > ERR: certificate does not exist [18:15:47] so all good. that's ExecStartPre=/usr/local/sbin/tls-check /etc/haproxy-tls-check.cfg [18:15:50] failing [18:19:19] oh. haproxy actually can't start until puppet has run? [18:19:27] after a reboot? [18:19:47] essentially yes, not until we have fetched the TLS keys and put them in the tmpfs [18:19:57] since they are no longer on disk [18:19:59] right [18:22:52] I guess you could do something like `systemctl disable haproxy` before rebooting, then systemd wouldn't try to restart it in vain before puppet has had a chance to run [18:23:02] but, eh [18:24:23] :) [18:29:12] sukhe: do you have any thoughts on how aggressively (or not) i should roll out the new deb ? [18:30:23] I was thinking of doing something like all of ulsfo today [18:31:05] cdanis: I would say at least one day of fermenting on the few hosts? then tomorrow yeah [18:31:13] ok :) [18:31:31] for clarity, since it happens under specific conditions, I am just waiting to see what happens under those [20:22:36] 06Traffic, 06SRE, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11961220 (10BCornwall) [20:40:37] 06Traffic, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11961282 (10BCornwall) [20:41:37] 06Traffic, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11961284 (10BCornwall) 05In progress→03Resolved [22:10:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11961402 (10Papaul) [22:34:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11961461 (10Papaul)