[03:43:51] FIRING: ErrorBudgetBurn: varnish-combined esams - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:11:02] 10netops, 06Infrastructure-Foundations, 06SRE: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691#11594857 (10ayounsi) Actually, that linecard is 13 years old and out of warranty, we keep it there just in case we needed it, but if it shows signs of failing we should just rem... [06:31:29] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11594896 (10ayounsi) a:05cmooney→03VRiley-WMF [06:42:46] 10netops, 06Infrastructure-Foundations, 06SRE: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769#11594905 (10ayounsi) 05Open→03Resolved All good through https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1237509 [06:44:14] 10netops, 06DBA, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: librenms.syslog table is 800GB - https://phabricator.wikimedia.org/T415270#11594907 (10Marostegui) I saw the above change being merged, can I go ahead and truncate this table again so it is left empty? ` root@db1213:/... [06:50:47] 10netops, 06DBA, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: librenms.syslog table is 800GB - https://phabricator.wikimedia.org/T415270#11594908 (10Marostegui) 05Resolved→03Open [07:43:51] FIRING: ErrorBudgetBurn: varnish-combined esams - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:37:25] 10netops, 06DBA, 06Infrastructure-Foundations, 10observability: librenms.syslog table is 800GB - https://phabricator.wikimedia.org/T415270#11595045 (10ayounsi) Yep you're good to go! [09:32:02] 10netops, 06Infrastructure-Foundations, 06SRE: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691#11595302 (10ayounsi) I also see that we have 2 old SCBE2-MX in the router: https://netbox.wikimedia.org/dcim/devices/1271/inventory/ (from 2014), as we've replaced them with SCB... [09:48:54] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11595502 (10elukey) a:03elukey [10:10:07] 10netops, 06Infrastructure-Foundations: access request - read-only access to pfw's for Avishua Stein (astein) - https://phabricator.wikimedia.org/T413826#11595644 (10ayounsi) 05Open→03Resolved All good. [10:11:26] 10netops, 06Infrastructure-Foundations, 06SRE: Offline script - adjust to work with fundraising - https://phabricator.wikimedia.org/T414321#11595653 (10ayounsi) 05Open→03Invalid Please reopen when anyone have more data for that one. [10:12:04] 06Traffic, 06cloud-services-team, 10Data-Services, 10Datasets-General-or-Unknown, 13Patch-For-Review: Move dumps.wikimedia.org HTTP service behind CDN edge - https://phabricator.wikimedia.org/T306550#11595657 (10ayounsi) [10:28:56] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691#11595732 (10ayounsi) a:05cmooney→03None >>! In T416691#11594857, @ayounsi wrote: > Actually, that linecard is 13 years old and out of warranty, we... [10:47:47] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11595973 (10cmooney) 05Resolved→03Open Well of course this has occurred again as soon as I made the decisions to close. @ayounsi hit it today on //lsw1-d7-eqiad// trying to reimage au... [10:51:22] 10netops, 06DBA, 06Infrastructure-Foundations, 10observability: librenms.syslog table is 800GB - https://phabricator.wikimedia.org/T415270#11596000 (10Marostegui) 05Open→03Resolved Done! Thanks [11:00:43] jayme: can I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237467? [11:02:41] vgutierrez: it looks all right to me - but if you have another 5min we could verify that with the PCC run from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237277 [11:03:11] oops.. too late [11:03:16] :D [11:03:17] let's see what PCC says anyways [11:15:14] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11596132 (10ayounsi) Ticket 05430684 created with Nokia [11:20:36] vgutierrez: looks good IMHO [11:20:53] basically just setting net.ipv4.conf.all.rp_filter = 0 now [11:31:09] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11596159 (10fgiunchedi) Following up from Lisbon: there are essentially two options available wrt network implementation, @cmooney... [11:34:33] jayme: and ipip0 should be deployed [11:35:08] yeah, sure. I meant from what PCC says...the interfaces don't show up there [11:35:49] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Users reporting issues connecting to Gerrit with HTTPS from Orange, FR mobile network (AS 3215) - https://phabricator.wikimedia.org/T411203#11596183 (10ayounsi) 05Open→03Declined Not actionable on our side. [11:38:52] 10netops, 06Infrastructure-Foundations, 06SRE: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11596195 (10cmooney) @papaul is this something you might have already drawn out in EVE-NG? [11:39:30] 10netops, 06Infrastructure-Foundations, 06SRE: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11596197 (10ayounsi) a:05cmooney→03Papaul Hey @Papaul would you be interested in working on that ? [11:43:51] FIRING: ErrorBudgetBurn: varnish-combined esams - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:47:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7001 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7001 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [11:52:43] RESOLVED: [2x] HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7001 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7001 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [13:19:49] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: move row-wide vlan gateways to Nokia switches - https://phabricator.wikimedia.org/T416872 (10cmooney) 03NEW p:05Triage→03Medium [13:20:05] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11596553 (10cmooney) [13:20:07] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: move row-wide vlan gateways to Nokia switches - https://phabricator.wikimedia.org/T416872#11596554 (10cmooney) [13:20:21] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596556 (10cmooney) [13:20:27] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: move row-wide vlan gateways to Nokia switches - https://phabricator.wikimedia.org/T416872#11596557 (10cmooney) [13:21:46] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11596563 (10cmooney) [13:22:11] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11596568 (10cmooney) I've removed parent task T409286 to track this independently but commenting for the record. [13:23:33] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11596571 (10cmooney) [13:23:45] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11596574 (10cmooney) [13:24:01] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067#11596577 (10cmooney) [13:24:09] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11596578 (10cmooney) [13:24:21] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11596580 (10cmooney) [13:24:30] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596581 (10cmooney) [13:24:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11596587 (10cmooney) [13:25:02] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596588 (10cmooney) [13:25:12] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: lvs1018: decom links to asw2-c2-eqiad and asw2-d7-eqiad - https://phabricator.wikimedia.org/T410661#11596589 (10cmooney) [13:25:20] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596590 (10cmooney) [13:25:30] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11596592 (10cmooney) [13:25:46] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596593 (10cmooney) [14:18:36] FIRING: [2x] ErrorBudgetBurn: varnish-combined esams - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:29:44] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11596826 (10ssingh) ` sukhe@cumin1003:~$ sudo cumin "A:bastion" "dig +nsid en.wikipedia.org @2a02:ec80:53::1| grep NSID" 8 hosts will be targeted: bast[1003-1004,2003,3007,4005,5004,6003,7002].wik... [15:23:58] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11597066 (10Ladsgroup) >>! In T414805#11591828, @Ladsgroup wrote: > I‌ think this should fix it: https://gerrit.wikimedia.org/r/c/ope... [15:29:41] 06Traffic, 06Privacy Engineering, 06SRE: Create and document Wikidough's privacy policy - https://phabricator.wikimedia.org/T275409#11597081 (10ssingh) a:03ssingh [15:37:47] 10netops, 06Traffic, 06Infrastructure-Foundations: 2026 Junos upgrade - https://phabricator.wikimedia.org/T416444#11597105 (10ayounsi) p:05Triage→03Low [15:37:53] 10netops, 06Infrastructure-Foundations: drmrs: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416441#11597107 (10ayounsi) p:05Triage→03Medium [15:38:45] 10netops, 06Infrastructure-Foundations, 10Observability-Metrics: gNMIc: investigate new "collector" command - https://phabricator.wikimedia.org/T416360#11597112 (10ayounsi) p:05Triage→03Low [15:39:41] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11597115 (10ayounsi) p:05Triage→03Medium [15:39:48] 10netops, 06Infrastructure-Foundations: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11597116 (10ayounsi) p:05Triage→03Medium [15:40:30] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06tools-infrastructure-team: codfw: Upgrade cloudsw1-b1-codfw (2026) - https://phabricator.wikimedia.org/T416443#11597118 (10ayounsi) p:05Triage→03Medium [15:40:35] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11597119 (10ayounsi) p:05Triage→03Medium [15:40:39] 10netops, 06Infrastructure-Foundations: eqsin: upgrade routers (2026) - https://phabricator.wikimedia.org/T416563#11597120 (10ayounsi) p:05Triage→03Medium [15:42:21] 10netops, 06Traffic, 06DC-Ops, 10ops-esams, and 2 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11597140 (10LSobanski) [15:46:38] 10netops, 06Traffic, 06Infrastructure-Foundations: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#11597180 (10MLechvien-WMF) [15:48:04] jayme: ipip0 appears on pcc... `Augeas[ipip0_manual]` [15:48:07] 10netops, 06Traffic, 06DC-Ops, 10ops-esams, and 2 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11597191 (10ssingh) 05Open→03Resolved a:03ssingh Marking this as resolved since the failure was transient and has since resolved. The monitoring not being... [15:49:09] and from the change catalog [15:49:20] https://www.irccloud.com/pastebin/C2RVMCBv/ [15:53:14] 10netops, 06Infrastructure-Foundations, 06SRE: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11597211 (10Papaul) Yes i can take it . thanks [15:57:47] ah, right. brainfarted and just stared at modified resources [16:15:42] 06Traffic, 06Security-Team, 10WMF-General-or-Unknown, 07ContentSecurityPolicy, 13Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618#11597391 (10TheDJ) I've been looking a bit at logstash, and I can't see any significant change within the noise that i... [16:17:42] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11597410 (10cmooney) p:05Triage→03Medium >>! In T414460#11582569, @Gehel wrote: > With the various... [16:23:57] 10netops, 06Traffic, 06DC-Ops, 10ops-esams, and 2 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11597481 (10cmooney) >>! In T415473#11552172, @ssingh wrote: > We had a transient link failure between eqiad and esams that resulted in this issue. Correct this... [17:06:21] 06Traffic, 06SRE, 13Patch-For-Review: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11597848 (10Xqt) [18:18:51] FIRING: [2x] ErrorBudgetBurn: varnish-combined esams - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:13:26] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11598623 (10ssingh) `ns1` v6 went live today. We will wait for a bit more data to come in but we can see an uptick in connections to the v6 already. @BBlack suggested that instead of doing unicas... [19:16:20] bblack: topranks: https://w.wiki/Ho2u [19:16:30] v6 is taking over v4 for ns1 basically [19:18:49] coooool [19:19:05] this is without the registrar-level glue updated in the public view yet, right? [19:19:41] the real spike starts after (registrar glue updated at 2026-02-09 18:47 UTC) [19:19:52] and you see the spike ~18:55ish [19:20:09] interestingly enough: [19:20:11] https://grafana.wikimedia.org/goto/QZaQLGvDR?orgId=1 [19:20:29] you also see a drop in dns3004 and during that period [19:20:35] which is simply ns2 v4 anycast [19:20:47] better view: https://grafana.wikimedia.org/goto/CaguYMvvR?orgId=1 [19:22:20] which perhaps confirms our theory that some recursor is preferring the v6 unicast in codfw, simply because it is v6, over lower latency [19:34:45] technically it's hard to know, unless we investigate the flows that moved from esams->codfw in netflow [19:35:07] it could also be the case that IPv4 anycast was poorly-routing some US network to esams, and in the IPv6 world things work better :) [19:35:44] who knows really, without deep dives on the data [19:36:12] yeah I definitely have plans to go through wmf_netflow once we have some data, to the extent we can [19:58:36] FIRING: [2x] ErrorBudgetBurn: varnish-combined esams - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:02:34] yeah I noticed that [20:02:38] we discussed this possibility [20:03:20] "v6 is taking over" is an exterme way to put it though I'd not describe it that way [20:04:18] I did notice there is actually more non-US traffic going to the ns1 IPv4 address than there is to the IPv6 ns1 address [20:04:19] https://w.wiki/Ho3w [20:04:39] https://w.wiki/Ho3x [20:05:29] that's the ns2 anycast though [20:05:32] 198.35.27.27 [20:05:39] oh shit [20:05:52] haha, yeah I did that comparison cos I didn't expect it based on the other things I saw [20:05:52] lol [20:06:59] yeah it makes more sense now: https://w.wiki/Ho3y [20:10:03] yeah [20:10:46] thanks for adding the v6 for ns0 [20:10:52] alas, we have another complication :( [20:11:17] writing it in the task, basically our puppetization does not support adding a v6 without a v4 [20:11:26] wihch means we can't do a v6 anycast if we have a v4 unicast [20:11:32] heh [20:11:34] beause we will do a v4 unicast say in eqiad/codfw for v4 [20:11:48] someone wrote some poor puppetization (probably me!) [20:11:49] but we can't do a v6 anycast in eqiad as well [20:12:19] bblack: well we started with v4 and added v6 later, so we are making the assumption that we will couple v4 and v6 [20:12:27] we can fix it but yeah, that's certainly more work [20:13:19] where is the assumption being made? [20:14:06] surely we can change our puppetization to allow for this? [20:14:21] yes [20:14:28] the problem is in Bird anycast profile is it? [20:14:31] yep [20:14:32] I was looking at it by hand just now, but I don't see the issue yet [20:14:34] oh ok [20:14:48] https://phabricator.wikimedia.org/T81605#11598835 typed it here [20:14:59] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11598835 (10ssingh) There is unfortunately one more problem here. Our Puppetization for `bird` means that we can't set up v6 addresses without v4 ones, in the same configuration. This stems from... [20:15:16] 06Traffic, 06collaboration-services, 10Gerrit: ATS/Gerrit: validate TLS hosts for gerrit (revert workaround that skips validation) - https://phabricator.wikimedia.org/T411904#11598836 (10Dzahn) 05Stalled→03Open Yes, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215684/6/hieradata/common/profile/t... [20:16:12] I will need to think more on what to change but I need to touch grass now so will pick this up with a fresher mind :P [20:16:38] I did try it here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238012 but it failed so yes, we will need to fix it in some place [20:16:41] see PCC https://puppet-compiler.wmflabs.org/output/1238012/8006/dns2004.wikimedia.org/change.dns2004.wikimedia.org.err [20:17:16] fun stuff :) [20:17:51] 06Traffic, 06collaboration-services, 06SRE, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11598842 (10Dzahn) [20:31:38] 06Traffic, 06ServiceOps new, 06SRE: Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200#11598920 (10MLechvien-WMF) a:03Clement_Goubert Hi Clement, would you be able to triage this task? is it still relevant given the plans for API Gateway? [20:32:06] 06Traffic, 06ServiceOps new, 10ServiceOps-SharedInfra, 06SRE: Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200#11598924 (10MLechvien-WMF) [20:55:08] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11599012 (10cmooney) Yeah that seems fine to me. I've allocated some [[ https://netbox.wikimedia.org/ipam/prefixes/1398/prefixes/ | new /48s ]] from our RIPE /29 range not for these, and created... [23:58:51] FIRING: ErrorBudgetBurn: varnish-combined esams - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn