[00:06:50] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11386711 (10Papaul) @ayounsi @cmooney on the other QFX5120-48Y in magru we are running version 22.2R3.S3.18 or right now the recommande version for that model is 23.4R2-S5. Do you want... [00:52:51] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11386786 (10cmooney) >>! In T390813#11386711, @Papaul wrote: > @ayounsi @cmooney on the other QFX5120-48Y in magru we are running version 22.2R3.S3.18 or right now the recommande versio... [06:52:08] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11386997 (10ayounsi) Let's open a different task for magru. drmrs is more urgent as they're end of support (and older). magru is to be done when we have time (lower priority). [07:58:43] FIRING: [3x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:03:43] FIRING: [13x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:08:43] FIRING: [13x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:13:43] RESOLVED: [13x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [08:28:37] 10netops, 06Infrastructure-Foundations, 06SRE: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11387153 (10ayounsi) [08:29:23] 10netops, 06Infrastructure-Foundations, 06SRE: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11387155 (10ayounsi) Thanks for the great writeup. We should unfortunately look at upgrading Netbox first. TBD if we need to spend time on a workaround. [08:42:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11387257 (10ayounsi) Lots great thanks ! Not sure how best to show it on the diagram, but we also need to remove the 10G link between cr3 and cr4. Maybe you can... [09:22:55] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11387375 (10MatthewVernon) @BCornwall Pcre2 was first released in 2015. Pcre3 stopped receiving __any__ upstream support (including security fixes) back in 2021, and I filed bugs against all packages depending on the obsolete... [09:36:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11387399 (10ayounsi) Nice ! As the IPs are already available, we should change the cr3/cr4/mr1 loopbacks ahead of time, in a different maintenance window, so... [10:20:21] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11387503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for ho... [10:23:03] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387510 (10ayounsi) I might have found something in Redfish for Dell: `lang=python r = spicerack.redfish('sretest2004') dump = r.scp_dump() dump.config['SystemConfiguration']['Comp... [10:48:19] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387577 (10ayounsi) Looks like it was a false hope, I looked at cirrussearch2115 which is showing the same behavior: ` lsw1-d3-codfw> show lldp neighbors | match xe-0/0/43 xe... [11:07:18] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387603 (10ayounsi) Haven't dug yet, but maybe an option is to install Broadcom's niccli tool : https://docs.broadcom.com/docs/Linux_Niccli-233.0.198.0 Then disabling it with: ` D... [11:18:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp1115:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqiad%20prometheus/ops&var-instance=cp1115 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:18:49] claime: probably related to your restart? [11:18:59] Very probable yes [11:19:20] Nov 19 11:11:35 cp1115 purged[1467843]: %4|1763550695.258|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10000 ms without a successful response from the group coordinator (broker 1003, last error was Broker: Not coordinator): revoking assignment and rejoining group [11:19:22] yeah [11:19:31] !log restarting purged on cp1115 [11:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:34] vgutierrez: So purged is one of the weird clients that doesn't handle changing brokers well? [11:21:28] you would expect that librdkafka does what it claims to be doing [11:21:36] but nope :D [11:22:06] :D [11:23:00] RESOLVED: [2x] PurgedHighEventLag: High event process lag with purged on cp1115:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqiad%20prometheus/ops&var-instance=cp1115 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:31:43] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11387687 (10SLyngshede-WMF) [11:55:30] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387754 (10cmooney) Another datapoint here, but the logspam seems worse on some switches: ` A:lsw1-d7-eqiad# show system logging buffer messages | grep -c "remote peer updated on i... [13:24:02] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11388046 (10MoritzMuehlenhoff) >>! In T401832#11386096, @BCornwall wrote: > @MatthewVernon Looks like you were a maintainer of pcre3 in Debian before it was axed in Trixie. Sadly, we're in need of that package for trafficserve... [14:05:57] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388212 (10MoritzMuehlenhoff) @ssingh The hcaptcha-proxy VMs in magru are up and running [14:07:01] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388216 (10ssingh) Oh wow, thanks @MoritzMuehlenhoff! But what was the issue for my understanding? [14:10:37] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388223 (10MoritzMuehlenhoff) >>! In T409860#11388216, @ssingh wrote: > Oh wow, thanks @MoritzMu... [14:14:35] 10netops, 06Traffic, 06Infrastructure-Foundations: POPs LVS : remove public vlan trunking - https://phabricator.wikimedia.org/T367732#11388231 (10cmooney) To confirm all that remains to be done is have someone on-site remove this cable: https://netbox.wikimedia.org/dcim/interfaces/27216/trace/ (assuming it... [14:17:37] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388240 (10ssingh) >>! In T409860#11388223, @MoritzMuehlenhoff wrote: >>>! In T409860#11388216,... [14:31:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11388318 (10cmooney) 05Resolved→03Open a:05cmooney→03None Hi. Seems I made an error here as not all the work is complete on site. We still ne... [14:34:23] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11388333 (10cmooney) 05Open→03Resolved >>! In T410047#11374122, @cmooney wrote: > Actually I discussed with @Papaul in relation to... [14:36:47] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11388344 (10Jclark-ctr) 05Open→03Resolved a:05RobH→03Jclark-ctr All Servers for Traffic have been migrated to new nokia switches [15:27:29] 06Traffic, 07Essential-Work, 06Experimentation Lab (Experiment Platform Sprint 15), 13Patch-For-Review: Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11388592 (10JVanderhoop-WMF) This setup looks good to me @Sfaci! A good ques... [15:27:34] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388594 (10SLyngshede-WMF) Depooling at 15:30 UTC ` % ssh cumin1003.eqiad.wmnet $ cookbook sre.dns.admin depool drmrs ` [15:28:30] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388597 (10SLyngshede-WMF) Pre-check ` $ sudo cookbook sre.dns.admin show ==> CURRENT STATE: text-addrs: pooled at all sites text-next: pooled at all sites upload-addrs: pooled at al... [15:33:49] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388705 (10SLyngshede-WMF) ` $ sudo cookbook sre.dns.admin -t T390813 depool drmrs ==> CURRENT STATE: text-addrs: pooled at all sites text-next: pooled at all sites upload-addrs: poole... [15:34:44] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388728 (10SLyngshede-WMF) ` $ sudo cookbook sre.dns.admin show ==> CURRENT STATE: text-addrs: depooled in drmrs text-next: depooled in drmrs upload-addrs: depooled in drmrs ncredir-ad... [15:35:56] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388744 (10SLyngshede-WMF) DNS traffic, for those following at home: https://grafana.wikimedia.org/goto/gdcIGJmDR?orgId=1 [15:37:23] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388756 (10SLyngshede-WMF) @cmooney / @Papaul traffic is moving. [15:40:54] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11388764 (10Papaul) @ayounsi Please see below the steps to disable LLDP in the BIOS for Dell servers. - once in the BIOS go to "Device Settings" -pick the first NIC if it is 1G or... [15:52:48] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388817 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=03158056-58f0-40e9-8ef7-4dd2bc33743a) set by pt1979@cumin2002 for 1:00:00 on 5 host(s) and their services wi... [16:07:07] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388863 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1b28f61a-0c57-409c-a53a-429cb2d44ddb) set by pt1979@cumin2002 for 1:00:00 on 8 host(s) and their services wi... [16:20:40] FIRING: [3x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp6011:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [16:20:43] FIRING: [3x] HaproxyKafkaExporterDown: HaproxyKafka on cp6009 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [16:20:43] FIRING: [3x] HaproxyKafkaExporterDown: HaproxyKafka on cp6001 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [16:21:45] yeah [16:46:16] FIRING: SLOMetricAbsent: trafficserver-combined - https://slo.wikimedia.org/?search=trafficserver-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:46:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11388988 (10Papaul) [16:51:16] RESOLVED: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:53:03] 10netops, 06Infrastructure-Foundations, 07Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#11389011 (10cmooney) [16:53:03] 10netops, 06Infrastructure-Foundations, 06SRE: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841#11389010 (10cmooney) [17:03:07] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11389031 (10Papaul) a:05Papaul→03cmooney Both switches in drmrs are now running Junos: 23.4R2-S5.8. @cmooney i am sending the task to you since you wanted to do the cloud switches. [17:03:13] RESOLVED: [3x] HaproxyKafkaExporterDown: HaproxyKafka on cp6001 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [17:03:13] RESOLVED: [3x] HaproxyKafkaExporterDown: HaproxyKafka on cp6009 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [17:03:25] RESOLVED: [5x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp6009:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [17:06:58] 06Traffic: Clean up purged release branches - https://phabricator.wikimedia.org/T410530 (10BCornwall) 03NEW [17:22:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11389081 (10Papaul) I think a am wrong on the public vlan for rack 22. We will not be re-imaging the servers in that rack with public vlan just changing the ne... [17:24:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11389084 (10Papaul) @ayounsi for the feed back i will work on it [17:59:52] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11389226 (10ayounsi) Thanks, looks like I missed it in my first look but it seems doable through Redfish on Dell : ` >>> dump.set('NIC.Integrated.1-2-1', 'Broadcom_LLDPNearestBridge... [18:32:06] 06Traffic, 06Commons: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T410201#11389363 (10ssingh) Hi @RoyZuo: we have tried to debug this on the CDN side and can't seem to find anything there that can point us to the problem. Can you upload any file at all, or is it simply this file, which... [18:40:21] 10netops, 06Infrastructure-Foundations, 06SRE: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11389442 (10cmooney) 05Open→03Resolved a:03cmooney Ok this is now done across the whole estate, eqiad and... [18:53:55] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11389515 (10BCornwall) [18:59:28] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11389539 (10BCornwall) >>! In T401832#11388046, @MoritzMuehlenhoff wrote: > I created an update based on the last version in Debian unstable before it got removed from the archive and fixed it to build on trixie (the more rece... [19:27:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11389651 (10RobH) Day 7 Update: * 33 hosts moved today, 44 remain * all row c wikikube migrated, some of row D wikikube migrated ** 23 wikikube hosts remain o... [23:51:59] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11390515 (10BCornwall)