[00:06:49] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11386711 (10Papaul) @ayounsi @cmooney on the other QFX5120-48Y in magru we are running version 22.2R3.S3.18 or right now the recommande version for that model is 23.4R2-S5. Do you want... [00:52:51] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11386786 (10cmooney) >>! In T390813#11386711, @Papaul wrote: > @ayounsi @cmooney on the other QFX5120-48Y in magru we are running version 22.2R3.S3.18 or right now the recommande versio... [06:52:08] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11386997 (10ayounsi) Let's open a different task for magru. drmrs is more urgent as they're end of support (and older). magru is to be done when we have time (lower priority). [08:28:36] 10netbox, 06Infrastructure-Foundations: Upgrade Netbox to 4.3.x - https://phabricator.wikimedia.org/T371889#11387154 (10ayounsi) [08:28:37] 10netops, 06Infrastructure-Foundations, 06SRE: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11387153 (10ayounsi) [08:29:23] 10netops, 06Infrastructure-Foundations, 06SRE: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11387155 (10ayounsi) Thanks for the great writeup. We should unfortunately look at upgrading Netbox first. TBD if we need to spend time on a workaround. [08:37:31] 10CAS-SSO, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: sso failure in codfw1dev (labtesthorizon.wikimedia.org) - https://phabricator.wikimedia.org/T409328#11387208 (10dcaro) p:05Triage→03High [08:42:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11387257 (10ayounsi) Lots great thanks ! Not sure how best to show it on the diagram, but we also need to remove the 10G link between cr3 and cr4. Maybe you can... [09:36:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11387399 (10ayounsi) Nice ! As the IPs are already available, we should change the cr3/cr4/mr1 loopbacks ahead of time, in a different maintenance window, so... [10:16:59] 10CAS-SSO, 10Gerrit, 06Infrastructure-Foundations: Use IDP for authentication in Gerrit - https://phabricator.wikimedia.org/T147864#11387497 (10Tacsipacsi) @hashar What is this stalled on? (I understand T147864#10541736 that it’s not a priority right now, but “stalled” is stronger than “not a priority”. I’d... [10:23:03] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387510 (10ayounsi) I might have found something in Redfish for Dell: `lang=python r = spicerack.redfish('sretest2004') dump = r.scp_dump() dump.config['SystemConfiguration']['Comp... [10:48:19] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387577 (10ayounsi) Looks like it was a false hope, I looked at cirrussearch2115 which is showing the same behavior: ` lsw1-d3-codfw> show lldp neighbors | match xe-0/0/43 xe... [11:07:18] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387603 (10ayounsi) Haven't dug yet, but maybe an option is to install Broadcom's niccli tool : https://docs.broadcom.com/docs/Linux_Niccli-233.0.198.0 Then disabling it with: ` D... [11:55:30] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387754 (10cmooney) Another datapoint here, but the logspam seems worse on some switches: ` A:lsw1-d7-eqiad# show system logging buffer messages | grep -c "remote peer updated on i... [14:14:35] 10netops, 06Infrastructure-Foundations, 06Traffic: POPs LVS : remove public vlan trunking - https://phabricator.wikimedia.org/T367732#11388231 (10cmooney) To confirm all that remains to be done is have someone on-site remove this cable: https://netbox.wikimedia.org/dcim/interfaces/27216/trace/ (assuming it... [14:31:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11388318 (10cmooney) 05Resolved→03Open a:05cmooney→03None Hi. Seems I made an error here as not all the work is complete on site. We still ne... [14:34:23] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11388333 (10cmooney) 05Open→03Resolved >>! In T410047#11374122, @cmooney wrote: > Actually I discussed with @Papaul in relation to... [14:41:52] 10CAS-SSO, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: sso failure in codfw1dev (labtesthorizon.wikimedia.org) - https://phabricator.wikimedia.org/T409328#11388363 (10taavi) a:05taavi→03None [15:27:34] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388594 (10SLyngshede-WMF) Depooling at 15:30 UTC ` % ssh cumin1003.eqiad.wmnet $ cookbook sre.dns.admin depool drmrs ` [15:28:30] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388597 (10SLyngshede-WMF) Pre-check ` $ sudo cookbook sre.dns.admin show ==> CURRENT STATE: text-addrs: pooled at all sites text-next: pooled at all sites upload-addrs: pooled at al... [15:33:47] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388705 (10SLyngshede-WMF) ` $ sudo cookbook sre.dns.admin -t T390813 depool drmrs ==> CURRENT STATE: text-addrs: pooled at all sites text-next: pooled at all sites upload-addrs: poole... [15:34:42] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388728 (10SLyngshede-WMF) ` $ sudo cookbook sre.dns.admin show ==> CURRENT STATE: text-addrs: depooled in drmrs text-next: depooled in drmrs upload-addrs: depooled in drmrs ncredir-ad... [15:35:54] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388744 (10SLyngshede-WMF) DNS traffic, for those following at home: https://grafana.wikimedia.org/goto/gdcIGJmDR?orgId=1 [15:37:23] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388756 (10SLyngshede-WMF) @cmooney / @Papaul traffic is moving. [15:40:54] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11388764 (10Papaul) @ayounsi Please see below the steps to disable LLDP in the BIOS for Dell servers. - once in the BIOS go to "Device Settings" -pick the first NIC if it is 1G or... [15:47:06] sukhe: quick question related to the magru depool, once the sre.dns.admin cookbook is ran, the only active services on the site are the anycast services (DNS/wikidough, etc), is there a cookbook to depool thoses? and if so should there be an option to the sre.dns.admin cookbook to drain those as well? (to only have 1 command for depools) [15:51:59] XioNoX: yeah, it's a good idea. we can get the hosts in the site using the aliases and then stop puppet and bird to effectively do the depool [15:52:03] I guess an observation there as well, it's been fairly hot now for a while but our transport to esams from eqiad is close to max since the depool in drmrs [15:52:09] wow ok [15:52:47] that's interesting because the last time we did it (it's been a while) things were fairly smoth? [15:52:48] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388817 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=03158056-58f0-40e9-8ef7-4dd2bc33743a) set by pt1979@cumin2002 for 1:00:00 on 5 host(s) and their services wi... [16:07:06] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11388863 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1b28f61a-0c57-409c-a53a-429cb2d44ddb) set by pt1979@cumin2002 for 1:00:00 on 8 host(s) and their services wi... [16:30:25] FIRING: SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:55] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11388988 (10Papaul) [16:53:03] 10netops, 06Infrastructure-Foundations, 07Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#11389011 (10cmooney) [16:53:03] 10netops, 06Infrastructure-Foundations, 06SRE: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841#11389010 (10cmooney) [17:00:55] RESOLVED: [2x] SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:07] 10netops, 06Infrastructure-Foundations, 06Traffic: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11389031 (10Papaul) a:05Papaul→03cmooney Both switches in drmrs are now running Junos: 23.4R2-S5.8. @cmooney i am sending the task to you since you wanted to do the cloud switches. [17:22:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11389081 (10Papaul) I think a am wrong on the public vlan for rack 22. We will not be re-imaging the servers in that rack with public vlan just changing the ne... [17:24:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11389084 (10Papaul) @ayounsi for the feed back i will work on it [17:59:52] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11389226 (10ayounsi) Thanks, looks like I missed it in my first look but it seems doable through Redfish on Dell : ` >>> dump.set('NIC.Integrated.1-2-1', 'Broadcom_LLDPNearestBridge... [18:04:08] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537 (10RLazarus) 03NEW p:05Triage→03Medium [18:08:31] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11389289 (10RLazarus) (I'm not married to the specific CLI syntax in the example. Among other things, making it an --optional-flag means that the positional `host... [18:40:21] 10netops, 06Infrastructure-Foundations, 06SRE: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11389442 (10cmooney) 05Open→03Resolved a:03cmooney Ok this is now done across the whole estate, eqiad and... [19:27:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11389651 (10RobH) Day 7 Update: * 33 hosts moved today, 44 remain * all row c wikikube migrated, some of row D wikikube migrated ** 23 wikikube hosts remain o... [22:00:04] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11390175 (10Volans) Just for context referencing past ideas on the topic: T327300