[03:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [06:11:02] 10netops, 06Infrastructure-Foundations, 06SRE: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691#11594857 (10ayounsi) Actually, that linecard is 13 years old and out of warranty, we keep it there just in case we needed it, but if it shows signs of failing we should just rem... [06:31:29] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11594896 (10ayounsi) a:05cmooney→03VRiley-WMF [06:42:46] 10netops, 06Infrastructure-Foundations, 06SRE: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769#11594905 (10ayounsi) 05Open→03Resolved All good through https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1237509 [06:44:14] 10netops, 06DBA, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: librenms.syslog table is 800GB - https://phabricator.wikimedia.org/T415270#11594907 (10Marostegui) I saw the above change being merged, can I go ahead and truncate this table again so it is left empty? ` root@db1213:/... [06:50:47] 10netops, 06DBA, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: librenms.syslog table is 800GB - https://phabricator.wikimedia.org/T415270#11594908 (10Marostegui) 05Resolved→03Open [07:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [08:37:25] 10netops, 06DBA, 06Infrastructure-Foundations, 10observability: librenms.syslog table is 800GB - https://phabricator.wikimedia.org/T415270#11595045 (10ayounsi) Yep you're good to go! [09:32:02] 10netops, 06Infrastructure-Foundations, 06SRE: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691#11595302 (10ayounsi) I also see that we have 2 old SCBE2-MX in the router: https://netbox.wikimedia.org/dcim/devices/1271/inventory/ (from 2014), as we've replaced them with SCB... [09:47:25] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:07] 10netops, 06Infrastructure-Foundations: access request - read-only access to pfw's for Avishua Stein (astein) - https://phabricator.wikimedia.org/T413826#11595644 (10ayounsi) 05Open→03Resolved All good. [10:11:26] 10netops, 06Infrastructure-Foundations, 06SRE: Offline script - adjust to work with fundraising - https://phabricator.wikimedia.org/T414321#11595653 (10ayounsi) 05Open→03Invalid Please reopen when anyone have more data for that one. [10:28:56] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691#11595732 (10ayounsi) a:05cmooney→03None >>! In T416691#11594857, @ayounsi wrote: > Actually, that linecard is 13 years old and out of warranty, we... [10:47:47] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11595973 (10cmooney) 05Resolved→03Open Well of course this has occurred again as soon as I made the decisions to close. @ayounsi hit it today on //lsw1-d7-eqiad// trying to reimage au... [10:51:22] 10netops, 06DBA, 06Infrastructure-Foundations, 10observability: librenms.syslog table is 800GB - https://phabricator.wikimedia.org/T415270#11596000 (10Marostegui) 05Open→03Resolved Done! Thanks [11:13:40] Morning folks. I have an interesting ticket here and I'd be grateful for your input: T416864 [11:13:40] T416864: Hive-metastore failure - seemingly related to clock skew - https://phabricator.wikimedia.org/T416864 [11:14:42] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:15:14] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11596132 (10ayounsi) Ticket 05430684 created with Nokia [11:16:36] We're running `systemd-timesyncd`everywhere, aren't we? We got an error from the hive-metastore that suggests that we have more than 5 minutes of clock skew between an-coord1003 and krb1002. I'm highly sceptical of that, but I thought you might like to know about it. [11:18:13] the clocks on an-coord1003 and krb1002 are in sync, so that looks like a bug in hive-metastore [11:18:51] we also have indirect alerting if timesyncd has issues, so I doubt this is a result of past time sync issues either [11:27:34] Thanks, Moritz. I'll let you know if I find anything else. [11:31:09] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11596159 (10fgiunchedi) Following up from Lisbon: there are essentially two options available wrt network implementation, @cmooney... [11:35:49] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Users reporting issues connecting to Gerrit with HTTPS from Orange, FR mobile network (AS 3215) - https://phabricator.wikimedia.org/T411203#11596183 (10ayounsi) 05Open→03Declined Not actionable on our side. [11:38:52] 10netops, 06Infrastructure-Foundations, 06SRE: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11596195 (10cmooney) @papaul is this something you might have already drawn out in EVE-NG? [11:39:30] 10netops, 06Infrastructure-Foundations, 06SRE: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11596197 (10ayounsi) a:05cmooney→03Papaul Hey @Papaul would you be interested in working on that ? [13:19:49] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: move row-wide vlan gateways to Nokia switches - https://phabricator.wikimedia.org/T416872 (10cmooney) 03NEW p:05Triage→03Medium [13:20:05] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11596553 (10cmooney) [13:20:07] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: move row-wide vlan gateways to Nokia switches - https://phabricator.wikimedia.org/T416872#11596554 (10cmooney) [13:20:21] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596556 (10cmooney) [13:20:27] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: move row-wide vlan gateways to Nokia switches - https://phabricator.wikimedia.org/T416872#11596557 (10cmooney) [13:21:46] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11596563 (10cmooney) [13:22:11] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11596568 (10cmooney) I've removed parent task T409286 to track this independently but commenting for the record. [13:23:33] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11596571 (10cmooney) [13:23:43] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11596574 (10cmooney) [13:23:59] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move asw2-d-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T409067#11596577 (10cmooney) [13:24:07] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11596578 (10cmooney) [13:24:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11596580 (10cmooney) [13:24:27] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596581 (10cmooney) [13:24:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11596587 (10cmooney) [13:25:00] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596588 (10cmooney) [13:25:18] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596590 (10cmooney) [13:25:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11596592 (10cmooney) [13:25:46] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11596593 (10cmooney) [15:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [15:37:47] 10netops, 06Infrastructure-Foundations, 06Traffic: 2026 Junos upgrade - https://phabricator.wikimedia.org/T416444#11597105 (10ayounsi) p:05Triage→03Low [15:37:53] 10netops, 06Infrastructure-Foundations: drmrs: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416441#11597107 (10ayounsi) p:05Triage→03Medium [15:38:45] 10netops, 06Infrastructure-Foundations, 10Observability-Metrics: gNMIc: investigate new "collector" command - https://phabricator.wikimedia.org/T416360#11597112 (10ayounsi) p:05Triage→03Low [15:39:28] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Decom cookbook: run Homer when needed - https://phabricator.wikimedia.org/T416313#11597114 (10ayounsi) p:05Triage→03Low [15:39:41] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11597115 (10ayounsi) p:05Triage→03Medium [15:39:48] 10netops, 06Infrastructure-Foundations: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11597116 (10ayounsi) p:05Triage→03Medium [15:40:30] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06tools-infrastructure-team: codfw: Upgrade cloudsw1-b1-codfw (2026) - https://phabricator.wikimedia.org/T416443#11597118 (10ayounsi) p:05Triage→03Medium [15:40:35] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11597119 (10ayounsi) p:05Triage→03Medium [15:40:39] 10netops, 06Infrastructure-Foundations: eqsin: upgrade routers (2026) - https://phabricator.wikimedia.org/T416563#11597120 (10ayounsi) p:05Triage→03Medium [15:42:21] 10netops, 06DC-Ops, 10ops-esams, 06SRE, and 2 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11597140 (10LSobanski) [15:46:38] 10netops, 06Infrastructure-Foundations, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#11597180 (10MLechvien-WMF) [15:48:07] 10netops, 06DC-Ops, 10ops-esams, 06SRE, and 2 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11597191 (10ssingh) 05Open→03Resolved a:03ssingh Marking this as resolved since the failure was transient and has since resolved. The monitoring not being fire... [15:53:14] 10netops, 06Infrastructure-Foundations, 06SRE: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11597211 (10Papaul) Yes i can take it . thanks [16:17:42] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11597410 (10cmooney) p:05Triage→03Medium >>! In T414460#11582569, @Gehel wrote: > With the various... [16:23:57] 10netops, 06DC-Ops, 10ops-esams, 06SRE, and 2 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11597481 (10cmooney) >>! In T415473#11552172, @ssingh wrote: > We had a transient link failure between eqiad and esams that resulted in this issue. Correct this cor... [19:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [21:29:47] 07Puppet, 10Gerrit: Puppet should restart Gerrit when changing it's replication config - https://phabricator.wikimedia.org/T416929 (10bd808) 03NEW [21:57:14] 07Puppet, 10Gerrit: Puppet should restart Gerrit when changing it's replication config - https://phabricator.wikimedia.org/T416929#11599302 (10bd808) Now that I understand what I'm reading, https://wikitech.wikimedia.org/wiki/Gerrit/Administration#Troubleshooting documents this problem as well: > Any changes... [21:58:32] 07Puppet, 10Gerrit: Puppet should restart Gerrit when changing it's replication config - https://phabricator.wikimedia.org/T416929#11599305 (10Reedy) [22:05:14] 07Puppet, 10Gerrit, 13Patch-For-Review: Puppet should restart Gerrit when changing it's replication config - https://phabricator.wikimedia.org/T416929#11599335 (10hashar) `git log -GautoReload modules/gerrit` has the history: That got disable by Paladox in October 2019 referencing "a bug". I went to enabl... [22:07:50] 07Puppet, 10Gerrit, 13Patch-For-Review: Puppet should restart Gerrit when changing it's replication config - https://phabricator.wikimedia.org/T416929#11599347 (10Dzahn) > Or we should have some other mechanism to ensure that Gerrit is restarted following a replication config change. Or we should do restart... [23:14:42] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag