[00:16:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [09:21:15] FIRING: [4x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:03] 10CAS-SSO, 06Infrastructure-Foundations: Redirect loop on idp.wikimedia.org (trying to log into Turnilo) - https://phabricator.wikimedia.org/T410249 (10daniel) 03NEW [09:51:16] RESOLVED: [4x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:53:12] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11378556 (10ayounsi) > Once the router change is done, therefore, we need to somehow adjust the netmask on all the existing hosts on the v... [12:01:18] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11378569 (10ayounsi) a:03Papaul @papaul is that something you could look into ? Is there is a way to disable the NIC's LLDP through the BIOS menu ? Maybe some solution from the la... [12:54:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11378749 (10ayounsi) > I personally prefer to use the first (ok second) address in each v6 subnet as the gateway, i.e. 2a02:ec80:400:1::1/64 Sounds good to me.... [13:01:02] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330#11378763 (10ayounsi) a:03ayounsi My bad ! I turned them off after adding the transit/peering saturation alerts. Forgetting transport and core links.... I'll take ca... [13:14:08] 10CAS-SSO, 06Infrastructure-Foundations: Redirect loop on idp.wikimedia.org (trying to log into Turnilo) - https://phabricator.wikimedia.org/T410249#11378799 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1205208 a... [14:37:19] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11379093 (10Papaul) @ayounsi yes I can look into it. Thanks. [14:45:56] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379108 (10ssingh) >>! In T410047#11374122, @cmooney wrote: > @ssingh I made a patch and can kick off the changes in Netbox and on the ro... [14:49:25] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379134 (10ayounsi) You can use 198.35.26.5/28. It's marked as reserved for infra, but we don't need it (and we will even less need it af... [14:50:49] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379141 (10cmooney) >>! In T410047#11379108, @ssingh wrote: > My plan for now to unblock the hCaptcha work was to decommission one of the... [14:52:31] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379156 (10ssingh) >>! In T410047#11379134, @ayounsi wrote: > You can use 198.35.26.5/28. It's marked as reserved for infra, but we don't... [14:53:09] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379157 (10ssingh) >>! In T410047#11379141, @cmooney wrote: >>>! In T410047#11379108, @ssingh wrote: >> My plan for now to unblock the hC... [14:56:36] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379165 (10cmooney) >>! In T410047#11379157, @ssingh wrote: > Yeah, good point about the LVS IPs since we no longer need them given Liber... [15:10:08] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379212 (10ssingh) >>! In T410047#11379165, @cmooney wrote: >>>! In T410047#11379157, @ssingh wrote: >> Yeah, good point about the LVS IP... [15:23:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [15:33:45] moritzm: hi! how do I debug a Ganeti VM that isn't coming back up during makevm? [15:33:48] > Host rebooted via gnt-instance [15:33:49] this is 3002 [15:35:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Netbox Cable report - incorrectly parsing Nokia power supplies - https://phabricator.wikimedia.org/T410073#11379334 (10LSobanski) p:05Triage→03Medium a:03ayounsi [15:36:06] in a meeting, will check in a bit [15:36:52] thanks! I tried to connect to the console but that doesn't work [15:37:04] happy to RTFM and do some debugging, just don't know how [15:37:22] `sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 2 --disk 20 --network public --os trixie -t T409860 --cluster esams03 --group B hcaptcha-proxy3002` [15:37:23] T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860 [15:41:33] 10SRE-tools, 06Infrastructure-Foundations: firmware-upgrade cookbook fails after successful upgrade - https://phabricator.wikimedia.org/T331135#11379352 (10LSobanski) 05Open→03Declined Considering the age of this task I'm resolving it, please reopen if specific issues occur. [15:43:44] 10CFSSL-PKI, 06Infrastructure-Foundations: cfssl: investigate using post handshake authentication - https://phabricator.wikimedia.org/T332149#11379362 (10LSobanski) p:05Medium→03Low [15:50:11] 10SRE-tools, 10Ganeti, 06Infrastructure-Foundations: Ganeti: consider --no-wait-for-sync as a default option for instance creation - https://phabricator.wikimedia.org/T335522#11379406 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is being used and works well, resolving [16:24:34] sukhe: the VM itself is fine, I enabled SPICE to the VM, and the last thing it shows is that by ipxe, the PXE boot for the VM failed [16:25:32] this might be the issue of the DHCP failing since the VM is on the same node, checking [16:25:43] thanks (in meeting) [16:26:11] yeah, that's the issue, I'm moving it, then the reimage will work [16:26:44] moritzm: that's a very cloud-native workaround, I love it [16:29:03] it's only cloud-native it it gets broken by an AWS us-east-1 outage:-) [16:29:32] context is https://phabricator.wikimedia.org/T396864 and we'll have it fixed once the new dnsmasq release is out [16:50:44] "just kill the {instance,pod}" is absolutely cloud-native :D [16:58:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:44] moritzm: TIL and thanks. I guess that also explains the magru 7002 issue I was seeing. [17:00:40] > I'm moving it, [17:01:23] that's gnt-node migrate right? [17:02:03] gnt-instance migrate, for a single vm [17:02:14] I don't think it's actually documented on wikitech [17:02:28] just gnt-node failover, which, does a similar thing but many times over :) [17:02:47] cdanis: it's here I _think_ https://wikitech.wikimedia.org/wiki/Ganeti#Reboot/Shutdown_for_maintenance_a_node [17:02:50] sudo gnt-node migrate -f ganeti1004.eqiad.wmnet [17:02:59] that moves all VMs off of the hardware node [17:03:03] you can also gnt-instance migrate [17:03:11] or gnt-instance failover [17:03:14] which moves just the one VM [17:03:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:44] ok thanks. let me try 3002 and then I will do that for 7002 [17:06:19] yeah I read the above incorrectly, that's for migrating the ganeti node itself. so -instance it is [17:06:32] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11379965 (10fgiunchedi) [17:08:48] 10netops, 06Infrastructure-Foundations, 06SRE: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11379974 (10fgiunchedi) Thank you @cmooney ! FYI as per Andrew we really only care about cloudcephosd1035 through c... [17:23:28] 10CAS-SSO, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: sso failure in codfw1dev (labtesthorizon.wikimedia.org) - https://phabricator.wikimedia.org/T409328#11380091 (10Andrew) I'm leaning towards moving this service to a separate host. Ganeti request is T410294 [18:08:39] indeed, gnt-instance migrate [18:32:59] saw the update thanks [19:23:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [21:03:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [21:08:44] !log LDAP - added ankita97531 to group nda - T409894 [21:08:44] mutante: Not expecting to hear !log here [21:08:44] T409894: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894 [23:07:28] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11381580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2004.codfw.wmnet with OS trixie [23:51:40] 10netops, 06Infrastructure-Foundations, 06SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11381762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2004.codfw.wmnet with OS trixie completed: - sretest2004 (**PASS**)...