[09:37:10] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11369925 (10cmooney) @papaul I'm really getting sick of Juniper on this one. Personally I suspect the input voltage/frequency (i.e. our feed... [10:45:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [11:45:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [13:58:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370745 (10Jclark-ctr) Swapped lswtest on Tuesday with the failed switch in D6, cabled it, and handed it over to Cathal for setup. Today, re... [14:00:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370748 (10cmooney) >>! In T409731#11370745, @Jclark-ctr wrote: > Swapped lswtest on Tuesday with the failed switch in D6, cabled it, and han... [14:04:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11370749 (10Jclark-ctr) 05Open→03Resolved a:05cmooney→03Jclark-ctr [14:15:38] moritzm: in ulsfo, I am getting a resource allocation could not be fulfilled while doing makevm [14:16:05] but I see nothing on the master node (ganeti4008). how do I debug this further? [14:16:16] (sudo grep failure /var/log/ganeti/commands.log on 4008 is empty) [14:16:35] `sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 2 --disk 20 --network public --os trixie -t T409860 --cluster ulsfo hcaptcha-proxy4002` [14:16:36] T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860 [14:16:58] we might have run out of public IPs for ulsfo? will check in a few [14:17:09] ah interesting [14:17:14] no worries I can check that [14:19:35] https://netbox.wikimedia.org/ipam/prefixes/12/ seems fine hmm [14:20:36] 10netops, 06Infrastructure-Foundations, 06SRE: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11370815 (10cmooney) 05Open→03Resolved So this has bounced a few times since, however it is relatively stable.... [14:20:42] https://netbox.wikimedia.org/ipam/prefixes/13/ is full [14:23:11] yeah but that's the /28 though [14:30:40] I'm not familiar with the finer details of the accounting, but it seems the "Utilization: 100%" makes it fail [14:31:06] ^ topranks: could you have a look when you have a moment? [14:31:54] * topranks looking [14:32:12] yeah it's full [14:32:33] write me a big cheque - preferably novelty-size lottery winner style - and I will go buy more IPv4s for you [14:34:39] so we are lucky here - I'd actually already earmarked that /28 to increase to /27 and the contiguous range is free so this can be done fairly easily [14:34:44] https://phabricator.wikimedia.org/T408892#11330727 [14:35:21] I can change it in Netbox and on the routers, that bit is easy, and won't disrupt existing traffic [14:35:54] however we need to change the configured netmask for all the existing hosts on the vlan before we add new hosts/VMs in the upper part of the expanded subnet [14:36:29] otherwise the new and old devices on the vlan may not be able to communicate [14:36:50] specifically anything that is allocated .15 or .16 [14:58:29] elukey: do you know if there are any tests hosts availabe to test efi boot partition redundancy? [15:01:03] jhathaway: o/ I think the best place may be the sretest hosts, the ms-be ones have been reimaged by Matthew IIRC [15:01:51] cool, I couldn't recall if they had soft raid on any of those [15:02:01] sukhe, topranks: let's maybe create a dedicated task for the above? given it will need config changes for ~ 10 hosts [15:17:36] moritzm: yeah sounds like a plan leave it with me [15:20:05] ok! [15:21:34] topranks: ok! question though, that /24 doesn't have capacity? we allocate the DNS hosts to it [15:21:37] I am not sure how that bit works [15:21:47] https://netbox.wikimedia.org/ipam/prefixes/12/ shows 214 available IPs [15:21:54] https://netbox.wikimedia.org/ipam/prefixes/12/ip-addresses/ [15:27:22] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047 (10cmooney) 03NEW p:05Triage→03Medium [15:27:48] sukhe: I've detailed the best way forward in the task I just made [15:27:54] let me know if anything is unclear [15:28:00] ok thanks, will read! [15:28:19] > bring the subdivision of the public /24 there match what we have at other POPs, where we have a public /27 for each rack. [15:28:38] hmm [15:32:32] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371150 (10cmooney) [15:32:58] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371154 (10cmooney) [15:34:10] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371173 (10cmooney) [15:40:15] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11371233 (10Reedy) [16:01:43] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11371319 (10Papaul) After swapping both PEM 2 and 3 ` re0.cr1-codfw> show chassis environment pem PEM 0 status: State... [16:05:00] topranks: thanks for the explanation, I now get the bit "Once all existing hosts have had this done we can safely add new hosts to the vlan, which will start using the free IPs in the upper half of the extended range." [16:21:59] I am actually thinking of decommissioning one of the wikidough hosts in ulsfo and then using that IP [16:23:07] the hcaptcha proxies did need public IPs? [16:24:06] mutante: yeah, they need to be on the public VLAN to make outbound queries [16:24:36] ack [16:39:19] sukhe: the LVS hosts also don't need to be on there afaik [16:40:38] topranks: yeah what's up with those? [16:41:40] probably need to ask the traffic team about it :P [16:41:46] haha [16:42:39] so basically someone at some point removed the config (in puppet) for the vlan sub-interfaces for those hosts [16:43:21] but they did not work with dc-ops to remove the cabling for the now unused ports, or do anything to clean up the IP allocations in Netbox [16:43:54] can't even get the changelog [16:43:55] Created 2020-06-19 00:00 · Updated 2023-04-05 17:22 [16:44:03] this seems to be around the time we set these up most likely [16:48:16] FIRING: ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:15] FIRING: [2x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:16] RESOLVED: [2x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11371762 (10RobH) Day 5 Update: * Moved all remaining ganeti hosts today * 17 hosts moved today, 108osts remain. * All remaining hosts are either k8 hosts (i... [17:41:37] 10CAS-SSO, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: sso failure in codfw1dev (labtesthorizon.wikimedia.org) - https://phabricator.wikimedia.org/T409328#11371866 (10Andrew) andrewbogott> Andrew Bogott moritzm: do you still aspire to look at https://phabricator.wikimedia.org/T409328... [17:45:04] 10CAS-SSO, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: sso failure in codfw1dev (labtesthorizon.wikimedia.org) - https://phabricator.wikimedia.org/T409328#11371888 (10bd808) https://cloudidp-dev.wikimedia.org/oidc/.well-known is serving an infinite redirect loop to itself with a `serv... [18:09:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [18:17:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: Netbox Cable report - incorrectly parsing Nokia power supplies - https://phabricator.wikimedia.org/T410073 (10RobH) 03NEW [18:18:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: Netbox Cable report - incorrectly parsing Nokia power supplies - https://phabricator.wikimedia.org/T410073#11372087 (10RobH) [18:25:41] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [19:14:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [19:19:40] 10CAS-SSO, 06cloud-services-team, 10Striker, 13Patch-For-Review: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#11372282 (10Arendpieter) @taavi do I need to do something else for https://gerrit.wikimedia.org/r/c/labs/striker/+/1189915 ? [19:30:41] RESOLVED: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [22:39:29] 10CAS-SSO, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: sso failure in codfw1dev (labtesthorizon.wikimedia.org) - https://phabricator.wikimedia.org/T409328#11372894 (10Andrew) @taavi this is one of the codfw1dev issues that has me blocked. I've spent a while messing with the envoy conf...