[03:11:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:11:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [09:58:15] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11391188 (10ayounsi) One liners: `lang=python >>> spicerack.redfish('sretest2004').scp_dump().components['NIC.Integrated.1-1-1'].get('Broadcom_LLDPNearestBridg... [10:03:09] 10netops, 06Infrastructure-Foundations: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606 (10taavi) 03NEW [10:04:36] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11391233 (10fgiunchedi) >>! In T407140#11383672, @cmooney wrote: > Ok thanks @fgiunchedi for the info. > > I think that seems doab... [10:07:15] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11391235 (10taavi) >>! In T407140#11391233, @fgiunchedi wrote: > There are also NFS read only shares towards clouddump hosts, thoug... [12:21:37] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11391608 (10cmooney) >>! In T407140#11391233, @fgiunchedi wrote: > The NFS shares host tools data and scratch space, specifically:... [12:32:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11391704 (10cmooney) >>! In T408892#11389081, @Papaul wrote: > I think a am wrong on the public vlan for rack 22. We will not be re-imaging the servers in that... [12:51:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11391814 (10cmooney) [13:18:45] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11391938 (10ayounsi) Not sure if it has been discussed but what do you think of using Calico's [[ https://docs.tigera.io/calico/lat... [13:33:09] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11392000 (10cmooney) >>! In T407140#11391938, @ayounsi wrote: > But instead of doing network separation using a complex VRF and `ip... [13:47:22] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11392028 (10ayounsi) @papaul, could you have a look at the BIOS of sretest1005 ? The matching Redfish keys don't exist :( `lang=python >>> dump3.components['N... [13:51:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11392041 (10cmooney) [14:09:27] XioNoX: topranks: if I enable the BGP flag for a host, I still have to run homer against the cr/ToR switch in that site, correct? [14:09:40] sukhe: yep [14:09:42] sukhe: yep [14:09:52] there is no auto-push to the network based on Netbox changes [14:09:59] thanks! [14:10:30] sukhe: is this for the new proxies ? [14:10:41] topranks: yep, hcaptcha anycast one [14:10:49] hmm ok, they are on the routed ganeti setup are they? [14:10:59] 4 of them in esams/magru yes [14:11:08] anything there to factor in? [14:11:11] might be slightly different there, as they will peer with the ganeti host not top-of-rack [14:11:25] yeah so no need to push to the switch for those [14:11:30] yeah that makes sense [14:11:32] cool! [14:11:32] XioNoX probably knows the setup better [14:11:47] so how does that work in that case, out of curiosity? I am assuming the ganeti's already have the BGP flag? [14:11:48] probably needs a puppet run on the metal host as well as the vm? [14:11:56] do we have dynamic neighbors on the ganeti hosts? or do we need puppet or something to configure the session from ganeti host to VM? [14:12:29] you need a special bird version on the VM side [14:12:33] ganeti should already have BGP flag for their connection to the switch yes (they propagate the routes they learn on internal BGP to VMs to the top-of-rack over physical link) [14:12:52] dynamic neighbors on the ganeti side so no puppet run needed [14:12:57] nice :) [14:13:11] https://wikitech.wikimedia.org/wiki/Ganeti#VMs_BGP for the doc [14:13:17] thanks <3 [14:13:23] XioNoX: we already do the routed ganeti bird via the hiera now [14:13:27] if $routed_ganeti_apt { [14:13:27] apt::package_from_component { 'bird2': [14:13:27] component => 'component/bird-routed-ganeti', [14:13:27] priority => 1002, [14:13:27] } [14:13:29] } else { [14:13:32] ensure_packages('bird2') [14:13:35] } [14:13:37] is that what you meant? [14:13:42] nice, yeah [14:13:47] looks like it yeah [14:13:54] that was moritz to be clear not me but yeah, I remember the patch [14:13:54] and moritz packaged it for trixie already too [14:14:11] topranks: yeah, he did. though I haven't tested it anywhere yet so we are doing bookworm for these hosts [14:14:26] then I will reimage the durum hosts to trixie to test this version of bird and then reimage the other hosts to trixie as well [14:15:02] so it should be automatic on the VM side, it will know it's on routed ganeti as it will use a 255.255.255.255 netmask and configure the specific bits automatically [14:15:14] ok, thanks! [14:15:46] it's a new setup so there might be some specific bits to hcaptcha proxy too (like IPs) [14:17:23] specific bits meaning? [14:17:44] so this setup is like anycast as well if that helps (since you were out when we discussed this) [14:18:13] https://phabricator.wikimedia.org/T409780 [14:18:27] yeah, dunno if it's the first servers being deployed like that or not [14:19:07] will read the task [14:19:08] in my mind at least (right or wrong, you be the judge!) this is akin to the current anycast setup. but yeah I will double check [14:23:41] it will be fine within a site I think, we'll have an extra ASN in the path but that won't matter if all of them are deployed on routed ganeti [14:24:25] sukhe: to do the active/passive we need to set the community in Bird, and adjust our policies to match on the CRs [14:24:41] all fairly easy but a little work, might be simpler to trial this as regular anycast and then add those bits when its working [14:24:59] topranks: yep thanks, it's on the list. but for now I am just bringing everything up and then I will ask you for that [14:25:17] yep [14:31:56] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11392187 (10cmooney) [14:54:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [15:19:01] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11392432 (10cmooney) [15:36:23] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11392634 (10cmooney) [15:54:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:16:19] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11392926 (10cmooney) [16:22:45] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11392975 (10Papaul) @ayounsi sretest1005 is the same as 2004 see below. what you can maybe check is the redfish /IDRAC version on sretest2004 and 1005 {F703... [16:51:51] topranks: [16:51:52] ERROR:homer_plugins.wmf-netbox:No BGP group found for hcaptcha-proxy1001. [16:52:01] what am I missing here? where do I specify that? [16:52:08] homer "cr*eqiad*" commit "bring up hcaptcha-proxy1001" [16:52:20] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/refs/heads/master/plugins/wmf-netbox.py#17 [16:57:29] oh that's interesting. [16:58:24] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11393374 (10ayounsi) Thanks, yeah that must be the reason : ` >>> spicerack.redfish('sretest1005').hw_model 9 >>> spicerack.redfish('sretest2004').hw_model 9 >... [17:01:47] I am surprised this requires a homer update [17:01:59] unless I am missing something else on where to set this in Netbox itself [17:02:18] did not need to touch homer when introducing new host name type [17:02:24] so far [17:02:36] mutante: for a host that does BGP? [17:02:49] does hcaptcha do that? [17:02:56] yeah the new ones do [17:03:02] oh, ok, nevermind then [17:03:34] interestingly none of the tcp-proxy VMs have the BGP custom field set [17:03:47] is that required to be enabled on the VMs or just on the bare-metal host itself? [17:03:56] are they doing BGP? [17:04:06] they are on routed ganeti, but, -- oh you're adding an anycast service? [17:04:10] yeah [17:04:11] oh ok [17:04:13] nvm :) [17:04:17] they are not doing BGP themselvesa [17:06:01] some are on routed ganeti and some are not.. in meeting, bbl [17:07:25] sukhe: hcaptcha-proxy1001 is in eqiad so old ganeti and peers with the core routers, so it will need to be added to the file taavi shared [17:07:42] routed ganeti hosts shouldn't need it [17:07:46] VMs I mean [17:07:47] XioNoX: ok, so then that also means for everything except magru/esams [17:07:49] cool, patching it [17:08:11] sukhe: yep, it's the wmf-netbox.py so doesn't require a homer release, "just" a new deploy [17:08:48] basically the steps from https://wikitech.wikimedia.org/wiki/Homer#In_the_deployment_server (after merge) [17:09:06] cool, thanks, TIL! [17:09:11] not ideal, I think Cathal has something somewhere I need t review to make it better [17:09:13] it's been a while since a new anycast service was added [17:09:18] (for me) [17:09:23] or thinking about it, still need to catch up on 200 emails :) [17:09:38] https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1207915 [17:10:06] going to leave the homer thing to cathal; he is always looking to do more work [17:10:09] :P [17:17:37] sukhe: do you need this today? [17:17:49] topranks: no, I would never ask you for that outside of business hours :) [17:17:52] so please no [17:18:05] tomorrow or Monday is fine. I have other stuff to do, so not blocked on you [17:18:12] ok cool [17:18:30] I'll merge the patch in the morning in that case, needs a new minor homer release to integrate those [17:18:57] yep all good. I need to reimage all the VMs and do other stuff, so tomorrow or even Monday [18:13:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11393756 (10RobH) Day 8 Update: * 22 hosts moved today, 22 remain ** all wikikube and aux host migrations completed ** (3) pc hosts in disucssion with data-p...