[07:26:46] sukhe: unrelated, but kinda related: we'll also need https://gerrit.wikimedia.org/r/1208212 in esams/magru [09:02:09] 10netops, 06Infrastructure-Foundations, 06SRE: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488#11394993 (10ayounsi) management routers are physically single homed in the old design (eqsin, codfw, eqiad), probably because it was best to not over engineer it, and mgmt net... [09:38:38] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11395092 (10ayounsi) New updated list, we're at 642 hosts if the MySQL query from the initial task is still the proper way. Up from 90 in 2020, probably becaus... [09:49:31] is someone using sretest2003 ? Can I reboot it for some LLDP tests? [09:52:04] sretest2003 is a Core DB server (mariadb::core) -- I hope it's not a real on [09:55:38] this was used by Manuel for tests, but he's off today for on call comp [09:56:15] this is possibly superceded by the db-test* servers Federico has been setting up, so you could validate with him if sretest2003 is good to reboot [09:56:42] thx [09:57:26] federico3, ping ? ^ [09:58:09] looking [10:01:11] XioNoX: do you need a phy host to reboot I suppose? Another phys test db would be ok? [10:02:22] federico3: I need one from that list : https://phabricator.wikimedia.org/T250367#11395092 (and not all the ones are eligible neither, for exemple sretest1003 doesn't fit) [10:02:45] but I can check if it's eligible if you give me another option :) [10:13:40] XioNoX: you can also use ganeti-test2001 or ganeti-test2003 [10:15:05] XioNoX: db1176 and db2230 are test hosts that can be rebooted but they don't seem to appear on the list [10:16:41] despite the name, sretest2003 appears to be pooled in so I would doublecheck with Manuel [10:18:24] cool, thx, I glad I asked :) [10:29:24] moritzm: seems like they're using a too old idrac version :( [10:34:23] XioNoX: have you confirmed it's down to the idrac version/ [10:35:19] topranks: not 100%, but so far the faulty ones return 14 for `spicerack.redfish('ganeti-test2003').generation` [10:35:48] and working ones "16" [10:37:04] hmm ok [10:37:20] well at least upgrading idrac can be done without affecting the host afaik [10:38:08] yeah I haven't tried, but I think on too old hosts there is a max idrac version, I wanted to chat with someone who knows more about that before trying [10:38:34] sretest1005 is 6 years ol [10:38:35] d [10:40:50] we can try to run the firmware upgrade cookbook? [10:41:20] I upgraded them the last time during the bookworm update, so if 16 is available in the mean time, we could catch up that way [10:43:49] looks like I got it mixed up, idrac is up to date, but the generation are the Dell server generation [10:43:58] https://www.irccloud.com/pastebin/JHK2GNSP/ [10:46:10] so maybe it's just a too old server, and even recent idrac doesn't expose the LLDP attribute? [10:51:00] well that sucks [10:51:28] having to reboot to disable is a major headache, you'd think not worth it probably :( [11:13:46] 10netops, 06Infrastructure-Foundations, 06SRE: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488#11395278 (10cmooney) >>! In T407488#11394993, @ayounsi wrote: > management routers are physically single homed in the old design (eqsin, codfw, eqiad), probably because it was... [11:26:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: mr1-codfw: add connection from - https://phabricator.wikimedia.org/T410717 (10cmooney) 03NEW p:05Triage→03Low [11:31:54] federico3: just got a message from Manuel, he said that sretest2003 was in production [11:32:20] "was"? :D [11:33:33] federico3: oops, I mean "is" [11:34:12] I feel like it's a risky move though [11:44:15] 10netops, 06Infrastructure-Foundations, 06SRE: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488#11395392 (10cmooney) 05Open→03Resolved a:03cmooney Will open task on getting the second link in place on mr1-codfw. We can look to do the same in eqiad once the row... [12:08:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11395518 (10cmooney) [12:10:13] /// [12:11:03] moritzm: yeah, I added to the new role already [12:11:36] and the confusion is true indeed but will remove that once this out and working :) [12:12:23] yeah, sorry for the noise. I abandoned that patch earlier, but had already pinged you [12:12:40] no worries, appreciate the dilligence! [12:12:43] was the revert back to bookworm caused by the BGP issues or other reasons? [12:14:04] will share when online fully but wanted to get this up and running first, test routed ganeti bird for trixie on durum nodes and then upgrade these [12:17:29] makes sense [12:18:01] I was just curious since I had seen the reimages on the Phab task and was wondering if there had been a technical blocker [12:30:35] sukhe: the homer update for the hcaptcha proxy BGP group mapping should be live on cumin hosts now [13:59:47] federico3: looking at some reports, it looks like es2040 is missing a primary v4 and v6 IP - https://netbox.wikimedia.org/dcim/devices/5182/ . The IPs are set on the interface, but not set as the host's primary IP [14:00:09] how does that happen? [14:01:06] topranks: https://netbox.wikimedia.org/extras/changelog/239129/ and https://netbox.wikimedia.org/extras/changelog/239132/ [14:01:54] as in.. it was never configured? looking [14:02:37] looks like it was removed by Papaul, I pinged him on -dcops [14:03:14] I see changes from 2019 [14:04:07] Time 2025-08-21 15:37:55 [14:04:15] federico3: is that host in production? [14:04:59] no [14:05:20] it seems it was an old host [14:05:45] ah wait, es2040 [14:06:17] yes, es2040 is live: https://zarcillo.wikimedia.org/ui/sections#es7 [14:06:52] ok, so it seems like a minor missconfig in Netbox, but better get it fixed sooner than later, because it causes something [14:07:00] will wait for Papaul's answer then fix it [14:07:31] is it possible that making the change in Netbox would trigger puppet runs changing anything on the host? If so we should rather depool the host for safety before the change [14:08:05] it shouldn't, but if the depool is easy, doesn't hurt [14:08:13] topranks: <3 [14:08:30] moritzm: yeah, no, no technical blocker but I wanted to get this up and running and will take that up over the next weeks [14:08:31] federico3: are you in charge of ms-be ? [14:08:41] I will also reimage a durum node in esams/magru to trixie first [14:10:11] ms-be2057 is the only ms-be host with no AAAA on its v6 IP, so I guess it's also a missconfig [14:10:21] XioNoX: no, only mariadb hosts [14:10:32] federico3: any idea who should I ping? [14:15:33] XioNoX: Emperor on #wikimedia-data-persistence - I already pinged him [14:16:03] thx! [14:16:39] * Emperor was summoned here... [14:18:33] XioNoX: just for context, are you seeing missing ipaddrs only around data persistence stuff or it's a general issue? Is there something we can/should do to improve our provisioning? [14:20:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11396006 (10ayounsi) The current cable is already to lsw1-a2 (https://netbox.wikimedia.org/dcim/cables/7147/) so probably a3 is the next one. To... [14:20:17] XioNoX: we didn't use to provision v6 on ms-* nodes, it's gradually being rolled out as we provision new ones. All ms-* nodes have to keep v4 indefinitely, though, as v4 addresses are what are used in the swift rings [14:21:12] federico3: looking at this coherence report : https://netbox.wikimedia.org/extras/scripts/results/271822/ [14:21:31] some are red herring [14:22:18] Emperor: ah ok, so ms-be2057 will be decom soon-ish? then all good :) Just making sure there was no un-expected issues [14:23:24] XioNoX: yeah, was bought 2020-08-11 so due to be replaced by the newly-arrived ms-be209[0-4] in my Copious Free Time [14:24:09] cool, then nothing special to do, thanks for not ignoring v6 on the new hosts :) [14:24:45] 👍 [14:25:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11396019 (10cmooney) >>! In T410717#11396006, @ayounsi wrote: > The current cable is already to lsw1-a2 (https://netbox.wikimedia.org/dcim/cables... [14:52:34] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11396143 (10cmooney) Just to update this we are not planning to move the IP gateways for vlans to the Nokia switches in t... [14:52:56] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11396146 (10cmooney) Just to update this we are not planning to move the IP gateways for vlans to the... [17:02:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11396665 (10RobH) Day 8 Update: * 3 hosts moved, 19 remain * John worked with Amir directly today to depool and migrate pc101[678] since the depool and repoo... [17:34:01] topranks: I cleared up all the v6-related changes for the hcaptcha ipv4_only group. anything still missing is my bad so let me know if you see something but cr* and as* runs should be clean [22:17:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed