[08:43:28] hello people [08:44:02] <_joe_> no one likes you, elukey, sorry [08:44:09] bam [08:44:26] he's not actually sorry though [08:44:32] I have a weird problem with an-test-worker1002, apparently the link with the switch is up from from ~15 hours ago till now the connectivity is down [08:44:42] <_joe_> ema: he just defined his 6 months of work with me as "Training Day" [08:44:43] and I don't see why [08:44:59] (I am deliberately ignoring noise in this chat) [08:45:02] (:D ) [08:45:04] <_joe_> sorry :P [08:45:15] <_joe_> I didn't think you had a real issue to talk about [08:45:29] <_joe_> so the link is down on the host but the switch sees it up? [08:45:35] <_joe_> can you access the server via OOB? [08:46:11] <_joe_> I would try to get into console and look at dmesg first of all [08:46:14] the interface is up on the host and the link seems detected/up, and it is up also on the switch.. I can access via OOB and I don't see anything weird in dmesg, route, arp, etc.. [08:46:24] <_joe_> oh sigh [08:46:28] the only weird thing is https://librenms.wikimedia.org/device/device=162/tab=port/port=14780/ [08:46:59] the is a rise in broadcast traffic at the same time that the host lost connectivity [08:47:51] (of course this happens when I want to test an upgrade procedure for hadoop) [08:48:20] tried also to powercycle, ifup/ifdown, etc.. [08:48:50] <_joe_> elukey: can you ping the default gw? [08:48:55] <_joe_> I guess not even icmp works [08:49:02] exactly yes [08:49:40] <_joe_> so either something's broken in the router, or the cable is damaged? [08:49:44] <_joe_> if no config was changed [08:49:54] <_joe_> s/router/switch/ [08:50:12] I was wondering the same thing, it might be the DAC cable, I don't see anything on the switch indicating a problem with the interface sigh [08:54:23] XioNoX: if you have time, any idea? [08:56:07] what's up? [08:57:03] elukey: did you try to bounce the switch port? [08:57:03] I am trying to get why an-test-worker1002 lost connectivity ~15h ago, but after some checks I cannot really figure out why [08:57:08] nope didn't [08:57:27] I'd do that then change the DAC indeed [08:57:31] or re-seat [08:58:37] XioNoX: so basically set interfaces batman disabled ; delete interface batman disabled; [08:58:53] and then open a task if it doesn't work [08:59:24] 🦇 👨 [09:00:05] kormat: I have appreciated your support in not adding comments on top of _joe_'s and ema's earlier on [09:00:23] elukey: :D [09:00:25] elukey: yep [09:08:11] didn't work :( [09:08:25] opening a task, thanks! [09:09:00] <_joe_> ok so, can we go back to shitposting? [09:09:02] <_joe_> :D [09:10:38] yes! [10:20:10] ack, sounds good [10:20:19] er oops [10:30:00] heads up, I am going to start toda to do some restarts of backup hosts, this may momentarily generate some global alerts/metrics lost that are difficult to downtime individually [10:38:41] volans: I just found and exception in the cookbooks.sre.hosts.decommission: https://phabricator.wikimedia.org/T271519#6747047 rings any bell? [10:40:05] arturo: from the look of it seems related to netbox upgrade to 2.9 that chaomodus did the other day [10:41:31] arturo: physical or ganeti vm? [10:41:39] volans: physical [10:42:45] just sent you the output in private [10:43:10] thx [10:53:05] seems related to https://phabricator.wikimedia.org/T266487#6747092, having a look [11:03:01] ok [11:28:47] volans: can I "workaround" it somehow, following manual methods? [11:29:12] I'm working on upgrading pynetbox [11:29:24] if it can wiat a bit [11:29:29] *wait [11:29:47] yes, it can wait, no problem :-) [12:15:51] arturo: could you retry to run it from cumin2001 please ? [12:16:00] I've updated python3-pynetbox there to test it [12:16:04] volans: ok [12:16:24] moritzm: I if not too much bothering could you retry the makevm too from cumin2001 when you have a chance? this might have fixed that one too [12:16:25] volans: I'm running `sudo cookbook sre.hosts.decommission -t T271519 labtestvirt2003.codfw.wmnet` [12:16:25] T271519: codfw1dev: repurpose/rename labtestvirt2003.codfw.wmnet as cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271519 [12:16:33] ack [12:20:22] the script is now at the `[INFO] Gathering devices, interfaces, addresses and prefixes from Netbox` stage [12:22:33] takes a while I know, netbox apis are particularly slow compared to the internal api [12:33:55] sure thing, re-trying now [12:36:01] it's still failing, but with a different error now, so progress: https://phabricator.wikimedia.org/P13770 [12:36:53] interesting, that's totally new, chaomodus ^^^ [12:47:00] volans: ulsfo is full https://netbox.wikimedia.org/ipam/prefixes/13/ip-addresses/ [12:47:22] I think we can delete 13/14 that were caused by the previous runs [12:47:31] lolol [12:47:37] thx for looking I [12:47:43] I'm still at the previous fix... [12:48:24] moritzm: go for it! [12:48:25] XioNoX: there's a typo in the homer call in the decom cookbook [12:48:34] ah? [12:48:44] logger.info('Running Homer on {switch}, it takes time ⏳, don\'t worry').format(switch=switch) [12:48:49] parentheses closed too early [12:48:54] before the format [12:49:12] arturo: you have (or XioNoX will do it for you) to run homer for that device [12:49:17] XioNoX: you mean remove .13 and .14 in Netbox and re-run the makevm cookbook? [12:49:31] moritzm: I removed them, you can re-run it [12:49:40] ack, doing that now [12:49:58] ok [12:50:03] volans: nice catch [12:50:04] I've never run homer myself before [12:50:14] I would appreciate some guidelines [12:50:26] seems to work fine now, it's beyond the point where it failed earlier [12:50:27] I have to run for lunch, could you take care of it XioNoX ? it's a quick patch [12:50:35] yep [12:50:41] thx [12:50:52] I've upgrade pynetbox everywhere fwiw [12:54:03] arturo: sure [12:54:15] arturo: let me know when you're done with the script? [12:55:01] XioNoX: the decom script is already done, it failed in what seems to be the last step: running homer [12:55:56] ok, cool [12:56:08] arturo: which host was it for? [12:56:30] labtestvirt2003 [12:56:41] context ii T271519 [12:56:41] T271519: codfw1dev: repurpose/rename labtestvirt2003.codfw.wmnet as cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271519 [12:57:56] arturo: on cumin, run `homer asw-b-codfw commit "T271519"` it will show you a diff before doing the change [12:58:19] feel free to share the diff if you want to be extra safe [12:58:40] mmm [12:58:44] there wasn't a diff at all heh [12:58:46] https://www.irccloud.com/pastebin/LOebtwwv/ [12:58:56] wait, let me try as root [12:59:19] same [12:59:21] https://www.irccloud.com/pastebin/qfcq6JEe/ [12:59:22] arturo: er, `homer asw-b-codfw* commit "T271519"` [12:59:27] see the * [12:59:31] root or user? [12:59:37] user [13:00:12] ok, now seeing some relevant output [13:01:12] XioNoX: diff: [13:01:15] https://www.irccloud.com/pastebin/a1wEOiNr/ [13:01:17] LGTM [13:01:33] yep [13:01:54] committing [13:03:12] XioNoX: ok, success [13:03:45] arturo: are you decom' ing several hosts? [13:03:58] no, I'm renaming labtestvirt2003 to cloudgw2001-dev [13:05:08] ok [13:05:23] I fixed the decom script bug, so next time it should be fine [13:05:26] there is a script in netbox for that ;) [13:06:20] I'm following https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging (I noticed the warning at the top but decided to keep moving anyway) [13:06:22] volans: to move, but not rename iirc? [13:06:57] rigt, my bad [13:19:43] arturo: I think the doc is missing a line too [13:21:38] ok [13:22:37] what would that be? [13:23:19] something like this: ----------- [13:23:34] there is nothing to re-enable its switch port and configure its vlan [13:23:45] same as configuring the IP I think [13:24:59] I can't check right now but does it mention the allocation script? [13:26:16] nop [13:26:36] forgot if we can run it manually for that case [13:28:06] * arturo food time [13:28:48] arturo: my makevm run is prompting me the addition of cloudgw2001-dev, shall I merge that along? [13:29:16] plus [virt|wan].cloudgw.codfw1dev [13:29:32] (in the netbox/DNS sync) [13:31:04] XioNoX: yes it should if mgmt only is what's left on the host [13:31:15] IIRC [13:38:38] I updated tags at https://phabricator.wikimedia.org/project/manage/1025/ tags, but only to the best of my ability, feel free to add latest changes [13:46:11] moritzm: yes [13:46:57] ack, doing that now [13:50:41] thanks [13:51:03] I probably left a prompt asking for confirmation somewhere [13:57:52] volans: in the last DNS diff from netbox I see something weird with the filename [13:57:56] https://www.irccloud.com/pastebin/K6eoRDZZ/ [13:58:32] I recall this is related to T266331 [13:58:32] T266331: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation - https://phabricator.wikimedia.org/T266331 [14:06:47] arturo: /me back [14:08:35] no big deal, perhaps we just need some manual cleanup [14:09:55] for now we decided to keep that zone manually managed so that's ok [14:10:05] what was weird in the diff for you? [14:10:18] b/wikimediacloud.org-codfw [14:10:48] that's how zonefiles are generated for global zones, to allow to migrate them on a per-dc basis [14:10:56] ok, then fine [14:11:02] wikimedia.org-eqiad, etc... [14:11:12] anyway, worth noting that these records are already in ops/dns.git: [14:11:13] https://www.irccloud.com/pastebin/IB5CZYwI/ [14:11:55] yep and not included from netbox data [14:12:02] excellent [16:59:38] think there was an ipv6 convo in here the other day which spured me to check with my isp if they offer native ipv6, there responses: [16:59:42] no, e do not support IPv6 , we own more than enough v4 IP’s. [17:34:20] tbf they are small enough they may entertain routing some rfc1918 space to me which would at leats avoid my NAT [17:34:37] * jbond42 feeling overly optomistic considering the response to the ipv6 questions [17:40:09] jbond42: you can probably negotiate some IPs in exchange to auditing their puppet :) [17:40:33] XioNoX: lol :D [17:44:50] the mkdir -p thing should be enough to convince them to have jbond42 come on board [17:47:12] :D "trading offering puppet functions for [ospf/bgp] neighbours" [17:47:45] their reply: "what's puppet? we just keep the router configs in CVS. mostly" [17:51:27] ....i lied i know nothing of IT what even is a NAT i have over commited.............. [17:53:16] https://media.giphy.com/media/3o7ZeEZUzRjyvWuuIg/giphy.gif [17:57:56] CVS is a pharmacy. You're lucky if they might have a manual backup of a 3 year old copy of a screenshot of a router config, somewhere :) [18:18:09] SCCS is where it's at. Programs check in and never check out. Like a motel. [22:27:58] klausman: AOD, append-only development. I like it :)