[08:57:26] hi, I'm observing some kind of issue with the debmonitor hook in one of my servers [08:57:43] https://www.irccloud.com/pastebin/KbAoWImr/ [08:58:21] apparently the hook is stuck in the `/usr/bin/debmonitor-client-unpriv -g || true` step [08:58:32] is debmonitor reachable from that instance? [08:58:52] I can test manually, what is the address/port? [08:59:27] simply run debmonitor-client on the shell [08:59:51] or debmonitor-client as non-root [09:00:36] it ask for the server address [09:01:07] ok, works as sudo [09:01:10] aborrero@cloudgw2002-dev:~ $ sudo debmonitor-client [09:01:10] INFO:debmonitor:Found 528 installed binary packages [09:01:10] INFO:debmonitor:Found 0 upgradable binary packages (including new dependencies) [09:01:10] INFO:debmonitor:Successfully sent the full update to the DebMonitor server [09:08:55] it's a buster instance, so I doubt there's something different in the way apt sends the hook, and the hook is also versioned [09:11:04] arturo: how much time was it stuck before you hit ctrl+c? [09:11:16] aproximately ofc [09:11:16] couple of minutes [09:11:39] because recently john added some retry logic, let me check the paramters [09:12:23] it does 3 attempts, with a backoff_factor of 60s [09:12:42] so yeah can take more than 2 minutes in the worse case scenario [09:13:10] shall I leave it untouched until it times out? and see if that makes any diff? [09:13:26] is it reproducible? [09:13:39] yes, try yourself if you want [09:14:02] su: warning: cannot change directory to /nonexistent: No such file or directory [09:14:02] INFO:debmonitor:Got 2 updates from dpkg hook version 3 [09:14:07] the warning is also generated by the script [09:14:23] yes, that's "expected", morit.z can tell you the whole story [09:15:23] ok [09:15:40] volans: you can use this machine to debug whatever you need [09:15:47] arturo: ack, looking it right now [09:17:32] ok it's getting a 500 on the debmonitor side, I'll check what's happening there [09:18:26] arturo: to unblock you just let it run and it will finish (because of the || true) without affecting the installation, I'll chat with john to maybe reduce the retry interval [09:18:40] as I have it already running should I let it run? or do you want to check the output? [09:18:52] nah, it is OK [09:19:01] just let me know when it finishes [09:19:06] ack [09:19:26] (and if there is any scary warning during the kernel install) [09:20:11] like here be dragons :D [09:20:35] or a command like `format C:\\` [09:20:50] hehe read: kormat C:\\ [09:21:10] arturo: done [09:21:12] dpkg: warning: downgrading linux-image-amd64 from 5.10.24-1~bpo10+1 to 4.19+105+deb10u11 [09:21:21] marostegui: :-P [09:21:27] and 3 warnings for Possible missing firmware for the tg3 module [09:21:38] volans: ack, thanks! I'll take it from there [09:22:24] all yours, if I have to bet the issue is related to the downgraded package, I'll have a look at the debmonitor side of things [09:23:36] 👍 [09:31:37] I've opened T282529 [09:31:38] T282529: debmonitor: 500 on upgraded+downgraded package - https://phabricator.wikimedia.org/T282529 [09:55:33] arturo: by any chance cloudgw2001-dev is in the same situation cloudgw2002-dev was before? [09:55:51] volans: yeah, also feel free to test stuff there [09:56:10] but for the record, I already reverted the kernel downgrade (so, upgraded it again) [09:56:43] ah ok, I jsut need to retry the install command but don't need to wait the end, I will ctrl+c it [09:56:51] that will help with the debug [09:58:47] all yours! [09:59:04] thanks, sorry for the trouble [09:59:39] no problem :-) we have these -dev servers for a safe playground [10:13:57] arturo: thanks, all done, I have the payload I needed to test troubleshoot without bothering your hosts ;) [10:14:19] excellent :-) [10:26:36] volans: I may have another bug report [10:26:49] in the reboot_single cookbook [10:27:27] arturo: sure, although not "mine" ;) [10:27:43] fair :-P [10:28:08] not a big deal, but I rebooted a server using it, the server is up and running, and for whatever reason it fails to detect the reboot and proceed [10:28:20] endless loop of [42/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)` [10:28:55] did it maybe fail to complete a puppet run? [10:29:07] can happen with servers which are still WIP [10:30:05] can have a look at the cumin logs later [10:30:11] no, this server is cloudgw1002, not the -dev counterpart [10:30:31] puppet agent run fine, I just tested [10:30:50] moritzm: I can open a phab task and collect some info if you want [10:31:42] moritzm: it's timing out to get the uptime [10:32:24] it's suuuuuuper slow to connec there, cecking [10:33:29] it's stuck at [10:33:32] debug1: Connecting to cloudgw1002.eqiad.wmnet [2620:0:861:118:10:64:20:5] port 22. [10:33:41] (ssh -vvv) [10:33:53] do we have anything network wise that might explain that? [10:33:57] seems some sort of timeout [10:34:01] that sounds like firewalling [10:34:05] let me review it [10:34:10] 30s [10:34:12] seems [10:34:19] the line after is [10:34:20] debug1: Connection established. [10:34:34] ah no got it [10:34:37] it fails on v6 [10:34:39] and retry on v4 [10:34:44] debug1: Connecting to cloudgw1002.eqiad.wmnet [2620:0:861:118:10:64:20:5] port 22. [10:34:47] debug1: connect to address 2620:0:861:118:10:64:20:5 port 22: Connection timed out [10:34:50] debug1: Connecting to cloudgw1002.eqiad.wmnet [10.64.20.5] port 22. [10:34:52] and apaprently has a 30s retry logic there [10:34:54] debug1: Connection established. [10:35:02] while we have a shorter timeout for the uptime check [10:35:16] why would it fail on v6? weird [10:35:28] is ssh listening on v6? [10:35:47] or firewalled? [10:36:05] both yes, it should work [10:36:20] 10:35:49.672838 IP6 cumin1001.eqiad.wmnet.46936 > cloudgw1002.eqiad.wmnet.ssh: Flags [S], seq 52186391, win 43200, options [mss 1440,sackOK,TS val 3756873696 ecr 0,nop,wscale 9], length 0 [10:36:20] 10:35:49.672924 IP6 cloudgw1002.eqiad.wmnet.ssh > cumin1001.eqiad.wmnet.46936: Flags [S.], seq 2268034202, ack 52186392, win 42840, options [mss 1440,sackOK,TS val 3882605519 ecr 3756873696,nop,wscale 9], length 0 [10:36:47] I know the problem: the cloudgw1002 server lacks a default v6 route [10:37:14] I would have expected that to be created by the installer .... [10:38:09] mmmm no, the v6 routing table is quite similar to a working server [10:40:19] so it might be a filter somewhere [10:40:48] the cloudgw1002 server puts the v6 packet in the wire, but it doesn't reach cumin1001 (the syn+ack) [10:42:02] anyway, brb [11:14:36] volans: re timeout feel free to change i picked those values arbitrarily [11:17:08] arturo: volans: looks like there is no cumin term in labs-in6 on the core routers [11:17:24] right [11:20:15] jbond42: indeed, but was never there, so what changed? why the other are working? [11:20:56] *other hosts [11:21:11] * volans running a quick test [11:21:22] other may not work either :-) [11:21:26] others* [11:21:33] I'm checking [11:23:18] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/688978 [11:24:34] a normal run has a 10s delay on some cloud hosts, while the direct ssh to cloudgw has more like 30s [11:25:00] but yes we should add it, i'llleave it to you two and arzhel, sorry gotta run for lunch [11:25:19] 👍 [11:25:59] merging [11:26:22] 👍 [11:27:21] something is wrong here [11:27:23] https://www.irccloud.com/pastebin/eTo3FAyf/ [11:29:35] s/protocol/next-header/ [11:30:19] jbond42: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/688979 [11:30:31] looking [11:31:45] arturo: +1 fyi i genrally run homer with diff first to catch issues like this [11:32:11] what should I do? [11:32:28] for the next time [11:32:56] arturo: `homer cr*eqiad* diff` [11:33:10] ok, let me try that [11:33:13] will first show a diff and will also spit any errors it would get from a commit [11:34:56] and if errors, then revert or patch accordingly before homer commit, no? [11:35:42] arturo: exactly [11:36:00] cool [11:36:22] BTW that was it. As soon as the patch was merged into homer, the cookbook finished :-P [11:38:34] \o/ [12:43:58] great [13:51:00] effie: re: prior art for swift account requests: https://phabricator.wikimedia.org/T280773 [14:12:24] cheers tx [14:19:18] arturo: FYI I've sent a patch to fix the debmonitor issue, basically you got unlucky, to hit it you had to install a version of package that is not the natual candidate one in APT *and* that package had to have changed source package name between the two versions [14:19:38] :-P [14:20:11] or you got lucky, depending of your point of view ;)