[11:01:24] last year Luca tracked down a segfault in uwsgi, which was exposed by Netbox (at this time running on netmon*): https://phabricator.wikimedia.org/T212697 [11:02:15] The patch was backported to a stretch-wikimedia uwsgi package 2.0.14+20161117-3+deb9u2+wmf1 [11:03:13] there's now a new stretch update and netmon/netbox are running buster these days and I don't think we've seen those segfaults else [11:04:03] so I'm planning to remove the 2.0.14+20161117-3+deb9u2+wmf1 from apt.wikimedia.org instead of rebasing the patch to +deb9u3 [11:04:38] but let me know if anyone thinks otherwise! [11:41:26] Only useful feature in the new LibreNMS release is Matrix alerting support https://github.com/librenms/librenms/pull/12018 good timing [12:32:40] moritzm: is that used by ores by any chance? [12:33:01] I don't recall if acmechief saw the same issue [12:33:07] * volans looking at https://debmonitor.wikimedia.org/packages/uwsgi [12:33:26] it is: https://debmonitor.wikimedia.org/packages/uwsgi [12:33:52] but the uwsgi bugfix was never fully rolled out [12:34:03] and currently it's without it in codfw [12:34:26] lol [12:34:51] I guess we can remove and remember about it if we hit that again [12:35:00] what about a reimage though? [12:35:13] if we have to reimage an existing host on stretch (for $reasons) [12:35:20] would it get back the old package with the issue I guess [12:35:57] no, when dropping the custom package, I'll also update to the new deb9u3 from Debian, and every new reimage will also get that one [12:36:12] so a reimage should not be necessary [12:38:08] ack, no objections then [12:44:48] ack, thx [13:29:25] if you are interested in a riddle, I've launched this and output is saying 'waiting on reboot', though I can run 'cumin' just fine [13:29:28] cumin1001:~# wmf-auto-reimage-host --new ms-be2057.codfw.wmnet [13:29:42] log file is saying this [13:29:43] 100.0% (1/1) of nodes failed to execute command 'cat /proc/uptime': ms-be2057.codfw.wmnet [13:30:18] though interactively I can run cumin fine [13:30:28] i.e. cumin1001:~# cumin --force ms-be2057.codfw.wmnet 'cat /proc/uptime' [13:31:01] maybe I'm holding it wrong (?) [13:32:06] godog: most likely it didn't reboot at all [13:32:30] it checks that the reboot time is after the issued reboot [13:33:06] mmmh up 24 min [13:33:14] it didn't reboot into d-i [13:33:27] so basically it didn't get the PXE bit set although we set it and check that is set [13:33:45] if you force the host to reboot into PXE the reimage script will happily continue [13:33:54] if you do that within the timeout (that is fairly long) [13:34:22] this is new hardware so it might be that the pxe setting didn't work at all [13:34:53] we've seen this happen on some hosts, but too few to make assumption on harware/firmware/versions [13:35:10] volans: the thing that confused me is the cumin output log file, which says that failed to execute cat /proc/uptime, though that's only partially the problem ? [13:35:49] godog: the user-facing message is [13:35:50] 2020-09-03 13:34:12 [INFO] (filippo) wmf-auto-reimage::print_line: Still waiting for reboot after 30.0 minutes [13:36:06] tje failed to run is actually correct [13:36:14] because cumin is trying to connect with the d-i key [13:36:20] that is authorized on d-i and then removed [13:36:24] and not the actual production key [13:36:34] ah got it, thank you [13:36:35] it uses /etc/cumin/config-installer.yaml [13:36:49] but I agree that might be confusing [13:36:59] in this specific case in which normal prod cumin "works" [13:37:46] yeah, perhaps logging what commands wmf-auto-reimage-host is executing in its log file might help debugging what goes on [13:37:54] I've tried rebooting manually now [13:38:10] reboot into PXE that is [13:46:27] godog: one option could be after N failures of proc/uptime to check with the normal key and if succeed tell the user the host rebooted into previous system and/or retry to force PXE [13:48:50] volans: that'd be sweet for sure yeah [13:50:04] feel free to open a feature request :) I should work to migrate that to cookbooks this Q (not sure I will be able at this point) [13:52:52] for sure, will do [14:05:44] thx