[00:00:36] add filter > date range > current date [00:01:21] cool, thx [07:17:24] moritzm: I'm getting an error trying to install cp3055 [07:17:30] Execution of preseeded command "wget -O /tmp/late_command http://apt.wikimedia.org/autoinstall/scripts/late_command.sh && sh │ │ [07:17:30] │ │ /tmp/late_command" failed with exit code 1 [07:17:33] is that expected? [07:20:31] It's a stretch installation [07:23:55] let me check [07:26:20] late_command.sh has some special casing for NMVe drives which were used in the last batch of caches in eqiad (1075-1090), do we maybe need to apply the same setting for the new esams caches as well? [07:27:33] maybe... I'm not sure TBH [07:28:06] those hosts come with a single NVMe drive [07:28:36] are they all the same hardware? probably not? [07:28:52] we could install the bast3004 to rule out a generic d-i issue [07:29:05] or one of the ganeti servers [07:29:45] there was a recent change to late-command.sh, but it's totally unrelated, we dropped workaround for re-installing puppet4 masters [07:31:11] and maybe check if there's anything in /var/log/syslog, basically anything that gets logged in d-i ends up there [07:31:22] I have ganeti3002 on the list for today as well [07:31:32] I' [07:31:53] I'd say, let's try installing that one, then we can see if it's specific to the cache hardware [07:32:31] vgutierrez: unrelated, the certs warnings in icinga for cp hosts can be downtimes and/or are known ? [07:32:52] or dns3002 [07:33:02] are known and we got already a digicert-2009 cert deployed [07:33:15] moritzm: ack [07:33:33] vgutierrez: kk, I'll silence for 10d, thanks! [07:37:02] bblack: you added ganeti3002 to the PXE boot config with stretch as the OS, but I think it makes more sense to directly install these with Buster (when we're have the time to tackle them later). our current main Ganeti clusters are on stretch, but the servers for the codfw expansion were also installed with Buster [07:39:17] moritzm: let me trigger a "reimage" (--new) of dns3002 then [07:39:37] ah, is that one installed already? [07:39:47] nvm, misread what you wrote [07:40:16] nope, it isn't :) [07:41:32] hmmm it looks like we are missing the dns prod entries for dns3002 [07:41:38] * vgutierrez checking [07:44:03] oh right [07:44:15] L8 on my side, dns3002.wikimedia.org, not dns3002.esams.wmnet [08:34:21] moritzm: the installation didn't fail on dns3002 [08:34:25] it went smoothly [08:35:29] that at least rules out a generic d-i issue [08:35:50] let me check logs on cp3035 whether I can find something [08:35:58] cp3055 [08:35:59] :) [08:36:07] ah, yes :-) [08:42:29] yeah, I think I found the error: the new caches are configured to use the same partman recipe as cp1075-cp1090 (cp2018), but we also need to switch late-command.sh to do the nvme setup for them, patch incoming [13:58:40] godog: you scooped me! [14:03:00] cdanis: lol [14:03:19] yeah one bogus availability alert too much [14:03:33] I've been meaning to reenable paging for 50x for months now really [14:03:44] very glad it is getting done [14:05:36] yeah! thanks for fixing the underlying metric, I'm thinking some days after we have the alerts we can turn paging on and let it be [14:06:11] I would be bold and turn it on right away, but I don't think you necessarily have to [14:07:01] yeah why not [14:07:10] brb [14:12:24] also I just remembered I had a dream last night involving somehow changing check_prometheus_metric.py to evaluate prometheus alerting rules (which as it turns out would not be easy) [14:12:27] sigh [14:19:34] cdanis: that would be a dream indeed! [14:19:50] although yeah quite the impedance mismatch there [14:22:14] it wouldn't be too bad to define the alerting rules in the usual way and then check /api/v1/alerts [14:22:22] but also, we could just set up alertmanager 🙃 [14:23:41] absolutely, this Q escalation took precedence but alertmanager is next in my mental agenda [15:11:32] godog: I found this in my puppet repo git stash and I can't remember why I wrote it https://phabricator.wikimedia.org/P9473 does it ring any bells for you? [15:11:40] did this come up earlier this week? [15:13:20] cdanis: don't remember it this week but in general I think it makes sense since https is what ats uses to talk to swift anyways [15:17:19] I could have sworn this fired for some reason or was somehow relevant [15:17:23] but I can't even find it in IRC log [15:17:26] it's been a long week [15:28:04] yeah I seem to remember at some point we came across it too [18:33:43] bstorm_: looking for a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/545550 before merging [18:34:12] Will take a look in a sec. Have some fun with database crashes at the moment :) [18:34:18] :) k np thank you [22:47:06] today I was happy to learn that you can use ProxyPass together with ProxyRemote in apache2