[06:19:15] <elukey>	 good morning
[06:19:57] <elukey>	 I'd need to add druid100[7,8] to the lvs service druid-public-broker, IIUC it is sufficient to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597918/ and let puppet do its job
[06:20:16] <elukey>	 (I checked the lo interface on the new hosts, they have the LVS IP)
[06:20:46] <elukey>	 anything else to do? (double checking do avoid a PEBCAK)
[06:53:55] <_joe_>	 elukey: you need to pool them afterwards, and set a weight
[06:57:38] <elukey>	 ack :)
[07:11:12] <_joe_>	 https://cdk8s.io/ ok I have to admit it. this is more or less what I thought a DSL for kubernetes should look like
[07:11:30] <_joe_>	 at the same time, the devil is as usual in the details
[07:13:16] <_joe_>	 for instance, I did appreciate the ability of helm to inject values at runtime. How will this work here?
[07:34:59] <ema>	 so writing code instead of config?
[07:35:25] <_joe_>	 s/config/yaml/
[07:35:27] <_joe_>	 but yes
[07:36:11] <ema>	 now we need a DSL for hiera!
[07:36:27] <_joe_>	 nah come on, hiera is a single-level yaml
[07:36:31] <_joe_>	 it's manageable
[07:36:51] <_joe_>	 have you ever taken the time to go look at the yaml of one of our k8s deployments?
[07:36:58] <ema>	 nope
[07:37:07] <ema>	 point me at one
[07:37:34] <jayme>	 ... down the rabit hole :)
[07:38:25] <_joe_>	 $ . .hfenv && helm get production |wc -l
[07:38:27] <_joe_>	 1576
[07:38:31] <_joe_>	 sorry, the full path
[07:38:41] <_joe_>	 deploy1001:/srv/deployment-charts/helmfile.d/services/staging/changeprop$ . .hfenv && helm get production |wc -l
[07:39:17] <_joe_>	 and this doesn't have TLS termination, for instance
[07:40:49] <_joe_>	 the immediate issue I see with cdk8s is it's incompatible with helm
[07:41:03] <_joe_>	 and I kinda-like what helm does with deployments
[07:41:29] <jayme>	 it's just generating the yaml output but lacks the "management" part of helm?
[07:43:03] <ema>	 in 2 clicks I have ended up on "Chocolatey"
[07:43:18] * ema closes the window
[07:46:29] <_joe_>	 jayme: AIUI yes
[07:47:16] <_joe_>	 so it would mean we'd have to write our own deployment tool, or let people use kubectl directly
[09:03:12] <jayme>	 hm..that's not that cool. But maybe something grows around it with some time and we can revise then
[09:12:02] <_joe_>	 I mean the idea in itself is pretty good, but I need to try to use it
[09:12:34] <_joe_>	 the two things helm does well are: 1 - support multi-level injection of parameters in your definitions 2 - deployment
[09:14:56] <_joe_>	 I just discovered a gem in puppet
[09:15:22] <_joe_>	 basically if you define a class locally in a file where you generaate a function
[09:15:28] <_joe_>	 it will be re-defined at every call
[09:16:03] <_joe_>	 because puppet ofc doesn't just include those files, it does its magic
[09:16:47] <_joe_>	 so for example I doubt the dns resolution cache in ipresolve has ever worked properly :/
[10:11:06] <paravoid>	 so re: that diffscan email, I'm curious, why is install1003 exposed to WMCS? is that by design or by accident?
[10:12:39] <paravoid>	 ah it's not just wmcs
[10:12:40] <mutante>	 paravoid: yesterday wmcs team reported they could not install a server from the cloudvirt VLAN.
[10:13:00] <paravoid>	 cloudvirt != cloud-instances though
[10:13:01] <mutante>	 i am about to limit it to DOMAIN_NETWORKS
[10:14:41] <mutante>	 it's also about the " install_servers (install*) should have a webserver like apt* servers and serve the tftp environment." so we can have install servers in POPs
[10:15:25] <mutante>	 adding the nginx was to unblock them while they were already in their maintenance window
[10:15:52] <paravoid>	 I'm not sure I understand
[10:15:54] <mutante>	 while debugging why they could not install from cloudvirt i saw firewall drop connections to port 80
[10:16:36] <mutante>	 paravoid: would you agree that limiting it to $DOMAIN_NETWORKS is the right thing?
[10:16:52] <mutante>	 because that's what i was about to upload
[10:17:16] <paravoid>	 I think so? haven't touched those things for a long time :)
[10:17:30] <paravoid>	 I'm curious if cloud-instances (i.e. VPSes) depend on anything from installNNNN
[10:18:35] <mutante>	 i don't think so
[10:19:11] <mutante>	 but the install* servers need to have a webserver and not just apt* servers
[10:19:40] <mutante>	 before the split into 2 roles it was all combined
[10:20:03] <paravoid>	 indeed - we switched from TFTP to HTTP as it makes it faster on high-latency links and easier to traverse firewalls/ACLs
[10:20:24] <mutante>	 yea, so first i thought it is actually TFTP and therefore no webserver needed on the new "light" install servers
[10:21:12] <mutante>	 then we talked about it some more and i opened a ticket to add a webserver so that they would be usable in POPs as well
[10:21:29] <mutante>	 then wmcs reported their install issue and i saw the dropped packets
[10:21:52] <mutante>	 that made me add nginx (as in "we were going to do that anyways") and it fixed their issue
[10:22:13] <mutante>	 now let me just limit it.. but imho it was the same in the past before we split stuff
[10:22:34] <paravoid>	 pxelinux.pathprefix in DHCP still points to apt.wikimedia.org though?
[10:22:44] <paravoid>	 how does that even work now :)
[10:23:05] <paravoid>	 and https://apt.wikimedia.org/tftpboot/buster-installer/ still exists
[10:24:32] <paravoid>	 gotta go, ttyl :)
[10:25:56] <mutante>	 yes, for some reason it worked for the eqiad/codfw VLANs and that still exists because it was all a quick workaround yesterday just to unblock their maintenance window.
[13:08:47] <_joe_>	 cdanis: let's talk here maybe
[13:08:54] <_joe_>	 so on authdns1001 I see
[13:09:13] <_joe_>	 at 14:51:48 puppet-agent says  Exec[systemd start for prometheus-nic-firmware-textfile.service]
[13:09:31] <_joe_>	 and systemd says May 21 14:51:48 authdns1001 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded.
[13:09:39] <cdanis>	 mmhmm
[13:09:54] <_joe_>	 and after that I see
[13:10:17] <cdanis>	 the three hosts I was looking at were: thanos-be2002 (new-ish reimage, but patch merged after the image; didn't work), authdns1001 (prom-nic-firmware run as part of reimage; didn't work), and dns1001 (prom-nic-firmware installed on existing Buster machine; worked)
[13:10:35] <_joe_>	 https://phabricator.wikimedia.org/P11285
[13:11:13] <_joe_>	 the systemd timer seems to have worked , but I see that when the timer fires, it says
[13:11:19] <_joe_>	 May 21 14:52:21 authdns1001 systemd[1]: Stopped Periodic execution of prometheus-nic-firmware-textfile.service.
[13:12:31] <cdanis>	 that's strange, and doesn't show on e.g. thanos-be2002
[13:12:43] <_joe_>	 no it does
[13:12:51] <_joe_>	 zgrep prometheus-nic-firmware-textfile.service /var/log/syslog.*.gz
[13:13:03] <_joe_>	 on may 19
[13:14:02] <cdanis>	 there's not the 'Stopped' message there
[13:15:06] <cdanis>	 sigh okay -- so on thanos-be2002 the timer runs every 5 minutes for an hour and change
[13:15:20] <cdanis>	 then the machine is rebooted, and starting the timer unit doesn't do anything by itself.
[13:16:50] <cdanis>	 authdns1001 was also rebooted shortly after the first run of the service unit
[13:17:07] <_joe_>	 bingo
[13:17:15] <cdanis>	 I've sent a patch
[13:17:20] <_joe_>	 so we need to add
[13:17:21] <_joe_>	 oh ok
[13:17:23] <_joe_>	 :D
[13:43:44] <cdanis>	 _joe_: I'm working on a 'proper' patch now; OnActiveSec solves our original problem 🤦
[13:44:00] <_joe_>	 you mean OnBootSec ?
[13:44:09] <cdanis>	 OnBootSec is necessary as well
[13:44:14] <cdanis>	 that solves _this_ problem
[13:44:27] <cdanis>	 OnActiveSec solves the problem of "you need to exec systemctl start on the service unit the first time"
[13:44:31] <cdanis>	 so we can clean this up quite a bit
[13:44:42] <_joe_>	 oh does it?
[13:44:48] <cdanis>	 yeah, I just tried it on my machine
[13:45:20] <cdanis>	 puppet already does a systemctl start on the timer, which will trigger OnActiveSec
[13:45:58] <_joe_>	 oh ok
[14:08:31] <kormat>	 mysql was a mistake. anyone have a time machine?
[14:10:12] <_joe_>	 kormat: I didn't make you an optimist
[14:10:39] <_joe_>	 you really think that if you give the nerds the chance of doing it all over again we will end up any better?
[14:10:41] <kormat>	 it's the desperation talking, i assure you
[14:10:46] <kormat>	 hahah
[14:10:54] <_joe_>	 I mean, you could end up having to deal with postgres
[14:10:59] <_joe_>	 or worse.
[14:12:33] <cdanis>	 _joe_: is ... is there some introductory documentation to writing puppet spec tests you'd recommend?
[14:12:50] <cdanis>	 I naively wrote this:
[14:12:52] <cdanis>	           is_expected.to contain_systemd__timer('dummy-test')
[14:12:54] <_joe_>	 yes, there is a good tutorial, lemme find it
[14:12:54] <cdanis>	                            .with_content(/OnActiveSec=/)
[14:12:56] <cdanis>	 but that does not work
[14:13:18] <_joe_>	 yeah lemme see a sec
[14:13:31] <cdanis>	 the change looks good in PCC btw https://puppet-compiler.wmflabs.org/compiler1002/22707/mw1299.eqiad.wmnet/index.html
[14:13:49] <_joe_>	 so this https://en.wikipedia.org/wiki/Necronomicon is the best starting point to learn puppet spec testing
[14:14:14] <_joe_>	 as an alternative, you can try https://rspec-puppet.com/tutorial/
[14:14:25] <_joe_>	 so!
[14:15:01] <_joe_>	 with_content means "the resource named systemd::timer has a parameter called 'content', whose value contains this regex"
[14:15:09] <cdanis>	 aha
[14:16:17] <_joe_>	 so, testing what you're trying to do can be done by testing precisely the array that ends in timer_intervals
[14:16:45] <_joe_>	 oh TIL .all is in puppet 5.5?
[14:16:55] <cdanis>	 *sigh*
[14:17:05] <_joe_>	 nice
[14:17:50] <_joe_>	 cdanis: so you're changing behaviour
[14:18:11] <_joe_>	 before you would add the systemctl start if we had one interval containing OnUnit...
[14:18:19] <_joe_>	 now you do so only if *all* of them do
[14:18:25] <_joe_>	 which seems more correct to me
[14:18:26] <cdanis>	 that's more correct
[14:18:35] <_joe_>	 as in general another timer interval will fire
[14:19:04] <cdanis>	 I mean, the examples for which it actually changes behavior are all kind of strange
[14:19:20] <cdanis>	 "I want this timer to fire on 00:00 May 1st 2021, and then every five minutes thereafter"
[14:29:37] <kormat>	 anyone know why icinga would say "Check systemd state" is failing, but there are no failed units on the machine?
[14:32:20] <XioNoX>	 kormat: recovered before icinga had time to run?
[14:32:32] <kormat>	 it's been reporting this for 2d now
[14:32:47] <kormat>	 https://cas-icinga.wikimedia.org/icinga/images/export_link.png
[14:33:28] <kormat>	 er https://cas-icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db2137
[14:34:30] <marostegui>	 kormat: there is one failed unit there
[14:34:37] <marostegui>	 ● prometheus-mysqld-exporter.service loaded failed failed Prometheus exporter for MySQL server
[14:34:46] <marostegui>	 so you might want to disable that and reset it
[14:35:38] * kormat blinks
[14:36:17] <kormat>	 https://phabricator.wikimedia.org/P11291 - following the instructions from https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state doesn't show it
[14:36:23] <mutante>	 kormat: most of the times with these alerts an "systemctl reset-failed" clears it
[14:36:53] <kormat>	 aaand i'm logged into the wrong machine
[14:36:54] <volans>	 I just use systemctl and look at the output
[14:36:57] <volans>	 ahahah
[14:37:09] <kormat>	 db2073 != db2173 *sobs*
[14:37:10] <volans>	 that might do the trick :D
[14:37:21] <marostegui>	 XDD
[14:38:11] <cdanis>	 ahaha
[14:38:31] <kormat>	 (in fact, db2073 != db2137. it's even worse)
[14:39:55] <kormat>	 fixed. i'd like to thank everyone for their help in this embarrassing time.
[14:40:24] <cdanis>	 _joe_: okay, issues addressed, let me know about what I copypastaed in the spec test https://gerrit.wikimedia.org/r/c/operations/puppet/+/598050
[14:40:36] <marostegui>	 kormat: https://jynus.com/gif/cheers.gifv
[14:40:45] <kormat>	 :D
[14:53:38] <kormat>	 cdanis: i like how dbctl instance X edit will throw away all your changes if it can't parse anything
[14:53:54] <cdanis>	 kormat: I'm sorry about that, and also, it's not trivial to fix :(
[14:54:11] <cdanis>	 hm maybe it wouldn't be too hard to save a backup[
[14:54:27] <kormat>	 it creates a tempfile - couldn't that be retained if something fails?
[14:54:46] <cdanis>	 the complicated things about the code path are all self-inflicted
[14:54:54] * kormat grins
[14:55:51] <kormat>	 i guess i should get in the habit of doing `:w a` while in the editor
[14:57:32] <cdanis>	 kormat: unrelatedly https://wikitech.wikimedia.org/wiki/Dbctl#Schema_upgrades
[14:58:03] <kormat>	 ohno
[15:09:08] <_joe_>	 kormat: you're welcome to add a schema-update command to dbctl though
[15:09:23] <_joe_>	 cdanis: so there is a secret I didn't tell you about puppet specs
[15:09:58] <_joe_>	 I cargo-cult it as well. And y'all (with the exception of alex and john) copy from my cargo-cult
[15:10:29] <_joe_>	 at times, we have religious mergers where two cargo-cults are joined in one that looks vaguely more appropriate
[15:10:42] <_joe_>	 think of it like "partman, but in ruby"
[15:17:52] <cdanis>	 _joe_: sigh, I thought my patch would also fix the machines that had gotten rebooted
[15:18:06] <cdanis>	 but the timer unit has already 'started' there
[15:18:20] <_joe_>	 yeah
[15:18:25] <_joe_>	 you have to restart them
[15:18:29] <cdanis>	 I am just going to cumin it, yeah
[15:18:42] <_joe_>	 "to cumin"
[15:26:57] <cdanis>	 _joe_: mw[2271-2272].codfw.wmnet still have puppet disabled -- should I reenable?
[15:27:08] <_joe_>	 not sure why, lemme check
[15:27:21] <cdanis>	 "switch tls to envoy --joe"
[15:27:34] <_joe_>	 heh this means the switch didn't happen on those servers
[15:27:48] <_joe_>	 lemme fix that
[15:42:24] <jbond42>	 cdanis: not neccesarily a recomendation but i use https://github.com/nwops/puppet-retrospec with a slightly modified template https://github.com/b4ldr/retrospec-templates, for generating the respec boilder plater.  its not pretty but can save some of the copy pasting
[15:42:40] <cdanis>	 thanks!
[17:52:38] <rzl>	 trying to route https://phabricator.wikimedia.org/T252826, but I don't even know for sure what part of the stack handles CORS headers, anyone have a pointer?
[17:52:42] <rzl>	 bblack maybe? ^
[17:53:33] <rzl>	 I did find https://gerrit.wikimedia.org/g/mediawiki/core/+/master/includes/api/ApiMain.php#749 but X-Wikimedia-Debug ought to be allowed, by my reading
[17:53:40] <rzl>	 (https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/wmf-config/CommonSettings.php#265)
[17:53:57] <rzl>	 so I wonder if there are other moving parts, in Varnish or elsewhere
[18:06:23] <bd808>	 rzl: random guess is -- https://github.com/wikimedia/restbase/blob/f1d9547e1a1258c7a231e70b10ea8aa677965e61/lib/security_response_header_filter.js#L70-L78 -- which would be CPT.
[18:07:56] <bd808>	 CORS handling is pretty twisted across many layers and Seddon seems to be talking about 2 different backends that he's struggling with.
[18:09:10] <rzl>	 I hadn't even thought about restbase but yeah of course, good find
[18:09:16] <rzl>	 I'll try that and see if it gets anywhere
[18:09:19] <rzl>	 thanks!