[06:19:15] good morning [06:19:57] I'd need to add druid100[7,8] to the lvs service druid-public-broker, IIUC it is sufficient to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597918/ and let puppet do its job [06:20:16] (I checked the lo interface on the new hosts, they have the LVS IP) [06:20:46] anything else to do? (double checking do avoid a PEBCAK) [06:53:55] <_joe_> elukey: you need to pool them afterwards, and set a weight [06:57:38] ack :) [07:11:12] <_joe_> https://cdk8s.io/ ok I have to admit it. this is more or less what I thought a DSL for kubernetes should look like [07:11:30] <_joe_> at the same time, the devil is as usual in the details [07:13:16] <_joe_> for instance, I did appreciate the ability of helm to inject values at runtime. How will this work here? [07:34:59] so writing code instead of config? [07:35:25] <_joe_> s/config/yaml/ [07:35:27] <_joe_> but yes [07:36:11] now we need a DSL for hiera! [07:36:27] <_joe_> nah come on, hiera is a single-level yaml [07:36:31] <_joe_> it's manageable [07:36:51] <_joe_> have you ever taken the time to go look at the yaml of one of our k8s deployments? [07:36:58] nope [07:37:07] point me at one [07:37:34] ... down the rabit hole :) [07:38:25] <_joe_> $ . .hfenv && helm get production |wc -l [07:38:27] <_joe_> 1576 [07:38:31] <_joe_> sorry, the full path [07:38:41] <_joe_> deploy1001:/srv/deployment-charts/helmfile.d/services/staging/changeprop$ . .hfenv && helm get production |wc -l [07:39:17] <_joe_> and this doesn't have TLS termination, for instance [07:40:49] <_joe_> the immediate issue I see with cdk8s is it's incompatible with helm [07:41:03] <_joe_> and I kinda-like what helm does with deployments [07:41:29] it's just generating the yaml output but lacks the "management" part of helm? [07:43:03] in 2 clicks I have ended up on "Chocolatey" [07:43:18] * ema closes the window [07:46:29] <_joe_> jayme: AIUI yes [07:47:16] <_joe_> so it would mean we'd have to write our own deployment tool, or let people use kubectl directly [09:03:12] hm..that's not that cool. But maybe something grows around it with some time and we can revise then [09:12:02] <_joe_> I mean the idea in itself is pretty good, but I need to try to use it [09:12:34] <_joe_> the two things helm does well are: 1 - support multi-level injection of parameters in your definitions 2 - deployment [09:14:56] <_joe_> I just discovered a gem in puppet [09:15:22] <_joe_> basically if you define a class locally in a file where you generaate a function [09:15:28] <_joe_> it will be re-defined at every call [09:16:03] <_joe_> because puppet ofc doesn't just include those files, it does its magic [09:16:47] <_joe_> so for example I doubt the dns resolution cache in ipresolve has ever worked properly :/ [10:11:06] so re: that diffscan email, I'm curious, why is install1003 exposed to WMCS? is that by design or by accident? [10:12:39] ah it's not just wmcs [10:12:40] paravoid: yesterday wmcs team reported they could not install a server from the cloudvirt VLAN. [10:13:00] cloudvirt != cloud-instances though [10:13:01] i am about to limit it to DOMAIN_NETWORKS [10:14:41] it's also about the " install_servers (install*) should have a webserver like apt* servers and serve the tftp environment." so we can have install servers in POPs [10:15:25] adding the nginx was to unblock them while they were already in their maintenance window [10:15:52] I'm not sure I understand [10:15:54] while debugging why they could not install from cloudvirt i saw firewall drop connections to port 80 [10:16:36] paravoid: would you agree that limiting it to $DOMAIN_NETWORKS is the right thing? [10:16:52] because that's what i was about to upload [10:17:16] I think so? haven't touched those things for a long time :) [10:17:30] I'm curious if cloud-instances (i.e. VPSes) depend on anything from installNNNN [10:18:35] i don't think so [10:19:11] but the install* servers need to have a webserver and not just apt* servers [10:19:40] before the split into 2 roles it was all combined [10:20:03] indeed - we switched from TFTP to HTTP as it makes it faster on high-latency links and easier to traverse firewalls/ACLs [10:20:24] yea, so first i thought it is actually TFTP and therefore no webserver needed on the new "light" install servers [10:21:12] then we talked about it some more and i opened a ticket to add a webserver so that they would be usable in POPs as well [10:21:29] then wmcs reported their install issue and i saw the dropped packets [10:21:52] that made me add nginx (as in "we were going to do that anyways") and it fixed their issue [10:22:13] now let me just limit it.. but imho it was the same in the past before we split stuff [10:22:34] pxelinux.pathprefix in DHCP still points to apt.wikimedia.org though? [10:22:44] how does that even work now :) [10:23:05] and https://apt.wikimedia.org/tftpboot/buster-installer/ still exists [10:24:32] gotta go, ttyl :) [10:25:56] yes, for some reason it worked for the eqiad/codfw VLANs and that still exists because it was all a quick workaround yesterday just to unblock their maintenance window. [13:08:47] <_joe_> cdanis: let's talk here maybe [13:08:54] <_joe_> so on authdns1001 I see [13:09:13] <_joe_> at 14:51:48 puppet-agent says Exec[systemd start for prometheus-nic-firmware-textfile.service] [13:09:31] <_joe_> and systemd says May 21 14:51:48 authdns1001 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded. [13:09:39] mmhmm [13:09:54] <_joe_> and after that I see [13:10:17] the three hosts I was looking at were: thanos-be2002 (new-ish reimage, but patch merged after the image; didn't work), authdns1001 (prom-nic-firmware run as part of reimage; didn't work), and dns1001 (prom-nic-firmware installed on existing Buster machine; worked) [13:10:35] <_joe_> https://phabricator.wikimedia.org/P11285 [13:11:13] <_joe_> the systemd timer seems to have worked , but I see that when the timer fires, it says [13:11:19] <_joe_> May 21 14:52:21 authdns1001 systemd[1]: Stopped Periodic execution of prometheus-nic-firmware-textfile.service. [13:12:31] that's strange, and doesn't show on e.g. thanos-be2002 [13:12:43] <_joe_> no it does [13:12:51] <_joe_> zgrep prometheus-nic-firmware-textfile.service /var/log/syslog.*.gz [13:13:03] <_joe_> on may 19 [13:14:02] there's not the 'Stopped' message there [13:15:06] sigh okay -- so on thanos-be2002 the timer runs every 5 minutes for an hour and change [13:15:20] then the machine is rebooted, and starting the timer unit doesn't do anything by itself. [13:16:50] authdns1001 was also rebooted shortly after the first run of the service unit [13:17:07] <_joe_> bingo [13:17:15] I've sent a patch [13:17:20] <_joe_> so we need to add [13:17:21] <_joe_> oh ok [13:17:23] <_joe_> :D [13:43:44] _joe_: I'm working on a 'proper' patch now; OnActiveSec solves our original problem 🤦 [13:44:00] <_joe_> you mean OnBootSec ? [13:44:09] OnBootSec is necessary as well [13:44:14] that solves _this_ problem [13:44:27] OnActiveSec solves the problem of "you need to exec systemctl start on the service unit the first time" [13:44:31] so we can clean this up quite a bit [13:44:42] <_joe_> oh does it? [13:44:48] yeah, I just tried it on my machine [13:45:20] puppet already does a systemctl start on the timer, which will trigger OnActiveSec [13:45:58] <_joe_> oh ok [14:08:31] mysql was a mistake. anyone have a time machine? [14:10:12] <_joe_> kormat: I didn't make you an optimist [14:10:39] <_joe_> you really think that if you give the nerds the chance of doing it all over again we will end up any better? [14:10:41] it's the desperation talking, i assure you [14:10:46] hahah [14:10:54] <_joe_> I mean, you could end up having to deal with postgres [14:10:59] <_joe_> or worse. [14:12:33] _joe_: is ... is there some introductory documentation to writing puppet spec tests you'd recommend? [14:12:50] I naively wrote this: [14:12:52] is_expected.to contain_systemd__timer('dummy-test') [14:12:54] <_joe_> yes, there is a good tutorial, lemme find it [14:12:54] .with_content(/OnActiveSec=/) [14:12:56] but that does not work [14:13:18] <_joe_> yeah lemme see a sec [14:13:31] the change looks good in PCC btw https://puppet-compiler.wmflabs.org/compiler1002/22707/mw1299.eqiad.wmnet/index.html [14:13:49] <_joe_> so this https://en.wikipedia.org/wiki/Necronomicon is the best starting point to learn puppet spec testing [14:14:14] <_joe_> as an alternative, you can try https://rspec-puppet.com/tutorial/ [14:14:25] <_joe_> so! [14:15:01] <_joe_> with_content means "the resource named systemd::timer has a parameter called 'content', whose value contains this regex" [14:15:09] aha [14:16:17] <_joe_> so, testing what you're trying to do can be done by testing precisely the array that ends in timer_intervals [14:16:45] <_joe_> oh TIL .all is in puppet 5.5? [14:16:55] *sigh* [14:17:05] <_joe_> nice [14:17:50] <_joe_> cdanis: so you're changing behaviour [14:18:11] <_joe_> before you would add the systemctl start if we had one interval containing OnUnit... [14:18:19] <_joe_> now you do so only if *all* of them do [14:18:25] <_joe_> which seems more correct to me [14:18:26] that's more correct [14:18:35] <_joe_> as in general another timer interval will fire [14:19:04] I mean, the examples for which it actually changes behavior are all kind of strange [14:19:20] "I want this timer to fire on 00:00 May 1st 2021, and then every five minutes thereafter" [14:29:37] anyone know why icinga would say "Check systemd state" is failing, but there are no failed units on the machine? [14:32:20] kormat: recovered before icinga had time to run? [14:32:32] it's been reporting this for 2d now [14:32:47] https://cas-icinga.wikimedia.org/icinga/images/export_link.png [14:33:28] er https://cas-icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db2137 [14:34:30] kormat: there is one failed unit there [14:34:37] ● prometheus-mysqld-exporter.service loaded failed failed Prometheus exporter for MySQL server [14:34:46] so you might want to disable that and reset it [14:35:38] * kormat blinks [14:36:17] https://phabricator.wikimedia.org/P11291 - following the instructions from https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state doesn't show it [14:36:23] kormat: most of the times with these alerts an "systemctl reset-failed" clears it [14:36:53] aaand i'm logged into the wrong machine [14:36:54] I just use systemctl and look at the output [14:36:57] ahahah [14:37:09] db2073 != db2173 *sobs* [14:37:10] that might do the trick :D [14:37:21] XDD [14:38:11] ahaha [14:38:31] (in fact, db2073 != db2137. it's even worse) [14:39:55] fixed. i'd like to thank everyone for their help in this embarrassing time. [14:40:24] _joe_: okay, issues addressed, let me know about what I copypastaed in the spec test https://gerrit.wikimedia.org/r/c/operations/puppet/+/598050 [14:40:36] kormat: https://jynus.com/gif/cheers.gifv [14:40:45] :D [14:53:38] cdanis: i like how dbctl instance X edit will throw away all your changes if it can't parse anything [14:53:54] kormat: I'm sorry about that, and also, it's not trivial to fix :( [14:54:11] hm maybe it wouldn't be too hard to save a backup[ [14:54:27] it creates a tempfile - couldn't that be retained if something fails? [14:54:46] the complicated things about the code path are all self-inflicted [14:54:54] * kormat grins [14:55:51] i guess i should get in the habit of doing `:w a` while in the editor [14:57:32] kormat: unrelatedly https://wikitech.wikimedia.org/wiki/Dbctl#Schema_upgrades [14:58:03] ohno [15:09:08] <_joe_> kormat: you're welcome to add a schema-update command to dbctl though [15:09:23] <_joe_> cdanis: so there is a secret I didn't tell you about puppet specs [15:09:58] <_joe_> I cargo-cult it as well. And y'all (with the exception of alex and john) copy from my cargo-cult [15:10:29] <_joe_> at times, we have religious mergers where two cargo-cults are joined in one that looks vaguely more appropriate [15:10:42] <_joe_> think of it like "partman, but in ruby" [15:17:52] _joe_: sigh, I thought my patch would also fix the machines that had gotten rebooted [15:18:06] but the timer unit has already 'started' there [15:18:20] <_joe_> yeah [15:18:25] <_joe_> you have to restart them [15:18:29] I am just going to cumin it, yeah [15:18:42] <_joe_> "to cumin" [15:26:57] _joe_: mw[2271-2272].codfw.wmnet still have puppet disabled -- should I reenable? [15:27:08] <_joe_> not sure why, lemme check [15:27:21] "switch tls to envoy --joe" [15:27:34] <_joe_> heh this means the switch didn't happen on those servers [15:27:48] <_joe_> lemme fix that [15:42:24] cdanis: not neccesarily a recomendation but i use https://github.com/nwops/puppet-retrospec with a slightly modified template https://github.com/b4ldr/retrospec-templates, for generating the respec boilder plater. its not pretty but can save some of the copy pasting [15:42:40] thanks! [17:52:38] trying to route https://phabricator.wikimedia.org/T252826, but I don't even know for sure what part of the stack handles CORS headers, anyone have a pointer? [17:52:42] bblack maybe? ^ [17:53:33] I did find https://gerrit.wikimedia.org/g/mediawiki/core/+/master/includes/api/ApiMain.php#749 but X-Wikimedia-Debug ought to be allowed, by my reading [17:53:40] (https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/wmf-config/CommonSettings.php#265) [17:53:57] so I wonder if there are other moving parts, in Varnish or elsewhere [18:06:23] rzl: random guess is -- https://github.com/wikimedia/restbase/blob/f1d9547e1a1258c7a231e70b10ea8aa677965e61/lib/security_response_header_filter.js#L70-L78 -- which would be CPT. [18:07:56] CORS handling is pretty twisted across many layers and Seddon seems to be talking about 2 different backends that he's struggling with. [18:09:10] I hadn't even thought about restbase but yeah of course, good find [18:09:16] I'll try that and see if it gets anywhere [18:09:19] thanks!