[01:13:55] effie: can you review that https://wikitech.wikimedia.org/wiki/Debugging_in_production#Directly is acceptable/correct/recommended as a way to reach mw-debug and mw-experimental over HTTP? [01:14:09] I.e. when debugging cache headers behind CDN etc [01:14:55] I've done this a few times but the docs still mention mwdebug. This seems to work, but if there's an easier/different way, edit as you wish :) [10:18:05] XioNoX: Emperor daniel is deploying a change that may or may not cause problems :) [10:21:20] ack 🤞 [10:22:24] godspeed! [10:28:31] Emperor XioNoX: FYI, I'm repooling puppetserver1002 in a few minutes, it had caused puppet failures before, but we made some config changes and are cautiously optimistic that it now works fine again [10:28:54] noted, thx [10:31:03] Krinkle: I will have a look sometime this week, primarily because I find that we can simplify some things [10:31:28] Krinkle: overall, many thanks for keep this current [13:28:24] did someone already merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277508 ? I don't see it when I run `puppet-merge` on puppetserver [13:28:58] it's in the git log, so I guess so ;) [13:34:41] inflatador: BTW, there's l one wdqs role still on the old discovery intermediate: wdqs:.internal [13:34:51] and also wcqs::public [13:43:11] inflatador: o/ [13:44:45] moritzm thanks. I don't think `wdqs:.internal` is used anymore since we split the graph, but will definitely fix `wcqs::public` [13:45:01] .o/ [13:45:23] ah, ok. let's best nuke the role then to avoid further confusion :-) [13:45:32] I checked wdqs1026 earlier on, IIUC from the puppet logs the CSR issue is not there anymore, the "only" weirdness is that the envoy config is not re-created [13:45:49] or have I missed something? I noticed your comment about having to do the same procedure as yesterday [13:46:42] elukey that's right, there is config in `envoy.yaml` that is not automatically refreshed by puppet, I think this is related to the services proxy as opposed to the TLS terminator [13:47:48] inflatador: ah ok I didn't check puppet, but what do you mean with service proxy vs tls terminator? [13:47:52] * elukey checks [13:47:55] https://phabricator.wikimedia.org/P91477 more details here [13:48:26] I could be wrong about it being part of the services proxy [13:48:37] But it is not in the TLS terminator config, it's directly in envoy.yaml [13:49:45] ah ok I didn't get what you were referring, you use profile::tlsproxy::envoy on wdqs [13:49:58] yeah sometimes the envoy.yaml is not rebuilt, there is an exec about it [13:50:03] you may be hitting some weird corner case [13:50:18] I ran into the same for apt-staging [13:50:51] and now that I'm checking closer also for debmonitor, I'll force a rebuild of these [13:51:02] there is `/etc/envoy/listeners.d/00-tls_terminator_443.yaml` which is correctly updated, and `/etc/envoy/envoy.yaml` which is not correctly updated [13:52:18] Based on the error about json and YAML in the above paste, maybe it has something to do with Puppet's JSON->YAML conversion not properly detecting changes? I wanna say I've seen that with Ansible before [13:53:40] and it definitely works if you remove the dir and force puppet to rerender everything [13:53:59] nono the main issue is that the envoy.yaml config is "peculiar": we create via puppet an empty file, and then use a build-config script that creates the real config [13:54:08] for some reason the step is not triggered] [13:57:58] "running build-envoy-config -c" manually correctly generates the config with the new cert, so the bug is somewhere in this exec not being triggered [13:58:11] correct yes, this is my understanding [13:58:26] if you remove the admin yaml file it works as well, puppet triggers the rebuild [13:59:49] currently we only have a notify on the yaml files, we could add a notify on /etc/envoy/ssl, that would trigger a rewrite once cfssl writes the new certs there? [13:59:51] I had a thing recently with the build-envoy-config script which is not triggered if you delete the envoy.yaml config, expecting puppet to recreate it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275827 [14:04:19] * inflatador wonders if Sophroid could manage the envoys on baremetal [14:05:16] arnaudb: no idea about "tag", never used it, but we could be safer and just keep Exec['verify-envoy-config'], ? [14:05:59] IIUC you added it also to the envoy.yaml file resource, that is empty at first, and then it gets populated. I feel that adding Exec['verify-envoy-config'], would break the first puppet run somehow, triggering it too early. wdyt? [14:06:17] but it would solve the actual issue of not being able to re-create the file if removed [14:06:55] elukey: unless I'm misunderstanding (very possible), I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275827/8/modules/envoyproxy/manifests/init.pp#127 means that the tag effectively *is* an Exec['verify-envoy-config'] [14:07:02] File <| tag == 'envoyproxy::verify_config' |> ~> Exec['verify-envoy-config'] [14:08:12] cdanis: yeah yeah I was suggesting to play it safe for this use case to avoid adding "tags" if we never used it before [14:08:33] we can introduce them later [14:09:24] elukey: tell me when you have some time so I can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277506 when you are around [14:10:20] jynus: you can proceed! thanks! [14:10:25] doing [14:10:26] elukey: https://codesearch.wmcloud.org/puppet/?q=%3C.*tag.*%3E.*%7E%3E&files=&excludeFiles=&repos= [14:10:49] on the positive side, looks like the changes worked flawlessly on cirrussearch (which doesn't use envoy) [14:11:47] nice :) [14:12:09] {◕ ◡ ◕} [14:12:16] so yesterday I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278480 as attempt to fix the envoy.yaml file re-creation, but now I think it may be working against it [14:12:38] because verify (and hence the rebuild) doesn't trigger if there is a non empty file on disk [14:12:51] like when we are changing the config [14:13:14] so since it doesn't really work, I'd just revert it [14:14:54] the other one https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277452 should have fixed the issue with the .csr file not being there before the build-config script ran (another problem that Brian discovered) [14:14:59] wdty? [14:17:50] I feel that we'll have to trigger a manual rebuild envoy config anyway on all the hosts in which puppet already ran [14:18:06] so build-envoy-config -c /etc/envoy && systemctl restart envoyproxy.service [14:18:28] better: depool build-envoy-config -c /etc/envoy systemctl restart envoyproxy.service pool [14:25:55] +1 from me on that [14:29:50] yeah, sounds good to me [14:30:04] all right merging and running with puppet disable just-to-be-sure-tm [14:45:14] ok puppet re-enabled, tested on ~50 hosts and it was a no-op [14:46:49] so I just tried `cumin -m async 'wdqs1026*' 'depool' '/usr/local/sbin/build-envoy-config -c /etc/envoy' 'systemctl restart envoyproxy' 'pool'` and it worked [14:46:58] inflatador: I think we have to run it on all wdqs hosts [14:47:04] does it make sense? [14:47:45] IIUC we can just use wdqs* but maybe some cluster is still not updated yet, lemme know [14:47:48] I can kick it off [14:48:31] elukey does that mean the Puppet code works or it will still need manual command as above? [14:49:01] Either way I am fine w/you running that against all WDQS hosts [14:49:06] or I can do it if you prefer [14:50:23] I'll backfill the config on debmonitor and apt-staging [14:52:42] done [14:54:45] inflatador: so the new puppet code may be ok, but I'd need to test it, not super sure. In either cases, only puppet runs that change the intermediate will benefit from it, not existing ones :( so yes for your hosts you'll have to run the command above [14:55:32] if you have time and you want to do it, please go ahead, I'd be grateful (I have my son running around home at the moment, home early from the kindergarten :D) [14:55:38] I'll switch Puppet to validate that it now works for a new host [14:55:53] Puppetboard I meant [14:57:02] thanks! I still feel something is missing [14:58:06] we'll see shortly :-) [14:58:39] elukey Yes, I will do it and report back, might be awhile as I have 2h of mtgs. If I can do anything else to help please LMK [15:07:10] elukey, inflatador: with the revert in place it works now for newly converted hosts [15:07:30] puppetboard2003 picked up the 2026 cert in /etc/envoy/envoy.yaml [15:07:38] \o/ [15:07:49] ok so yesterday I fixed one thing and broke another one [15:07:56] not bad [15:07:56] yeah :-) [15:07:59] and Envoy got restarted as wel [15:08:09] with the multi-layered epoch restarts they do [15:08:33] public shame: I should have checked and tested with Pontoon, this is a good use case for it and I missed to do it because I was sure about the fix. [15:08:42] will do next time [15:08:54] I'll create tasks for sub teams which remaining tlsproxy::envoy need to be converted [15:09:00] I am making sure to run puppet, check config update is applied, restart services just in case it is not triggered automatically, then check service, would anything else be needed? [15:09:36] jynus: to be sure check with openssl if the new intermediate is picked up [15:09:48] do you have the oneliner? [15:09:48] but apart from that, all good! thanks [15:10:11] you can use openssl s_client ip:port | grep openssl x509 -text -noout [15:10:13] I can search it, but probably you have it handy [15:10:17] thanks [15:10:18] should do the trick [15:11:22] moritzm: re:tasks - thanks! [15:11:56] moritzm: since everything works now, I can take care of filing the puppet patches and ping the teams like I did yesterday [15:12:00] should be quicker [15:16:53] I don't think that's spefically needed, given how simple these are people can easily create and merge themselves? [15:17:18] fine either way, but then let me create them first, so that they can be referenced in Gerrit [15:17:57] I wouldn't feel too bad about not using pontoon, I think this is more about putting complexity in places that are hard to see and test. WMF-specific puppet code vs. python + systemd timers [15:21:15] inflatador: sure, but in this case the complexity was there and fleet-wide, so it is always good to verify one's assumption in these cases [15:21:37] we can also work on the complexity as separate thing of course (and pontoon is again a good candidate) [15:22:09] I am mainly saying that it is a great tool that we don't use enough for laziness (at least me, most of the times) [15:31:17] inflatador: https://phabricator.wikimedia.org/T424672 tracks wcqs::public and the remaining Data Platform servics which need to be moved [16:35:33] Yeah, to me the bigger question is why do we need tools like PCC and Pontoon, how can we make our environments more reproducable and sustainable to new hires, etc etc. You've heard it all before from me ;) [16:37:27] well Pontoon should address the part of reproducing a certain environment in a place where invasive changes are safe and sound [16:37:52] some safety net will always be needed [16:39:14] ideally we should invest more time on Pontoon to reduce the °scaffolding° time as much as possible, so the time-to-test anything will be (ideally) few mins every time (it is already like that in some cases) [16:39:48] but Filippo has only two hands and also a team to attend to :D [16:40:09] I think simplifying the infrastructure makes more sense. Again, the fact that there needs to be a PCC and a Pontoon is a red flag [16:40:35] inflatador: how is PCC different from the deployment-charts Helmfile differ? [16:41:29] it's custom software that we own and maintain forever [16:41:37] inflatador: simplifying the infra is always good, I am all for it. But I doubt we'll ever be in a place where pcc or pontoon won't be needed [16:41:38] so is our helmfile stack [16:41:50] how so? [16:44:05] everything from our own templating, to tooling to help manage that templating (`sextant`), to bespoke integration with gerrit and/or gitlab, and then all the way down the rest of the stack too [16:45:41] So when I run `helmfile diff` I'm running custom code? [16:45:55] if you're doing it on the deploy hosts, absolutely [16:46:07] Interesting, I did not know that [16:47:11] even the selection of the tool `helmfile` was non-trivial engineering work [16:47:39] I think it was a good choice, because there is visibility [16:48:11] yes, but there's still multiple ways you can 'hold' that tool, and we have our own conventions and assumptions in deployment-charts and in the ... let's say developer ecosystem surrounding it [16:48:16] Whereas with Puppet, it is ready-fire-aim. We maintain custom tools like Pontoon and PCC to help, but AFAIK that is intrinsic to Puppet [16:48:58] Right, and it's a vibrant ecosystem (at least I hope it is) and there's a lot less friction onboarding someone to K8s [16:50:34] helmfile, and our wrappers and bespoke tooling and production-specific template libraries on top of it, is a way to make gitops feasible in a repo that can be public and shared [16:51:11] similarly, PCC is a tool to make gitops feasible in a messy world of baremetal servers we're racking ourselves. there are maybe some more-reasonable and/or more-industry-standard ways to do that in 2026, but ... well, we barely have enough time to do the work that's both urgent *and* important [16:51:36] I absolutely concede it's important. is it important and urgent, especially relative to other needs? [16:54:17] I can only offer my own selfish perspective as a "captive customer" [16:55:57] But I do think there is an over-reliance on custom tooling owned by 1 or 2 people to do really low-level stuff, and that is reflected in rework, delays, etc that's not being tracked AFAIK [16:57:09] I get that everyone's busy, and I'm not trying to force my way to the front of the line [16:57:29] Just want to say that there is a continued cost to do things this way that may not be apparent to everyone [16:59:18] I understand that perspective, I do. can I offer you a little of my perspective on the issues? it's not like I am any lover of Puppet, and when I joined here long ago, I was coming to it, and PHP, and very fresh to all that after spending 10 years immersed in a particular dialect of C++ and the ur-Kubernetes with many mistakes along the way [17:02:01] we can also move to DM if you'd prefer -- just want to be sensitive to the venue and everyone else in the channel, because this has been a fair bit of length in a big channel already :) [17:02:03] and even with very well maintained open source projects that we use and deploy, there is a customization needed most of the times. The PKI infrastructure is relying on cfssl, but the project has not been really active over the years and we maintain custom patches (for example, cfssl-issuer on k8s it built in house as cert-manager plugin) [17:02:43] Sure! Or Slack if Luca wants to talk too [17:02:53] I don't wanna keep y'all all day though, you know I'm a ranter ;) [17:03:12] ranting is good :) [17:03:58] but we have to compromise most of the times :) [17:05:06] {◕ ◡ ◕} [17:06:05] I don't think anyone on the team actually *likes* Puppet. but it's what we have. and we have a lot of people who have both broad and deep experience with it, and infrastructure built around it, and the running of a lot of Stuff encoded into it, Stuff that we can't just up and stop running [17:06:37] * jhathaway hides [17:06:56] (I also think it's totally fine if you do like it, btw) [17:07:01] :) [17:07:44] I actually like it, but mostly for its failed potential, it could have been a terraform type project [17:08:29] its testing story is rough, at my last company we built a bespoke system, which is what many other companies did, e.g. github [17:08:36] I think it's less about Puppet and more about the stuff we have around it. PCC, Pontoon, Spicerack, cookbooks, Cumin, etc etc. All this stuff that (to me anyway) is solved better by the community. Not to mention it's owned by private equity and had a nasty breakup with its community [17:08:53] inflatador: but we still need all that stuff [17:09:05] Unfortunately for the Puppet piece there is no good community option [17:09:21] I'm definitely on the "Puppet hater" side, but I also try to be pragmatic about ways we might ever get out from under it [17:09:30] hi folks, just a heads up will do some redis hosts reboots during the infra window)) [17:09:53] and what I decided to do was to try to also be a k8s lover (and, when I can, advocate, and enabler of others) [17:10:02] I don't even hate Puppet anymore. Im also quite over Ansible TBH, been looking at Nix and pyinfra [17:10:22] the more Stuff we move to k8s, the less we need Puppet, and less we need to support it [17:11:06] for sure, and if we shrink our Puppet use, then replacing it becomes more feasible [17:11:10] exactly [17:11:24] I still have to reimage at least 300 servers every year, even if I move everything to K8s [17:11:46] inflatador: but moving everything to k8s gives us a way to rebuild the engines without making the applications stop flying [17:12:42] I worked for a huge public cloud, so my expectations may be out of whack, but we ran reimages on baremetal constantly. Mostly just cobbler/DHCP with an airflow-like orchestration layer on to [17:12:49] p [17:13:13] k8s nodes being a majority or even plurality of nodes in the fleet gives you an excuse to experiment with a simpler, better-fitting provisioning process [17:15:03] (and, btw, it won't be "everything", it just has to be a lot of things, and doing the easier pieces first still helps) [17:15:16] Cool, I appreciate your listening and your perspective ;) [17:15:19] but now *I've* ranted too much, and I'm going to step down from my soapbox and get osme lunch [18:15:24] cdanis: ranting is good, it allows me to steal your thoughts for day job. A lot can be learnt from the opinions shared in rants about infrastructure. [18:28:00] oh, the non-k8s but not-too-far-away-either other escape hatch is, something like container images + reusing a lot of k8s infra to produce k8s yaml formatted systemd quadlets to run on bare-metal hosts [18:59:23] andrewbogott: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/afeb6b993f82aeaab69d413c856c8ceebda8c9ed is causing puppet failure in eqiad. [18:59:34] E: Conflicting values set for option Signed-By regarding source http://apt.wikimedia.org/wikimedia/ trixie-wikimedia: /etc/apt/keyrings/wikimedia-archive-keyring.gpg != /etc/apt/keyrings/56056AB2FEE4EECB_openstack.gpg [19:00:35] sukhe: what host? Or, all of them? [19:02:08] all cloud* hosts https://puppetboard.wikimedia.org/failures [19:02:28] cloudvirt* rather sorry [19:02:44] ok, I see it [19:02:46] thx, will fix [19:03:00] thanks! [19:07:26] andrewbogott: I ran into a similar one not long ago. if you are using apt::repository for a thirdparty repo, you likely need to add the keyfile_path: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1260766/6/modules/profile/manifests/ci/thirdparty_apt.pp#21 [19:09:22] grep Signed-By /etc/apt/sources.list.d/* [19:09:59] if there are different lines in there for wikimedia source that creates an error like above [19:10:39] I think yeah, it's just a remnant of a previous version of the patch where I was using an external archive [19:10:59] ack. maybe then you can simply delete the old .list file [19:12:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278543/1/modules/openstack/manifests/serverpackages/flamingo/trixie.pp [19:12:24] whether it's manual or puppet the point is that all wikimedia sources have just /etc/apt/keyrings/wikimedia-archive-keyring.gpg and this Signed-By line is only needed from bookworm and later [19:12:27] no, I mean, a remnant in the puppet patch [19:12:32] gotcha [19:13:04] andrewbogott: yes, but you need to add .gpg [19:13:13] oops thanks [19:13:17] in my patch I forgot that and followed up with another one [19:13:20] to add the extension