[06:56:09] inflatador: if you have host that has that listener enabled (I doubt you do) that one. If not, there is none. But you will see a diff in deployment-charts CI for charts that enable the listener [06:59:44] since thumbor has no real owner, if anyone is up for reviewing https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/1243135 please go ahead :-) [07:11:13] moritzm: {done} ;) [07:12:08] Thanks! [14:05:44] seems that puppet is broken on the deployment server, with [14:05:44] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Service "mw-parsoid" not found in service::catalog (file: /srv/puppet_code/environments/production/modules/profile/manifests/kubernetes/deployment_server/global_config.pp, line: 48, column: 13) on node [14:05:44] deploy1003.eqiad.wmnet [14:06:16] meh, sorry, maybe not worthy of this chan [14:07:21] puppet being broken on the deploy host seems worthy of this chan to me [14:08:38] The last Puppet run was at Mon Mar 30 19:22:39 UTC 2026 (1119 minutes ago). [14:08:47] effie: do we need to edit hieradata/common/profile/services_proxy/envoy.yaml maybe? [14:09:01] https://gitlab.wikimedia.org/repos/sre/CIDERGRINDER/-/merge_requests/10 [14:09:03] err no [14:09:05] T420468 [14:09:06] T420468: Retire mw-parsoid LVS service - https://phabricator.wikimedia.org/T420468 [14:09:07] that. [14:11:51] effie I think might be related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1262052 [14:12:23] oh [14:12:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1262054 [14:12:52] let's merge cdanis ? [14:12:53] brouberol: I suggest you merge that and run puppet, if it still breaks I can take a look [14:13:20] ack. On it thanks [14:13:25] thank you :) [14:13:35] bloody hell, I actually didn't see that coming (I am off today) [14:13:55] then please enjoy your time off! [14:13:55] tbf, it is something CI should warn you about [14:14:03] we got this don't worry [14:14:18] a spec test that the deployment profile compiles when you edit the service catalog, or something [14:16:36] yep, we seem to be in a better spot [14:16:40] it is my bad really, I mean, it was literally there, thanks folks [14:20:29] np! [14:20:54] Notice: Applied catalog in 138.65 seconds [14:20:58] all good [15:32:03] Is there someone that can lend a second set of eyes to this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1265450 it's fallout from the removal of the mw-parsoid listener. PC is returning the very same errors I see on the restbase cluster, but I've removed the get_url invocation for mw-parsoid [15:32:27] I'm sure this is something dumb that I'm staring right at... [15:35:47] <_joe_> urandom: uhm why not point restbase to mw-api-int? [15:36:02] <_joe_> but sigh it looks like a lot of such cases weren't handled properly [15:36:09] urandom: https://puppet-compiler.wmflabs.org/output/1265450/6304/ --> "Hosts that have no differences (or compile correctly only with the change)" [15:36:20] CI run failed successfully [15:39:48] elukey: I merged a couple of spicerack patches; is it safe/easy to deploy a new release to cumin hosts? Should I make a phab task for that? [15:41:31] (there's no particular rush so if that happens according to a standard cadence I can just wait) [15:42:30] andrewbogott: you are lucky since I needed to cut a release yesterday, they are already on cumin hosts :) [15:42:48] that's handy! ty [15:43:34] urandom: oh. the thing making actual CI fail is: [15:43:36] 11:42:38 Could not parse for environment *root*: Syntax error at '=>' (file: /srv/workspace/puppet/modules/profile/manifests/restbase.pp, line: 85, column: 26) [15:43:55] looks like you dropped a comma heh [15:45:28] yeah, that one I get :) [15:46:16] (that one was new) [15:46:23] oho [15:47:47] urandom: so, PCC "worked" -- compilation failed in the prod state, and succeeded in the change state [15:48:11] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-puppet7-test/6307/console [15:48:40] oh. and it calls that failed overall? [15:49:00] 1) geez, 2) why I have never noticed that? [15:50:05] seems so yeah [15:50:08] not a very common state I guess [15:51:24] in the context of a change being compiled to correct a production failure, calling one that succeeds in doing that "failed" is... and odd choice [15:52:31] cdanis: thank you though! [15:53:18] andrewbogott: the deploy on the cloudcumin hosts is left to us to do, just fyi [15:53:42] That's just 'apt install' though right? [15:54:22] yes [16:12:58] urandom: FWIW, I have seen PCC 'fail' to compile in the prod state a few times recently, where it actually was a failure with the PCC infra (or something like that), rather than a production compilation actually failing. e.g. https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-puppet7-test/6159/console [16:13:46] A_smart_kitten: https://puppet-compiler.wmflabs.org/output/1256301/6159/phab1004.eqiad.wmnet/prod.phab1004.eqiad.wmnet.err [16:13:51] same case where the production run failed [16:16:37] cdanis: yeah. i basically interpreted that as an error in the PCC infra rather than an error in the actual current production state. but then I guess it means that PCC needs to be reran in order to get the actual diff that the Gerrit change would make [16:16:50] I think it is to call your attention to that, yeah [16:16:59] in the same way a check engine light might [16:20:21] sukhe: Am I correct in assuming we do not log dns queries to our pdns-recursor from internal hosts? [16:24:53] <_joe_> A_smart_kitten, cdanis yes, I can confirm [16:25:44] <_joe_> but also, the puppet compiler was born to do very different stuff that it's doing now, originally, and we didn't really work on all the things that should've been adapted to its final role in our environment [16:26:19] I can propose some good first patches that are easy and relevant-to-this-discussion UX fixes, if anyone is interested [16:33:04] Ok, so puppet runs cleanly on the restbase cluster again, but the transform endpoint no longer works. The only errors I'm seeing are those provoked by me running the service checker, so unless I hear anything I'm not going to mess with it further. [16:34:56] a gif of a dog in a burning room is implied, I think [16:37:32] topranks: that is correct, yep [16:38:01] we have hardly any logging on gdnsd side as well for that matter [16:38:16] gdnsd side is a different question I feel, and probably ok not to log [16:38:35] for security it might be an idea to enable query logging for pdns-recursor, to see what domains our own systems are connecting to [16:38:58] anyway not there now so it won't help my current sleuthing, thanks! [16:40:12] topranks: yeah, in some ways, I don't see any downside to it but I think we don't do it because we didn't have a use case [16:40:53] topranks: depending on what you are doing, the debugging, I think it may just be fine to disable puppet on a single host and then test stuff (and enable logging for that period) in that interim [16:42:33] ah it's not important. I know from working in other environments if there is a high number of queries then logging them (the replies too) can put quite a lot of load on the resolvers, so often not something to wade into (then there are things like dnstap to more efficiently log dns activity) [16:43:04] it's definitely a good thing to have if possible though, to understand what connections are being initiated from your hosts (i.e. C&C domains being looked up, dns data exfil junk) [16:43:15] ha yeah. pdns-rec does support dnstap [16:43:20] for what I'm looking at right now the connections are via squid proxy, so the logs there will do me [16:43:23] ok! [16:44:01] topranks: internal hosts use the proxies to connect to the internet so we can audit it there. and my vote would go to do the same for public hosts (after reducing the amount of public hosts we have to the strict minimum [16:44:36] XioNoX: yes agreed on that general approach [16:45:21] what is missing there is I see in Netflow a lot of traffic to IP x.x.x.x, and I can find who owns the IP. I then also see connections to some domain name in the squid logs. [16:45:41] in this case it's fairly easy to make an assumption that the domain is resolving to that IP based on the pattern [16:46:32] I think squid doesn't record the IP the name resolves to as the system handles that for it. Not 100% though perhaps it could record that too.