[08:27:57] <sobanski>	 Morning. Is there a known issue with Puppet in WMCS? There's a high number of PuppetAgentNoResources alerts: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DPuppetAgentNoResources
[08:29:49] <arturo>	 sobanski: yeah, something definitely going on
[08:32:34] <arturo>	 bd808: I removed all the puppet code for labtestwikitech, because it was broken. Since early october. Puppet being broken affected my ability to work with labtesthorizon, which I need.
[08:33:27] <arturo>	 bd808: the patch, which can be reverted, with changes, if required is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1081968
[08:47:22] <arturo>	 I created T377803 as UBN!
[08:47:22] <stashbot>	 T377803: Cloud VPS:  cloud-wide puppet problem related to puppet-enc 2024-10-22 - https://phabricator.wikimedia.org/T377803
[09:16:40] <arturo>	 dhinus: I would like to run a cumin command, and would like you to +1 it before I run it
[09:22:41] <dhinus>	 arturo: sure
[09:24:11] <arturo>	 dhinus: from cloudcumin1001
[09:24:13] <arturo>	 $ sudo cumin -x --force O{*} 'systemctl list-units --all | grep -q puppetserver.service && systemctl try-restart puppetserver.service || echo "no puppetserver service"'
[09:24:26] <arturo>	 my intention is to:
[09:24:45] <arturo>	 1) only try restarting the puppetserver on VMs that have it defined as a service
[09:25:02] <arturo>	 2) only restart it if it was previously running (the `try-restart`)
[09:25:29] <dhinus>	 do you need -x?
[09:26:02] <dhinus>	 I would try on a couple different hosts before running it on 'O{*}'
[09:26:07] <arturo>	 maybe not because the `||`
[09:26:20] <dhinus>	 exactly, the || should be enough
[09:27:01] <dhinus>	 I might have another improvement, give me a sec
[09:28:39] <arturo>	 ok
[09:31:32] <dhinus>	 nah my alternative would probably be more complicated, I was thinking of using "systemctl list-units puppetserver.service --plain --no-legend"
[09:32:06] <dhinus>	 the exit code is always 0, so you would need to use "wc -l" or something like that
[09:32:44] <arturo>	 ok, I'll be using the previous command with `list-units --all`
[09:33:34] <arturo>	 I'm running it now, will take ~5 minutes, I will report when it finishes
[09:33:38] <dhinus>	 ok!
[09:33:48] <arturo>	 thanks!
[09:39:58] <arturo>	 the command did not work as expected, I don't think it restarted anything
[09:40:12] <arturo>	 for all VMs, it returned "no puppetserver service"
[09:42:10] <dhinus>	 hmmm
[09:43:21] <dhinus>	 it seems to work on tools-puppetserver-01
[09:43:42] <dhinus>	 systemctl list-units --all | grep -q puppetserver.service && echo true || echo false
[09:44:52] <dhinus>	 can you see the list of servers it ran on?
[09:45:14] <arturo>	 (655) :-)
[09:45:35] <arturo>	 oh wait
[09:45:54] <arturo>	 the cumin command targeted 760 VMs
[09:46:06] <arturo>	 and 655 reported "no puppetserver service"
[09:46:21] <arturo>	 so, I guess ~100 VMs got the service restarted
[09:46:57] <dhinus>	 sounds better. I guess I can check on a random one if the service was restarted recently
[09:47:26] <arturo>	 I will run this
[09:47:26] <arturo>	 'systemctl list-units --all | grep -q puppetserver.service && journalctl -u puppetserver.service --since "10m ago" || echo "no puppetserver service"'
[09:47:32] <arturo>	 to see what's in the logs
[09:48:56] <dhinus>	 10m is not enough
[09:49:16] <arturo>	 ok :-(
[09:50:28] <dhinus>	 also be careful about the " inside other "
[09:50:37] <dhinus>	 ah no it's fine sorry
[09:51:50] <arturo>	 re-running with 30m
[09:53:14] <dhinus>	 sounds good
[09:56:18] <arturo>	 this one worked as expected
[09:56:26] <arturo>	 mostly all puppetservers have restarted just fine
[09:57:02] <arturo>	 example log
[09:57:04] <arturo>	 https://www.irccloud.com/pastebin/1jsD7pdf/
[09:57:35] <arturo>	 there is one weirdo
[09:57:43] <arturo>	 https://www.irccloud.com/pastebin/3vS3qQO7/
[10:01:11] <dhinus>	 I think that host might have a broken puppet
[10:01:16] <dhinus>	 I remember seeing alerts
[10:01:48] <dhinus>	 [Cloud VPS alert][mariadbtest] Puppet failure on puppet-1.mariadbtest.eqiad1.wikimedia.cloud
[10:02:06] <arturo>	 ok
[11:05:09] <arturo>	 dhinus: I would appreciate if you could take a look to this decision request when you have a few spare cycles in the CPU: https://phabricator.wikimedia.org/T377467
[11:11:08] <dhinus>	 that's a big one :) I will comment in the task but probably not today 
[11:11:21] <arturo>	 thanks
[11:11:36] <dhinus>	 thanks for the pointer
[11:31:35] <arturo>	 dhinus: quick review here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082188
[11:34:12] <dhinus>	 arturo: I'm out for lunch sorry
[11:34:25] <arturo>	 ok
[12:51:39] <arturo>	 slyngs: puppet agent runs are broken in cloudinfra-idp-1.cloudinfra.eqiad1.wikimedia.cloud
[12:51:51] <arturo>	 https://www.irccloud.com/pastebin/JZr1tW9t/
[12:51:54] <arturo>	 missing hiera
[13:12:57] <slyngs>	 I thought I had it this time :-(
[13:14:18] <slyngs>	 Just a sec
[13:40:54] <slyngs>	 arturo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082209
[14:10:09] <arturo>	 slyngs: if you set a `Hosts:` entry in the commit message to `re: cloudinfra-idp-1.*` you can then run PCC via clicking the 'check experimental' button in gerrit
[14:10:58] <slyngs>	 With the "re: " or just cloudinfra-idp-1.*
[14:11:48] <arturo>	 with the `re:` included
[14:15:41] <slyngs>	 I think I did it wrong https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082209
[14:22:39] <arturo>	 slyngs: maybe is without the space, let me try
[14:23:25] <arturo>	 slyngs: there you go, is without the space
[14:23:35] <arturo>	 now PCC fails for legit reasons :-)
[14:23:46] <slyngs>	   But weird reasons.
[14:23:51] <slyngs>	 I'll take a look
[14:24:37] <arturo>	 slyngs: Function lookup() did not find a value for the name 'profile::idp::standalone::django_secret_key' 
[14:24:48] <arturo>	 usually indicates a missing entry on the labs/private.git repo
[14:25:04] <slyngs>	 It's probably something I did, but it shouldn't need that
[14:33:20] <arturo>	 slyngs: because you sent a newer patch, the line `Hosts: re: cloudinfra-idp-1.*` should become `Hosts: re:cloudinfra-idp-1.*`
[14:34:24] <slyngs>	 Fixed, the missing variable should already be in labs/private it's back from 2022
[14:37:21] <arturo>	 slyngs: you need the hiera key to be present in the _public_ version of labs/private, which is what PCC uses, see https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/
[14:37:39] <slyngs>	 Aaah, yes, you are correct, just realized the same thing
[14:39:45] <slyngs>	 That also already has the variable :-(
[14:44:55] <arturo>	 where?
[14:45:11] <slyngs>	 common/profile/idp/standalone.yaml
[14:47:39] <bd808>	 arturo: thanks for the explanation on labtestwikitech. Do you know if anyone has thought about where the keystone integration with labtestwikitech in that deployment should point instead? I think at this point the bot edits to make a Nova Resource namespace page for each project is the only connection left between our OpenStack setup and MediaWiki.
[14:49:04] <arturo>	 bd808: maybe the keystone integration can point to some special namespace in the actual wikitech?
[14:49:45] <arturo>	 or other test/disposable/whatever environment within the mediawiki
[14:50:13] <arturo>	 running a mediawiki deploy just for this seems a bit overkill to me
[14:50:14] <bd808>	 arturo: yeah, that seems possible. Or even the same namespace and we add some optional per-deployment prefix for the page names.
[14:50:21] <arturo>	 yeah
[14:50:24] <taavi>	 bd808: speaking of that, did anyone ever fix the bot credentials after the wikitech auth change?
[14:50:36] <arturo>	 taavi: I did not get to do it :-(
[14:50:56] <arturo>	 T376220
[14:50:56] <stashbot>	 T376220: Labslogbot needs new SUL OAuth credentials after Wikitech authn changes - https://phabricator.wikimedia.org/T376220
[14:50:58] <bd808>	 ah, I was going to say that I thought arturo was lokiing into it
[14:51:40] <arturo>	 taavi: this looks like something it will take you 10 minutes to fix, and will take me a whole morning to learn which buttons to press
[14:54:50] <slyngs>	 arturo: I'll see if I can figure out why PCC can't find the missing variable tomorrow. I suspect it has never worked, which is a little strange because it has clearly worked in production
[14:55:08] <slyngs>	 I mean WMCS production
[14:55:12] <arturo>	 slyngs: ack
[15:03:54] <arturo>	 bd808: I like the per-deployment prefix (or suffix), similar to what we do with per-tenant, per-deploy domain names
[15:14:37] <bd808>	 arturo: :nod: That seems like less work than another custom namespace (although is Nova Resource still a registered namespace now?)
[15:15:04] <arturo>	 I have no idea
[15:15:46] <bd808>	 it seems to be. I haven't dug in to see where we set it up as I thought it was from the open stack extension
[15:19:16] <taavi>	 codesearch says https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/7b80931ea4dcd414dff6333e5c510e7591e264b3/wmf-config/core-Namespaces.php#2441
[15:23:59] <arturo>	 `Heira`
[15:24:08] <arturo>	 :-S
[15:25:51] <bd808>	 that hiera namespace was super cool
[15:29:15] <arturo>	 slyngs: the same puppet error in cloudweb2002-dev
[15:29:18] <arturo>	 https://www.irccloud.com/pastebin/1lhdD0li/
[15:31:15] <arturo>	 bd808: taavi: now that you are at the laptop, I invite you both to comment on T377467
[15:31:15] <stashbot>	 T377467: Decision Request  - How to do the Cloud VPS VXLAN/IPv6 migration - https://phabricator.wikimedia.org/T377467
[16:04:38] * arturo offline