[08:27:57] Morning. Is there a known issue with Puppet in WMCS? There's a high number of PuppetAgentNoResources alerts: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DPuppetAgentNoResources [08:29:49] sobanski: yeah, something definitely going on [08:32:34] bd808: I removed all the puppet code for labtestwikitech, because it was broken. Since early october. Puppet being broken affected my ability to work with labtesthorizon, which I need. [08:33:27] bd808: the patch, which can be reverted, with changes, if required is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1081968 [08:47:22] I created T377803 as UBN! [08:47:22] T377803: Cloud VPS: cloud-wide puppet problem related to puppet-enc 2024-10-22 - https://phabricator.wikimedia.org/T377803 [09:16:40] dhinus: I would like to run a cumin command, and would like you to +1 it before I run it [09:22:41] arturo: sure [09:24:11] dhinus: from cloudcumin1001 [09:24:13] $ sudo cumin -x --force O{*} 'systemctl list-units --all | grep -q puppetserver.service && systemctl try-restart puppetserver.service || echo "no puppetserver service"' [09:24:26] my intention is to: [09:24:45] 1) only try restarting the puppetserver on VMs that have it defined as a service [09:25:02] 2) only restart it if it was previously running (the `try-restart`) [09:25:29] do you need -x? [09:26:02] I would try on a couple different hosts before running it on 'O{*}' [09:26:07] maybe not because the `||` [09:26:20] exactly, the || should be enough [09:27:01] I might have another improvement, give me a sec [09:28:39] ok [09:31:32] nah my alternative would probably be more complicated, I was thinking of using "systemctl list-units puppetserver.service --plain --no-legend" [09:32:06] the exit code is always 0, so you would need to use "wc -l" or something like that [09:32:44] ok, I'll be using the previous command with `list-units --all` [09:33:34] I'm running it now, will take ~5 minutes, I will report when it finishes [09:33:38] ok! [09:33:48] thanks! [09:39:58] the command did not work as expected, I don't think it restarted anything [09:40:12] for all VMs, it returned "no puppetserver service" [09:42:10] hmmm [09:43:21] it seems to work on tools-puppetserver-01 [09:43:42] systemctl list-units --all | grep -q puppetserver.service && echo true || echo false [09:44:52] can you see the list of servers it ran on? [09:45:14] (655) :-) [09:45:35] oh wait [09:45:54] the cumin command targeted 760 VMs [09:46:06] and 655 reported "no puppetserver service" [09:46:21] so, I guess ~100 VMs got the service restarted [09:46:57] sounds better. I guess I can check on a random one if the service was restarted recently [09:47:26] I will run this [09:47:26] 'systemctl list-units --all | grep -q puppetserver.service && journalctl -u puppetserver.service --since "10m ago" || echo "no puppetserver service"' [09:47:32] to see what's in the logs [09:48:56] 10m is not enough [09:49:16] ok :-( [09:50:28] also be careful about the " inside other " [09:50:37] ah no it's fine sorry [09:51:50] re-running with 30m [09:53:14] sounds good [09:56:18] this one worked as expected [09:56:26] mostly all puppetservers have restarted just fine [09:57:02] example log [09:57:04] https://www.irccloud.com/pastebin/1jsD7pdf/ [09:57:35] there is one weirdo [09:57:43] https://www.irccloud.com/pastebin/3vS3qQO7/ [10:01:11] I think that host might have a broken puppet [10:01:16] I remember seeing alerts [10:01:48] [Cloud VPS alert][mariadbtest] Puppet failure on puppet-1.mariadbtest.eqiad1.wikimedia.cloud [10:02:06] ok [11:05:09] dhinus: I would appreciate if you could take a look to this decision request when you have a few spare cycles in the CPU: https://phabricator.wikimedia.org/T377467 [11:11:08] that's a big one :) I will comment in the task but probably not today [11:11:21] thanks [11:11:36] thanks for the pointer [11:31:35] dhinus: quick review here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082188 [11:34:12] arturo: I'm out for lunch sorry [11:34:25] ok [12:51:39] slyngs: puppet agent runs are broken in cloudinfra-idp-1.cloudinfra.eqiad1.wikimedia.cloud [12:51:51] https://www.irccloud.com/pastebin/JZr1tW9t/ [12:51:54] missing hiera [13:12:57] I thought I had it this time :-( [13:14:18] Just a sec [13:40:54] arturo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082209 [14:10:09] slyngs: if you set a `Hosts:` entry in the commit message to `re: cloudinfra-idp-1.*` you can then run PCC via clicking the 'check experimental' button in gerrit [14:10:58] With the "re: " or just cloudinfra-idp-1.* [14:11:48] with the `re:` included [14:15:41] I think I did it wrong https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082209 [14:22:39] slyngs: maybe is without the space, let me try [14:23:25] slyngs: there you go, is without the space [14:23:35] now PCC fails for legit reasons :-) [14:23:46] But weird reasons. [14:23:51] I'll take a look [14:24:37] slyngs: Function lookup() did not find a value for the name 'profile::idp::standalone::django_secret_key' [14:24:48] usually indicates a missing entry on the labs/private.git repo [14:25:04] It's probably something I did, but it shouldn't need that [14:33:20] slyngs: because you sent a newer patch, the line `Hosts: re: cloudinfra-idp-1.*` should become `Hosts: re:cloudinfra-idp-1.*` [14:34:24] Fixed, the missing variable should already be in labs/private it's back from 2022 [14:37:21] slyngs: you need the hiera key to be present in the _public_ version of labs/private, which is what PCC uses, see https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/ [14:37:39] Aaah, yes, you are correct, just realized the same thing [14:39:45] That also already has the variable :-( [14:44:55] where? [14:45:11] common/profile/idp/standalone.yaml [14:47:39] arturo: thanks for the explanation on labtestwikitech. Do you know if anyone has thought about where the keystone integration with labtestwikitech in that deployment should point instead? I think at this point the bot edits to make a Nova Resource namespace page for each project is the only connection left between our OpenStack setup and MediaWiki. [14:49:04] bd808: maybe the keystone integration can point to some special namespace in the actual wikitech? [14:49:45] or other test/disposable/whatever environment within the mediawiki [14:50:13] running a mediawiki deploy just for this seems a bit overkill to me [14:50:14] arturo: yeah, that seems possible. Or even the same namespace and we add some optional per-deployment prefix for the page names. [14:50:21] yeah [14:50:24] bd808: speaking of that, did anyone ever fix the bot credentials after the wikitech auth change? [14:50:36] taavi: I did not get to do it :-( [14:50:56] T376220 [14:50:56] T376220: Labslogbot needs new SUL OAuth credentials after Wikitech authn changes - https://phabricator.wikimedia.org/T376220 [14:50:58] ah, I was going to say that I thought arturo was lokiing into it [14:51:40] taavi: this looks like something it will take you 10 minutes to fix, and will take me a whole morning to learn which buttons to press [14:54:50] arturo: I'll see if I can figure out why PCC can't find the missing variable tomorrow. I suspect it has never worked, which is a little strange because it has clearly worked in production [14:55:08] I mean WMCS production [14:55:12] slyngs: ack [15:03:54] bd808: I like the per-deployment prefix (or suffix), similar to what we do with per-tenant, per-deploy domain names [15:14:37] arturo: :nod: That seems like less work than another custom namespace (although is Nova Resource still a registered namespace now?) [15:15:04] I have no idea [15:15:46] it seems to be. I haven't dug in to see where we set it up as I thought it was from the open stack extension [15:19:16] codesearch says https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/7b80931ea4dcd414dff6333e5c510e7591e264b3/wmf-config/core-Namespaces.php#2441 [15:23:59] `Heira` [15:24:08] :-S [15:25:51] that hiera namespace was super cool [15:29:15] slyngs: the same puppet error in cloudweb2002-dev [15:29:18] https://www.irccloud.com/pastebin/1lhdD0li/ [15:31:15] bd808: taavi: now that you are at the laptop, I invite you both to comment on T377467 [15:31:15] T377467: Decision Request - How to do the Cloud VPS VXLAN/IPv6 migration - https://phabricator.wikimedia.org/T377467 [16:04:38] * arturo offline