[12:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:14:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:19:40] hi folks. grafana is 503ing right now, could use a check [17:31:54] seems to be in a systemd restart loop on grafana1002 [17:32:01] May 20 17:30:32 grafana1002 grafana[268558]: logger=provisioning t=2026-05-20T17:30:32.864631687Z level=error msg="Failed to provision plugins" error="app provisioning error: open /etc/grafana/provisioning/plugins/mahendrapaipuri-dashboardreporter-app.yaml: permission denied" [17:32:08] related to this? [17:32:08] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1286986 [17:32:23] though seems to have been working a decent length of time after it was patched [17:32:25] that seems to be what it's logging about, yeah [17:32:54] `error="app provisioning error: open /etc/grafana/provisioning/plugins/mahendrapaipuri-dashboardreporter-app.yaml: permission denied"` [17:33:14] yep [17:33:34] I'd be tempted to maybe roll back the patch... [17:33:47] herron: you around? [17:35:28] hey, I'm at lunch at the moment [17:35:42] I was gonna say "I'm just gonna do it cos we got paged again", but the page is for Grafana [17:36:15] that's odd. It was working earlier. [17:36:29] yeah you merged the patch a while back [17:37:02] Yeah, feel free to revert or possibly the file needs permissions changed [17:37:13] codfw is reporting the same error [17:37:17] +1 on revert if we're dealing with other outages [17:37:54] herron: you feel the revert is safe? I'm not sure what the dashboardreporter is we can live without it for a little while? [17:38:14] I've manually chgrped that file in case that fixes temporarily [17:38:20] it's safe yeah it's a new plug-in nothing depending on it yet [17:40:06] hnowlan: I did the same on the codfw one [17:40:09] and grafana seems back [17:40:18] This one? [17:40:24] cmooney@grafana1002:/etc/grafana/provisioning/plugins$ ls -lah [17:40:24] total 24K [17:40:24] drwxr-xr-x 2 root grafana 4.0K May 20 16:00 . [17:40:24] drwxr-xr-x 8 root grafana 4.0K Feb 12 2024 .. [17:40:24] -rw-r----- 1 root grafana 8.4K May 20 16:00 mahendrapaipuri-dashboardreporter-app.yaml [17:40:25] -rw-r----- 1 root grafana 219 Feb 12 2024 sample.yaml [17:40:28] yep [17:40:29] ack, I will toggle puppet on them for the moment to stop it breaking back [17:40:47] hnowlan: thanks [17:41:00] yep to clarify I posted wrong thing, but the yaml file was owned by root before [17:41:02] -rw-r----- 1 root root 8.4K May 20 13:57 mahendrapaipuri-dashboardreporter-app.yaml [17:41:21] or in the root group, changed to grafana group service came back [17:41:29] thanks apologies for the hassle. I'll upload a patch to sort it out today. [17:41:30] odd that the 17:17 puppet run (on 1002) would suddenly apply these corrective group and owner changes [17:41:35] disabled [17:41:47] we should be good for now [17:42:02] topranks: yeah, checks out my fix was `sudo chgrp grafana /etc/grafana/provisioning/plugins/mahendrapaipuri-dashboardreporter-app.yaml ` [17:42:59] cool, thanks! yeah I made the same change on the other one [18:27:32] looks like grafana was upgraded with security patches as well which may help explain why the plugin worked initially, then permissions went sideways later on. I think the interaction where puppet alters the default package file permissions, and restarts the service happened shortly before the alert. at any rate here's a patch to persist the chgrp https://gerrit.wikimedia.org/r/c/operations/puppet/+/1290038 [19:33:15] patch deployed and puppet is re-enabled on grafana [21:29:52] thanks!