[00:48:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:43:25] FIRING: [2x] SystemdUnitFailed: arclamp_generate_svgs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:25] FIRING: [2x] SystemdUnitFailed: arclamp_generate_svgs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:31] hi all! in https://phabricator.wikimedia.org/T371520 we’re seeing several broken Grafana dashboards that seemingly forgot about their Graphite sources; does anyone here know what might be happening there? [08:52:22] hey Lucas_WMDE, I'll take a look [08:53:26] thanks! [08:58:49] I bet that's a side effect of an incomplete audit/change of graphite dashboards in T269333 [08:58:50] T269333: Switch default Grafana datasource to Thanos - https://phabricator.wikimedia.org/T269333 [08:59:17] I see… [08:59:29] so we had dashboards that didn’t explicitly set the data source to graphite? [08:59:58] I believe that's true, though I'm not super clear on the details [09:00:07] I'm tempted to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1057882 [09:00:13] and see if we get the dashboards back [09:00:59] hm, the panel JSON seems to say "datasource": { "uid": "000000026", "type": "prometheus" } [09:01:06] but maybe Grafana is just filling in the default data source dynamically [09:01:34] yeah I think you are right, if you go and save the dashboard you'll see the diff [09:01:42] which is grafana updating the dashboard model [09:01:54] ok [09:02:01] of course without actually saving the dashboard, I looked at the diff [09:03:02] sigh I wish there was an easy way to export that diff [09:04:26] Lucas_WMDE: I'll revert the default datasource patch [09:04:37] alright, thanks! [09:04:39] let’s hope it helps [09:04:47] and doesn’t break anything else 😬 [09:05:17] indeed [09:08:06] and then we’re hoping that a no-op save on the broken dashboards will explicitly save the data source so we can change the default again? [09:08:29] that I don't know [09:11:55] Lucas_WMDE: ok wikidata-edits is back for me [09:12:04] \o/ [09:12:06] * Lucas_WMDE looks [09:12:17] thank you! [09:12:31] sure np, I'll leave things as is until Keith is back [09:12:35] yeah, if I just press save there’s definitely a diff [09:12:38] from "datasource": "-- Grafana --", [09:12:46] to a JSON with uid grafana [09:13:03] yeah totally, I can see it [09:13:51] apparently keith was able to somehow find dashboards with null datasources, so maybe he’ll know how to find the ones with "-- Grafana --" too [09:13:58] (guessing that’s why they escaped notice before) [09:14:09] * Lucas_WMDE has no idea how to search for grafana dashboards by their contents [09:15:13] operations/software.git has a misc/search-grafana-dashboards.js utility [09:17:16] ooh, sounds good [09:20:35] hot damn, https://grafana-rw.wikimedia.org/d/000000188/wikidata-special-entitydata?orgId=1&refresh=30m&forceLogin=true# was on schema version *16* [09:20:36] → 38 [09:20:39] more than doubling it now ^^ [09:21:43] hehe [09:22:57] Lucas_WMDE: to be clear, I don't know what are the proper next steps [09:23:34] nor I have the time/bandwidth to investigate now, I'm sure Keith will know more when he's back at the end of Aug [09:24:02] yeah, I’m fine with waiting for that (if keeping the reverted default is fine until then) [09:24:17] or, if you want to change the default again, now we know what to do if we find more broken dashboards [09:24:37] I think we're fine to live with the revert for a while [09:25:39] ok [09:25:46] thanks a lot for your help! [09:27:38] sure np, thanks for reaching out [10:48:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:25] RESOLVED: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed