[08:28:56] FIRING: [3x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:35] will check/fix ↑ [08:40:25] https://phabricator.wikimedia.org/T371049 ↑ [08:41:57] ah no, not at all [09:00:38] actually, I'm not sure, I've added my findings as a comment: T371049#10279895 [09:00:38] T371049: prometheus-mysqld-exporter doesn't fully support multi-instances for pt-heartbeat - https://phabricator.wikimedia.org/T371049 [09:02:01] you need to manually disable prometheus-mysqld-exporter.service [09:02:17] puppet cannot do it because it doesn't know if that should be up or not [09:02:32] or at least now without a source of truth as it is now [09:02:58] as the profiles only enable the multiinstance, but don't know if the main unit should be enabled or not [09:04:02] is there some cases where multiinstance and single instance are coexisting? [09:07:11] fixed [09:10:12] thanks! we could maybe try to make those cases mutually exclusive if there was no case where both normal and @ exporter are coexisting [09:11:10] you disabled s3 and s4, you will have to reenable those [09:11:30] it is better if you disable notifications until data is loaded, then reenable notifications [09:11:38] I did not disabled anything, its why I found it weird [09:11:47] then someone did [09:11:48] (systemd wise) [09:12:00] "disabled; preset: enabled" [09:12:18] oh I believe you, I'm trying to make sense of it [09:12:36] RESOLVED: [3x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:03] if disable notifications still sends notifications you should report that to observability team [09:13:26] as it used to work for icinga, it is not ok not to work for prometheus [09:13:45] it is usually, idk this one spammed 🤔 [09:14:45] (and this is why I don't want orchestration on puppet, but on a diferent, state-dependent service) [09:15:11] health of the host cannot be monitored until data is loaded [09:15:22] and data cannot be loaded until configuration is loaded [09:15:39] so current puppet workflow doesn't fit our needs [09:15:49] I find the exporter dynamic useful in that context, it helps layering the monitoring cake down to the lowest possible bit [09:16:35] sorry, didn't fully understand the last sentence "layering the monitoring cake"? [09:17:05] haha sorry I tried a metaphor that was a bit too far fetched [09:17:40] I am just providing my opinion, feel free to disagree- but you are the one suffering those shortcomings [09:20:45] I'm just commenting :) I find exporters useful as they are decoupled from each other but they need to fit our workflow indeed. In that case I was specifically asking about a detail: given that puppet is able to know if it's supposed to deploy 1 or n services for prometheus-mysqld-exporter, disabling the default exporter could be handled at [09:20:45] installation time, avoiding that kind of issue (I'm not sure that it was 100% of the issue as I feel I'm missing something here), if both cases are not coexisting [09:21:24] (both cases being: multi instance mysqld exporter + single instance mysqld exporter) [09:21:38] that won't solve the issue I am mentioning- prometheus monitoring not being able to work until mysql is started [09:23:00] ok, I'm narrowing down where I failed to understand! its why I was mentioning the "layered cake", mysqld-exporter won't monitor mysqld but the node (and the services) will be monitored, so we have monitoring, just up to the state at which the node is, no? [09:23:01] But regarding your question, I think the right approach would be to export a resource that can be collected at "role level" (can be a common profile, I hope that terminology is understood) and if 0 resources are exported by profiles, disable it [09:23:41] Again, will that work if the untils fail because the cannot run to completion? (who knows) [09:23:59] *units [09:25:55] maybe there are simpler ways that don't hardcode what the other profiles do [11:55:04] arnaudb: sorry, I was in a meeting. I'm not sure I understand your questions, but not sure I can answer those. Either puppet or obs team (or their docs) will have an answer. Sorry I may not be able to help there. [14:10:16] if no one objects to this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1084730 I'll merge it this afternoon [14:11:38] (mycli: https://www.mycli.net/ its quite handy) [14:14:04] arnaudb: LGTM, but ideally get a +1 from a database person? [14:14:55] thanks Emperor, indeed I'd be happy to have a +1 from them! [14:21:54] As a reminder, I am not a database person, I take care of backups [14:28:01] I think Amir1 is around today... [14:30:05] * dhinus wants a t-shirt with "not a database person" [14:33:11] are we expectin db2190 down? [14:42:37] dhinus: I'm still trying to find a good design for "webmaster@wikipedia.org" as suggested by moritzm, I'm looking for a suggestion box for the shop :D [14:45:31] Give me a bit [14:46:04] Emperor: I disabled notification on db2190 but it's not working it seems. Even on icinga [14:46:25] Amir1: don't worry it's been downtimed [14:46:43] Yeah, this is what I mentioned before- I don't think alertmanager is doing the right thing [14:46:56] for me this is a regression on how it works on puppet [14:47:11] so definitelly not your fault, Amir1 [14:47:30] (not taht it is anyone's fault, but I think it is a bug IMHO) [14:47:58] Yeah. I thought it would be just prometheus ones but icinga is not working either [14:48:06] oh [14:48:18] I see, I think I know what it is [14:48:26] because puppet didn't run (couldn't) [14:48:38] resources werent generated [14:48:56] so it may work well, but it has a gap (if puppet cannot run, it doesn't work) [15:01:23] Oh no, the down db system has started speaking in Mexican! https://phab.wmfusercontent.org/file/data/eofruw7hrincbb64yofn/PHID-FILE-fxllyczrciyqj7opba3r/image.png [15:05:16] that's all I could gather: https://phabricator.wikimedia.org/T378628#10281209 cannot help further without physical access [15:47:36] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:27:24] Is there anyone around that can sanity-check https://gerrit.wikimedia.org/r/c/operations/puppet/+/1085430? [16:31:22] {done} [16:31:46] arnaudb: ty! [16:36:50] Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1085434 please? I think on reflect going with upstream-expected scrap interval of 15s is a more sensible place to be at this point, and matches what upstream's dashboards expect too [16:39:07] tiny change, but I'd like to push it today if poss :) [18:12:48] FIRING: PuppetFailure: Puppet has failed on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:07:36] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:48] FIRING: PuppetFailure: Puppet has failed on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:07:36] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed