[08:28:56] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:29:35] <arnaudb>	 will check/fix ↑
[08:40:25] <arnaudb>	 https://phabricator.wikimedia.org/T371049 ↑ 
[08:41:57] <arnaudb>	 ah no, not at all
[09:00:38] <arnaudb>	 actually, I'm not sure, I've added my findings as a comment: T371049#10279895
[09:00:38] <stashbot>	 T371049: prometheus-mysqld-exporter doesn't fully support multi-instances for pt-heartbeat - https://phabricator.wikimedia.org/T371049
[09:02:01] <jynus>	 you need to manually disable prometheus-mysqld-exporter.service
[09:02:17] <jynus>	 puppet cannot do it because it doesn't know if that should be up or not
[09:02:32] <jynus>	 or at least now without a source of truth as it is now
[09:02:58] <jynus>	 as the profiles only enable the multiinstance, but don't know if the main unit should be enabled or not
[09:04:02] <arnaudb>	 is there some cases where multiinstance and single instance are coexisting?
[09:07:11] <jynus>	 fixed
[09:10:12] <arnaudb>	 thanks! we could maybe try to make those cases mutually exclusive if there was no case where both normal and @ exporter are coexisting
[09:11:10] <jynus>	 you disabled s3 and s4, you will have to reenable those
[09:11:30] <jynus>	 it is better if you disable notifications until data is loaded, then reenable notifications
[09:11:38] <arnaudb>	 I did not disabled anything, its why I found it weird
[09:11:47] <jynus>	 then someone did
[09:11:48] <arnaudb>	 (systemd wise)
[09:12:00] <jynus>	 "disabled; preset: enabled"
[09:12:18] <arnaudb>	 oh I believe you, I'm trying to make sense of it
[09:12:36] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:13:03] <jynus>	 if disable notifications still sends notifications you should report that to observability team
[09:13:26] <jynus>	 as it used to work for icinga, it is not ok not to work for prometheus
[09:13:45] <arnaudb>	 it is usually, idk this one spammed 🤔
[09:14:45] <jynus>	 (and this is why I don't want orchestration on puppet, but on a diferent, state-dependent service)
[09:15:11] <jynus>	 health of the host cannot be monitored until data is loaded
[09:15:22] <jynus>	 and data cannot be loaded until configuration is loaded
[09:15:39] <jynus>	 so current puppet workflow doesn't fit our needs
[09:15:49] <arnaudb>	 I find the exporter dynamic useful in that context, it helps layering the monitoring cake down to the lowest possible bit
[09:16:35] <jynus>	 sorry, didn't fully understand the last sentence "layering the monitoring cake"?
[09:17:05] <arnaudb>	 haha sorry I tried a metaphor that was a bit too far fetched
[09:17:40] <jynus>	 I am just providing my opinion, feel free to disagree- but you are the one suffering those shortcomings
[09:20:45] <arnaudb>	 I'm just commenting :) I find exporters useful as they are decoupled from each other but they need to fit our workflow indeed. In that case I was specifically asking about a detail: given that puppet is able to know if it's supposed to deploy 1 or n services for prometheus-mysqld-exporter, disabling the default exporter could be handled at
[09:20:45] <arnaudb>	 installation time, avoiding that kind of issue (I'm not sure that it was 100% of the issue as I feel I'm missing something here), if both cases are not coexisting 
[09:21:24] <arnaudb>	 (both cases being: multi instance mysqld exporter + single instance mysqld exporter)
[09:21:38] <jynus>	 that won't solve the issue I am mentioning- prometheus monitoring not being able to work until mysql is started
[09:23:00] <arnaudb>	 ok, I'm narrowing down where I failed to understand! its why I was mentioning the "layered cake", mysqld-exporter won't monitor mysqld but the node (and the services) will be monitored, so we have monitoring, just up to the state at which the node is, no?
[09:23:01] <jynus>	 But regarding your question, I think the right approach would be to export a resource that can be collected at "role level" (can be a common profile, I hope that terminology is understood) and if 0 resources are exported by profiles, disable it
[09:23:41] <jynus>	 Again, will that work if the untils fail because the cannot run to completion? (who knows)
[09:23:59] <jynus>	 *units
[09:25:55] <jynus>	 maybe there are simpler ways that don't hardcode what the other profiles do
[11:55:04] <jynus>	 arnaudb: sorry, I was in a meeting. I'm not sure I understand your questions, but not sure I can answer those. Either puppet or obs team (or their docs) will have an answer. Sorry I may not be able to help there.
[14:10:16] <arnaudb>	 if no one objects to this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1084730 I'll merge it this afternoon
[14:11:38] <arnaudb>	 (mycli: https://www.mycli.net/ its quite handy)
[14:14:04] <Emperor>	 arnaudb: LGTM, but ideally get a +1 from a database person?
[14:14:55] <arnaudb>	 thanks Emperor, indeed I'd be happy to have a +1 from them!
[14:21:54] <jynus>	 As a reminder, I am not a database person, I take care of backups
[14:28:01] <Emperor>	 I think Amir1 is around today...
[14:30:05] * dhinus wants a t-shirt with "not a database person"
[14:33:11] <Emperor>	 are we expectin db2190 down?
[14:42:37] <arnaudb>	 dhinus: I'm still trying to find a good design for "webmaster@wikipedia.org" as suggested by moritzm, I'm looking for a suggestion box for the shop :D
[14:45:31] <Amir1>	 Give me a bit 
[14:46:04] <Amir1>	 Emperor: I disabled notification on db2190 but it's not working it seems. Even on icinga
[14:46:25] <arnaudb>	 Amir1: don't worry it's been downtimed
[14:46:43] <jynus>	 Yeah, this is what I mentioned before- I don't think alertmanager is doing the right thing
[14:46:56] <jynus>	 for me this is a regression on how it works on puppet
[14:47:11] <jynus>	 so definitelly not your fault, Amir1
[14:47:30] <jynus>	 (not taht it is anyone's fault, but I think it is a bug IMHO)
[14:47:58] <Amir1>	 Yeah. I thought it would be just prometheus ones but icinga is not working either 
[14:48:06] <jynus>	 oh
[14:48:18] <jynus>	 I see, I think I know what it is
[14:48:26] <jynus>	 because puppet didn't run (couldn't)
[14:48:38] <jynus>	 resources werent generated
[14:48:56] <jynus>	 so it may work well, but it has a gap (if puppet cannot run, it doesn't work)
[15:01:23] <jynus>	 Oh no, the down db system has started speaking in Mexican! https://phab.wmfusercontent.org/file/data/eofruw7hrincbb64yofn/PHID-FILE-fxllyczrciyqj7opba3r/image.png
[15:05:16] <jynus>	 that's all I could gather: https://phabricator.wikimedia.org/T378628#10281209 cannot help further without physical access
[15:47:36] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:27:24] <urandom>	 Is there anyone around that can sanity-check https://gerrit.wikimedia.org/r/c/operations/puppet/+/1085430?
[16:31:22] <arnaudb>	 {done}
[16:31:46] <urandom>	 arnaudb: ty! 
[16:36:50] <Emperor>	 Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1085434 please? I think on reflect going with upstream-expected scrap interval of 15s is a more sensible place to be at this point, and matches what upstream's dashboards expect too
[16:39:07] <Emperor>	 tiny change, but I'd like to push it today if poss :)
[18:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:07:36] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:07:36] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed