[00:05:25] FIRING: SystemdUnitFailed: logrotate.service on logstash1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:45:25] FIRING: [2x] SystemdUnitFailed: sync-icinga-state.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:25] FIRING: [2x] SystemdUnitFailed: sync-icinga-state.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:57] FIRING: PuppetFailure: Puppet has failed on logstash1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:51:57] FIRING: [2x] PuppetFailure: Puppet has failed on logstash1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:56:49] FIRING: [3x] PuppetFailure: Puppet has failed on kafka-logging1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:56:57] FIRING: PuppetFailure: Puppet has failed on webperf1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:01:49] FIRING: [8x] PuppetFailure: Puppet has failed on kafka-logging1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:01:57] FIRING: PuppetFailure: Puppet has failed on titan1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:05:10] FIRING: PuppetFailure: Puppet has failed on prometheus1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:06:48] FIRING: [8x] PuppetFailure: Puppet has failed on kafka-logging1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:06:57] RESOLVED: PuppetFailure: Puppet has failed on titan1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:10:19] RESOLVED: PuppetFailure: Puppet has failed on prometheus1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:11:49] FIRING: [8x] PuppetFailure: Puppet has failed on kafka-logging1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:16:48] FIRING: [8x] PuppetFailure: Puppet has failed on kafka-logging1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:21:49] FIRING: [6x] PuppetFailure: Puppet has failed on kafka-logging1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:21:53] RESOLVED: PuppetFailure: Puppet has failed on webperf1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:26:48] RESOLVED: [4x] PuppetFailure: Puppet has failed on kafka-logging1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:35:40] FIRING: SystemdUnitFailed: logrotate.service on logstash1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:25] RESOLVED: SystemdUnitFailed: logrotate.service on logstash1027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:41] FIRING: [2x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [14:34:59] ^^I'm going to check it, there are some problems with the OpenSearch rules. [15:02:20] `found duplicate series for the match group {cluster=\"dumps\", nodename=\"snapshot1015\"} on the right hand-side of the operation: [{cluster=\"dumps\", instance=\"snapshot1015:9100\", nodename=\"snapshot1015\", team=\"data-platform\"}, {cluster=\"dumps\", instance=\"snapshot1015:9100\", nodename=\"snapshot1015\", team=\"data-engineering\"}];many-to-many matching not allowed: [15:02:22] matching labels must be unique on one side` [15:07:25] Appears to be a recurrence of https://phabricator.wikimedia.org/T374178 [15:10:22] yes cwhite, same error different reason [15:10:25] I'm trying to understand why the Data Engineering team is in charge of snapshot1015. Moreover, after looking into the puppet repository, it seems that the contact for the server has been listed as 'Data Engineer' since 2024-01-29, while Prometheus was reporting it as 'data-platform' until today at 14:25. [15:14:20] tappof: https://gerrit.wikimedia.org/r/c/operations/puppet/+/993659 [15:14:42] `profile::contacts::role_contacts` was changed in that patch. [15:19:39] Okay, cwhite, it's clear. I was using git blame to check the timing on hieradata, and it showed me the commit date, so I assumed the patchset was applied since January. I was wrong. Thank you. [15:24:06] cwhite: Another question: it seems that snapshot1015 is the only server affiliated with the Data Engineering team in the role_owner metric. Can we assume that this is a mistake? [15:26:03] I'm not sure. Maybe btullis would know? [15:26:30] Hello. Reading backscroll now. [15:27:01] Oh yeah, that should be Data Platform on snapshot1015. [15:27:19] ok, thank you btullis [15:27:55] Shall I patch that now? Is this the only issue? [15:28:15] Yes btullis, I think so [15:28:26] k, on it. [15:30:30] cwhite: The RuleEvaluationFailures alert will recover within 24 hours starting from the corrective patch set from btullis. [15:30:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076219 [15:41:07] Awesome, thank you! I submitted a 25hr silence for that alert. :) [15:50:29] Great. Thanks and sorry for the goof-up. [15:56:42] I don't think apologies are necessary. We only noticed it because one of our rules is somehow sensitive to role_contacts changes. I feel this event begs the questions: Would we have caught this error another way? and What problems would result if this change continued unnoticed?