[02:35:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:45:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:52:34] FIRING: DiskSpace: Disk space prometheus1007:9100:/srv/prometheus/k8s-aux 3.986% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=prometheus1007 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:52:57] ^^ looking [11:02:34] RESOLVED: DiskSpace: Disk space prometheus1007:9100:/srv/prometheus/k8s-aux 3.966% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=prometheus1007 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:24:47] tappof: question for you about https://gerrit.wikimedia.org/r/c/operations/alerts/+/1297163 [12:24:47] Now the certificate expired, and `probe_ssl_earliest_cert_expiry` is returning "no data". What would be the cleanest way to keep alerting even after the expiration? https://grafana.wikimedia.org/goto/cfo4b1il51m9se?orgId=1 I think C is cleanest but maybe too ressource intensive? [12:45:01] XioNoX: Please take a look at D (https://grafana.wikimedia.org/goto/afo4cjt6c8o3kc?orgId=1). The query returns the value of probe_ssl_earliest_cert_expiry when it is defined, or 0 for all network devices that are currently being successfully scraped but do not have a value for probe_ssl_earliest_cert_expiry (i.e., those with an expired certificate). [12:45:08] Please take a look at the plotted series and verify that they match what you are expecting to see. [12:48:34] FIRING: DiskSpace: Disk space prometheus1007:9100:/srv/prometheus/k8s-aux 3.986% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=prometheus1007 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:58:34] FIRING: [2x] DiskSpace: Disk space prometheus1007:9100:/srv/prometheus/k8s-aux 3.959% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:03:34] RESOLVED: [2x] DiskSpace: Disk space prometheus1007:9100:/srv/prometheus/k8s-aux 3.954% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:06:31] tappof: thanks! that helped find more issues :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1297688 because of that all the Nokia devices were not being monitored properly (cc topranks) [13:18:34] Sorry for generating more work with my suggestion XioNoX :) but I'm glad it helped [13:22:23] hahaha, good kind of work, I'll send you an CR shortly to update monitoring [13:36:38] tappof: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1297698 [13:59:05] LGTM XioNoX, I've just added the test case. [13:59:10] tappof: <3 [14:25:34] tappof, topranks: all good https://alerts.wikimedia.org/?q=scope%3Dnetwork&q=alertname%3DCertAlmostExpired ! [14:37:15] nice ! [14:37:23] I presume 0s means "already expired" ? [14:38:31] yeah [14:38:47] clearing the alert before I step away [14:38:49] hrm probe_ssl_earliest_cert_expiry is the unix timestamp of when it expires, not the number of seconds until expiry [14:39:02] so 0 as the value of that indicates that the scrape failed, not that it's already expired [14:39:23] taavi: it's more complicated, see https://gerrit.wikimedia.org/r/c/operations/alerts/+/1297698 [21:44:56] FIRING: PrometheusZombieSeriesDetected: Zombie series detected on k8s (eqiad) - https://wikitech.wikimedia.org/wiki/Prometheus#Runbooks - https://grafana.wikimedia.org/d/taff979/prometheus-tsdb-cardinality-monitoring?orgId=1&from=now-14d&to=now&timezone=utc&var-prometheus=k8s&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DPrometheusZombieSeriesDetected [22:14:56] FIRING: [2x] PrometheusZombieSeriesDetected: Zombie series detected on k8s (codfw) - https://wikitech.wikimedia.org/wiki/Prometheus#Runbooks - https://alerts.wikimedia.org/?q=alertname%3DPrometheusZombieSeriesDetected