[00:04:12] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [00:32:10] !log [WDQS] Restarted blazegraph on `wdqs101[1,3]` [00:32:11] ryankemper: Not expecting to hear !log here [00:32:23] woops, thought i was in operations [02:17:49] FIRING: DiskSpace: Disk space mwlog2002:9100:/srv 2.893% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=mwlog2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:57:34] RESOLVED: DiskSpace: Disk space mwlog2002:9100:/srv 3.86% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=mwlog2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:04:27] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [04:23:34] FIRING: DiskSpace: Disk space mwlog2002:9100:/srv 3.962% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=mwlog2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:53:34] RESOLVED: DiskSpace: Disk space mwlog2002:9100:/srv 3.767% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=mwlog2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:04:27] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [09:01:42] godog: hi! would you have a bit of time to help me figure out a weird rsyslog error on cephosd1003? [10:04:00] nvm, I figured it out [10:14:09] brouberol: ok! nicely done [10:14:17] what was it ? [10:16:31] sorry yes, I should have mentioned. rsyslog failed and crashed with [10:16:31] Feb 20 08:28:32 cephosd1003 rsyslogd[18467]: file '/var/spool/rsyslog/centrallog1002.eqiad.wmnet:6514.00001368': open error: No such file or directory [v8.2302.0 try https://www.rsyslog.com/e/2040 ] [10:17:05] and If I were to create the file, it would crash on another not being found, with the counter incremented by 1 [10:17:05] Feb 20 08:28:32 cephosd1003 rsyslogd[18467]: file '/var/spool/rsyslog/centrallog1002.eqiad.wmnet:6514.00001369': open error: No such file or directory [v8.2302.0 try https://www.rsyslog.com/e/2040 ] [10:17:28] turns out I had a corrupted spool file in /var/spool/rsyslog, that, when cleared up, allowed rsyslog to restart cleanly [10:17:43] ah yeah totally, that makes sense [10:18:09] rsyslog spools has been observed to crash in the past [10:20:43] I cleaned up millions of stale files from some s3 buckets, and turns out the radow gateway logs each API call to syslog, which would then end up in the rsyslog spool, until the rootfs filled up due to /var/log/ceph/radosgw logs taking the whole disk [10:20:55] at that point, I suspect, the spool files got corrupted [10:21:19] quite likely yes, also ow re: radosgw logs [10:21:34] yep, that was an interesting failure mode ^^' [10:23:39] can you guess when it happened? https://grafana.wikimedia.org/d/tbO9LAiZK/ceph-cluster?orgId=1&refresh=1m&var-interval=%24__auto&from=now-7d&to=now&timezone=utc&var-DS_PROMETHEUS=000000026&var-cluster=cephosd&var-site=eqiad&viewPanel=panel-47 [10:26:55] hahaha oops [10:27:03] brutal [12:04:27] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [12:54:12] RESOLVED: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [14:08:12] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [16:53:12] RESOLVED: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [17:00:34] FIRING: DiskSpace: Disk space mwlog2002:9100:/srv 3.965% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=mwlog2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:55:12] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [18:57:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [19:37:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [21:00:49] FIRING: DiskSpace: Disk space mwlog2002:9100:/srv 2.793% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=mwlog2002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:10:12] RESOLVED: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [21:10:24] FIRING: SystemdUnitFailed: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:01] FIRING: ThanosCompactIsDown: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [21:11:56] ^^ it's me [21:35:24] RESOLVED: SystemdUnitFailed: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:36:01] RESOLVED: ThanosCompactIsDown: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [23:45:34] FIRING: [2x] DiskSpace: Disk space mwlog1002:9100:/srv 3.952% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace