[00:04:11] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:16:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:18:55] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:27:07] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[00:28:43] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[00:30:03] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:21] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:59:07] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:17:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:30:03] <wikibugs>	 (03PS1) 10Jeena Huneidi: Migrate tools/release to gitlab [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260)
[02:07:46] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:46] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:20:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:27:46] <jinxer-wm>	 (JobUnavailable) firing: (13) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:21] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[02:33:57] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[02:37:46] <jinxer-wm>	 (JobUnavailable) firing: (13) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:47:46] <jinxer-wm>	 (JobUnavailable) firing: (13) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:31:41] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[03:36:27] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[03:58:31] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[04:01:33] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[04:12:47] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[04:14:21] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[04:36:24] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Nova: puppetize /etc/nova/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/874938 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott)
[05:19:19] <icinga-wm>	 PROBLEM - SSH on an-worker1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:39:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:41:19] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:13:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:43:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:47:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:58:19] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:04:41] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:35:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:06:15] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:18:53] <icinga-wm>	 PROBLEM - Host cp2031 is DOWN: PING CRITICAL - Packet loss = 100%
[08:19:15] <icinga-wm>	 PROBLEM - Host ms-be2046 is DOWN: PING CRITICAL - Packet loss = 100%
[08:19:25] <icinga-wm>	 PROBLEM - Host elastic2041 is DOWN: PING CRITICAL - Packet loss = 100%
[08:19:33] <icinga-wm>	 PROBLEM - Host kafka-logging2002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:19:55] <icinga-wm>	 PROBLEM - Host mc2043 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:01] <icinga-wm>	 PROBLEM - Host thanos-fe2002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:11] <icinga-wm>	 PROBLEM - Host elastic2063 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:11] <icinga-wm>	 PROBLEM - Host cp2032 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:13] <icinga-wm>	 PROBLEM - Host elastic2064 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:13] <icinga-wm>	 PROBLEM - Host elastic2057 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:13] <icinga-wm>	 PROBLEM - Host lvs2008 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:15] <icinga-wm>	 PROBLEM - Host elastic2077 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:19] <icinga-wm>	 PROBLEM - Host elastic2078 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:19] <icinga-wm>	 PROBLEM - Host mc2042 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:23] <icinga-wm>	 PROBLEM - Host ms-fe2010 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:27] <icinga-wm>	 PROBLEM - Host ms-be2041 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:27] <icinga-wm>	 PROBLEM - Host ml-cache2002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:29] <icinga-wm>	 PROBLEM - Host elastic2042 is DOWN: PING CRITICAL - Packet loss = 100%
[08:21:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:21:35] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:22:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:22:17] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:22:23] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[08:22:43] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:23:38] <jinxer-wm>	 (virtual-chassis crash) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[08:24:15] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[08:26:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:27:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:28:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:07] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[08:29:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[08:33:38] <jinxer-wm>	 (virtual-chassis crash) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[08:34:21] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2001 is CRITICAL: 76 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001
[08:34:35] <RhinosF1>	 Erm, them hosts all down can’t be good
[08:35:33] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2003 is CRITICAL: 76 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003
[08:36:44] <RhinosF1>	 _joe_: ^
[08:41:13] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:13] <wikibugs>	 10SRE: asw-b-codfw virtual chassis crash - https://phabricator.wikimedia.org/T327001 (10RhinosF1)
[08:43:19] <wikibugs>	 10SRE: asw-b-codfw virtual chassis crash - https://phabricator.wikimedia.org/T327001 (10RhinosF1) p:05Triage→03Unbreak!
[08:46:25] <icinga-wm>	 PROBLEM - configured eth on lvs2009 is CRITICAL: ens3f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[08:46:29] <wikibugs>	 10SRE: asw-b-codfw virtual chassis crash - https://phabricator.wikimedia.org/T327001 (10RhinosF1) Following hosts are down: cp2031 ms-be2046 elastic2041 kafka-logging2002 mc2043 thanos-fe2002 elastic2063 cp2032 elastic2046 elastic2057 lvs2008 elastic2077 elastic2078 mc2042 ms-fe2010 ms-be2041 ml-cache2002 elasti...
[08:50:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:52:39] <taavi>	 good morning
[08:57:24] <wikibugs>	 10SRE: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10taavi)
[08:59:08] <taavi>	 i don't see any user-facing impact, but also it doesn't feel safe to me to leave it like that for the weekend
[08:59:10] * taavi klaxons
[08:59:24] <RhinosF1>	 taavi: thank you, I have no klaxon
[08:59:33] <RhinosF1>	 100% agree though, not safe over weekend
[09:00:55] <TheresNoTime>	 It's 1AM here in SF so I'm about to sleep, but I see someone else is paging now
[09:01:35] <RhinosF1>	 TheresNoTime: when did you end up in SF
[09:02:00] <TheresNoTime>	 About a week now
[09:05:11] <TheresNoTime>	 That's also worth a "lower impact" #page for T327001
[09:05:12] <stashbot>	 T327001: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001
[09:06:51] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:29] <godog>	 hey, checking
[09:10:30] <taavi>	 hey godog! tl;dr is that the switch in codfw b2 went down, and took all the hosts in it (https://netbox.wikimedia.org/dcim/racks/52/) with it
[09:10:47] <godog>	 *nod* thanks taavi for the context and heads up
[09:10:53] <taavi>	 also that page took very long to reach anyone
[09:11:06] <godog>	 do we know if there's any user impact so far ?
[09:11:39] <godog>	 can't see any obvious signs from NEL, so that's good
[09:11:50] * Emperor appears
[09:12:07] <taavi>	 I've tried looking at the relevant dashboards (webrequest 50x and mediawiki-erros) and didn't find anything that looks to be caused by this
[09:12:10] <Emperor>	 (got the p.age by email but not SMS)
[09:12:27] <RhinosF1>	 godog: i doubt any major user impact, possibly a few dead requests if the cp’s were serving as it dropped
[09:12:35] <taavi>	 but I klaxon'd anyways since this doesn't feel like a safe situation to be left over a weekend
[09:12:44] <RhinosF1>	 I think it’s quite lucky only 2 cp, no mw or db
[09:13:02] <RhinosF1>	 https://phabricator.wikimedia.org/T327001#8525268 is a list down
[09:13:20] <godog>	 agreed
[09:13:23] <Emperor>	 if we can't get the switch back, I should depool ms-fe2010 and thanos-fe2002
[09:13:59] <godog>	 Emperor: yeah please do now anyways, not sure if we are getting the switch back
[09:14:17] <RhinosF1>	 There’s a lot of elastic hosts, not sure if they need action
[09:14:33] <RhinosF1>	 That row is about 50% elastic
[09:14:58] * TheresNoTime will /away now
[09:15:29] <taavi>	 iirc elastic should be able to deal with a rack down by itself, and I don't see any alerts to indicate otherwise
[09:15:30] <Emperor>	 the virtual chassis crash alert has cleared; dunno if that means it's possible to get the switch to start up again?
[09:16:24] <godog>	 not sure
[09:16:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:36] <RhinosF1>	 There’s 2 alerts that look related to kafka-logging2002 being down
[09:17:07] <godog>	 it might be that the virtual chassis is healthy and a single switch isn't
[09:17:09] <RhinosF1>	 lvs can cope with 1 of them being down, that got said yesterday
[09:17:59] <godog>	 yeah that's right RhinosF1 
[09:18:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic2057-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[09:19:00] <Emperor>	 !log depool ms-fe2010 T327001
[09:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:04] <stashbot>	 T327001: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001
[09:19:23] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=no; selector: name=ms-fe2010.codfw.wmnet
[09:19:52] <godog>	 let's see if we can powercycle the switch
[09:19:56] <Emperor>	 !log depool thanos-fe2002 T327001 
[09:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:16] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet
[09:21:19] <wikibugs>	 10SRE: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10fgiunchedi) from netbox b2 was the master previously I think, master is now b7  ` filippo@asw-b-codfw> show virtual-chassis   Preprovisioned Virtual Chassis Fabric Fabric ID: 5ddb.095b.79f3 Fabric Mode: Mixed...
[09:22:10] <godog>	 tbh I can't remember if we have individual power control for rack switches
[09:23:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: (2) Elasticsearch instance elastic2057-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[09:23:26] <Emperor>	 godog: I'm afraid I've not really interacted with our switches very much :-/
[09:23:57] <godog>	 yeah me neither Emperor 
[09:27:53] <Emperor>	 godog: I think you want "request system reboot member X"
[09:28:19] <Emperor>	 [where X is the member switch you want to kick]
[09:28:26] * Emperor reading https://www.juniper.net/documentation/us/en/software/junos/virtual-chassis-qfx/topics/task/virtual-chassis-ex4200.html
[09:29:08] <Emperor>	 (but if that doesn't fix it we might need one of the people who actually know about this stuff)
[09:30:18] <godog>	 my understanding is that's possible when the switch is present/connected heh
[09:31:19] <Emperor>	 might be worth a try?
[09:32:31] <godog>	 I don't think so
[09:32:36] <Emperor>	 FE
[09:34:45] <godog>	 IIRC we do have individual outlet controls in "network racks" in the PDUs, I'm checking if that's the case for B2
[09:35:03] <Emperor>	 'k
[09:35:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:43:43] <godog>	 yeah can confirm ps1-b2-codfw is not a switched PDU
[09:43:57] <Emperor>	 alas
[09:44:26] <godog>	 going to try request system reboot member 2, might as well
[09:45:07] <Emperor>	 🤞
[09:46:06] <godog>	 !log issue 'request system reboot member 2' - T327001
[09:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:10] <stashbot>	 T327001: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001
[09:46:28] <godog>	 not sure that did anything, yet at least
[09:48:36] <godog>	 off the top of my head the next thing would be either dcops or smart hands to power cycle the device
[09:48:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:49:19] <Emperor>	 I guess those are Monday things now? It's obv. not ideal, but I think we're meant to be able to run with a rack down
[09:50:37] <godog>	 assuming we can invoke smart hands that's a 24/7 thing iirc, but yeah afaict none of the hosts down in the rack requires immediate intervention
[09:53:27] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:53:54] <wikibugs>	 10SRE: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10fgiunchedi) >>! In T327001#8525305, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/iSesr4UB6FQ6iqKiDtno} [2023-01-14T09:46:06Z] <godog> issue 'request system reboot member...
[09:54:31] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[09:55:20] <godog>	 I'm checking docs and see if we could even get a hold of codfw smart hands now
[09:56:05] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[09:57:02] <akosiaris>	 Hey. I am in bed with headache and an otitis, so can't be much help, but let me know if you need anything
[09:57:36] <Emperor>	 ouch, GWS!
[09:57:38] <godog>	 thank you akosiaris, take care
[10:00:06] <godog>	 no joy with docs on how to invoke smart hands
[10:00:51] <godog>	 gehel ryankemper inflatador ^ FYI there's a switch down in codfw with a few elastic hosts (list in T327001) 
[10:00:52] <stashbot>	 T327001: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001
[10:01:12] <godog>	 I don't think there's any action required at this time, please correct me if I'm mistaken
[10:02:04] <godog>	 papaul wiki_willy ^ FYI too
[10:02:25] <godog>	 I'm going to shoot them an email too
[10:02:32] <Emperor>	 Thanks :)
[10:09:33] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:13:25] <gehel>	 godog: thanks for the ping!
[10:13:52] <godog>	 gehel: for sure! hope there's indeed no action required now ?
[10:14:09] <godog>	 we can escalate if that's the case though
[10:14:26] <Emperor>	 (at least in theory, I'm not sure I know how to raise codfw smart hands :) )
[10:14:33] <gehel>	 We should be good. I don't have access to a computer right now, but I'll have a look in a few hours 
[10:15:01] <godog>	 *nod* thank you
[10:15:20] <gehel>	 We should ld be able to loose 5-6 servers without much impact at all. There might be a few sets about unassigned shards 
[10:15:34] <XioNoX>	 gehel, godog, Emperor, hello hello
[10:15:49] <godog>	 yo XioNoX 
[10:15:51] <XioNoX>	 just woke up
[10:15:56] <Emperor>	 morning :)
[10:16:27] <godog>	 gehel: ack
[10:16:43] <XioNoX>	 I saw some direct pages, anything I can help with?
[10:16:57] <godog>	 I'll resolve the oncall incident for now or it'll page again
[10:17:06] <RhinosF1>	 gehel: I count 7, there was a memory pressure alert earlier
[10:17:22] <RhinosF1>	 XioNoX: asw-b2-codfw is down
[10:17:31] <godog>	 XioNoX: tl;dr is that asw-b2-codfw went down, virtual chassis failed over to another master and a bunch of hosts down
[10:17:39] <godog>	 XioNoX: https://phabricator.wikimedia.org/T327001
[10:17:45] <XioNoX>	 all B2 racks I guess
[10:17:47] <XioNoX>	 :)
[10:18:13] <godog>	 indeed
[10:18:27] <XioNoX>	 did everything failed over as they should have?
[10:18:47] <godog>	 looks like it yeah, master is b7 now and there's no user impact afaics
[10:19:03] <XioNoX>	 aweseome
[10:19:24] <Emperor>	 I depooled a couple of things in b2, but we should be OK without them 'til Monday
[10:19:26] <godog>	 I emailed papaul and willy for awareness, though the current thinking is we can leave things be
[10:19:27] <XioNoX>	 B2 and B7 are the links to the routers, so that means we don't have redundancy
[10:19:57] <XioNoX>	 Emperor: monday is a US holiday, so we might need to wait until Tuesday
[10:19:57] <godog>	 hah, thank you I missed that
[10:20:15] <XioNoX>	 is there a task?
[10:20:20] <godog>	 T327001
[10:20:20] <stashbot>	 T327001: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001
[10:22:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:23:00] <XioNoX>	 as you might expect, the switch's console is.... dead too
[10:24:07] <Emperor>	 sadness
[10:27:50] <godog>	 thoughts on next steps ?
[10:28:12] <XioNoX>	 godog: updating the task, one sec
[10:28:34] <XioNoX>	 godog: https://phabricator.wikimedia.org/T327001#8525324
[10:28:46] <wikibugs>	 10SRE: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10ayounsi) Thanks for the task, quite an eventful week for switches :)  Indeed the switch is dead, console doesn't reply neither.  Everything that can be done remotely is done, next step is to replace it with a spare switch and RMA it.  Monday...
[10:29:28] <godog>	 XioNoX: ack, thanks for the update
[10:30:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:32] <Emperor>	 sigh, hardware
[10:34:08] <XioNoX>	 in the risky stuff, 'lvs2008'      => 'high-traffic2', so it's upload, but the secondary node took over properly: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=27
[10:36:43] <godog>	 indeed, so loss of redundancy for now
[10:39:30] <godog>	 if there's agreement we're okay for now I'll go back to my day
[10:40:03] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:38] <Emperor>	 +1
[10:44:50] <godog>	 ack, XioNoX ^
[10:45:17] <RhinosF1>	 Thanks everyone
[10:46:51] <p858snake>	 godog: it might be worthwile to file a documentation task to improve the smarthands documentation for <eerily futuristic sounds>The Future™</eerily futuristic sounds>
[10:48:57] <godog>	 heheh good point p858snake, I'll ask dcops since we might have docs and I wasn't able to find them
[10:49:12] <RhinosF1>	 godog: id probably say that should have auto-paged too
[10:49:13] <Emperor>	 "how to try and access the switch console" too, maybe?
[10:49:38] <Emperor>	 not so sure about that; we are capable of running with a switch failed
[10:49:59] <Emperor>	 (by design)
[10:50:30] <Emperor>	 ...but I think that discussion can wait 'til Mon/Tue also :)
[10:51:02] <RhinosF1>	 I’ll update the task later with full timings and everything
[10:51:10] <p858snake>	 Emperor: should always document for the worst case, just because the design is for safe operation with one down, doesn't mean it always goes that way
[10:51:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:51:27] <godog>	 SGTM, thanks all so far
[10:51:46] <RhinosF1>	 Please go enjoy your weekends now
[11:00:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:05:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:12] <wikibugs>	 10SRE: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10RhinosF1) P43154 has a draft IR if needed, collates times and actionables to save any SRE time.
[11:19:02] <wikibugs>	 10SRE, 10ops-codfw: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10RhinosF1) p:05Unbreak!→03High
[11:19:14] <wikibugs>	 10SRE, 10ops-codfw: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10RhinosF1) Lowering to deal with on Monday/Tuesday and updated description. Thanks everyone for the response.
[11:19:23] <RhinosF1>	 Tidied task up
[11:19:28] <RhinosF1>	 I’m going to my weekend now
[11:20:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:15] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:10:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:15] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:29:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:34:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:50:27] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:11:19] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:43] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 878722 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[13:39:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:53] <icinga-wm>	 PROBLEM - Disk space on doh3001 is CRITICAL: DISK CRITICAL - free space: / 341 MB (3% inode=87%): /tmp 341 MB (3% inode=87%): /var/tmp 341 MB (3% inode=87%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=doh3001&var-datasource=esams+prometheus/ops
[13:51:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:15] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:24:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:07] <icinga-wm>	 RECOVERY - Disk space on doh3001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=doh3001&var-datasource=esams+prometheus/ops
[14:36:25] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:39:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:55] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:35] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:43] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:29:15] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:37:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:40:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:46:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:51:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10taavi)
[16:01:03] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:10:15] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:11:11] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10taavi) 05Resolved→03Open Re-opening. The developer account `Hxi-ctr` has shell name `xihua`, not `hxi-ctr` which was added to Puppet in this pa...
[16:22:15] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:22:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:15] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:29:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[16:33:43] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:40:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:48:15] <icinga-wm>	 PROBLEM - Disk space on doh3002 is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=87%): /tmp 340 MB (3% inode=87%): /var/tmp 340 MB (3% inode=87%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=doh3002&var-datasource=esams+prometheus/ops
[16:51:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:59:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:37] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:10:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:10:27] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:10:27] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:16:47] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:21:43] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:33] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:27] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:45:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:55] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:57:07] <wikibugs>	 (03PS1) 10Krinkle: team-perf: Remove firstinputtiming alerts [alerts] - 10https://gerrit.wikimedia.org/r/879925 (https://phabricator.wikimedia.org/T323623)
[17:57:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:58:11] <wikibugs>	 (03PS1) 10Krinkle: Remove unused eventlogging_FirstInputTiming stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879926 (https://phabricator.wikimedia.org/T323623)
[18:03:56] <wikibugs>	 (03PS2) 10Krinkle: Remove unused eventlogging_RUMSpeedIndex stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog)
[18:04:00] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove unused eventlogging_RUMSpeedIndex stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog)
[18:10:35] <icinga-wm>	 RECOVERY - Disk space on doh3002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=doh3002&var-datasource=esams+prometheus/ops
[18:13:47] <wikibugs>	 (03PS3) 10Krinkle: eventlogging: Remove obsoleted navtiming schemas [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T281103) (owner: 10Phedenskog)
[18:16:45] <wikibugs>	 (03PS3) 10Krinkle: Remove unused eventlogging_RUMSpeedIndex stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog)
[18:16:47] <wikibugs>	 (03PS2) 10Krinkle: Remove former EventLogging streams for navtiming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879926 (https://phabricator.wikimedia.org/T281103)
[18:18:51] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:28:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:29:19] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:31:28] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Request control of IRC mailing list - https://phabricator.wikimedia.org/T327014 (10Legoktm) a:03Legoktm
[18:35:17] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Request control of IRC mailing list - https://phabricator.wikimedia.org/T327014 (10Legoktm) 05Open→03Resolved Done.
[18:39:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:39:55] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:41:45] <icinga-wm>	 PROBLEM - Host ps1-b2-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[18:48:53] <wikibugs>	 10SRE, 10ops-codfw: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) @ayounsi  This spare switch is in place. I am using https://netbox-next.wikimedia.org/dcim/devices/3423/. Let me know is you want to set it up now on wait until next week
[18:58:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:19] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:09:51] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:11:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:11:39] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:14:13] <icinga-wm>	 PROBLEM - Disk space on doh4001 is CRITICAL: DISK CRITICAL - free space: / 342 MB (3% inode=87%): /tmp 342 MB (3% inode=87%): /var/tmp 342 MB (3% inode=87%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=doh4001&var-datasource=ulsfo+prometheus/ops
[19:28:51] <icinga-wm>	 RECOVERY - configured eth on lvs2009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[19:29:19] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:40:19] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:43:25] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:59:33] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:08:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:36] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T327015 (10phaultfinder)
[20:27:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:29:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[20:29:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:40:31] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:49] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:52:55] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:58:01] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:00:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:10:27] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:21:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:36:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:40:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:48:59] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:50:39] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:50:55] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:52:15] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:01:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:11:33] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:11:35] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:20:55] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:21:01] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:21:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:51] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:23:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:24:03] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:32:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:35:43] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:11] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:51:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:53:25] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:05] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:22:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:24:03] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:43] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:40:03] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:51:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:54:25] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state