[00:00:37] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:55] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:55] <icinga-wm>	 PROBLEM - MegaRAID on db1155 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:16:56] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1155 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T328825 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:17:01] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10ops-monitoring-bot)
[00:46:25] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Peachey88)
[02:20:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:27:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 132 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:32:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:36:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 43 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:04:17] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:04:33] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 180 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:08:05] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 50 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:09:18] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:11:37] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:15:09] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 16 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:36:43] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install5002), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:48:03] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable leveling up features on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886342 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan)
[05:48:09] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Disable leveling up features in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886343 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan)
[06:19:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Marostegui) p:05Triage→03High @jclarck-ctr @wiki_willy I'm not sure if this host is still under warranty. But if it isn't, can we still get a spare disk somewhere? It is an important host for us.
[09:18:44] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[09:31:21] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:47] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:53:35] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[09:55:15] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[10:27:49] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:32:54] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) I would suggest that instead of handling individual systems, we depool the whole datacenter from external and internal traffic for the duration of t...
[10:37:21] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:41:15] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 117 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:07:51] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:08:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:14:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:15:01] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:44:23] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) >>! In T327925#8587104, @Joe wrote: > I would suggest that instead of handling individual systems, we depool the whole datacenter from extern...
[16:12:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:17:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:43:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10wiki_willy) a:03Jclark-ctr Looks like we're still under warranty until November 2023.  Assigning over to @Jclark-ctr to submit a RMA on Monday.  Thanks, Willy
[20:00:17] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:01:51] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 1085 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:05:17] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:05:31] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:15:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Marostegui) Thank you!
[23:17:49] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:28:39] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring