[00:15:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:17:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:31:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:31:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:34:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:34:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:49:04] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:49:56] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:34:38] <icinga-wm>	 RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops
[01:51:12] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:01:14] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:48] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:25:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:27:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:57:38] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:00] <icinga-wm>	 PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220130T0800)
[08:02:04] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:25:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:27:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:01:16] <icinga-wm>	 RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:07:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:11:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:20:39] <wikibugs>	 (03PS1) 10BrandonXLF: wiki replicas: Only hide log_params when bit 0 is on in log_delete [puppet] - 10https://gerrit.wikimedia.org/r/758081
[09:25:28] <wikibugs>	 (03PS2) 10BrandonXLF: wiki replicas: Only hide log_params when bit 0 is on in log_delete [puppet] - 10https://gerrit.wikimedia.org/r/758081
[09:27:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[09:37:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[09:54:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:57:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:06:38] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:16] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:08] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::redis_sentinel: fix hardcoded interface [puppet] - 10https://gerrit.wikimedia.org/r/758090 (https://phabricator.wikimedia.org/T153810)
[15:32:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[15:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[15:46:58] <wikibugs>	 (03PS1) 10Majavah: openstack encapi: Drop special treatment for puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/758091
[15:48:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack encapi: Drop special treatment for puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah)
[15:50:13] <wikibugs>	 (03PS2) 10Majavah: openstack encapi: Drop special treatment for puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/758091
[15:52:39] <wikibugs>	 (03PS1) 10Majavah: hieradata: pcc: add cloudinfra-internal-puppetmaster02 key [puppet] - 10https://gerrit.wikimedia.org/r/758092
[15:54:59] <wikibugs>	 (03PS1) 10Majavah: spec_helper: increase max supported_os to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758093
[16:09:46] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:28:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:30:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:53:37] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: remove useless prometheus_fixup [puppet] - 10https://gerrit.wikimedia.org/r/758095
[20:18:55] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] backup: remove fileset for static-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[20:25:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:27:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:35:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Hi again @faidon! A few more notes here... For the "Add your site manually", options for verification are:  1) add a small XML file to a site's root directory;  2) add an...
[20:38:06] <icinga-wm>	 PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 33663 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops
[20:51:36] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:05:50] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[23:33:55] <icinga-wm>	 PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 1405 MB (3% inode=95%): /tmp 1405 MB (3% inode=95%): /var/tmp 1405 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops