[00:16:33] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:20:29] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:20:37] <icinga-wm>	 RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2139.codfw.wmnet:3313) taken on 2021-08-14 21:08:01 (1122 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[01:52:45] <jinxer-wm>	 (Storage over 90%) firing: Storage over 90%   - https://alerts.wikimedia.org
[03:01:27] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[03:34:55] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[04:01:21] <icinga-wm>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:02:13] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:52:45] <jinxer-wm>	 (Storage over 90%) firing: Storage over 90%   - https://alerts.wikimedia.org
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210815T0700)
[07:06:19] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[07:31:47] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[07:37:21] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[08:05:05] <icinga-wm>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:43:37] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[08:54:57] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[09:08:07] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[09:52:45] <jinxer-wm>	 (Storage over 90%) firing: Storage over 90%   - https://alerts.wikimedia.org
[11:07:47] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[11:13:21] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[12:40:37] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[12:42:33] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[12:50:57] <Bsadowski1>	 Ooh debian 11 is out
[13:52:45] <jinxer-wm>	 (Storage over 90%) firing: Storage over 90%   - https://alerts.wikimedia.org
[13:53:19] <icinga-wm>	 PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 61487 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops
[14:10:19] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:33:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P::toolforge: Prepare apt repos for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713052 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah)
[15:10:52] <wikibugs>	 (03PS2) 10Majavah: Add Bullseye based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713053 (https://phabricator.wikimedia.org/T284590)
[16:10:03] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@c23a155]: adding cinder volume resize warning
[16:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:55] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@c23a155]: adding cinder volume resize warning (duration: 03m 52s)
[16:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:39] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Add Bullseye based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713053 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah)
[16:27:17] <wikibugs>	 (03Merged) 10jenkins-bot: Add Bullseye based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713053 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah)
[16:27:46] <wikibugs>	 (03PS3) 10Majavah: kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590)
[16:28:11] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah)
[16:28:51] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah)
[16:30:54] <wikibugs>	 (03PS1) 10Majavah: d/changelog: Prepare for 0.76 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713079 (https://phabricator.wikimedia.org/T284590)
[16:31:08] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.76 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713079 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah)
[16:32:20] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.76 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713079 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah)
[17:17:58] <wikibugs>	 (03PS1) 10Majavah: php74-sssd: Add lighttpd-mod-openssl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713080
[17:18:05] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] php74-sssd: Add lighttpd-mod-openssl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713080 (owner: 10Majavah)
[17:18:43] <wikibugs>	 (03Merged) 10jenkins-bot: php74-sssd: Add lighttpd-mod-openssl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713080 (owner: 10Majavah)
[17:52:45] <jinxer-wm>	 (Storage over 90%) firing: Storage over 90%   - https://alerts.wikimedia.org
[18:10:10] <wikibugs>	 (03PS4) 10Jforrester: Provide nodejs12-slim and -devel based on Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697672 (https://phabricator.wikimedia.org/T284346)
[19:14:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:16:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:29:03] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:57] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.589 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[20:02:18] <addshore>	 !log restarting blazegraph on wdqs2004
[20:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:37] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:03:47] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[20:26:21] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:13] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:02:28] <wikibugs>	 (03PS1) 10Zabe: osm: migrate cron osm_sync_lag to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673)
[21:25:51] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:52:45] <jinxer-wm>	 (Storage over 90%) firing: Storage over 90%   - https://alerts.wikimedia.org
[22:01:59] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:26:49] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:02:19] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:03] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state