[00:16:33] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:29] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:37] RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2139.codfw.wmnet:3313) taken on 2021-08-14 21:08:01 (1122 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [03:01:27] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:34:55] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:01:21] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:13] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210815T0700) [07:06:19] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:31:47] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:37:21] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [08:05:05] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:43:37] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [08:54:57] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [09:08:07] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [09:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [11:07:47] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [11:13:21] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [12:40:37] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:42:33] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:50:57] Ooh debian 11 is out [13:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [13:53:19] PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 61487 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [14:10:19] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:33:28] (03CR) 10Andrew Bogott: [C: 03+2] P::toolforge: Prepare apt repos for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713052 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [15:10:52] (03PS2) 10Majavah: Add Bullseye based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713053 (https://phabricator.wikimedia.org/T284590) [16:10:03] !log andrew@deploy1002 Started deploy [horizon/deploy@c23a155]: adding cinder volume resize warning [16:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:55] !log andrew@deploy1002 Finished deploy [horizon/deploy@c23a155]: adding cinder volume resize warning (duration: 03m 52s) [16:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:39] (03CR) 10Majavah: [C: 03+2] Add Bullseye based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713053 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [16:27:17] (03Merged) 10jenkins-bot: Add Bullseye based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713053 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [16:27:46] (03PS3) 10Majavah: kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590) [16:28:11] (03CR) 10Majavah: [C: 03+2] kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [16:28:51] (03Merged) 10jenkins-bot: kubernetes: Add Bullseye images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713055 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [16:30:54] (03PS1) 10Majavah: d/changelog: Prepare for 0.76 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713079 (https://phabricator.wikimedia.org/T284590) [16:31:08] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.76 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713079 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [16:32:20] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.76 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713079 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [17:17:58] (03PS1) 10Majavah: php74-sssd: Add lighttpd-mod-openssl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713080 [17:18:05] (03CR) 10Majavah: [C: 03+2] php74-sssd: Add lighttpd-mod-openssl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713080 (owner: 10Majavah) [17:18:43] (03Merged) 10jenkins-bot: php74-sssd: Add lighttpd-mod-openssl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713080 (owner: 10Majavah) [17:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [18:10:10] (03PS4) 10Jforrester: Provide nodejs12-slim and -devel based on Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697672 (https://phabricator.wikimedia.org/T284346) [19:14:11] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:16:11] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:29:03] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:57] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.589 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:02:18] !log restarting blazegraph on wdqs2004 [20:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:37] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:47] RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:26:21] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:13] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:28] (03PS1) 10Zabe: osm: migrate cron osm_sync_lag to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) [21:25:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [22:01:59] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:49] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:19] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:03] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state