[00:09:37] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:16:05] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), No backups: 6 (dbprov1001, ...), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210801T0700) [08:58:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:00:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:10:45] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [09:12:41] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [09:29:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:32:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:21:51] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [11:27:39] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [11:58:14] 10SRE, 10SRE-swift-storage: Can't delete a file - https://phabricator.wikimedia.org/T287828 (10Peachey88) [11:59:11] 10SRE, 10SRE-swift-storage: Unable to delete `Балістичні таблиці P1720666.JPG` on uk.wikipedia - An unknown error occurred in storage backend "local-multiwrite" - https://phabricator.wikimedia.org/T287828 (10Peachey88) [11:59:27] 10SRE, 10SRE-swift-storage: Unable to delete `Балістичні таблиці P1720666.JPG` on uk.wikipedia - An unknown error occurred in storage backend "local-multiwrite" - https://phabricator.wikimedia.org/T287828 (10RhinosF1) T244567 ? [12:00:02] 10SRE, 10SRE-swift-storage: Unable to delete `Балістичні таблиці P1720666.JPG` on uk.wikipedia - An unknown error occurred in storage backend "local-multiwrite" - https://phabricator.wikimedia.org/T287828 (10RhinosF1) [12:00:13] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10RhinosF1) [12:00:50] 10SRE-swift-storage, 10MediaWiki-File-management, 10Structured Data Engineering, 10Structured-Data-Backlog, 10Wikimedia-production-error: Cannot delete one image file on Thai Wikipedia: Error deleting file: An unknown error occurred in storage backend "local-mult... - https://phabricator.wikimedia.org/T270811 [12:08:47] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Andriy.v) >>! In T244567#7231975, @WindE... [12:17:17] PROBLEM - MariaDB memory on clouddb1019 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (18326) = 75.8% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:19:52] 10SRE-swift-storage, 10MediaWiki-File-management, 10Structured Data Engineering, 10Structured-Data-Backlog, 10Wikimedia-production-error: Cannot delete one image file on Thai Wikipedia: Error deleting file: An unknown error occurred in storage backend "local-mult... - https://phabricator.wikimedia.org/T270811 [12:20:02] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Zabe) [12:21:03] Ty zabe [12:21:52] np [12:22:12] * RhinosF1 didn't do it in case he was blind and missing anything [12:25:04] * zabe just thought to himself: 'They can reopen the task again anyway in such a case' [12:25:59] T174269 seems to also be a duplicate [12:26:00] T174269: Two cases of local-multiwrite storage backend failure - https://phabricator.wikimedia.org/T174269 [12:26:41] Amir1: that's yours ^ [12:27:29] That seems older [12:27:34] By 3 years than the main [12:30:07] 10SRE, 10SRE-swift-storage: Two cases of local-multiwrite storage backend failure - https://phabricator.wikimedia.org/T174269 (10Zabe) Is this the same as T244567? [13:45:22] (03PS5) 10Labdajiwa: Set the project namespace and sitename for Javanese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708206 (https://phabricator.wikimedia.org/T287437) [15:45:11] PROBLEM - MariaDB memory on clouddb1019 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (18326) = 75.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [17:26:03] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [17:28:00] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 11 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:24:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:26:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:12:52] 10SRE, 10ops-eqiad: Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10ops-monitoring-bot) [20:26:01] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:27:57] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 11 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [21:02:24] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [21:04:19] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 14 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [21:37:00] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [21:38:55] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator