[00:01:29] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:04:29] RECOVERY - Disk space on ml-etcd2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops [00:12:15] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:39:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [01:48:57] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [02:34:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [02:34:31] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [02:39:16] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [02:40:31] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [02:45:31] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [02:46:31] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [05:29:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [05:34:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [05:53:45] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:39:51] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:46:46] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [06:54:59] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:03] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220109T0800) [08:53:29] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:16:31] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [09:45:27] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10elukey) Downtimed the host for a day (from now), so that it will not show up in icinga. [09:54:37] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:57:16] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading: File upload not working: Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T295343 (10Aklapper) [09:58:38] 10SRE, 10Wikimedia-Mailing-lists: Wikipedia-l list needs owners - https://phabricator.wikimedia.org/T295244 (10Aklapper) @Quiddity: Anything left to do here, or can this be resolved? Thanks. [10:15:44] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Yann) Note that it is possible to overwr... [10:24:42] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Yann) Right now, 82 files in https://com... [10:44:57] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:27:08] (03CR) 10Majavah: [C: 03+1] Use namespaced CentralAuthUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752308 (https://phabricator.wikimedia.org/T298840) (owner: 10Zabe) [12:09:54] (03PS2) 10Urbanecm: Growth: Add GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752187 (https://phabricator.wikimedia.org/T298792) [12:51:27] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:58:31] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:02:39] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:13:51] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:24:43] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:51:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:24] 10SRE, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563 (10Yann) I get this error message, while the file uploaded fine. Very annoying. `Request from 2.7.118.246 via cp3064 cp3064, Varnish XID 322415563 Error: 503, Backend fetch failed at Sun, 09 Jan 2022 13:47:48 GMT`... [14:03:02] 10SRE, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563 (10Yann) 05Resolved→03Open This happens regularly for files from 25 to 35 MB. [14:11:43] 10SRE, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563 (10RhinosF1) 05Open→03Resolved Please open a new task. These will be separate issues. [14:40:20] (03Abandoned) 10RhinosF1: Use f string to avoid repeating words. [cookbooks] - 10https://gerrit.wikimedia.org/r/656923 (owner: 10RhinosF1) [14:49:51] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:53:50] 10SRE: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 (10Majavah) 05Stalled→03Open [15:00:53] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:04:57] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:07:42] (03PS1) 10Majavah: P:mw::maintenance: add centralauth group purge job [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) [16:08:37] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [16:10:03] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [16:17:19] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:22:18] (03PS1) 10Majavah: beta: Enable temporary global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752343 (https://phabricator.wikimedia.org/T153815) [16:22:20] (03PS1) 10Majavah: Enable temporary global user groups on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815) [16:22:46] (03CR) 10Majavah: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [16:34:13] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:36:29] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:12:33] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752343 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [17:12:46] (03CR) 10Urbanecm: [C: 03+1] "code LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [17:29:19] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:31] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:41] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:05:47] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:24:09] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:45] PROBLEM - MD RAID on elastic2035 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:31:46] ACKNOWLEDGEMENT - MD RAID on elastic2035 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T298853 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:31:50] 10SRE, 10ops-codfw: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10ops-monitoring-bot) [19:56:01] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:03:33] 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10Peachey88) [20:04:15] PROBLEM - Device not healthy -SMART- on elastic2035 is CRITICAL: cluster=elasticsearch device=sdb instance=elastic2035 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [20:57:03] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:12:17] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:43:03] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:15] RECOVERY - Device not healthy -SMART- on elastic2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops