[00:01:29] <icinga-wm>	 RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:04:29] <icinga-wm>	 RECOVERY - Disk space on ml-etcd2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops
[00:12:15] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:59] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:53:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:39:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[01:48:57] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[02:34:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[02:34:31] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[02:39:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[02:40:31] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[02:45:31] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[02:46:31] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[05:29:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[05:34:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[05:53:45] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:39:51] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:46:46] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[06:54:59] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:41:03] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220109T0800)
[08:53:29] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:16:31] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[09:45:27] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10elukey) Downtimed the host for a day (from now), so that it will not show up in icinga.
[09:54:37] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:57:16] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading: File upload not working: Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T295343 (10Aklapper)
[09:58:38] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Wikipedia-l list needs owners - https://phabricator.wikimedia.org/T295244 (10Aklapper) @Quiddity: Anything left to do here, or can this be resolved? Thanks.
[10:15:44] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Yann) Note that it is possible to overwr...
[10:24:42] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Yann) Right now, 82 files in https://com...
[10:44:57] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:27:08] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Use namespaced CentralAuthUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752308 (https://phabricator.wikimedia.org/T298840) (owner: 10Zabe)
[12:09:54] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Add GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752187 (https://phabricator.wikimedia.org/T298792)
[12:51:27] <icinga-wm>	 RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[12:58:31] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:02:39] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:13:51] <icinga-wm>	 PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:24:43] <icinga-wm>	 PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:51:05] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:24] <wikibugs>	 10SRE, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563 (10Yann) I get this error message, while the file uploaded fine. Very annoying.  `Request from 2.7.118.246 via cp3064 cp3064, Varnish XID 322415563 Error: 503, Backend fetch failed at Sun, 09 Jan 2022 13:47:48 GMT`...
[14:03:02] <wikibugs>	 10SRE, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563 (10Yann) 05Resolved→03Open This happens regularly for files from 25 to 35 MB.
[14:11:43] <wikibugs>	 10SRE, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563 (10RhinosF1) 05Open→03Resolved Please open a new task. These will be separate issues.
[14:40:20] <wikibugs>	 (03Abandoned) 10RhinosF1: Use f string to avoid repeating words. [cookbooks] - 10https://gerrit.wikimedia.org/r/656923 (owner: 10RhinosF1)
[14:49:51] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:53:50] <wikibugs>	 10SRE: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 (10Majavah) 05Stalled→03Open
[15:00:53] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:04:57] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:07:42] <wikibugs>	 (03PS1) 10Majavah: P:mw::maintenance: add centralauth group purge job [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815)
[16:08:37] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[16:10:03] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[16:17:19] <icinga-wm>	 RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:22:18] <wikibugs>	 (03PS1) 10Majavah: beta: Enable temporary global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752343 (https://phabricator.wikimedia.org/T153815)
[16:22:20] <wikibugs>	 (03PS1) 10Majavah: Enable temporary global user groups on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815)
[16:22:46] <wikibugs>	 (03CR) 10Majavah: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[16:34:13] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:36:29] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:12:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752343 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[17:12:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "code LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah)
[17:29:19] <icinga-wm>	 RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:04:31] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:08:41] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:05:47] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:24:09] <icinga-wm>	 PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:45] <icinga-wm>	 PROBLEM - MD RAID on elastic2035 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:31:46] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on elastic2035 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T298853 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:31:50] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10ops-monitoring-bot)
[19:56:01] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:03:33] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10Peachey88)
[20:04:15] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on elastic2035 is CRITICAL: cluster=elasticsearch device=sdb instance=elastic2035 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops
[20:57:03] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:12:17] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:43:03] <icinga-wm>	 RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:15] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on elastic2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops