[00:02:48] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:51:48] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:10:25] 10SRE, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531 (10Andrew) [01:31:02] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:08] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:26:51] 10SRE: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10BCornwall) This may or may not be relevant but the Arch Linux project has had similar issues of keeping track of expiring keys: They're moving toward WKD as infrastructure rather than the dying/unreliable keyserver inf... [02:29:25] 10SRE: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Peachey88) [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [02:51:22] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:30:02] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:02] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update_main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:34] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:03:59] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1016.eqiad.wmnet [04:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:49] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [04:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:03] (03PS1) 10Andrew Bogott: Remove references to cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/789890 (https://phabricator.wikimedia.org/T307825) [04:12:22] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/789890 (https://phabricator.wikimedia.org/T307825) (owner: 10Andrew Bogott) [04:12:36] (03PS2) 10Andrew Bogott: Remove references to cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/789890 (https://phabricator.wikimedia.org/T307825) [04:13:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:42] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1016.eqiad.wmnet [04:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:49] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudvirt1016 - https://phabricator.wikimedia.org/T307825 (10Andrew) a:03Cmjohnson [04:19:58] 10SRE, 10Privacy Engineering, 10Research, 10Security-Team, and 3 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10bmansurov) @BBlack thank you. Who would be a good person to review the patch? @Dzahn, would you be able... [04:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:44:19] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:23] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:47:25] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:55:13] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:09:31] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220507T0700) [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:10:55] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:01:24] Rook: Hey, I noticed you created a local userpage at mw wiki, If you create a userpage on metawiki, It will act as a global userpage for all the wikis that you visit (and auto create a local account) - https://meta.wikimedia.org/wiki/Global_user_pages [08:01:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:19] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:18:29] p858snake: neat, I'll do that. Thanks! [08:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:58:47] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:58:53] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:55] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:01:01] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:13] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:24:27] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [10:53:43] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:19:07] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:06:37] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:04:49] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:51] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:02:51] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:08] (03PS1) 10Andrew Bogott: Horizon: include Victoria openstack bpos on all cloudweb/labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789904 (https://phabricator.wikimedia.org/T307561) [14:15:46] (03Abandoned) 10Andrew Bogott: Horizon: include Victoria openstack bpos on all cloudweb/labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789904 (https://phabricator.wikimedia.org/T307561) (owner: 10Andrew Bogott) [14:26:27] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [14:53:49] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:06:28] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: seeking consistency between codfw1dev and eqiad1 [15:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:39] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:52] 10SRE-swift-storage, 10Commons, 10Wikimedia-Site-requests, 10affects-Kiwix-and-openZIM: JPEG image is reported with the wrong mime-type application/octet-stream - https://phabricator.wikimedia.org/T298011 (10Kelson) @Reedy Would you be able please to help here? [15:17:23] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: seeking consistency between codfw1dev and eqiad1 (duration: 10m 55s) [15:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:32] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: seeking consistency between codfw1dev and eqiad1 [15:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:27] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: seeking consistency between codfw1dev and eqiad1 (duration: 10m 55s) [15:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:09] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): testing scapping to cloudweb2002 [15:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:42] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): testing scapping to cloudweb2002 (duration: 00m 33s) [15:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:48:59] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): (no justification provided) [15:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:04] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): (no justification provided) (duration: 00m 05s) [15:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:12] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): (no justification provided) [15:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:29] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): (no justification provided) (duration: 00m 17s) [15:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:36] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): testing scapping to cloudweb2002 [15:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:51] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): testing scapping to cloudweb2002 (duration: 01m 14s) [15:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:56:56] 10SRE-swift-storage, 10Commons, 10Wikimedia-Site-requests, 10affects-Kiwix-and-openZIM: JPEG image is reported with the wrong mime-type application/octet-stream - https://phabricator.wikimedia.org/T298011 (10Reedy) 05Open→03Resolved a:03Reedy ` reedy@deploy1002:~$ mwscript maintenance/refreshFileHead... [16:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:39:51] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 59.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [17:42:07] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [17:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:24:55] 10SRE-swift-storage, 10Commons, 10Wikimedia-Site-requests, 10affects-Kiwix-and-openZIM: JPEG image is reported with the wrong mime-type application/octet-stream - https://phabricator.wikimedia.org/T298011 (10Kelson) @Reedy Thank you! [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:20:33] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:38:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:41:05] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:43:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:03:23] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:05:07] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:32:01] (03PS4) 10Gergő Tisza: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [20:32:11] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [20:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:04:35] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:35] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: seeking consistency between codfw1dev and eqiad1 [21:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:17] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:11:26] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: seeking consistency between codfw1dev and eqiad1 (duration: 05m 51s) [21:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:07] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:25:34] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: seeking consistency between codfw1dev and eqiad1 [21:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:38] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: seeking consistency between codfw1dev and eqiad1 (duration: 04m 04s) [21:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:07:37] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:02:21] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale