[00:07:06] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS bullseye
[00:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye
[00:08:54] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) 1016 had already Buster installed . I am re-running the cookbook again to install Bullseye
[00:11:34] <icinga-wm>	 PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:46] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv4: Idle - Orange, AS5511/IPv6: Idle - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[00:39:56] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage
[00:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:22] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage
[00:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:52:08] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:55:56] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:56:35] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1016.eqiad.wmnet with OS bullseye
[00:56:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye completed: - aqs1016 (**PASS**)...
[00:57:44] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:59:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:04:44] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:07:06] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1017.eqiad.wmnet with OS bullseye
[01:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:07:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1017.eqiad.wmnet with OS bullseye
[01:08:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul)
[01:10:34] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:34] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:13:06] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:16:30] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:17:12] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:23:06] <wikibugs>	 (03CR) 10Tim Starling: "Getting bored of waiting for review, tempted to just merge it" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[01:37:19] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:38:52] <wikibugs>	 10SRE, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster and away from the memcached cluster - https://phabricator.wikimedia.org/T267581 (10tstarling)
[01:39:00] <wikibugs>	 10SRE, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10tstarling)
[01:39:28] <wikibugs>	 (03Abandoned) 10Tim Starling: Switch wgMainStash back to Redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804024 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling)
[01:39:54] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) 05Open→03Resolved Metrics on db1151 look fine. Disk space usage on db1151 is growing at a rate of...
[01:39:55] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage
[01:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:42:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:43:04] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage
[01:43:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:45:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:10] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE))
[01:54:47] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1017.eqiad.wmnet with OS bullseye
[01:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:54:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1017.eqiad.wmnet with OS bullseye completed: - aqs1017 (**WARN**)...
[02:00:47] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "To be on the safe side, I'll do an eval.php check of the tagline default before scap, then I'll check for a missing tagline at https://aa." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE))
[02:01:37] <wikibugs>	 (03Merged) 10jenkins-bot: Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE))
[02:02:25] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1018.eqiad.wmnet with OS bullseye
[02:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:02:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1018.eqiad.wmnet with OS bullseye
[02:03:00] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:47] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 03m 43s)
[02:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:07:41] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "Seems fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE))
[02:08:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:08:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:09:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:09:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:10:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:20:00] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:36:24] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage
[02:36:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:39:31] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage
[02:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:42:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:50:56] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:51:12] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:51:22] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1018.eqiad.wmnet with OS bullseye
[02:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1018.eqiad.wmnet with OS bullseye completed: - aqs1018 (**WARN**)...
[03:12:32] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 25, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:20:58] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul)
[03:38:01] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:52:35] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:58:31] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:59:51] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:06:33] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:11:07] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:53] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:19:57] <icinga-wm>	 PROBLEM - SSH on an-worker1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:24:07] <icinga-wm>	 PROBLEM - Host an-worker1109 is DOWN: PING CRITICAL - Packet loss = 100%
[04:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[04:53:39] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:59:37] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:00:37] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:03:00] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:08:55] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:10:33] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:10:41] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:13:03] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:22:01] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:26:37] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:32:23] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[06:46:53] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:43] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:55:59] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220617T0700)
[07:02:47] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:11:15] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove webperf1002/webperf2002 from Kafka firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/804334 (https://phabricator.wikimedia.org/T305460)
[07:13:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:15:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove webperf1002/webperf2002 from Kafka firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/804334 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[07:16:50] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] cas: Update to 6.5.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[07:17:15] <wikibugs>	 (03Abandoned) 10Muehlenhoff: envoyproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799311 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:17:38] <wikibugs>	 (03Abandoned) 10Muehlenhoff: cas: Update to 6.5.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[07:17:51] <wikibugs>	 (03PS2) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518)
[07:23:15] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:27:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Mergin" [puppet] - 10https://gerrit.wikimedia.org/r/806216 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:31:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/806218 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:31:38] <wikibugs>	 (03PS2) 10Muehlenhoff: spamassassin: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806218 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:32:21] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:38:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806208 (owner: 10Jbond)
[07:39:13] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:39:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/806219 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:39:52] <wikibugs>	 (03PS2) 10Muehlenhoff: tomcat: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806219 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:41:23] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-staging-ctrl[2001-2002].codfw.wmnet with reason: Rebooting to activate new kernel for T310483
[07:41:25] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-staging-ctrl[2001-2002].codfw.wmnet with reason: Rebooting to activate new kernel for T310483
[07:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:29] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:24] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendations frontend, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806365 (https://phabricator.wikimedia.org/T304548)
[07:50:02] <wikibugs>	 (03CR) 10Muehlenhoff: vrts: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:54:28] <wikibugs>	 (03PS1) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673)
[08:01:45] <wikibugs>	 (03PS2) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673)
[08:02:42] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet
[08:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:29] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:06:35] <wikibugs>	 (03PS3) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673)
[08:07:45] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35899/console" [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:08:39] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet
[08:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:46] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2002.codfw.wmnet
[08:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:39] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] snapshot: migrate adds-changes cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779016 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[08:11:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:12:57] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:17:09] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2002.codfw.wmnet
[08:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti4004.ulsfo.wmnet with reason: Enable virt in BIOS
[08:17:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti4004.ulsfo.wmnet with reason: Enable virt in BIOS
[08:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:56] <jinxer-wm>	 (CertManagerCertNotReady) resolved: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[08:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:21:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[08:21:52] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-serve-ctrl[2001-2002].codfw.wmnet with reason: Rebooting to activate new kernel for T310483
[08:21:54] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve-ctrl[2001-2002].codfw.wmnet with reason: Rebooting to activate new kernel for T310483
[08:21:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:33] <wikibugs>	 (03PS1) 10Ayounsi: Prometheus/Netbox: use netbox.wikimedia.org SNI [puppet] - 10https://gerrit.wikimedia.org/r/806368 (https://phabricator.wikimedia.org/T243928)
[08:22:55] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:22:56] <jinxer-wm>	 (CertManagerCertNotReady) firing: (2) Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[08:24:43] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:apt do not include private apt repo on cloud hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede)
[08:27:27] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:27:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/806368 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[08:29:15] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:31:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Prometheus/Netbox: use netbox.wikimedia.org SNI [puppet] - 10https://gerrit.wikimedia.org/r/806368 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[08:33:47] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[08:37:12] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10MoritzMuehlenhoff)
[08:37:35] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) 05Resolved→03Open The server doesn't have virtualisation enabled. I tried to enable it via the BIOS over the serial console, but I'm not getting a cons...
[08:38:43] <icinga-wm>	 RECOVERY - Host an-worker1109 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms
[08:39:07] <icinga-wm>	 RECOVERY - SSH on an-worker1109 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:39:44] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet
[08:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:07] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:41:28] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) The server can be powered down any time, while it already has the ganeti role, it's not yet added to the cluster.
[08:45:11] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:47:27] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:47:47] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet
[08:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM added hashar as a heads up" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:49:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm)
[08:51:20] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2002.codfw.wmnet
[08:51:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:37] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:53:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:56:03] <wikibugs>	 (03CR) 10Tacsipacsi: CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[08:56:39] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:57:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm)
[08:57:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:58:28] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2002.codfw.wmnet
[08:58:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:55] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:01:22] <wikibugs>	 (03PS3) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518)
[09:01:42] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2003.codfw.wmnet
[09:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm)
[09:02:51] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[09:02:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:02:58] <wikibugs>	 (03CR) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[09:04:03] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:07:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:07:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:08:03] <wikibugs>	 (03PS2) 10Zabe: snapshot: remove absented add-changes cron [puppet] - 10https://gerrit.wikimedia.org/r/779017 (https://phabricator.wikimedia.org/T273673)
[09:09:43] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2003.codfw.wmnet
[09:09:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] php: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806217 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[09:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:39] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2004.codfw.wmnet
[09:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:38] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/35900/" [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff)
[09:12:55] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:12:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:13:13] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:14:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet
[09:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:15] <icinga-wm>	 PROBLEM - Host ganeti4004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:18:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet
[09:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:42] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2004.codfw.wmnet
[09:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti4004.ulsfo.wmnet with reason: Enable virt in BIOS
[09:23:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti4004.ulsfo.wmnet with reason: Enable virt in BIOS
[09:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet
[09:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:24] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet
[09:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet
[09:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:57] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:26:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373
[09:27:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: base: include profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806374
[09:27:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: fix race between SD/dnsmasq and resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/806375
[09:27:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: enable SD for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/806376
[09:27:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: update hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/806377
[09:27:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmcs: add default for metricsinfra_prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/806378
[09:27:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add metricsinfra_prometheus_nodes to settings [puppet] - 10https://gerrit.wikimedia.org/r/806379
[09:28:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet
[09:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:07] <wikibugs>	 (03PS4) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518)
[09:30:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi)
[09:30:53] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2005.codfw.wmnet
[09:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:32:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet
[09:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet
[09:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:28] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2006.codfw.wmnet
[09:34:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1004.eqiad.wmnet
[09:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:07] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:38:04] <wikibugs>	 (03CR) 10Jbond: "This is probably acceptable as long as we track it, however please get a +1 from moritz so its definitely on thier radar" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[09:40:35] <wikibugs>	 (03PS1) 10Btullis: Disable the telemetry for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806381 (https://phabricator.wikimedia.org/T310079)
[09:40:37] <wikibugs>	 (03CR) 10Volans: "What if instead we solve the problem accepting both absolute and percentage batch sizes values like cumin?" [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:41:54] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2006.codfw.wmnet
[09:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:57] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 104.5 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[09:44:05] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2007.codfw.wmnet
[09:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:15] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:46:09] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Disable the telemetry for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806381 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis)
[09:50:15] <wikibugs>	 (03Merged) 10jenkins-bot: Disable the telemetry for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806381 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis)
[09:50:33] <wikibugs>	 (03PS1) 10Jbond: P:promethous::ops: add host header to scrap config [puppet] - 10https://gerrit.wikimedia.org/r/806382
[09:51:49] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2007.codfw.wmnet
[09:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:37] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[09:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:58] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[09:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:29] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[09:55:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:51] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[09:55:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:03] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[09:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:33] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[09:56:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:05] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2008.codfw.wmnet
[09:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:55] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:03:00] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:05:35] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2008.codfw.wmnet
[10:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:17] <wikibugs>	 (03PS1) 10Jbond: P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383
[10:11:30] <wikibugs>	 (03Abandoned) 10Jbond: P:promethous::ops: add host header to scrap config [puppet] - 10https://gerrit.wikimedia.org/r/806382 (owner: 10Jbond)
[10:12:12] <wikibugs>	 (03PS1) 10Btullis: Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806384 (https://phabricator.wikimedia.org/T310629)
[10:14:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35904/console" [puppet] - 10https://gerrit.wikimedia.org/r/806383 (owner: 10Jbond)
[10:18:41] <wikibugs>	 (03PS2) 10Jbond: P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383
[10:19:14] <wikibugs>	 (03CR) 10Muehlenhoff: "Enable unpriv user_ns seems fine for this use case, but I think two aspects are relevant here:" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[10:20:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806384 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis)
[10:21:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383 (owner: 10Jbond)
[10:21:46] <wikibugs>	 (03PS3) 10Jbond: P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383
[10:22:11] <wikibugs>	 (03PS4) 10Jbond: P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383
[10:22:15] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[10:22:51] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI" [puppet] - 10https://gerrit.wikimedia.org/r/806252
[10:23:25] <wikibugs>	 (03Merged) 10jenkins-bot: Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806384 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis)
[10:24:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff)
[10:25:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI" [puppet] - 10https://gerrit.wikimedia.org/r/806252 (owner: 10Ayounsi)
[10:25:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383 (owner: 10Jbond)
[10:25:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI" [puppet] - 10https://gerrit.wikimedia.org/r/806252 (owner: 10Ayounsi)
[10:28:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete. The ulsfo cluster is affected by T309724, but that will be investigated via that task (and it doesn't have a functional impact apart fr...
[10:28:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete. The eqsin cluster is affected by T309724, but that will be investigated via that task (and it doesn't have a functional impact apart fr...
[10:28:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete.
[10:31:04] <wikibugs>	 (03PS3) 10Volans: icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447)
[10:32:43] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[10:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:59] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[10:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:15] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[10:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:16] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[10:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:30] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[10:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:32] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[10:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:23] <wikibugs>	 (03PS1) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299)
[10:45:27] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:46:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney)
[10:48:58] <wikibugs>	 (03PS2) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299)
[10:50:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney)
[10:51:03] <wikibugs>	 (03PS1) 10Slyngshede: Admin: grant samtar access to deployment [puppet] - 10https://gerrit.wikimedia.org/r/806391 (https://phabricator.wikimedia.org/T302231)
[10:53:23] <wikibugs>	 (03PS2) 10Slyngshede: Admin: grant samtar access to deployment [puppet] - 10https://gerrit.wikimedia.org/r/806391 (https://phabricator.wikimedia.org/T302231)
[10:54:18] <wikibugs>	 (03PS1) 10Jbond: netbox: add hostname to allowed list of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/806392
[10:54:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/806391 (https://phabricator.wikimedia.org/T302231) (owner: 10Slyngshede)
[10:54:35] <wikibugs>	 (03PS3) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299)
[10:55:05] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Admin: grant samtar access to deployment [puppet] - 10https://gerrit.wikimedia.org/r/806391 (https://phabricator.wikimedia.org/T302231) (owner: 10Slyngshede)
[10:56:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10SLyngshede-WMF)
[10:56:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney)
[10:57:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netbox: add hostname to allowed list of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/806392 (owner: 10Jbond)
[10:57:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[10:58:50] <wikibugs>	 (03PS2) 10Jbond: netbox: add hostname to allowed list of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/806392
[11:00:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10SLyngshede-WMF) 05Open→03Resolved
[11:00:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1010.eqiad.wmnet
[11:00:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] netbox: add hostname to allowed list of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/806392 (owner: 10Jbond)
[11:06:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Add missing file [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806393
[11:06:49] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1010.eqiad.wmnet
[11:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:58] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1011.eqiad.wmnet
[11:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:13] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:08:41] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:08:48] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add missing file [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806393 (owner: 10Muehlenhoff)
[11:09:04] <wikibugs>	 (03PS1) 10Jbond: Revert "netbox: add hostname to allowed list of hostnames" [puppet] - 10https://gerrit.wikimedia.org/r/806253
[11:09:17] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:10:41] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: add ptwiki draftquality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/806394 (https://phabricator.wikimedia.org/T310704)
[11:10:47] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:11:23] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:12:01] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:12:29] <wikibugs>	 (03PS1) 10Jbond: C:netbox: fic typo [puppet] - 10https://gerrit.wikimedia.org/r/806395
[11:12:33] <icinga-wm>	 PROBLEM - Check systemd state on netbox2002 is CRITICAL: CRITICAL - degraded: The following units failed: rq-netbox.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:12:39] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service,netbox_ganeti_esams_sync.service,netbox_report_puppetdb_virtual_run.service,rq-netbox.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:07] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:21] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1011.eqiad.wmnet
[11:13:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno)
[11:15:22] <wikibugs>	 (03PS2) 10Jbond: C:netbox: fic typo [puppet] - 10https://gerrit.wikimedia.org/r/806395
[11:16:26] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1012.eqiad.wmnet
[11:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:57] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:17:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35908/console" [puppet] - 10https://gerrit.wikimedia.org/r/806395 (owner: 10Jbond)
[11:17:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:netbox: fic typo [puppet] - 10https://gerrit.wikimedia.org/r/806395 (owner: 10Jbond)
[11:20:41] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:22:47] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1012.eqiad.wmnet
[11:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:19] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:23:45] <wikibugs>	 (03PS1) 10Btullis: Disable native authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806396 (https://phabricator.wikimedia.org/T310079)
[11:26:32] <wikibugs>	 (03PS1) 10Jaime Nuche: scap bootstrap: refactor [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740)
[11:28:00] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Disable native authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806396 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis)
[11:31:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2010.codfw.wmnet
[11:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:09] <wikibugs>	 (03Merged) 10jenkins-bot: Disable native authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806396 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis)
[11:31:57] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:32:24] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:59] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[11:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:12] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[11:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:15] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[11:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:27] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[11:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:03] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:13] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[11:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:09] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2010.codfw.wmnet
[11:37:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2011.codfw.wmnet
[11:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:47] <icinga-wm>	 RECOVERY - Check systemd state on netbox2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:03] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] snapshot: remove absented add-changes cron [puppet] - 10https://gerrit.wikimedia.org/r/779017 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[11:40:09] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:40:17] <moritzm>	 !log upload cas 6.5.5+wmf11u1 to apt.wikimedia.org T305518
[11:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:22] <stashbot>	 T305518: Upgrade IDPs to CAS 6.5/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518
[11:43:33] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI"" [puppet] - 10https://gerrit.wikimedia.org/r/806254
[11:43:43] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2011.codfw.wmnet
[11:43:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:54] <wikibugs>	 (03PS2) 10Muehlenhoff: coal: Remove support for pre Bullseye installs [puppet] - 10https://gerrit.wikimedia.org/r/804340 (https://phabricator.wikimedia.org/T305460)
[11:45:11] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:31] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI"" [puppet] - 10https://gerrit.wikimedia.org/r/806254 (owner: 10Ayounsi)
[11:47:16] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2012.codfw.wmnet
[11:47:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:53:16] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:53:18] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2012.codfw.wmnet
[11:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:37] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@18182aa]: (no justification provided)
[11:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:51] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@18182aa]: (no justification provided) (duration: 00m 13s)
[11:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:07] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831)
[11:58:00] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:58:15] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:58:34] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831)
[12:01:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Reenable U2F for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff)
[12:01:19] <wikibugs>	 (03PS4) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299)
[12:01:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney)
[12:04:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] coal: Remove support for pre Bullseye installs [puppet] - 10https://gerrit.wikimedia.org/r/804340 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[12:04:20] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "i dont think we should include this in the production profile, i think it makes more sense to inject it via the pontoon enc, happy to chat" [puppet] - 10https://gerrit.wikimedia.org/r/806374 (owner: 10Filippo Giunchedi)
[12:05:33] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:50] <wikibugs>	 (03PS1) 10Muehlenhoff: squid/racktables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806406 (https://phabricator.wikimedia.org/T308013)
[12:07:09] <wikibugs>	 (03CR) 10Jbond: "CI error relates to spdx" [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi)
[12:09:17] <godog>	 jbond: thank you for the reviews, however none of https://gerrit.wikimedia.org/r/q/topic:pontoon-latest-merges is ready yet
[12:09:59] <wikibugs>	 (03CR) 10Jbond: "LGTM, Sorry i missed this when i did prod" [puppet] - 10https://gerrit.wikimedia.org/r/806377 (owner: 10Filippo Giunchedi)
[12:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:12:34] <wikibugs>	 (03PS1) 10Urbanecm: Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885)
[12:12:55] <wikibugs>	 (03CR) 10Jbond: Bump changelog for 6.5.5 and add some docs how to resync the overlay (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[12:16:55] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:17] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Fix typo in config var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse)
[12:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[12:29:09] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for 6.5.5 and add some docs how to resync the overlay (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[12:35:13] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:35:28] <SandraEbele>	 !log deployed daily airflow dag for 3 Wikidata metrics.
[12:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[12:36:57] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:47:25] <wikibugs>	 (03PS5) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299)
[12:48:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney)
[12:52:21] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: remove redundant parameters for qps_max [puppet] - 10https://gerrit.wikimedia.org/r/806414
[12:53:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Majavah) a:05hashar→03None
[12:53:31] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35913/console" [puppet] - 10https://gerrit.wikimedia.org/r/806414 (owner: 10Ssingh)
[12:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:56:12] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: remove redundant parameters for qps_max [puppet] - 10https://gerrit.wikimedia.org/r/806414 (owner: 10Ssingh)
[13:00:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Michael)
[13:03:51] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:09:03] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:11:19] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:11:35] <wikibugs>	 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10RhinosF1) This just alerted again: > 14:09:04 <+icinga-wm> PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate li...
[13:13:31] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:19:27] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:20:23] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:35:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Michael Große to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806417 (https://phabricator.wikimedia.org/T308013)
[13:35:31] <wikibugs>	 (03PS1) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108)
[13:36:43] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35914/console" [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah)
[13:38:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Michael Große to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806417 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:38:40] <wikibugs>	 (03PS2) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108)
[13:39:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35915/console" [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah)
[13:40:09] <wikibugs>	 (03PS3) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108)
[13:41:10] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35916/console" [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah)
[13:42:18] <wikibugs>	 (03PS6) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299)
[13:43:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah)
[13:46:11] <wikibugs>	 (03PS4) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108)
[13:47:18] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35917/console" [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah)
[13:49:45] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Complete Netbox prometheus scraping - https://phabricator.wikimedia.org/T243928 (10ayounsi) 05Open→03Resolved All done thanks to John and Filippo.  Example dashboard can be seen there https://grafana.wikimedia.org/d/DvXT6LCnk/ fee...
[13:53:48] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831)
[13:55:36] <wikibugs>	 (03PS5) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108)
[14:00:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah)
[14:01:49] <wikibugs>	 (03CR) 10Ayounsi: "This fails PCC, dunno how to make it work :)" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[14:04:57] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:05:01] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:07:33] <wikibugs>	 (03PS1) 10Ayounsi: Netbox stats, set scrape interval to 1h [puppet] - 10https://gerrit.wikimedia.org/r/806422
[14:07:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) @Eevans hello i talked to @Cmjohnson on this task since i will be working on it. He said that you wanted Buster on it so i just wanted to confirm if it is s...
[14:08:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul)
[14:11:07] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:13:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff) I reached out to Marc-André Pelletier (Coren) via email and he replied the following (quoted with his permission), as such I'm listing him under CONTRIB...
[14:15:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Coren to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806424 (https://phabricator.wikimedia.org/T308013)
[14:19:06] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[14:19:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Coren to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806424 (https://phabricator.wikimedia.org/T308013)
[14:21:47] <wikibugs>	 (03PS1) 10Jbond: wmflib: update kernel_details to also include kernel.unprivileged_userns_clone [puppet] - 10https://gerrit.wikimedia.org/r/806425
[14:22:19] <wikibugs>	 (03CR) 10Andrew Bogott: "As mentioned on the ticket, I'm inclined to leave this part of the monitoring infra undone for now." [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah)
[14:24:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Coren to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806424 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:24:47] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[14:24:48] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[14:24:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove references [puppet] - 10https://gerrit.wikimedia.org/r/806426
[14:32:51] <wikibugs>	 (03CR) 10Jbond: Netbox: add monitoring to dns.git endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[14:33:58] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[14:36:11] <wikibugs>	 (03PS1) 10David Caro: wmcs.openstaack: Add runbook to increase the quotas [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/806429 (https://phabricator.wikimedia.org/T297606)
[14:36:45] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10Aklapper) 05Open→03Declined Hi, please see https://wikitech.wikimedia.org/wiki/Maps/External_usage : "maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites hosted by Wikimedia Aff...
[14:38:07] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet
[14:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:57] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) >>! In T305570#8011724, @Papaul wrote: > @Eevans hello i talked to @Cmjohnson on this task since i will be working on it. He said that you wanted Buster on...
[14:41:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) @Eevans thanks
[14:46:33] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS buster
[14:46:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS buster
[14:47:45] <wikibugs>	 (03PS1) 10Cwhite: logstash: alertmanager use logsource as source for host.name field [puppet] - 10https://gerrit.wikimedia.org/r/806430 (https://phabricator.wikimedia.org/T222826)
[14:51:30] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: add ptwiki draftquality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/806394 (https://phabricator.wikimedia.org/T310704) (owner: 10Kevin Bazira)
[14:54:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet
[14:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:12] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1040.eqiad.wmnet
[14:55:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:21] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage
[14:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:47] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1041.eqiad.wmnet
[14:59:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:32] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage
[15:02:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:40] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1017.eqiad.wmnet with OS buster
[15:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:46] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1017.eqiad.wmnet with OS buster
[15:04:52] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:09:33] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1041.eqiad.wmnet
[15:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:21] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1042.eqiad.wmnet
[15:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:07] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1016.eqiad.wmnet with OS buster
[15:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS buster completed: - aqs1016 (**PASS**)   -...
[15:16:13] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage
[15:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:53] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1018.eqiad.wmnet with OS buster
[15:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1018.eqiad.wmnet with OS buster
[15:17:54] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet
[15:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:26] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet
[15:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:47] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage
[15:18:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:09] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti4004.mgmt.ulsfo.wmnet with reboot policy GRACEFUL
[15:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:30] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4004.mgmt.ulsfo.wmnet with reboot policy GRACEFUL
[15:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1042.eqiad.wmnet
[15:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:24] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH)
[15:21:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1043.eqiad.wmnet
[15:21:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:48] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) 05Open→03In progress >>! In T289715#8010871, @MoritzMuehlenhoff wrote: > The server doesn't have virtualisation enabled. I tried to enable it via the BIOS over the...
[15:26:36] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:28:20] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1043.eqiad.wmnet
[15:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:37] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage
[15:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:05] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) 05In progress→03Open
[15:30:08] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH)
[15:30:24] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) a:05RobH→03MoritzMuehlenhoff virtualization is now enabled, yo ushould be able to push this into service as needed now
[15:30:30] <wikibugs>	 (03PS1) 10Majavah: P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433
[15:31:02] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1017.eqiad.wmnet with OS buster
[15:31:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1017.eqiad.wmnet with OS buster completed: - aqs1017 (**PASS**)   -...
[15:31:12] <icinga-wm>	 RECOVERY - Host ganeti4004 is UP: PING OK - Packet loss = 0%, RTA = 68.97 ms
[15:31:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah)
[15:31:34] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:32:04] <wikibugs>	 (03PS2) 10Majavah: P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433
[15:32:19] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) Thanks! I'll do that on Tuesday
[15:32:56] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage
[15:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:13] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet
[15:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:23] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35923/console" [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah)
[15:35:23] <wikibugs>	 (03PS3) 10Majavah: P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433
[15:36:13] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35924/console" [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah)
[15:36:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet
[15:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:01] <wikibugs>	 (03PS4) 10Majavah: P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433
[15:37:39] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35925/console" [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah)
[15:39:48] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS buster
[15:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:54] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster
[15:39:55] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1019.eqiad.wmnet with OS buster
[15:39:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster executed with errors: - aqs1019 (**...
[15:43:10] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet
[15:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:48] <wikibugs>	 (03CR) 10Ayounsi: Add check to network report to ensure IPs match connected Vlans (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney)
[15:43:59] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:31] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:46:48] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1018.eqiad.wmnet with OS buster
[15:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1018.eqiad.wmnet with OS buster completed: - aqs1018 (**PASS**)   -...
[15:49:57] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) Noted, that's the link we do the least traffic on so we can keep it down for some time. I'll take care of it on Monday.
[15:51:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet
[15:51:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet
[15:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:19] <wikibugs>	 (03CR) 10Ayounsi: "This should significantly decrease the number of "GET extras-api:jobresult-detail 200"" [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[15:56:47] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet
[15:56:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:55] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet
[15:57:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:31] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet
[15:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:09] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2043.codfw.wmnet
[16:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:42] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:52] <wikibugs>	 (03PS1) 10Btullis: Update the container used for datahub deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/806435 (https://phabricator.wikimedia.org/T310079)
[16:04:23] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet
[16:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:28] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS buster
[16:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster
[16:06:35] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1019.eqiad.wmnet with OS buster
[16:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster executed with errors: - aqs1019 (**...
[16:10:02] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS buster
[16:10:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster
[16:10:54] <wikibugs>	 (03PS1) 10Andrew Bogott: Replace the bash/socket-based galera healthcheck with a python flask app [puppet] - 10https://gerrit.wikimedia.org/r/806437 (https://phabricator.wikimedia.org/T310664)
[16:12:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Replace the bash/socket-based galera healthcheck with a python flask app [puppet] - 10https://gerrit.wikimedia.org/r/806437 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott)
[16:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:15:07] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2043.codfw.wmnet
[16:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:42] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:57] <wikibugs>	 (03PS2) 10Andrew Bogott: Replace the bash/socket-based galera healthcheck with a python flask app [puppet] - 10https://gerrit.wikimedia.org/r/806437 (https://phabricator.wikimedia.org/T310664)
[16:18:04] <wikibugs>	 (03PS1) 10Majavah: metricsinfra: add metricsinfra-prometheus-2 [puppet] - 10https://gerrit.wikimedia.org/r/806439 (https://phabricator.wikimedia.org/T288108)
[16:18:41] <wikibugs>	 (03PS2) 10Majavah: metricsinfra: add metricsinfra-prometheus-2 [puppet] - 10https://gerrit.wikimedia.org/r/806439 (https://phabricator.wikimedia.org/T310799)
[16:20:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Replace the bash/socket-based galera healthcheck with a python flask app [puppet] - 10https://gerrit.wikimedia.org/r/806437 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott)
[16:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:21:15] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS buster
[16:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1020.eqiad.wmnet with OS buster
[16:22:04] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:22:43] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1019.eqiad.wmnet with reason: host reimage
[16:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:49] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1019.eqiad.wmnet with reason: host reimage
[16:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[16:27:40] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the container used for datahub deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/806435 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis)
[16:29:19] <wikibugs>	 (03PS1) 10Andrew Bogott: galera-nodecheck: fix filename [puppet] - 10https://gerrit.wikimedia.org/r/806441
[16:31:13] <wikibugs>	 (03Merged) 10jenkins-bot: Update the container used for datahub deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/806435 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis)
[16:32:20] <wikibugs>	 (03PS2) 10Andrew Bogott: galera-nodecheck: fix filename [puppet] - 10https://gerrit.wikimedia.org/r/806441
[16:32:40] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[16:32:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:48] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[16:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[16:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:32] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1020.eqiad.wmnet with reason: host reimage
[16:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:48] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[16:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[16:34:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] galera-nodecheck: fix filename [puppet] - 10https://gerrit.wikimedia.org/r/806441 (owner: 10Andrew Bogott)
[16:35:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[16:35:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[16:36:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Remove references [puppet] - 10https://gerrit.wikimedia.org/r/806426 (owner: 10Muehlenhoff)
[16:37:36] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1020.eqiad.wmnet with reason: host reimage
[16:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:24] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1019.eqiad.wmnet with OS buster
[16:38:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster completed: - aqs1019 (**PASS**)   -...
[16:40:59] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1021.eqiad.wmnet with OS buster
[16:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:05] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1021.eqiad.wmnet with OS buster
[16:49:27] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1020.eqiad.wmnet with OS buster
[16:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1020.eqiad.wmnet with OS buster completed: - aqs1020 (**WARN**)   -...
[16:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:57:24] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:59:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] metricsinfra: add metricsinfra-prometheus-2 [puppet] - 10https://gerrit.wikimedia.org/r/806439 (https://phabricator.wikimedia.org/T310799) (owner: 10Majavah)
[17:01:44] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:06:01] <wikibugs>	 (03PS3) 10BCornwall: analytics: add varnishkafka delivery error alarms [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723)
[17:13:18] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:13:22] <wikibugs>	 (03PS1) 10Majavah: P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448
[17:21:56] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:22:10] <wikibugs>	 (03CR) 10BCornwall: analytics: add varnishkafka delivery error alarms (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[17:23:21] <wikibugs>	 (03PS2) 10Andrew Bogott: P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (owner: 10Majavah)
[17:25:12] <wikibugs>	 (03PS4) 10BCornwall: data-engineering: add varnishkafka delivery errors [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723)
[17:26:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (owner: 10Majavah)
[17:27:25] <wikibugs>	 (03PS5) 10Samtar: Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE))
[17:27:43] <Lucas_WMDE>	 👀
[17:28:02] <TheresNoTime>	 oops did that ping you?
[17:28:09] <Lucas_WMDE>	 yeah ^^^
[17:28:11] <Lucas_WMDE>	 * ^^
[17:28:48] <TheresNoTime>	 sorrryyy ^^ is a pointless rebase really, but the big red "merge conflict" irritates me :P
[17:28:52] <Lucas_WMDE>	 but it’s not a bad thing that I get notified when people think they can rebase my patches :P
[17:29:02] <Lucas_WMDE>	 it’s okay 😌
[17:29:10] <wikibugs>	 (03PS3) 10Andrew Bogott: P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (https://phabricator.wikimedia.org/T310664) (owner: 10Majavah)
[17:31:04] <wikibugs>	 (03PS1) 10Andrew Bogott: Galera/haproxy: remove old bash healthcheck script, replace with flask [puppet] - 10https://gerrit.wikimedia.org/r/806449 (https://phabricator.wikimedia.org/T310664)
[17:32:12] <wikibugs>	 (03CR) 10Btullis: data-engineering: add varnishkafka delivery errors (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[17:32:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (https://phabricator.wikimedia.org/T310664) (owner: 10Majavah)
[17:34:58] <wikibugs>	 (03PS4) 10Andrew Bogott: P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (https://phabricator.wikimedia.org/T310664) (owner: 10Majavah)
[17:35:08] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1021.eqiad.wmnet with reason: host reimage
[17:35:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:38] <wikibugs>	 (03PS2) 10Andrew Bogott: Galera/haproxy: remove old bash healthcheck script, replace with flask [puppet] - 10https://gerrit.wikimedia.org/r/806449 (https://phabricator.wikimedia.org/T310664)
[17:35:40] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::openstack::base::galera::node: remove old absented files [puppet] - 10https://gerrit.wikimedia.org/r/806450 (https://phabricator.wikimedia.org/T310664)
[17:35:57] <wikibugs>	 (03PS1) 10Cwhite: profile: add kibana to dashboards rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360)
[17:38:24] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1021.eqiad.wmnet with reason: host reimage
[17:38:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (https://phabricator.wikimedia.org/T310664) (owner: 10Majavah)
[17:39:08] <wikibugs>	 (03PS5) 10BCornwall: data-engineering: add varnishkafka delivery errors [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723)
[17:40:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Galera/haproxy: remove old bash healthcheck script, replace with flask [puppet] - 10https://gerrit.wikimedia.org/r/806449 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott)
[17:40:42] <wikibugs>	 (03CR) 10BCornwall: data-engineering: add varnishkafka delivery errors (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[17:43:48] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[17:45:01] <wikibugs>	 (03PS6) 10BCornwall: data-engineering: add varnishkafka delivery errors [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723)
[17:49:54] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1021.eqiad.wmnet with OS buster
[17:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:00] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1021.eqiad.wmnet with OS buster completed: - aqs1021 (**WARN**)   -...
[17:50:22] <wikibugs>	 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10RhinosF1)
[17:52:46] <wikibugs>	 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10RhinosF1) @BCornwall: See {T306101} which already exists and has some comments on regarding diff scan which has already been shut down via {T306245} and can likely just be deleted @ayounsi conf...
[17:56:20] <wikibugs>	 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10BCornwall) @RhinosF1: Ugh, sorry about that. My searches didn't manage to find those, so thanks for doing the dirty work for me...  ._.
[17:56:22] <wikibugs>	 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10ayounsi) +1 to delete the old instance.
[17:56:50] <wikibugs>	 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10BCornwall) 05Open→03Invalid
[17:58:36] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:03:08] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] site: remove old gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/806279 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth)
[18:04:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul)
[18:04:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) 05Open→03Resolved @Cmjohnson @Eevans this is complete with all the hosts running Buster.
[18:04:57] <wikibugs>	 10SRE, 10Traffic: Review Debian Stretch VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10BCornwall)
[18:10:41] <wikibugs>	 (03PS2) 10Andrew Bogott: profile::openstack::base::galera::node: remove old absented files [puppet] - 10https://gerrit.wikimedia.org/r/806450 (https://phabricator.wikimedia.org/T310664)
[18:10:43] <wikibugs>	 (03PS1) 10Andrew Bogott: remove nodecheck.sh. It was replaced with nodecheck.py [puppet] - 10https://gerrit.wikimedia.org/r/806457 (https://phabricator.wikimedia.org/T310664)
[18:10:51] <wikibugs>	 (03PS1) 10Andrew Bogott: galera-nodecheck: turn logging way, way down [puppet] - 10https://gerrit.wikimedia.org/r/806458
[18:12:00] <wikibugs>	 (03PS1) 10Majavah: openstack::trove: reduce workers [puppet] - 10https://gerrit.wikimedia.org/r/806459
[18:12:59] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35929/console" [puppet] - 10https://gerrit.wikimedia.org/r/806459 (owner: 10Majavah)
[18:17:03] <wikibugs>	 (03PS2) 10Majavah: openstack::trove: reduce workers [puppet] - 10https://gerrit.wikimedia.org/r/806459
[18:18:13] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35930/console" [puppet] - 10https://gerrit.wikimedia.org/r/806459 (owner: 10Majavah)
[18:20:40] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) Re-ordering duplicate query parameters could be problematic. If a parameter appears multiple times, its value in `$_GET` will be set based on the latterm...
[18:23:00] <wikibugs>	 (03PS1) 10Majavah: openstack::nova: reduce max amount of open connections [puppet] - 10https://gerrit.wikimedia.org/r/806460
[18:28:48] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:30:47] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) 05Open→03In progress
[18:33:40] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:45:32] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[18:51:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10RobH)
[18:51:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into  moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10RobH)
[18:51:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10RobH)
[18:51:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into  moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10RobH)
[19:04:31] <wikibugs>	 (03PS3) 10Andrew Bogott: profile::openstack::base::galera::node: remove old absented files [puppet] - 10https://gerrit.wikimedia.org/r/806450 (https://phabricator.wikimedia.org/T310664)
[19:04:33] <wikibugs>	 (03PS2) 10Andrew Bogott: remove nodecheck.sh. It was replaced with nodecheck.py [puppet] - 10https://gerrit.wikimedia.org/r/806457 (https://phabricator.wikimedia.org/T310664)
[19:04:35] <wikibugs>	 (03PS2) 10Andrew Bogott: galera-nodecheck: turn logging way, way down [puppet] - 10https://gerrit.wikimedia.org/r/806458
[19:04:37] <wikibugs>	 (03PS1) 10Andrew Bogott: neutron: increase rpc_response_timeout [puppet] - 10https://gerrit.wikimedia.org/r/806466 (https://phabricator.wikimedia.org/T309930)
[19:07:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] neutron: increase rpc_response_timeout [puppet] - 10https://gerrit.wikimedia.org/r/806466 (https://phabricator.wikimedia.org/T309930) (owner: 10Andrew Bogott)
[19:08:52] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:09:08] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:11:21] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] profile: add kibana to dashboards rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360) (owner: 10Cwhite)
[19:20:42] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:24:30] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:39:20] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:04:54] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:16:20] <Tamzin>	 upstream connect error or disconnect/reset before headers. reset reason: connection failure
[20:16:25] <Tamzin>	 hello friends
[20:16:32] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:18:32] <RhinosF1>	 Tamzin: enwiki fine here
[20:18:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:18:51] <jynus>	 there is something going on
[20:18:55] <Tamzin>	 yeah, bounced back already on my end, if slightly slow
[20:19:09] <Tamzin>	 heard the same error report from someone else in another channel though
[20:19:10] <jynus>	 db1111 maybe?
[20:19:28] <jynus>	 someone around if to try to depool it?
[20:19:58] <marostegui>	 I'm in bed but the command if tha host is broken is: dbctl instance db1111 depool
[20:20:04] <marostegui>	 from cumin1001
[20:20:14] * jhathaway here
[20:20:22] <jynus>	 I am doing that
[20:20:30] <jhathaway>	 jynus: thanks
[20:20:39] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P29907 and previous config saved to /var/cache/conftool/dbconfig/20220617-202038-jynus.json
[20:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:51] <marostegui>	 and then dbctl config commit -m "depool"
[20:20:57] <jynus>	 yeah, done already
[20:21:01] <marostegui>	 thanks 
[20:21:05] <marostegui>	 is it down?
[20:21:16] <jynus>	 no idea but it is the first thing I thought
[20:21:24] <jynus>	 if it helps, it is that, if not I will repoolit
[20:21:32] <jynus>	 it was complining on logs
[20:21:34] <jynus>	 and icinga
[20:21:41] <RhinosF1>	 There was a drop in queiries on it marostegui
[20:21:43] <RhinosF1>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1111&var-port=9104&from=now-15m&to=now
[20:21:58] <cdanis>	 it is not down
[20:22:12] <cdanis>	 jynus: please repool it
[20:22:17] <jynus>	 doing
[20:22:41] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1111', diff saved to https://phabricator.wikimedia.org/P29908 and previous config saved to /var/cache/conftool/dbconfig/20220617-202240-jynus.json
[20:22:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:25:57] <wikibugs>	 (03PS15) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261
[20:26:33] <wikibugs>	 (03CR) 10Ayounsi: "Thanks, reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[20:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[20:29:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[20:31:22] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:34:14] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:35:58] <wikibugs>	 (03PS1) 10Thcipriani: Docker homepage builder: relicense Apache-2.0 [puppet] - 10https://gerrit.wikimedia.org/r/806473 (https://phabricator.wikimedia.org/T67270)
[20:36:00] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] Docker homepage builder: relicense Apache-2.0 [puppet] - 10https://gerrit.wikimedia.org/r/806473 (https://phabricator.wikimedia.org/T67270) (owner: 10Thcipriani)
[20:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[20:37:37] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] "Needs input from @Legoktm before merge (it's mostly his work here :))" [puppet] - 10https://gerrit.wikimedia.org/r/806473 (https://phabricator.wikimedia.org/T67270) (owner: 10Thcipriani)
[20:45:22] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:49:11] <hauskatze>	 Do we have tasks for ^
[20:49:13] <hauskatze>	 ?
[20:49:45] <RhinosF1>	 hauskatze: yes
[20:50:01] <hauskatze>	 okay
[20:50:02] <RhinosF1>	 hauskatze: it just needs an apache restart
[20:50:02] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:50:25] <RhinosF1>	 It's an issue where reload doesn't destroy the childs properly
[20:50:30] <RhinosF1>	 Certificate is renewed fine
[20:50:38] <hauskatze>	 Just until August?
[20:50:50] <RhinosF1>	 It's LE so every 3 months
[20:51:01] <hauskatze>	 ah, that explains
[20:52:27] <RhinosF1>	 hauskatze: https://phabricator.wikimedia.org/T293826
[20:52:59] <RhinosF1>	 I'll poke around during the week to get someone to do it if it's still flapping
[20:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:56:10] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): replace TODOs with links to logs and runbook in HAProxy pages (page thanos sre) - https://phabricator.wikimedia.org/T310933 (10Dzahn)
[20:58:56] <wikibugs>	 (03CR) 10DannyS712: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[21:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:10:27] <wikibugs>	 (03PS1) 10Dzahn: gitlab: add prometheus blackbox http monitor [puppet] - 10https://gerrit.wikimedia.org/r/806476
[21:10:28] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:13:36] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:15:06] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:15:48] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.099 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:17:22] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:17:24] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:30:35] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Query canonicalization for MediaWiki - https://phabricator.wikimedia.org/T310087 (10Krinkle) Regarding parameter - One thing that comes to mind from a previous experiment long ago (I don't recall specifics and couldn't find any) is OAuth verification.  OAuth is sensitiv...
[21:35:28] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:54:38] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:13:04] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:14:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:15:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) - deleted store@ and merchandise@ after they were created in Google- coordinated with Brendan of ITS and Sandra Hust, store manager  - introduced some...
[22:15:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05In progress→03Resolved
[22:18:00] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:18:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) The remaining SRE aliases in the file can now be separated into:  - standards - SREs - DNS related - network related - dumps related - monitoring rela...
[22:19:34] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:41:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10wiki_willy) a:05cmooney→03Cmjohnson
[22:56:18] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 58.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:58:38] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 102.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:05:33] <wikibugs>	 (03PS4) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[23:06:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[23:09:37] <wikibugs>	 (03PS5) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[23:10:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[23:10:58] <wikibugs>	 (03PS6) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[23:11:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[23:14:04] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:16:33] <wikibugs>	 (03PS1) 10Cwhite: logstash: copy aqs info field to error.message [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760)
[23:17:55] <wikibugs>	 (03PS7) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[23:18:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[23:19:28] <wikibugs>	 (03PS8) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[23:25:32] <wikibugs>	 (03CR) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[23:30:33] <wikibugs>	 (03PS1) 10Dzahn: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081)
[23:30:48] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/806486/" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[23:31:55] <wikibugs>	 (03PS2) 10Dzahn: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081)
[23:33:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Dzahn)
[23:37:15] <wikibugs>	 (03PS1) 10Dzahn: admin: add taavi to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/806487 (https://phabricator.wikimedia.org/T309375)
[23:41:12] <wikibugs>	 (03CR) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[23:41:21] <wikibugs>	 (03PS9) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)