[00:07:06] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS bullseye [00:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:12] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye [00:08:54] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) 1016 had already Buster installed . I am re-running the cookbook again to install Bullseye [00:11:34] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:46] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv4: Idle - Orange, AS5511/IPv6: Idle - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [00:39:56] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage [00:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:22] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage [00:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:58] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:52:08] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:56] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:56:35] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1016.eqiad.wmnet with OS bullseye [00:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:41] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye completed: - aqs1016 (**PASS**)... [00:57:44] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:59:22] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:04:44] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:07:06] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1017.eqiad.wmnet with OS bullseye [01:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:11] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1017.eqiad.wmnet with OS bullseye [01:08:03] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) [01:10:34] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:34] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:13:06] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:16:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:12] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:06] (03CR) 10Tim Starling: "Getting bored of waiting for review, tempted to just merge it" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [01:37:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:38:52] 10SRE, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster and away from the memcached cluster - https://phabricator.wikimedia.org/T267581 (10tstarling) [01:39:00] 10SRE, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10tstarling) [01:39:28] (03Abandoned) 10Tim Starling: Switch wgMainStash back to Redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804024 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [01:39:54] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) 05Open→03Resolved Metrics on db1151 look fine. Disk space usage on db1151 is growing at a rate of... [01:39:55] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage [01:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:43:04] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage [01:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:28] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:10] (03CR) 10Thcipriani: [C: 03+1] Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE)) [01:54:47] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1017.eqiad.wmnet with OS bullseye [01:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:52] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1017.eqiad.wmnet with OS bullseye completed: - aqs1017 (**WARN**)... [02:00:47] (03CR) 10Tim Starling: [C: 03+2] "To be on the safe side, I'll do an eval.php check of the tagline default before scap, then I'll check for a missing tagline at https://aa." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE)) [02:01:37] (03Merged) 10jenkins-bot: Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE)) [02:02:25] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1018.eqiad.wmnet with OS bullseye [02:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:31] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1018.eqiad.wmnet with OS bullseye [02:03:00] (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:47] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 03m 43s) [02:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:41] (03CR) 10Tim Starling: [C: 03+2] "Seems fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE)) [02:08:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:40] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:09:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:09:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:00] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:36:24] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage [02:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:31] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage [02:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:34] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:50:56] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:51:12] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:51:22] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1018.eqiad.wmnet with OS bullseye [02:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:27] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1018.eqiad.wmnet with OS bullseye completed: - aqs1018 (**WARN**)... [03:12:32] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 25, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:20:58] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) [03:38:01] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:35] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:58:31] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:59:51] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:33] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:11:07] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:53] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:19:57] PROBLEM - SSH on an-worker1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:24:07] PROBLEM - Host an-worker1109 is DOWN: PING CRITICAL - Packet loss = 100% [04:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [04:53:39] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:59:37] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:00:37] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:03:00] (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:08:55] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:10:33] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:10:41] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:03] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:22:01] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:26:37] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:32:23] (03CR) 10Thiemo Kreuz (WMDE): phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [06:46:53] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:43] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:55:59] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220617T0700) [07:02:47] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:15] (03PS2) 10Muehlenhoff: Remove webperf1002/webperf2002 from Kafka firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/804334 (https://phabricator.wikimedia.org/T305460) [07:13:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:15:33] (03CR) 10Muehlenhoff: [C: 03+2] Remove webperf1002/webperf2002 from Kafka firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/804334 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [07:16:50] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] cas: Update to 6.5.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [07:17:15] (03Abandoned) 10Muehlenhoff: envoyproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799311 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:17:38] (03Abandoned) 10Muehlenhoff: cas: Update to 6.5.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [07:17:51] (03PS2) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) [07:23:15] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:55] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Mergin" [puppet] - 10https://gerrit.wikimedia.org/r/806216 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:31:32] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/806218 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:31:38] (03PS2) 10Muehlenhoff: spamassassin: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806218 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:32:21] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:19] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806208 (owner: 10Jbond) [07:39:13] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:45] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/806219 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:39:52] (03PS2) 10Muehlenhoff: tomcat: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806219 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:41:23] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-staging-ctrl[2001-2002].codfw.wmnet with reason: Rebooting to activate new kernel for T310483 [07:41:25] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-staging-ctrl[2001-2002].codfw.wmnet with reason: Rebooting to activate new kernel for T310483 [07:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:29] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:24] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendations frontend, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806365 (https://phabricator.wikimedia.org/T304548) [07:50:02] (03CR) 10Muehlenhoff: vrts: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:54:28] (03PS1) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) [08:01:45] (03PS2) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) [08:02:42] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [08:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:29] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:35] (03PS3) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) [08:07:45] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35899/console" [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:08:39] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [08:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:46] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2002.codfw.wmnet [08:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:39] (03CR) 10Slyngshede: [C: 03+2] snapshot: migrate adds-changes cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779016 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:11:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:12:57] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:17:09] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2002.codfw.wmnet [08:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti4004.ulsfo.wmnet with reason: Enable virt in BIOS [08:17:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti4004.ulsfo.wmnet with reason: Enable virt in BIOS [08:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:56] (CertManagerCertNotReady) resolved: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [08:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:21:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [08:21:52] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-serve-ctrl[2001-2002].codfw.wmnet with reason: Rebooting to activate new kernel for T310483 [08:21:54] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve-ctrl[2001-2002].codfw.wmnet with reason: Rebooting to activate new kernel for T310483 [08:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:33] (03PS1) 10Ayounsi: Prometheus/Netbox: use netbox.wikimedia.org SNI [puppet] - 10https://gerrit.wikimedia.org/r/806368 (https://phabricator.wikimedia.org/T243928) [08:22:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:22:56] (CertManagerCertNotReady) firing: (2) Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [08:24:43] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:apt do not include private apt repo on cloud hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede) [08:27:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:27:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/806368 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [08:29:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:31:43] (03CR) 10Ayounsi: [C: 03+2] Prometheus/Netbox: use netbox.wikimedia.org SNI [puppet] - 10https://gerrit.wikimedia.org/r/806368 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [08:33:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [08:37:12] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10MoritzMuehlenhoff) [08:37:35] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) 05Resolved→03Open The server doesn't have virtualisation enabled. I tried to enable it via the BIOS over the serial console, but I'm not getting a cons... [08:38:43] RECOVERY - Host an-worker1109 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [08:39:07] RECOVERY - SSH on an-worker1109 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:39:44] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [08:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:41:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) The server can be powered down any time, while it already has the ganeti role, it's not yet added to the cluster. [08:45:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:47:27] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:47:47] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [08:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:04] (03CR) 10Jbond: [C: 03+1] "LGTM added hashar as a heads up" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:49:48] (03CR) 10Jbond: [C: 03+1] SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm) [08:51:20] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2002.codfw.wmnet [08:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:53:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:56:03] (03CR) 10Tacsipacsi: CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [08:56:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:57:09] (03CR) 10Jbond: [C: 03+1] SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm) [08:57:58] (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:58:28] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2002.codfw.wmnet [08:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:01:22] (03PS3) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) [09:01:42] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2003.codfw.wmnet [09:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm) [09:02:51] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [09:02:58] (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:02:58] (03CR) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [09:04:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:07:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:07:58] (KubernetesCalicoDown) firing: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:08:03] (03PS2) 10Zabe: snapshot: remove absented add-changes cron [puppet] - 10https://gerrit.wikimedia.org/r/779017 (https://phabricator.wikimedia.org/T273673) [09:09:43] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2003.codfw.wmnet [09:09:44] (03CR) 10Jbond: [C: 03+2] php: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806217 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:39] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2004.codfw.wmnet [09:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:38] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/35900/" [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff) [09:12:55] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:58] (KubernetesCalicoDown) resolved: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:13:13] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:14:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet [09:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:15] PROBLEM - Host ganeti4004 is DOWN: PING CRITICAL - Packet loss = 100% [09:18:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet [09:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:42] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2004.codfw.wmnet [09:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:19] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti4004.ulsfo.wmnet with reason: Enable virt in BIOS [09:23:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti4004.ulsfo.wmnet with reason: Enable virt in BIOS [09:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet [09:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:24] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet [09:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet [09:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:57] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:26:59] (03PS1) 10Filippo Giunchedi: pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 [09:27:01] (03PS1) 10Filippo Giunchedi: base: include profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806374 [09:27:03] (03PS1) 10Filippo Giunchedi: pontoon: fix race between SD/dnsmasq and resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/806375 [09:27:05] (03PS1) 10Filippo Giunchedi: pontoon: enable SD for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/806376 [09:27:07] (03PS1) 10Filippo Giunchedi: pontoon: update hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/806377 [09:27:09] (03PS1) 10Filippo Giunchedi: wmcs: add default for metricsinfra_prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/806378 [09:27:11] (03PS1) 10Filippo Giunchedi: pontoon: add metricsinfra_prometheus_nodes to settings [puppet] - 10https://gerrit.wikimedia.org/r/806379 [09:28:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [09:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:07] (03PS4) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) [09:30:24] (03CR) 10CI reject: [V: 04-1] pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi) [09:30:53] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2005.codfw.wmnet [09:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:00] (03CR) 10Jbond: [C: 03+1] sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:32:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [09:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet [09:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:28] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2006.codfw.wmnet [09:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1004.eqiad.wmnet [09:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:04] (03CR) 10Jbond: "This is probably acceptable as long as we track it, however please get a +1 from moritz so its definitely on thier radar" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [09:40:35] (03PS1) 10Btullis: Disable the telemetry for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806381 (https://phabricator.wikimedia.org/T310079) [09:40:37] (03CR) 10Volans: "What if instead we solve the problem accepting both absolute and percentage batch sizes values like cumin?" [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:41:54] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2006.codfw.wmnet [09:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:57] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 104.5 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [09:44:05] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2007.codfw.wmnet [09:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:46:09] (03CR) 10Btullis: [C: 03+2] Disable the telemetry for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806381 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [09:50:15] (03Merged) 10jenkins-bot: Disable the telemetry for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806381 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [09:50:33] (03PS1) 10Jbond: P:promethous::ops: add host header to scrap config [puppet] - 10https://gerrit.wikimedia.org/r/806382 [09:51:49] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2007.codfw.wmnet [09:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:37] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:58] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:29] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [09:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [09:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:03] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [09:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:33] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [09:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:05] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2008.codfw.wmnet [09:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:03:00] (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:05:35] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2008.codfw.wmnet [10:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:17] (03PS1) 10Jbond: P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383 [10:11:30] (03Abandoned) 10Jbond: P:promethous::ops: add host header to scrap config [puppet] - 10https://gerrit.wikimedia.org/r/806382 (owner: 10Jbond) [10:12:12] (03PS1) 10Btullis: Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806384 (https://phabricator.wikimedia.org/T310629) [10:14:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35904/console" [puppet] - 10https://gerrit.wikimedia.org/r/806383 (owner: 10Jbond) [10:18:41] (03PS2) 10Jbond: P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383 [10:19:14] (03CR) 10Muehlenhoff: "Enable unpriv user_ns seems fine for this use case, but I think two aspects are relevant here:" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [10:20:19] (03CR) 10Btullis: [C: 03+2] Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806384 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis) [10:21:44] (03CR) 10Ayounsi: [C: 03+1] P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383 (owner: 10Jbond) [10:21:46] (03PS3) 10Jbond: P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383 [10:22:11] (03PS4) 10Jbond: P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383 [10:22:15] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [10:22:51] (03PS1) 10Ayounsi: Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI" [puppet] - 10https://gerrit.wikimedia.org/r/806252 [10:23:25] (03Merged) 10jenkins-bot: Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806384 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis) [10:24:27] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) [10:25:08] (03CR) 10Jbond: [C: 03+1] Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI" [puppet] - 10https://gerrit.wikimedia.org/r/806252 (owner: 10Ayounsi) [10:25:21] (03CR) 10Jbond: [C: 03+2] P:netbox: add proxy for the metricts endpoint in the exports vhost [puppet] - 10https://gerrit.wikimedia.org/r/806383 (owner: 10Jbond) [10:25:59] (03CR) 10Ayounsi: [C: 03+2] Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI" [puppet] - 10https://gerrit.wikimedia.org/r/806252 (owner: 10Ayounsi) [10:28:02] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete. The ulsfo cluster is affected by T309724, but that will be investigated via that task (and it doesn't have a functional impact apart fr... [10:28:22] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete. The eqsin cluster is affected by T309724, but that will be investigated via that task (and it doesn't have a functional impact apart fr... [10:28:42] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete. [10:31:04] (03PS3) 10Volans: icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) [10:32:43] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:59] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:15] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [10:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:16] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [10:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:30] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [10:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:32] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [10:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:23] (03PS1) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [10:45:27] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:46:03] (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [10:48:58] (03PS2) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [10:50:14] (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [10:51:03] (03PS1) 10Slyngshede: Admin: grant samtar access to deployment [puppet] - 10https://gerrit.wikimedia.org/r/806391 (https://phabricator.wikimedia.org/T302231) [10:53:23] (03PS2) 10Slyngshede: Admin: grant samtar access to deployment [puppet] - 10https://gerrit.wikimedia.org/r/806391 (https://phabricator.wikimedia.org/T302231) [10:54:18] (03PS1) 10Jbond: netbox: add hostname to allowed list of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/806392 [10:54:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/806391 (https://phabricator.wikimedia.org/T302231) (owner: 10Slyngshede) [10:54:35] (03PS3) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [10:55:05] (03CR) 10Slyngshede: [C: 03+2] Admin: grant samtar access to deployment [puppet] - 10https://gerrit.wikimedia.org/r/806391 (https://phabricator.wikimedia.org/T302231) (owner: 10Slyngshede) [10:56:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10SLyngshede-WMF) [10:56:25] (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [10:57:06] (03CR) 10CI reject: [V: 04-1] netbox: add hostname to allowed list of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/806392 (owner: 10Jbond) [10:57:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [10:58:50] (03PS2) 10Jbond: netbox: add hostname to allowed list of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/806392 [11:00:00] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10SLyngshede-WMF) 05Open→03Resolved [11:00:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1010.eqiad.wmnet [11:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:11] (03CR) 10Jbond: [C: 03+2] netbox: add hostname to allowed list of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/806392 (owner: 10Jbond) [11:06:27] (03PS1) 10Muehlenhoff: Add missing file [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806393 [11:06:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1010.eqiad.wmnet [11:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1011.eqiad.wmnet [11:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:13] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:08:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:08:48] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add missing file [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806393 (owner: 10Muehlenhoff) [11:09:04] (03PS1) 10Jbond: Revert "netbox: add hostname to allowed list of hostnames" [puppet] - 10https://gerrit.wikimedia.org/r/806253 [11:09:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:10:41] (03PS1) 10Kevin Bazira: ml-services: add ptwiki draftquality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/806394 (https://phabricator.wikimedia.org/T310704) [11:10:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:11:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:12:01] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:12:29] (03PS1) 10Jbond: C:netbox: fic typo [puppet] - 10https://gerrit.wikimedia.org/r/806395 [11:12:33] PROBLEM - Check systemd state on netbox2002 is CRITICAL: CRITICAL - degraded: The following units failed: rq-netbox.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:39] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service,netbox_ganeti_esams_sync.service,netbox_report_puppetdb_virtual_run.service,rq-netbox.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:07] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1011.eqiad.wmnet [11:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:10] (03CR) 10Urbanecm: [C: 03+1] MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [11:15:22] (03PS2) 10Jbond: C:netbox: fic typo [puppet] - 10https://gerrit.wikimedia.org/r/806395 [11:16:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1012.eqiad.wmnet [11:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:57] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:17:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35908/console" [puppet] - 10https://gerrit.wikimedia.org/r/806395 (owner: 10Jbond) [11:17:54] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:netbox: fic typo [puppet] - 10https://gerrit.wikimedia.org/r/806395 (owner: 10Jbond) [11:20:41] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:22:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1012.eqiad.wmnet [11:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:19] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:23:45] (03PS1) 10Btullis: Disable native authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806396 (https://phabricator.wikimedia.org/T310079) [11:26:32] (03PS1) 10Jaime Nuche: scap bootstrap: refactor [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) [11:28:00] (03CR) 10Btullis: [C: 03+2] Disable native authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806396 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [11:31:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2010.codfw.wmnet [11:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:09] (03Merged) 10jenkins-bot: Disable native authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/806396 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [11:31:57] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:32:24] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:59] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:12] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:15] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [11:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:27] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [11:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:03] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:13] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [11:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2010.codfw.wmnet [11:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2011.codfw.wmnet [11:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:47] RECOVERY - Check systemd state on netbox2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:03] (03CR) 10ArielGlenn: [C: 03+2] snapshot: remove absented add-changes cron [puppet] - 10https://gerrit.wikimedia.org/r/779017 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:40:09] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:17] !log upload cas 6.5.5+wmf11u1 to apt.wikimedia.org T305518 [11:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:22] T305518: Upgrade IDPs to CAS 6.5/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 [11:43:33] (03PS1) 10Ayounsi: Revert "Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI"" [puppet] - 10https://gerrit.wikimedia.org/r/806254 [11:43:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2011.codfw.wmnet [11:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:54] (03PS2) 10Muehlenhoff: coal: Remove support for pre Bullseye installs [puppet] - 10https://gerrit.wikimedia.org/r/804340 (https://phabricator.wikimedia.org/T305460) [11:45:11] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:31] (03CR) 10Ayounsi: [C: 03+2] Revert "Revert "Prometheus/Netbox: use netbox.wikimedia.org SNI"" [puppet] - 10https://gerrit.wikimedia.org/r/806254 (owner: 10Ayounsi) [11:47:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2012.codfw.wmnet [11:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:45] (JobUnavailable) resolved: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:53:16] (JobUnavailable) firing: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:53:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2012.codfw.wmnet [11:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:37] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@18182aa]: (no justification provided) [11:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:51] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@18182aa]: (no justification provided) (duration: 00m 13s) [11:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:07] (03PS1) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) [11:58:00] (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:58:15] (JobUnavailable) resolved: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:58:34] (03PS2) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) [12:01:17] (03CR) 10Jbond: [C: 03+1] Reenable U2F for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff) [12:01:19] (03PS4) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [12:01:54] (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [12:04:18] (03CR) 10Muehlenhoff: [C: 03+2] coal: Remove support for pre Bullseye installs [puppet] - 10https://gerrit.wikimedia.org/r/804340 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [12:04:20] (03CR) 10Jbond: [C: 04-1] "i dont think we should include this in the production profile, i think it makes more sense to inject it via the pontoon enc, happy to chat" [puppet] - 10https://gerrit.wikimedia.org/r/806374 (owner: 10Filippo Giunchedi) [12:05:33] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:50] (03PS1) 10Muehlenhoff: squid/racktables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806406 (https://phabricator.wikimedia.org/T308013) [12:07:09] (03CR) 10Jbond: "CI error relates to spdx" [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi) [12:09:17] jbond: thank you for the reviews, however none of https://gerrit.wikimedia.org/r/q/topic:pontoon-latest-merges is ready yet [12:09:59] (03CR) 10Jbond: "LGTM, Sorry i missed this when i did prod" [puppet] - 10https://gerrit.wikimedia.org/r/806377 (owner: 10Filippo Giunchedi) [12:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:12:34] (03PS1) 10Urbanecm: Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885) [12:12:55] (03CR) 10Jbond: Bump changelog for 6.5.5 and add some docs how to resync the overlay (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [12:16:55] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:17] (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Fix typo in config var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse) [12:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [12:29:09] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for 6.5.5 and add some docs how to resync the overlay (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [12:35:13] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:28] !log deployed daily airflow dag for 3 Wikidata metrics. [12:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [12:36:57] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:47:25] (03PS5) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [12:48:16] (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [12:52:21] (03PS1) 10Ssingh: dnsdist: remove redundant parameters for qps_max [puppet] - 10https://gerrit.wikimedia.org/r/806414 [12:53:28] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Majavah) a:05hashar→03None [12:53:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35913/console" [puppet] - 10https://gerrit.wikimedia.org/r/806414 (owner: 10Ssingh) [12:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:56:12] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: remove redundant parameters for qps_max [puppet] - 10https://gerrit.wikimedia.org/r/806414 (owner: 10Ssingh) [13:00:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Michael) [13:03:51] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:09:03] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:11:19] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:11:35] 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10RhinosF1) This just alerted again: > 14:09:04 <+icinga-wm> PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate li... [13:13:31] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:19:27] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:20:23] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:35:08] (03PS1) 10Muehlenhoff: Add Michael Große to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806417 (https://phabricator.wikimedia.org/T308013) [13:35:31] (03PS1) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) [13:36:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35914/console" [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah) [13:38:03] (03CR) 10Muehlenhoff: [C: 03+2] Add Michael Große to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806417 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:38:40] (03PS2) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) [13:39:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35915/console" [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah) [13:40:09] (03PS3) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) [13:41:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35916/console" [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah) [13:42:18] (03PS6) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [13:43:35] (03CR) 10CI reject: [V: 04-1] openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah) [13:46:11] (03PS4) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) [13:47:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35917/console" [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah) [13:49:45] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Complete Netbox prometheus scraping - https://phabricator.wikimedia.org/T243928 (10ayounsi) 05Open→03Resolved All done thanks to John and Filippo. Example dashboard can be seen there https://grafana.wikimedia.org/d/DvXT6LCnk/ fee... [13:53:48] (03PS3) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) [13:55:36] (03PS5) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) [14:00:19] (03CR) 10CI reject: [V: 04-1] openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah) [14:01:49] (03CR) 10Ayounsi: "This fails PCC, dunno how to make it work :)" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [14:04:57] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:05:01] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:07:33] (03PS1) 10Ayounsi: Netbox stats, set scrape interval to 1h [puppet] - 10https://gerrit.wikimedia.org/r/806422 [14:07:49] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) @Eevans hello i talked to @Cmjohnson on this task since i will be working on it. He said that you wanted Buster on it so i just wanted to confirm if it is s... [14:08:01] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) [14:11:07] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:13:48] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff) I reached out to Marc-André Pelletier (Coren) via email and he replied the following (quoted with his permission), as such I'm listing him under CONTRIB... [14:15:42] (03PS1) 10Muehlenhoff: Add Coren to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806424 (https://phabricator.wikimedia.org/T308013) [14:19:06] (03CR) 10Eevans: [C: 03+1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [14:19:11] (03PS2) 10Muehlenhoff: Add Coren to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806424 (https://phabricator.wikimedia.org/T308013) [14:21:47] (03PS1) 10Jbond: wmflib: update kernel_details to also include kernel.unprivileged_userns_clone [puppet] - 10https://gerrit.wikimedia.org/r/806425 [14:22:19] (03CR) 10Andrew Bogott: "As mentioned on the ticket, I'm inclined to leave this part of the monitoring infra undone for now." [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah) [14:24:32] (03CR) 10Muehlenhoff: [C: 03+2] Add Coren to contributors [puppet] - 10https://gerrit.wikimedia.org/r/806424 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:24:47] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [14:24:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [14:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:54] (03PS1) 10Muehlenhoff: Remove references [puppet] - 10https://gerrit.wikimedia.org/r/806426 [14:32:51] (03CR) 10Jbond: Netbox: add monitoring to dns.git endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [14:33:58] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [14:36:11] (03PS1) 10David Caro: wmcs.openstaack: Add runbook to increase the quotas [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/806429 (https://phabricator.wikimedia.org/T297606) [14:36:45] 10SRE, 10Maps: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10Aklapper) 05Open→03Declined Hi, please see https://wikitech.wikimedia.org/wiki/Maps/External_usage : "maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites hosted by Wikimedia Aff... [14:38:07] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet [14:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:57] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) >>! In T305570#8011724, @Papaul wrote: > @Eevans hello i talked to @Cmjohnson on this task since i will be working on it. He said that you wanted Buster on... [14:41:30] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) @Eevans thanks [14:46:33] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS buster [14:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:38] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS buster [14:47:45] (03PS1) 10Cwhite: logstash: alertmanager use logsource as source for host.name field [puppet] - 10https://gerrit.wikimedia.org/r/806430 (https://phabricator.wikimedia.org/T222826) [14:51:30] (03CR) 10AikoChou: [C: 03+1] ml-services: add ptwiki draftquality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/806394 (https://phabricator.wikimedia.org/T310704) (owner: 10Kevin Bazira) [14:54:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet [14:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1040.eqiad.wmnet [14:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:21] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage [14:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1041.eqiad.wmnet [14:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:32] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage [15:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:40] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1017.eqiad.wmnet with OS buster [15:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:46] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1017.eqiad.wmnet with OS buster [15:04:52] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:09:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1041.eqiad.wmnet [15:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1042.eqiad.wmnet [15:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:07] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1016.eqiad.wmnet with OS buster [15:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:14] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS buster completed: - aqs1016 (**PASS**) -... [15:16:13] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage [15:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:53] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1018.eqiad.wmnet with OS buster [15:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:59] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1018.eqiad.wmnet with OS buster [15:17:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet [15:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet [15:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:47] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1017.eqiad.wmnet with reason: host reimage [15:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:09] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti4004.mgmt.ulsfo.wmnet with reboot policy GRACEFUL [15:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:30] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4004.mgmt.ulsfo.wmnet with reboot policy GRACEFUL [15:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1042.eqiad.wmnet [15:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:24] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) [15:21:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1043.eqiad.wmnet [15:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:48] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) 05Open→03In progress >>! In T289715#8010871, @MoritzMuehlenhoff wrote: > The server doesn't have virtualisation enabled. I tried to enable it via the BIOS over the... [15:26:36] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1043.eqiad.wmnet [15:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:37] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage [15:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:05] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) 05In progress→03Open [15:30:08] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) [15:30:24] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) a:05RobH→03MoritzMuehlenhoff virtualization is now enabled, yo ushould be able to push this into service as needed now [15:30:30] (03PS1) 10Majavah: P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433 [15:31:02] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1017.eqiad.wmnet with OS buster [15:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:08] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1017.eqiad.wmnet with OS buster completed: - aqs1017 (**PASS**) -... [15:31:12] RECOVERY - Host ganeti4004 is UP: PING OK - Packet loss = 0%, RTA = 68.97 ms [15:31:29] (03CR) 10CI reject: [V: 04-1] P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah) [15:31:34] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:32:04] (03PS2) 10Majavah: P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433 [15:32:19] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) Thanks! I'll do that on Tuesday [15:32:56] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1018.eqiad.wmnet with reason: host reimage [15:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet [15:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:23] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35923/console" [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah) [15:35:23] (03PS3) 10Majavah: P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433 [15:36:13] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35924/console" [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah) [15:36:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet [15:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:01] (03PS4) 10Majavah: P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433 [15:37:39] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35925/console" [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah) [15:39:48] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS buster [15:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:54] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster [15:39:55] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1019.eqiad.wmnet with OS buster [15:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:04] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster executed with errors: - aqs1019 (**... [15:43:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet [15:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:48] (03CR) 10Ayounsi: Add check to network report to ensure IPs match connected Vlans (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [15:43:59] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:31] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:46:48] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1018.eqiad.wmnet with OS buster [15:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:53] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1018.eqiad.wmnet with OS buster completed: - aqs1018 (**PASS**) -... [15:49:57] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) Noted, that's the link we do the least traffic on so we can keep it down for some time. I'll take care of it on Monday. [15:51:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet [15:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet [15:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:19] (03CR) 10Ayounsi: "This should significantly decrease the number of "GET extras-api:jobresult-detail 200"" [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [15:56:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet [15:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet [15:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet [15:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2043.codfw.wmnet [16:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:42] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:52] (03PS1) 10Btullis: Update the container used for datahub deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/806435 (https://phabricator.wikimedia.org/T310079) [16:04:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet [16:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:28] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS buster [16:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:34] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster [16:06:35] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1019.eqiad.wmnet with OS buster [16:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:41] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster executed with errors: - aqs1019 (**... [16:10:02] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS buster [16:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:08] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster [16:10:54] (03PS1) 10Andrew Bogott: Replace the bash/socket-based galera healthcheck with a python flask app [puppet] - 10https://gerrit.wikimedia.org/r/806437 (https://phabricator.wikimedia.org/T310664) [16:12:00] (03CR) 10CI reject: [V: 04-1] Replace the bash/socket-based galera healthcheck with a python flask app [puppet] - 10https://gerrit.wikimedia.org/r/806437 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott) [16:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:15:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2043.codfw.wmnet [16:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:42] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:57] (03PS2) 10Andrew Bogott: Replace the bash/socket-based galera healthcheck with a python flask app [puppet] - 10https://gerrit.wikimedia.org/r/806437 (https://phabricator.wikimedia.org/T310664) [16:18:04] (03PS1) 10Majavah: metricsinfra: add metricsinfra-prometheus-2 [puppet] - 10https://gerrit.wikimedia.org/r/806439 (https://phabricator.wikimedia.org/T288108) [16:18:41] (03PS2) 10Majavah: metricsinfra: add metricsinfra-prometheus-2 [puppet] - 10https://gerrit.wikimedia.org/r/806439 (https://phabricator.wikimedia.org/T310799) [16:20:33] (03CR) 10Andrew Bogott: [C: 03+2] Replace the bash/socket-based galera healthcheck with a python flask app [puppet] - 10https://gerrit.wikimedia.org/r/806437 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott) [16:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:21:15] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS buster [16:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:20] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1020.eqiad.wmnet with OS buster [16:22:04] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:22:43] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1019.eqiad.wmnet with reason: host reimage [16:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:49] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1019.eqiad.wmnet with reason: host reimage [16:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [16:27:40] (03CR) 10Btullis: [C: 03+2] Update the container used for datahub deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/806435 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [16:29:19] (03PS1) 10Andrew Bogott: galera-nodecheck: fix filename [puppet] - 10https://gerrit.wikimedia.org/r/806441 [16:31:13] (03Merged) 10jenkins-bot: Update the container used for datahub deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/806435 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [16:32:20] (03PS2) 10Andrew Bogott: galera-nodecheck: fix filename [puppet] - 10https://gerrit.wikimedia.org/r/806441 [16:32:40] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [16:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:48] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:56] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [16:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:32] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1020.eqiad.wmnet with reason: host reimage [16:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:48] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [16:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:56] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [16:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:36] (03CR) 10Andrew Bogott: [C: 03+2] galera-nodecheck: fix filename [puppet] - 10https://gerrit.wikimedia.org/r/806441 (owner: 10Andrew Bogott) [16:35:56] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [16:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [16:36:51] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove references [puppet] - 10https://gerrit.wikimedia.org/r/806426 (owner: 10Muehlenhoff) [16:37:36] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1020.eqiad.wmnet with reason: host reimage [16:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:24] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1019.eqiad.wmnet with OS buster [16:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:30] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1019.eqiad.wmnet with OS buster completed: - aqs1019 (**PASS**) -... [16:40:59] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1021.eqiad.wmnet with OS buster [16:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:05] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1021.eqiad.wmnet with OS buster [16:49:27] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1020.eqiad.wmnet with OS buster [16:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:32] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1020.eqiad.wmnet with OS buster completed: - aqs1020 (**WARN**) -... [16:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:57:24] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:56] (03CR) 10Andrew Bogott: [C: 03+2] metricsinfra: add metricsinfra-prometheus-2 [puppet] - 10https://gerrit.wikimedia.org/r/806439 (https://phabricator.wikimedia.org/T310799) (owner: 10Majavah) [17:01:44] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:06:01] (03PS3) 10BCornwall: analytics: add varnishkafka delivery error alarms [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) [17:13:18] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:13:22] (03PS1) 10Majavah: P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 [17:21:56] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:22:10] (03CR) 10BCornwall: analytics: add varnishkafka delivery error alarms (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [17:23:21] (03PS2) 10Andrew Bogott: P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (owner: 10Majavah) [17:25:12] (03PS4) 10BCornwall: data-engineering: add varnishkafka delivery errors [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) [17:26:13] (03CR) 10CI reject: [V: 04-1] P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (owner: 10Majavah) [17:27:25] (03PS5) 10Samtar: Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [17:27:43] 👀 [17:28:02] oops did that ping you? [17:28:09] yeah ^^^ [17:28:11] * ^^ [17:28:48] sorrryyy ^^ is a pointless rebase really, but the big red "merge conflict" irritates me :P [17:28:52] but it’s not a bad thing that I get notified when people think they can rebase my patches :P [17:29:02] it’s okay 😌 [17:29:10] (03PS3) 10Andrew Bogott: P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (https://phabricator.wikimedia.org/T310664) (owner: 10Majavah) [17:31:04] (03PS1) 10Andrew Bogott: Galera/haproxy: remove old bash healthcheck script, replace with flask [puppet] - 10https://gerrit.wikimedia.org/r/806449 (https://phabricator.wikimedia.org/T310664) [17:32:12] (03CR) 10Btullis: data-engineering: add varnishkafka delivery errors (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [17:32:28] (03CR) 10CI reject: [V: 04-1] P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (https://phabricator.wikimedia.org/T310664) (owner: 10Majavah) [17:34:58] (03PS4) 10Andrew Bogott: P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (https://phabricator.wikimedia.org/T310664) (owner: 10Majavah) [17:35:08] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1021.eqiad.wmnet with reason: host reimage [17:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:38] (03PS2) 10Andrew Bogott: Galera/haproxy: remove old bash healthcheck script, replace with flask [puppet] - 10https://gerrit.wikimedia.org/r/806449 (https://phabricator.wikimedia.org/T310664) [17:35:40] (03PS1) 10Andrew Bogott: profile::openstack::base::galera::node: remove old absented files [puppet] - 10https://gerrit.wikimedia.org/r/806450 (https://phabricator.wikimedia.org/T310664) [17:35:57] (03PS1) 10Cwhite: profile: add kibana to dashboards rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360) [17:38:24] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1021.eqiad.wmnet with reason: host reimage [17:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:28] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::haproxy: fix galera health check [puppet] - 10https://gerrit.wikimedia.org/r/806448 (https://phabricator.wikimedia.org/T310664) (owner: 10Majavah) [17:39:08] (03PS5) 10BCornwall: data-engineering: add varnishkafka delivery errors [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) [17:40:08] (03CR) 10Andrew Bogott: [C: 03+2] Galera/haproxy: remove old bash healthcheck script, replace with flask [puppet] - 10https://gerrit.wikimedia.org/r/806449 (https://phabricator.wikimedia.org/T310664) (owner: 10Andrew Bogott) [17:40:42] (03CR) 10BCornwall: data-engineering: add varnishkafka delivery errors (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [17:43:48] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [17:45:01] (03PS6) 10BCornwall: data-engineering: add varnishkafka delivery errors [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) [17:49:54] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1021.eqiad.wmnet with OS buster [17:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:00] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1021.eqiad.wmnet with OS buster completed: - aqs1021 (**WARN**) -... [17:50:22] 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10RhinosF1) [17:52:46] 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10RhinosF1) @BCornwall: See {T306101} which already exists and has some comments on regarding diff scan which has already been shut down via {T306245} and can likely just be deleted @ayounsi conf... [17:56:20] 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10BCornwall) @RhinosF1: Ugh, sorry about that. My searches didn't manage to find those, so thanks for doing the dirty work for me... ._. [17:56:22] 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10ayounsi) +1 to delete the old instance. [17:56:50] 10SRE, 10Traffic: Review Debian Buster VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10BCornwall) 05Open→03Invalid [17:58:36] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:03:08] (03CR) 10AOkoth: [C: 03+2] site: remove old gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/806279 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth) [18:04:06] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) [18:04:52] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Papaul) 05Open→03Resolved @Cmjohnson @Eevans this is complete with all the hosts running Buster. [18:04:57] 10SRE, 10Traffic: Review Debian Stretch VMs set for 2022-06 termination - https://phabricator.wikimedia.org/T310910 (10BCornwall) [18:10:41] (03PS2) 10Andrew Bogott: profile::openstack::base::galera::node: remove old absented files [puppet] - 10https://gerrit.wikimedia.org/r/806450 (https://phabricator.wikimedia.org/T310664) [18:10:43] (03PS1) 10Andrew Bogott: remove nodecheck.sh. It was replaced with nodecheck.py [puppet] - 10https://gerrit.wikimedia.org/r/806457 (https://phabricator.wikimedia.org/T310664) [18:10:51] (03PS1) 10Andrew Bogott: galera-nodecheck: turn logging way, way down [puppet] - 10https://gerrit.wikimedia.org/r/806458 [18:12:00] (03PS1) 10Majavah: openstack::trove: reduce workers [puppet] - 10https://gerrit.wikimedia.org/r/806459 [18:12:59] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35929/console" [puppet] - 10https://gerrit.wikimedia.org/r/806459 (owner: 10Majavah) [18:17:03] (03PS2) 10Majavah: openstack::trove: reduce workers [puppet] - 10https://gerrit.wikimedia.org/r/806459 [18:18:13] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35930/console" [puppet] - 10https://gerrit.wikimedia.org/r/806459 (owner: 10Majavah) [18:20:40] 10SRE, 10MediaWiki-General, 10Traffic-Icebox: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) Re-ordering duplicate query parameters could be problematic. If a parameter appears multiple times, its value in `$_GET` will be set based on the latterm... [18:23:00] (03PS1) 10Majavah: openstack::nova: reduce max amount of open connections [puppet] - 10https://gerrit.wikimedia.org/r/806460 [18:28:48] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:30:47] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) 05Open→03In progress [18:33:40] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:45:32] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [18:51:06] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10RobH) [18:51:10] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10RobH) [18:51:20] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10RobH) [18:51:38] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10RobH) [19:04:31] (03PS3) 10Andrew Bogott: profile::openstack::base::galera::node: remove old absented files [puppet] - 10https://gerrit.wikimedia.org/r/806450 (https://phabricator.wikimedia.org/T310664) [19:04:33] (03PS2) 10Andrew Bogott: remove nodecheck.sh. It was replaced with nodecheck.py [puppet] - 10https://gerrit.wikimedia.org/r/806457 (https://phabricator.wikimedia.org/T310664) [19:04:35] (03PS2) 10Andrew Bogott: galera-nodecheck: turn logging way, way down [puppet] - 10https://gerrit.wikimedia.org/r/806458 [19:04:37] (03PS1) 10Andrew Bogott: neutron: increase rpc_response_timeout [puppet] - 10https://gerrit.wikimedia.org/r/806466 (https://phabricator.wikimedia.org/T309930) [19:07:04] (03CR) 10Andrew Bogott: [C: 03+2] neutron: increase rpc_response_timeout [puppet] - 10https://gerrit.wikimedia.org/r/806466 (https://phabricator.wikimedia.org/T309930) (owner: 10Andrew Bogott) [19:08:52] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:08] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:11:21] (03CR) 10Krinkle: [C: 03+1] profile: add kibana to dashboards rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360) (owner: 10Cwhite) [19:20:42] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:30] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:39:20] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:54] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:16:20] upstream connect error or disconnect/reset before headers. reset reason: connection failure [20:16:25] hello friends [20:16:32] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:32] Tamzin: enwiki fine here [20:18:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:18:51] there is something going on [20:18:55] yeah, bounced back already on my end, if slightly slow [20:19:09] heard the same error report from someone else in another channel though [20:19:10] db1111 maybe? [20:19:28] someone around if to try to depool it? [20:19:58] I'm in bed but the command if tha host is broken is: dbctl instance db1111 depool [20:20:04] from cumin1001 [20:20:14] * jhathaway here [20:20:22] I am doing that [20:20:30] jynus: thanks [20:20:39] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P29907 and previous config saved to /var/cache/conftool/dbconfig/20220617-202038-jynus.json [20:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:51] and then dbctl config commit -m "depool" [20:20:57] yeah, done already [20:21:01] thanks [20:21:05] is it down? [20:21:16] no idea but it is the first thing I thought [20:21:24] if it helps, it is that, if not I will repoolit [20:21:32] it was complining on logs [20:21:34] and icinga [20:21:41] There was a drop in queiries on it marostegui [20:21:43] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1111&var-port=9104&from=now-15m&to=now [20:21:58] it is not down [20:22:12] jynus: please repool it [20:22:17] doing [20:22:41] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1111', diff saved to https://phabricator.wikimedia.org/P29908 and previous config saved to /var/cache/conftool/dbconfig/20220617-202240-jynus.json [20:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:25:57] (03PS15) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [20:26:33] (03CR) 10Ayounsi: "Thanks, reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [20:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [20:29:01] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [20:31:22] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:14] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:35:58] (03PS1) 10Thcipriani: Docker homepage builder: relicense Apache-2.0 [puppet] - 10https://gerrit.wikimedia.org/r/806473 (https://phabricator.wikimedia.org/T67270) [20:36:00] (03CR) 10Thcipriani: [C: 04-1] Docker homepage builder: relicense Apache-2.0 [puppet] - 10https://gerrit.wikimedia.org/r/806473 (https://phabricator.wikimedia.org/T67270) (owner: 10Thcipriani) [20:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [20:37:37] (03CR) 10Thcipriani: [C: 04-1] "Needs input from @Legoktm before merge (it's mostly his work here :))" [puppet] - 10https://gerrit.wikimedia.org/r/806473 (https://phabricator.wikimedia.org/T67270) (owner: 10Thcipriani) [20:45:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:11] Do we have tasks for ^ [20:49:13] ? [20:49:45] hauskatze: yes [20:50:01] okay [20:50:02] hauskatze: it just needs an apache restart [20:50:02] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:50:25] It's an issue where reload doesn't destroy the childs properly [20:50:30] Certificate is renewed fine [20:50:38] Just until August? [20:50:50] It's LE so every 3 months [20:51:01] ah, that explains [20:52:27] hauskatze: https://phabricator.wikimedia.org/T293826 [20:52:59] I'll poke around during the week to get someone to do it if it's still flapping [20:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:56:10] 10SRE, 10Sustainability (Incident Followup): replace TODOs with links to logs and runbook in HAProxy pages (page thanos sre) - https://phabricator.wikimedia.org/T310933 (10Dzahn) [20:58:56] (03CR) 10DannyS712: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [21:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:10:27] (03PS1) 10Dzahn: gitlab: add prometheus blackbox http monitor [puppet] - 10https://gerrit.wikimedia.org/r/806476 [21:10:28] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:13:36] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:15:06] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:15:48] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.099 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:17:22] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:17:24] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:30:35] 10SRE, 10MediaWiki-General, 10Traffic: Query canonicalization for MediaWiki - https://phabricator.wikimedia.org/T310087 (10Krinkle) Regarding parameter - One thing that comes to mind from a previous experiment long ago (I don't recall specifics and couldn't find any) is OAuth verification. OAuth is sensitiv... [21:35:28] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:54:38] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:04] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:14:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:15:33] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) - deleted store@ and merchandise@ after they were created in Google- coordinated with Brendan of ITS and Sandra Hust, store manager - introduced some... [22:15:59] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05In progress→03Resolved [22:18:00] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:29] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) The remaining SRE aliases in the file can now be separated into: - standards - SREs - DNS related - network related - dumps related - monitoring rela... [22:19:34] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:41:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10wiki_willy) a:05cmooney→03Cmjohnson [22:56:18] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 58.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:58:38] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 102.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:05:33] (03PS4) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [23:06:30] (03CR) 10CI reject: [V: 04-1] base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [23:09:37] (03PS5) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [23:10:31] (03CR) 10CI reject: [V: 04-1] base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [23:10:58] (03PS6) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [23:11:52] (03CR) 10CI reject: [V: 04-1] base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [23:14:04] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:16:33] (03PS1) 10Cwhite: logstash: copy aqs info field to error.message [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) [23:17:55] (03PS7) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [23:18:50] (03CR) 10CI reject: [V: 04-1] base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [23:19:28] (03PS8) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [23:25:32] (03CR) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [23:30:33] (03PS1) 10Dzahn: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081) [23:30:48] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/806486/" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [23:31:55] (03PS2) 10Dzahn: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081) [23:33:36] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Dzahn) [23:37:15] (03PS1) 10Dzahn: admin: add taavi to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/806487 (https://phabricator.wikimedia.org/T309375) [23:41:12] (03CR) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [23:41:21] (03PS9) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)