[00:00:29] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:00:37] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:21] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:06:59] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:11:05] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:17:31] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:29:39] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:32:05] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:33] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:39:05] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:39:13] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:43] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:55] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:59:37] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:06:03] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:13:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul) [01:20:41] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:31:15] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:36:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:51] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:45:43] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:29] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:57:07] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:02:49] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:06:09] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:06:45] (JobUnavailable) resolved: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:19] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:21:31] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:32:11] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:38:41] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:50:53] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:05:37] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:10:29] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:22:43] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:30:05] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:37:27] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:40:43] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:44:47] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:57:01] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:11:45] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:15:03] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:53] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:23:59] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:38:39] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:41:57] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:50:51] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:51:41] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:05:35] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:17:37] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:55] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:26:37] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Self note to double check it was compiled with the fix: ` # egrep -B1 "goto wait_for_unzip;|wait_for_unzip|\!buf_LRU_fr... [05:27:48] 10SRE, 10ops-codfw, 10DBA: db2149 broken storage after reboot - https://phabricator.wikimedia.org/T316494 (10Marostegui) The RAID is now fine: ` root@db2149:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Pri... [05:28:05] (03PS1) 10Marostegui: Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/829005 [05:28:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2149 T316494 ', diff saved to https://phabricator.wikimedia.org/P33738 and previous config saved to /var/cache/conftool/dbconfig/20220902-052841-marostegui.json [05:28:47] T316494: db2149 broken storage after reboot - https://phabricator.wikimedia.org/T316494 [05:28:58] (03CR) 10Marostegui: [C: 03+2] Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/829005 (owner: 10Marostegui) [05:36:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:39:04] (03PS1) 10Marostegui: db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/829105 (https://phabricator.wikimedia.org/T316870) [05:39:48] (03CR) 10Marostegui: [C: 03+2] db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/829105 (https://phabricator.wikimedia.org/T316870) (owner: 10Marostegui) [05:42:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:23] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Installed the fixed version on db1143 (s4), if all goes ok during the weekend I will start repooling it along with db11... [05:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 to clone db1107 T316870', diff saved to https://phabricator.wikimedia.org/P33739 and previous config saved to /var/cache/conftool/dbconfig/20220902-054405-root.json [05:44:12] T316870: Move db1107 to s1 - https://phabricator.wikimedia.org/T316870 [05:44:43] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:46:59] (03PS1) 10Marostegui: mariadb: Move db1107 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/829106 (https://phabricator.wikimedia.org/T316870) [05:51:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1107 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/829106 (https://phabricator.wikimedia.org/T316870) (owner: 10Marostegui) [05:54:29] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:57:23] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [06:01:32] (03CR) 10Legoktm: Use shell webservice-runner for node16 image (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827009 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [06:01:34] (03PS2) 10Legoktm: Use shell webservice-runner for node16 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827009 (https://phabricator.wikimedia.org/T293552) [06:02:04] (03PS3) 10Legoktm: Use shell webservice-runner for node16 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827009 (https://phabricator.wikimedia.org/T293552) [06:23:39] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:25:31] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [06:32:46] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [06:33:50] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [06:37:26] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [06:38:01] this is noise ^^ we get too few post requests on 7.4 for it to be meaningful atm [06:38:40] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [06:39:49] (03PS1) 10Legoktm: Use shell webservice-runner for golang111 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829107 (https://phabricator.wikimedia.org/T293552) [06:44:10] (03PS2) 10Legoktm: Use shell webservice-runner for golang111 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829107 (https://phabricator.wikimedia.org/T293552) [06:50:40] (03PS5) 10Clément Goubert: C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) [06:51:16] (03CR) 10CI reject: [V: 04-1] C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert) [06:51:49] (03PS6) 10Clément Goubert: C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) [06:53:50] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [06:58:04] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [06:58:54] (03PS1) 10Slyngshede: C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) [06:59:29] (03CR) 10CI reject: [V: 04-1] C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220902T0700) [07:00:59] (03PS2) 10Slyngshede: C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) [07:04:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert) [07:05:46] (03CR) 10Muehlenhoff: [C: 03+2] sre.misc-clusters.thumbor: Switch to SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/829023 (owner: 10Muehlenhoff) [07:05:50] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:08:05] (03PS7) 10Clément Goubert: C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) [07:08:38] (03CR) 10Clément Goubert: C:cpufrequtils: Exclude VM from cpufrequtils (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert) [07:08:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert) [07:09:04] (03CR) 10Clément Goubert: [C: 03+2] C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert) [07:17:06] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:16] !log restarting blazegraph on wdqs1016 (BlazegraphFreeAllocatorsDecreasingRapidly) [07:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:02] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:18:14] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:20:49] (03PS17) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [07:20:55] (03PS4) 10David Caro: tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 [07:21:14] (03PS10) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:21:49] (03PS11) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:21:53] (03CR) 10David Caro: [C: 03+2] wmcs.novafullstack: Remove nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/814798 (owner: 10David Caro) [07:26:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:27:45] (03PS3) 10David Caro: wmcs.novafullstack: Remove nrpe check and cleanup absent resources [puppet] - 10https://gerrit.wikimedia.org/r/814798 [07:29:02] (03PS4) 10David Caro: wmcs.novafullstack: Remove nrpe check and cleanup absent resources [puppet] - 10https://gerrit.wikimedia.org/r/814798 [07:30:02] (03PS12) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:31:27] (03PS5) 10David Caro: wmcs.novafullstack: Remove nrpe check and cleanup absent resources [puppet] - 10https://gerrit.wikimedia.org/r/814798 (https://phabricator.wikimedia.org/T316919) [07:31:40] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:43] (03CR) 10David Caro: [C: 03+2] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [07:32:56] (03CR) 10David Caro: [C: 03+2] tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [07:33:23] (03CR) 10David Caro: [C: 03+2] wmcs.novafullstack: Remove nrpe check and cleanup absent resources [puppet] - 10https://gerrit.wikimedia.org/r/814798 (https://phabricator.wikimedia.org/T316919) (owner: 10David Caro) [07:35:40] (03PS3) 10Slyngshede: C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) [07:35:49] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) [07:37:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37098/console" [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [07:38:37] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) [07:39:18] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi) [07:39:33] (03Merged) 10jenkins-bot: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [07:39:35] (03Merged) 10jenkins-bot: tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [07:44:13] (03CR) 10Slyngshede: [V: 03+1] "Avoid triggering alerts, but send output to foundations team for debugging." [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [07:46:28] (03CR) 10Jcrespo: [C: 03+2] P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:47:06] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:37] (03CR) 10David Caro: [C: 03+1] "Feel free to ignore the nit, let me know if/when you want to merge (probably monday?)" [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) (owner: 10Majavah) [07:50:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org [07:54:14] (03PS1) 10David Caro: tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 [07:54:26] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:38] (03PS7) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 [07:54:40] (03PS2) 10David Caro: tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 [07:55:28] (03PS3) 10David Caro: tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 [07:56:45] (03CR) 10David Caro: [C: 03+2] wmcs.novafullstack: stop sending stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/814800 (owner: 10David Caro) [07:56:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org [07:56:54] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:57:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org [07:57:09] (03CR) 10CI reject: [V: 04-1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro) [07:57:29] (03CR) 10CI reject: [V: 04-1] tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 (owner: 10David Caro) [08:01:00] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:01:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1002.wikimedia.org [08:03:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org [08:04:53] (03CR) 10Ayounsi: BGP: remove local-as 14907 loops 2 for anycast peers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/827950 (owner: 10Ayounsi) [08:06:08] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:06:26] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:07:30] (03PS1) 10David Caro: Remove buster0 buildpacks images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 [08:07:55] (03PS2) 10David Caro: Remove buster0 buildpacks images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 [08:11:18] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:13:14] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:13:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp2002.wikimedia.org [08:13:42] PROBLEM - Check systemd state on idp2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:28] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:15:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet [08:16:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2002.codfw.wmnet [08:16:52] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: send SIGUSR2 on log rotation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [08:23:04] 10SRE, 10Acme-chief, 10Traffic-Icebox, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [08:23:26] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:24:00] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:26:03] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) 05Open→03Stalled a:05Ladsgroup→03TThoabala [08:26:34] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw [08:29:04] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:30:02] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:30:34] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:34:14] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) [08:34:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw [08:35:16] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) Added missing information. [08:36:34] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad [08:37:49] !log fnegri@cumin1001 START - Cookbook sre.dns.netbox [08:38:07] 10SRE, 10Traffic: Implement SLI measurement for HAProxy - https://phabricator.wikimedia.org/T307898 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [08:38:54] 10SRE, 10Traffic: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) [08:39:12] 10SRE, 10Traffic: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) p:05Triage→03Medium [08:41:37] !log fnegri@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:44:28] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) @wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing eqsin? [08:44:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad [08:44:38] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:45:12] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:47:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ayounsi) 05Resolved→03Open @Cmjohnson @Dzahn The following hosts are alerting in https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ ` mw1459 (W... [08:48:00] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) meanwhile I'll remove it from puppet, cause it's been a month since the host crashed and it already got prunned from puppetdb [08:51:43] (03PS1) 10Vgutierrez: cache: Remove cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/829118 (https://phabricator.wikimedia.org/T314256) [08:52:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10ayounsi) From https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ `moss-be1001 (WMF5034) Device is Staged in Netbox but is missing from... [08:52:30] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:53:18] (03CR) 10Vgutierrez: [C: 03+2] cache: Remove cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/829118 (https://phabricator.wikimedia.org/T314256) (owner: 10Vgutierrez) [08:53:33] (03PS1) 10Muehlenhoff: testreduce: Switch systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829119 [08:53:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] trafficserver: remove search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [08:54:05] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10dcaro) It seems that the runbook did not cleanup puppetdb or it was repopulated right after, as the host still shows there: https://debm... [08:54:13] (03CR) 10CI reject: [V: 04-1] testreduce: Switch systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff) [08:54:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ayounsi) 05Resolved→03Open FYI, Netbox is alerting with: `an-presto1009 (WMF11494) Device is in PuppetDB but is Planned in Netbox (should... [08:56:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:56:46] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:56:59] (03PS2) 10Muehlenhoff: testreduce: Switch systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829119 [08:59:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:26] RECOVERY - k8s requests count to the API on ml-serve-ctrl1001 is OK: (C)100 ge (W)50 ge 48.47 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [09:00:24] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [09:02:26] (03PS1) 10Muehlenhoff: webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121 [09:03:02] (03CR) 10CI reject: [V: 04-1] webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff) [09:04:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:04:26] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:22] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:06:32] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:07:04] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:09:34] (03PS2) 10Muehlenhoff: webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121 [09:10:37] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) Re-added SRE as this is ready to move forward. [09:11:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Pretty nice, let's finally puppetize this! Couple of inline suggestions." [puppet] - 10https://gerrit.wikimedia.org/r/828673 (owner: 10AOkoth) [09:12:07] (03CR) 10Ladsgroup: [C: 03+1] Exclude cloud-eqiad prefix from lists trusted networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [09:14:07] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Ladsgroup) \o/ Welcome Peter! [09:18:46] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:21:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:09] (03PS1) 10Muehlenhoff: acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 [09:22:45] (03CR) 10CI reject: [V: 04-1] acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [09:23:59] (03CR) 10Vgutierrez: [C: 04-1] "looking good, just fix the syntax error :)" [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [09:25:49] (03PS2) 10Muehlenhoff: acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 [09:25:51] (03CR) 10Muehlenhoff: acme-chief: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [09:26:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [09:26:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [09:26:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [09:26:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [09:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T314041)', diff saved to https://phabricator.wikimedia.org/P33743 and previous config saved to /var/cache/conftool/dbconfig/20220902-092704-ladsgroup.json [09:27:09] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:30:44] (03CR) 10Vgutierrez: [C: 03+1] acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [09:33:48] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:34:51] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Volans) The decom cookbook is meant to be idempotent, so you can safely re-run it. That said I can look next week on the logs of the prev... [09:35:44] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:46:46] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) p:05Triage→03Medium a:03Jelto [09:47:12] (03PS1) 10Muehlenhoff: Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T267673) [09:47:58] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:49:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/829038 (https://phabricator.wikimedia.org/T302145) (owner: 10ArielGlenn) [09:50:10] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:50:18] (03PS2) 10ArielGlenn: Add Hannah Okwelum to icinga read access, remove Holger Knust [puppet] - 10https://gerrit.wikimedia.org/r/829038 (https://phabricator.wikimedia.org/T302145) [09:51:40] (03CR) 10ArielGlenn: [C: 03+2] Add Hannah Okwelum to icinga read access, remove Holger Knust [puppet] - 10https://gerrit.wikimedia.org/r/829038 (https://phabricator.wikimedia.org/T302145) (owner: 10ArielGlenn) [09:53:20] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:58:58] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:00:00] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:00:52] (03PS1) 10Jelto: admin: add production access for pfischer [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090) [10:01:42] (03CR) 10Gehel: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [10:06:55] (03PS1) 10Jelto: admin: add pfischer to search*, analytics and deployment group [puppet] - 10https://gerrit.wikimedia.org/r/829150 (https://phabricator.wikimedia.org/T316090) [10:07:26] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:08:15] !log depooled kubemaster1002 for tests [10:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:34] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:10:44] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:12:30] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) Welcome @pfischer! Thanks for the request and all the approvals. We are missing one last approval from @thcipri... [10:14:33] (03PS1) 10Muehlenhoff: Remove access for mmandere [puppet] - 10https://gerrit.wikimedia.org/r/829155 [10:20:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:33] (03CR) 10Peter Fischer: [C: 03+1] "checked public key" [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [10:31:12] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:48] (03CR) 10Klausman: [C: 03+2] ml-services: update outlink image to support pre-transformed data [deployment-charts] - 10https://gerrit.wikimedia.org/r/829015 (https://phabricator.wikimedia.org/T315998) (owner: 10AikoChou) [10:33:04] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:33:24] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:33:31] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) SSH key verified by Meet session and Gerrit +1 in https://gerrit.wikimedia.org/r/829148 [10:33:36] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for mmandere [puppet] - 10https://gerrit.wikimedia.org/r/829155 (owner: 10Muehlenhoff) [10:34:32] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging MMandere out of all services on: 779 hosts [10:34:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging MMandere out of all services on: 779 hosts [10:35:09] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging MMandere out of all services on: 1235 hosts [10:35:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging MMandere out of all services on: 1235 hosts [10:35:45] (03CR) 10DCausse: "nice!" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [10:35:49] (03Merged) 10jenkins-bot: ml-services: update outlink image to support pre-transformed data [deployment-charts] - 10https://gerrit.wikimedia.org/r/829015 (https://phabricator.wikimedia.org/T315998) (owner: 10AikoChou) [10:38:30] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:56] (03PS2) 10Hnowlan: Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) [10:43:02] (03CR) 10Hnowlan: Fix environment in prep stage (032 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [10:44:26] (03PS1) 10Vgutierrez: mtail::atsbackend: Add SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/829163 (https://phabricator.wikimedia.org/T316921) [10:51:26] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:58:06] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:24] (03PS1) 10Muehlenhoff: Remove Marc from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/829165 [11:03:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [11:05:20] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:46] (03PS2) 10Vgutierrez: mtail::atsbackend: Add SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/829163 (https://phabricator.wikimedia.org/T316921) [11:27:06] (03CR) 10Muehlenhoff: [C: 03+2] Remove Marc from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/829165 (owner: 10Muehlenhoff) [11:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:31:32] (03PS1) 10Muehlenhoff: Also remove Marc's lowercase username variant from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/829169 [11:42:19] (03CR) 10Muehlenhoff: [C: 03+2] Also remove Marc's lowercase username variant from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/829169 (owner: 10Muehlenhoff) [11:43:34] (03PS6) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [11:44:53] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T267673) (owner: 10Muehlenhoff) [11:45:28] (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (034 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [11:48:20] (03PS1) 10Muehlenhoff: Remove Marc from sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/829173 [11:51:15] (03CR) 10Muehlenhoff: [C: 03+2] Remove Marc from sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/829173 (owner: 10Muehlenhoff) [12:00:34] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:52] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:14] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [12:41:06] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1132 has now the fix installed. I will leave it replicating during the weekend and start pooling it back next week. [12:47:25] (03CR) 10DCausse: "some unit tests would be nice :)" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [12:56:19] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:00:40] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:01:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Jclark-ctr) [13:04:57] (03CR) 10Vgutierrez: "As mentioned in T316932:" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [13:07:06] (03CR) 10DCausse: "lgtm," [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar) [13:09:37] (03CR) 10DCausse: [C: 03+1] build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 (owner: 10Hashar) [13:10:12] !log redepooled kubemaster1002 [13:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:55] "redepooling".. that's tricky :) [13:15:07] hrhr [13:15:14] !log repooled kubemaster1002 [13:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:22] thanks for pointing out :) [13:15:29] oh you didn't depool it again actually [13:15:53] nope :) [13:31:10] !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196'] [13:31:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Jclark-ctr) kafka-stretch1001 E3 U17 Port 17 cableid 20220230 kafka-stretch1002 F3 U17 Port 17 cableid 20220229 [13:31:43] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1196'] [13:31:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Jclark-ctr) [13:32:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [13:48:30] (03PS1) 10Papaul: when using sorted i get the error msg:TypeError: '<' not supported between instances of 'DellDriver' and 'DellDriver' changed this back to list to get some install for now [cookbooks] - 10https://gerrit.wikimedia.org/r/829193 [13:51:10] (03PS1) 10JMeybohm: Cache helm list results for one minute [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/829194 [13:52:32] (03PS2) 10JMeybohm: Cache helm list results for one minute [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/829194 [13:53:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) druid1009 A5 U06 Port25 CableID 230000145 druid1010 B5 U13 Port 4 CableID 2988 druid1011 D6 U37 Port 37 Cabl... [13:54:08] (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Add SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/829163 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez) [13:55:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) [13:55:32] (03CR) 10Papaul: [C: 03+2] when using sorted i get the error msg:TypeError: '<' not supported between instances of 'DellDriver' and 'DellDriver' changed this back to l [cookbooks] - 10https://gerrit.wikimedia.org/r/829193 (owner: 10Papaul) [13:57:09] !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196'] [13:57:31] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1196'] [13:57:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] Cache helm list results for one minute [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/829194 (owner: 10JMeybohm) [13:59:02] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Cache helm list results for one minute [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/829194 (owner: 10JMeybohm) [14:01:29] (03PS1) 10JMeybohm: helm-state-metrics: Update to v0.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/829197 [14:01:49] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196'] [14:02:33] (03PS1) 10JMeybohm: helm-state-metrics: Update to v0.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/829198 [14:05:54] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1197'] [14:07:17] (03CR) 10Hashar: "I have to:" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [14:07:33] (03PS3) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) [14:08:03] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:11:41] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1003 [14:14:53] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:46] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [14:16:37] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['db1197'] [14:18:26] (03PS1) 10Muehlenhoff: puppet_compiler: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829200 (https://phabricator.wikimedia.org/T308013) [14:18:28] (03PS1) 10Muehlenhoff: keyholder: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013) [14:18:30] (03PS1) 10Muehlenhoff: releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) [14:18:32] (03PS1) 10Muehlenhoff: udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) [14:18:34] (03PS1) 10Muehlenhoff: sslcert: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013) [14:18:36] (03PS1) 10Muehlenhoff: k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) [14:18:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:18:36] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices1003 [14:18:42] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudservices1003` - cloudservices1003 (**F... [14:19:36] (03CR) 10CI reject: [V: 04-1] keyholder: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:20:03] (03CR) 10CI reject: [V: 04-1] releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:20:33] (03CR) 10CI reject: [V: 04-1] udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:21:44] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['db1196'] [14:22:33] (03CR) 10CI reject: [V: 04-1] k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:22:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert) [14:23:38] (03CR) 10CI reject: [V: 04-1] sslcert: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:31:28] (03PS2) 10Muehlenhoff: keyholder: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013) [14:32:41] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196'] [14:33:46] 10SRE: Upgrade ferm to 2.5.1 - https://phabricator.wikimedia.org/T248954 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [14:33:53] 10SRE: Upgrade ferm to 2.5.1 - https://phabricator.wikimedia.org/T248954 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:34:07] RECOVERY - Disk space on ms-be2037 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2037&var-datasource=codfw+prometheus/ops [14:34:22] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Dzahn) dmarc-rua@ is an alias for dmarc@donate.wikimedia.org donate.wikimedia.org mail is routed to the fundraising-tech's CiviCRM system. While it's po... [14:34:35] (03PS1) 10Giuseppe Lavagetto: Rebuild php 7.4 images with newer php versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/829206 [14:34:45] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Dzahn) [14:37:52] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Rebuild php 7.4 images with newer php versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/829206 (owner: 10Giuseppe Lavagetto) [14:38:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1196'] [14:39:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196'] [14:41:45] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1197'] [14:42:21] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1003.wikimedia.org [14:43:08] (03PS2) 10Muehlenhoff: releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) [14:43:12] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Andrew) It looks like the decom script can't find it in puppetdb even though the alert says it is in puppetdb. [14:46:13] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [14:47:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1196'] [14:47:27] (03PS2) 10Muehlenhoff: udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) [14:47:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1197'] [14:48:41] (03PS1) 10CDanis: vcl: Temporarily restore past Badtitle behavior [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932) [14:49:06] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:06] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices1003.wikimedia.org [14:49:12] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudservices1003.wikimedia.org` - cloudser... [14:49:29] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:49:40] (03PS1) 10Vgutierrez: mtail::atsbacken: Handle cache_read|write_time=-1 [puppet] - 10https://gerrit.wikimedia.org/r/829208 (https://phabricator.wikimedia.org/T316938) [14:51:54] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1003.wikimedia.org [14:52:05] (03PS2) 10Vgutierrez: mtail::atsbackend: Handle cache_read|write_time=-1 [puppet] - 10https://gerrit.wikimedia.org/r/829208 (https://phabricator.wikimedia.org/T316938) [14:53:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1197'] [14:55:05] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object-replicator.service,swift-object.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:46] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [14:57:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1198'] [14:57:33] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:44] (03PS2) 10Muehlenhoff: k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) [14:58:45] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:46] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices1003.wikimedia.org [14:58:51] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudservices1003.wikimedia.org` - cloudser... [14:59:08] (03CR) 10Dzahn: "Adding Arnold. This is about what I had in mind but if the mail gets sent only to infra-foundations he won't be able to debug it. Maybe if" [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [14:59:17] (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Handle cache_read|write_time=-1 [puppet] - 10https://gerrit.wikimedia.org/r/829208 (https://phabricator.wikimedia.org/T316938) (owner: 10Vgutierrez) [15:00:02] (03CR) 10Dzahn: "and like Moritz said, we do not expect this is caused by switching to systemd timer. We expect that this merely gave us the new monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [15:00:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) druid1009 E2 U38 Port43 Cableid 23000023 druid1009 E3 U38 Port43 Cableid 23000054 druid1009 F2 U38 Port43 Ca... [15:00:55] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1196.eqiad.wmnet with OS bullseye [15:01:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:01:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1196.eqiad.wmnet with OS bullseye [15:01:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1197'] [15:02:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1198'] [15:03:02] (03PS2) 10Muehlenhoff: sslcert: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013) [15:03:28] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1198'] [15:03:30] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1198'] [15:04:59] !log depooled kubemaster1001 [15:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:51] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Andrew) a:05Cmjohnson→03Volans I am now officially out of ideas :) Over to you, @volans! [15:06:55] (03CR) 10JMeybohm: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:09:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1198'] [15:09:29] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1199'] [15:09:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1198'] [15:09:53] 10SRE, 10Traffic, 10Patch-For-Review: atsbackend.mtail doesn't track requests with cache read|write time set to -1 properly - https://phabricator.wikimedia.org/T316938 (10Vgutierrez) 05In progress→03Resolved after merging https://gerrit.wikimedia.org/r/829208 sli_total|sli_good counters seem sane: `vguti... [15:13:00] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage [15:14:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1199'] [15:15:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1199'] [15:16:51] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage [15:18:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1197.eqiad.wmnet with OS bullseye [15:18:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1197.eqiad.wmnet with OS bullseye [15:19:01] (03Abandoned) 10Jdrewniak: Add wgWMEWebUIScrollTrackingSamplingRate config to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784690 (https://phabricator.wikimedia.org/T303297) (owner: 10Jdrewniak) [15:19:30] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1198'] [15:23:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1199'] [15:27:27] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1201'] [15:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:28:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1198'] [15:30:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage [15:31:45] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @Jclark-ctr for more information: the MMF/MTP fibers ordered in https://phabricator.wikimedia.org/T313464 we want 1 fiber from rack c2 to rack a1 1 fi... [15:31:59] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1196.eqiad.wmnet with OS bullseye [15:32:05] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1196.eqiad.wmnet with OS bullseye completed: - db1... [15:32:20] (03PS1) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) [15:32:45] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1200'] [15:32:49] (03CR) 10Dzahn: [C: 03+1] releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:33:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1201'] [15:33:58] (03PS2) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) [15:34:01] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1198.eqiad.wmnet with OS bullseye [15:34:08] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1198.eqiad.wmnet with OS bullseye [15:34:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage [15:34:54] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul) [15:35:37] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1201'] [15:35:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1201'] [15:37:27] !log repooled kubemaster1001 [15:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:52] !log depool kubemaster2002 [15:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1200'] [15:42:19] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1200'] [15:45:59] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage [15:47:59] (03PS1) 10Hnowlan: api-gateway: Distinguish between internal host and host header setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/829216 [15:48:44] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [15:49:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1197.eqiad.wmnet with OS bullseye [15:49:54] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage [15:49:58] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1197.eqiad.wmnet with OS bullseye completed: - db1... [15:50:25] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:50:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1200'] [15:51:55] (03CR) 10Vgutierrez: [C: 03+1] vcl: Temporarily restore past Badtitle behavior (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932) (owner: 10CDanis) [15:52:37] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:07] (03PS1) 10Giuseppe Lavagetto: api_appserver: convert all canaries to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829217 (https://phabricator.wikimedia.org/T271736) [15:54:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:55:18] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37100/console" [puppet] - 10https://gerrit.wikimedia.org/r/829217 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [15:56:18] (03PS1) 10David Caro: bullseye0: Improve the install-packages script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) [15:57:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1199.eqiad.wmnet with OS bullseye [15:57:24] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1199.eqiad.wmnet with OS bullseye [15:57:25] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:47] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1201'] [15:57:57] !log repool kubemaster2002 [15:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:58:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:58:40] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10MatthewVernon) Hi @Jclark-ctr sorry to be a bother, but could you check another drive in this system, please? `/dev/sdy` has started being unhappy since 31 August at about the tim... [15:58:50] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10MatthewVernon) 05Resolved→03Open [16:01:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1202'] [16:02:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:03:49] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1198.eqiad.wmnet with OS bullseye [16:03:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1198.eqiad.wmnet with OS bullseye completed: - db1... [16:05:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1201'] [16:07:17] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1200.eqiad.wmnet with OS bullseye [16:07:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:07:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1200.eqiad.wmnet with OS bullseye [16:08:14] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:04] (03CR) 10CDanis: [C: 03+2] vcl: Temporarily restore past Badtitle behavior [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932) (owner: 10CDanis) [16:09:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1199.eqiad.wmnet with reason: host reimage [16:10:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1202'] [16:10:52] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1203'] [16:11:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:11:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:11:19] (03PS2) 10CDanis: vcl: Temporarily restore past Badtitle behavior [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932) [16:11:29] (03CR) 10CDanis: [C: 03+2] vcl: Temporarily restore past Badtitle behavior (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932) (owner: 10CDanis) [16:13:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1199.eqiad.wmnet with reason: host reimage [16:15:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:15:45] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1202'] [16:17:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1203'] [16:19:15] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1200.eqiad.wmnet with reason: host reimage [16:23:15] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1200.eqiad.wmnet with reason: host reimage [16:23:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1202'] [16:26:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1199.eqiad.wmnet with OS bullseye [16:26:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1199.eqiad.wmnet with OS bullseye completed: - db1... [16:27:09] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:26] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1203'] [16:31:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1201.eqiad.wmnet with OS bullseye [16:31:48] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1201.eqiad.wmnet with OS bullseye [16:37:59] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1203'] [16:39:05] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1200.eqiad.wmnet with OS bullseye [16:39:10] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1200.eqiad.wmnet with OS bullseye completed: - db1... [16:40:56] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1202.eqiad.wmnet with OS bullseye [16:41:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1202.eqiad.wmnet with OS bullseye [16:42:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul) [16:43:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage [16:47:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage [16:52:50] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1202.eqiad.wmnet with reason: host reimage [16:56:17] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1202.eqiad.wmnet with reason: host reimage [17:00:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1201.eqiad.wmnet with OS bullseye [17:00:14] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1201.eqiad.wmnet with OS bullseye completed: - db1... [17:01:51] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:57] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:04] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1202.eqiad.wmnet with OS bullseye [17:11:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1202.eqiad.wmnet with OS bullseye completed: - db1... [17:12:45] (03CR) 10JMeybohm: "This change is ready for review." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [17:13:55] (03CR) 10Subramanya Sastry: [C: 03+1] "I am okay with it if it is okay with Daniel. :)" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff) [17:14:05] (03PS7) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [17:28:20] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10wiki_willy) Hi @Vgutierrez - yeah, probably makes more sense to replace than purchase a replacement part, since the new servers have already been ordered and are expected to arrive in Oct... [17:28:44] (03PS1) 10Andrew Bogott: sre-sandbox: remove automatic VM purge logic [puppet] - 10https://gerrit.wikimedia.org/r/829231 (https://phabricator.wikimedia.org/T247517) [17:29:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1203.eqiad.wmnet with OS bullseye [17:29:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1203.eqiad.wmnet with OS bullseye [17:41:11] 10SRE, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10Andrew) @jbond, after the recent unpleasantness with @herron having VMs deleted by surprise I'm revisiting the practices in... [17:41:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage [17:44:28] (03PS1) 10Milimetric: aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/829233 [17:45:03] (03CR) 10Milimetric: [C: 04-1] "this has to wait for the druid load to finish, just putting it out before I forget." [puppet] - 10https://gerrit.wikimedia.org/r/829233 (owner: 10Milimetric) [17:45:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage [17:46:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:51:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:51:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:52:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:58:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1203.eqiad.wmnet with OS bullseye [17:58:25] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1203.eqiad.wmnet with OS bullseye completed: - db1... [18:09:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul) [18:10:41] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul) 05Open→03Resolved @Marostegui @Jclark-ctr this is complete [18:16:47] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:40] (03CR) 10Andrew Bogott: [C: 03+2] Move profile::wmcs::backup_glance_images from cloudcontrols to backup servers [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) (owner: 10Andrew Bogott) [18:30:39] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10RobH) >>! In T314256#8207835, @Vgutierrez wrote: > @wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing... [18:39:31] I'm going to do some mw-on-k8s image build/deploy testing [18:39:36] (03CR) 10Dzahn: [C: 03+1] "yes, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff) [18:39:50] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37101/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff) [18:40:02] !log dancy@deploy1002 Started scap: testing T299648 [18:40:08] T299648: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 [18:41:12] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "only affects testreduce1001 - https://puppet-compiler.wmflabs.org/pcc-worker1001/37102/" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff) [18:42:28] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "File[/etc/sysusers.d/testreduce.conf]/ensure: defined content .. /Exec[Refresh sysusers]: Triggered 'refresh' from 1 event ..and that's al" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff) [18:43:20] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "[testreduce1001:~] $ cat /etc/sysusers.d/testreduce.conf" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff) [18:46:12] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37103/webperf1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff) [18:47:30] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:48:10] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "just merged a similar change for testreduce at I8304306b989642 and it was just fine. it will create the new config file and refresh sysuse" [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff) [18:49:38] (03CR) 10Dzahn: [C: 03+1] acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [18:51:38] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:51:56] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:55:51] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:56:07] !log dancy@deploy1002 dancy: testing T299648 synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [18:56:11] T299648: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 [18:58:04] !log dancy@deploy1002 Sync cancelled. [19:00:09] Test completed [19:01:54] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "testing if it's gone everywhere with: cumin -b 20 -s 2 -x C:puppet::agent 'file /etc/cron.d/puppet'" [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff) [19:02:31] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) I cleaned out the `flink_ha_storage` pseudofolder from the `rdf-streaming-updater-codfw` bucket as r... [19:03:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:03:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:03:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:03:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:11:42] (03PS2) 10Dzahn: Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff) [19:12:05] (03CR) 10Dzahn: [C: 03+1] "was accidentally linked to an unrelated ticket. fixing bug: link" [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff) [19:13:00] (03CR) 10Dzahn: [C: 03+2] "comments-only and all look good to me" [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff) [19:14:25] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) >>! In T316899#8208607, @Dzahn wrote: > dmarc-rua@ is an alias for dmarc@donate.wikimedia.org > > donate.wikim... [19:19:28] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Dzahn) @Jgreen Alright, gotcha! In that case can we move the alias into the section that says managed by fr-tech and y... [19:22:27] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) >>! In T316899#8209163, @Dzahn wrote: > @Jgreen Alright, gotcha! In that case can we move the alias into the s... [19:24:13] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Dzahn) @Jgreen I am not sure but let's ask for input from Infra Foundations. [19:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:33:59] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:38:55] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [20:01:27] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:39] (03PS3) 10Dzahn: rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [20:11:19] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:21:47] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) T86209 is relevant, we do appear to still be sending dmarc-ruf@ to dmarcian, although I don't know who has acce... [20:35:05] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:41:06] (03CR) 10Dzahn: [C: 03+2] rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [20:42:57] (03CR) 10Dzahn: [C: 03+2] "on netmon1002, netmon1003, netmon2001 /etc/sysusers.d/rancid.conf was created and sysusers was refreshed and nothing else happened because" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [20:48:04] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) I was able to log into the "wikimedia" dmarcian account and determine that the subscription expired, so I sent... [20:49:53] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, and 2 others: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) p:05Triage→03Medium [20:53:47] (03PS1) 10Dzahn: phabricator: ensure only the one active_server connects to rw mysql [puppet] - 10https://gerrit.wikimedia.org/r/829244 (https://phabricator.wikimedia.org/T315713) [20:56:16] (03CR) 10Volans: Remove obsolete absented cron file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff) [20:57:01] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37104/" [puppet] - 10https://gerrit.wikimedia.org/r/829244 (https://phabricator.wikimedia.org/T315713) (owner: 10Dzahn) [21:01:04] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "Ok, I'll revert if that's a problem. I am not entirely sure what the call to action is though." [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff) [21:06:50] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "well.. if the file doesn't exist anywhere I fail to see what the problem is that we need to fix" [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff) [21:07:10] (03PS1) 10Dzahn: Revert "Remove obsolete absented cron file" [puppet] - 10https://gerrit.wikimedia.org/r/829143 [21:07:21] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:59] (03CR) 10Dzahn: [C: 03+2] Revert "Remove obsolete absented cron file" [puppet] - 10https://gerrit.wikimedia.org/r/829143 (owner: 10Dzahn) [21:10:45] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "reverted" [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff) [21:15:01] (03PS3) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) [21:16:34] (03CR) 10Dduvall: phabricator: Deploy user should own everything under old rev directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [21:18:25] (03CR) 10Dzahn: phabricator: Deploy user should own everything under old rev directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [21:19:46] (03CR) 10Dzahn: phabricator: Deploy user should own everything under old rev directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [21:20:44] (03PS4) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) [21:21:08] (03PS5) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) [21:21:17] (03CR) 10Dzahn: "there is a rule that shell scripts are not supposed to be created from .sh.erb files anymore because then CI can't check the .sh files. Th" [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [21:22:45] (03CR) 10Dduvall: phabricator: Deploy user should own everything under old rev directories (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [21:40:31] (03PS6) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) [21:42:15] (03PS7) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) [21:43:56] (03CR) 10Dduvall: phabricator: Deploy user should own everything under old rev directories (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [21:46:29] (03CR) 10Volans: Remove obsolete absented cron file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff) [22:01:45] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:08] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) fwiw WMF-NDA on Phabricator and LDAP groups are entirely unrelated things and handled by different people. [22:11:01] (03PS1) 10Dzahn: Revert "Revert "Remove obsolete absented cron file"" [puppet] - 10https://gerrit.wikimedia.org/r/829144 [22:13:01] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "Remove obsolete absented cron file"" [puppet] - 10https://gerrit.wikimedia.org/r/829144 (owner: 10Dzahn) [22:14:06] (03CR) 10Dzahn: [V: 03+1 C: 03+2] Remove obsolete absented cron file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff) [22:19:22] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "cool, thank you. lgtm and compiles" [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [23:07:45] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:26:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert