[00:00:29] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:00:37] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:21] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:06:59] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:11:05] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:17:31] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:29:39] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:32:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:37:33] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:39:05] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:39:13] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:43] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:44:55] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:59:37] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:06:03] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:13:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul)
[01:20:41] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:31:15] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:36:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:51] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:45:43] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:55:29] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:57:07] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:02:49] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:06:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:06:45] <jinxer-wm>	 (JobUnavailable) resolved: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:19] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:21:31] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:32:11] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:38:41] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:50:53] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:05:37] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:10:29] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:22:43] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[03:30:05] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:37:27] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:40:43] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:44:47] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:57:01] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:11:45] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:15:03] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:16:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:21:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:23:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:23:53] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:23:59] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:38:39] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:41:57] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:50:51] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:51:41] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:05:35] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:17:37] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:25:55] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:26:37] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Self note to double check it was compiled with the fix: ` # egrep -B1 "goto wait_for_unzip;|wait_for_unzip|\!buf_LRU_fr...
[05:27:48] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2149 broken storage after reboot - https://phabricator.wikimedia.org/T316494 (10Marostegui) The RAID is now fine: ` root@db2149:~# megacli -LDInfo -Lall -aALL   Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name                : RAID Level          : Pri...
[05:28:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/829005
[05:28:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2149 T316494 ', diff saved to https://phabricator.wikimedia.org/P33738 and previous config saved to /var/cache/conftool/dbconfig/20220902-052841-marostegui.json
[05:28:47] <stashbot>	 T316494: db2149 broken storage after reboot - https://phabricator.wikimedia.org/T316494
[05:28:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/829005 (owner: 10Marostegui)
[05:36:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:39:04] <wikibugs>	 (03PS1) 10Marostegui: db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/829105 (https://phabricator.wikimedia.org/T316870)
[05:39:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/829105 (https://phabricator.wikimedia.org/T316870) (owner: 10Marostegui)
[05:42:05] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:42:23] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Installed the fixed version on db1143 (s4), if all goes ok during the weekend I will start repooling it along with db11...
[05:44:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 to clone db1107 T316870', diff saved to https://phabricator.wikimedia.org/P33739 and previous config saved to /var/cache/conftool/dbconfig/20220902-054405-root.json
[05:44:12] <stashbot>	 T316870: Move db1107 to s1 - https://phabricator.wikimedia.org/T316870
[05:44:43] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:46:59] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1107 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/829106 (https://phabricator.wikimedia.org/T316870)
[05:51:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1107 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/829106 (https://phabricator.wikimedia.org/T316870) (owner: 10Marostegui)
[05:54:29] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:57:23] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[06:01:32] <wikibugs>	 (03CR) 10Legoktm: Use shell webservice-runner for node16 image (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827009 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm)
[06:01:34] <wikibugs>	 (03PS2) 10Legoktm: Use shell webservice-runner for node16 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827009 (https://phabricator.wikimedia.org/T293552)
[06:02:04] <wikibugs>	 (03PS3) 10Legoktm: Use shell webservice-runner for node16 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827009 (https://phabricator.wikimedia.org/T293552)
[06:23:39] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:25:31] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[06:32:46] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method
[06:33:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[06:37:26] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method
[06:38:01] <joe>	 this is noise ^^ we get too few post requests on 7.4 for it to be meaningful atm
[06:38:40] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[06:39:49] <wikibugs>	 (03PS1) 10Legoktm: Use shell webservice-runner for golang111 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829107 (https://phabricator.wikimedia.org/T293552)
[06:44:10] <wikibugs>	 (03PS2) 10Legoktm: Use shell webservice-runner for golang111 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829107 (https://phabricator.wikimedia.org/T293552)
[06:50:40] <wikibugs>	 (03PS5) 10Clément Goubert: C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398)
[06:51:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert)
[06:51:49] <wikibugs>	 (03PS6) 10Clément Goubert: C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398)
[06:53:50] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method
[06:58:04] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[06:58:54] <wikibugs>	 (03PS1) 10Slyngshede: C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903)
[06:59:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220902T0700)
[07:00:59] <wikibugs>	 (03PS2) 10Slyngshede: C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903)
[07:04:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert)
[07:05:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.misc-clusters.thumbor: Switch to SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/829023 (owner: 10Muehlenhoff)
[07:05:50] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:08:05] <wikibugs>	 (03PS7) 10Clément Goubert: C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398)
[07:08:38] <wikibugs>	 (03CR) 10Clément Goubert: C:cpufrequtils: Exclude VM from cpufrequtils (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert)
[07:08:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert)
[07:09:04] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] C:cpufrequtils: Exclude VM from cpufrequtils [puppet] - 10https://gerrit.wikimedia.org/r/829040 (https://phabricator.wikimedia.org/T315398) (owner: 10Clément Goubert)
[07:17:06] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:16] <dcausse>	 !log restarting blazegraph on wdqs1016 (BlazegraphFreeAllocatorsDecreasingRapidly)
[07:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:02] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:18:14] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:20:49] <wikibugs>	 (03PS17) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870)
[07:20:55] <wikibugs>	 (03PS4) 10David Caro: tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782
[07:21:14] <wikibugs>	 (03PS10) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:21:49] <wikibugs>	 (03PS11) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:21:53] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.novafullstack: Remove nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/814798 (owner: 10David Caro)
[07:26:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[07:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[07:27:45] <wikibugs>	 (03PS3) 10David Caro: wmcs.novafullstack: Remove nrpe check and cleanup absent resources [puppet] - 10https://gerrit.wikimedia.org/r/814798
[07:29:02] <wikibugs>	 (03PS4) 10David Caro: wmcs.novafullstack: Remove nrpe check and cleanup absent resources [puppet] - 10https://gerrit.wikimedia.org/r/814798
[07:30:02] <wikibugs>	 (03PS12) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:31:27] <wikibugs>	 (03PS5) 10David Caro: wmcs.novafullstack: Remove nrpe check and cleanup absent resources [puppet] - 10https://gerrit.wikimedia.org/r/814798 (https://phabricator.wikimedia.org/T316919)
[07:31:40] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:32:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[07:32:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[07:33:23] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.novafullstack: Remove nrpe check and cleanup absent resources [puppet] - 10https://gerrit.wikimedia.org/r/814798 (https://phabricator.wikimedia.org/T316919) (owner: 10David Caro)
[07:35:40] <wikibugs>	 (03PS3) 10Slyngshede: C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903)
[07:35:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi)
[07:37:38] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37098/console" [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede)
[07:38:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi)
[07:39:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10ayounsi)
[07:39:33] <wikibugs>	 (03Merged) 10jenkins-bot: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[07:39:35] <wikibugs>	 (03Merged) 10jenkins-bot: tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[07:44:13] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "Avoid triggering alerts, but send output to foundations team for debugging." [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede)
[07:46:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:47:06] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:47:37] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Feel free to ignore the nit, let me know if/when you want to merge (probably monday?)" [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) (owner: 10Majavah)
[07:50:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org
[07:54:14] <wikibugs>	 (03PS1) 10David Caro: tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111
[07:54:26] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:54:38] <wikibugs>	 (03PS7) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915
[07:54:40] <wikibugs>	 (03PS2) 10David Caro: tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111
[07:55:28] <wikibugs>	 (03PS3) 10David Caro: tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111
[07:56:45] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.novafullstack: stop sending stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/814800 (owner: 10David Caro)
[07:56:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org
[07:56:54] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:57:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org
[07:57:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro)
[07:57:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 (owner: 10David Caro)
[08:01:00] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:01:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1002.wikimedia.org
[08:03:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org
[08:04:53] <wikibugs>	 (03CR) 10Ayounsi: BGP: remove local-as 14907 loops 2 for anycast peers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/827950 (owner: 10Ayounsi)
[08:06:08] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:06:26] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:07:30] <wikibugs>	 (03PS1) 10David Caro: Remove buster0 buildpacks images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116
[08:07:55] <wikibugs>	 (03PS2) 10David Caro: Remove buster0 buildpacks images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116
[08:11:18] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:13:14] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:13:28] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp2002.wikimedia.org
[08:13:42] <icinga-wm>	 PROBLEM - Check systemd state on idp2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:14:28] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:15:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet
[08:16:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2002.codfw.wmnet
[08:16:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: send SIGUSR2 on log rotation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh)
[08:23:04] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic-Icebox, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez
[08:23:26] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:24:00] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:26:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) 05Open→03Stalled a:05Ladsgroup→03TThoabala
[08:26:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw
[08:29:04] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:30:02] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:30:34] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:34:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer)
[08:34:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw
[08:35:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) Added missing information.
[08:36:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad
[08:37:49] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.dns.netbox
[08:38:07] <wikibugs>	 10SRE, 10Traffic: Implement SLI measurement for HAProxy - https://phabricator.wikimedia.org/T307898 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez
[08:38:54] <wikibugs>	 10SRE, 10Traffic: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez)
[08:39:12] <wikibugs>	 10SRE, 10Traffic: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) p:05Triage→03Medium
[08:41:37] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:44:28] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) @wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing eqsin?
[08:44:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad
[08:44:38] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:45:12] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:47:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ayounsi) 05Resolved→03Open @Cmjohnson @Dzahn   The following hosts are alerting in https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ ` mw1459 (W...
[08:48:00] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) meanwhile I'll remove it from puppet, cause it's been a month since the host crashed and it already got prunned from puppetdb
[08:51:43] <wikibugs>	 (03PS1) 10Vgutierrez: cache: Remove cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/829118 (https://phabricator.wikimedia.org/T314256)
[08:52:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10ayounsi) From https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ `moss-be1001 (WMF5034)  Device is Staged in Netbox but is missing from...
[08:52:30] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:53:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache: Remove cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/829118 (https://phabricator.wikimedia.org/T314256) (owner: 10Vgutierrez)
[08:53:33] <wikibugs>	 (03PS1) 10Muehlenhoff: testreduce: Switch systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829119
[08:53:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] trafficserver: remove search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[08:54:05] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10dcaro) It seems that the runbook did not cleanup puppetdb or it was repopulated right after, as the host still shows there:  https://debm...
[08:54:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] testreduce: Switch systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff)
[08:54:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ayounsi) 05Resolved→03Open FYI, Netbox is alerting with: `an-presto1009 (WMF11494)  Device is in PuppetDB but is Planned in Netbox (should...
[08:56:46] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[08:56:46] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:56:59] <wikibugs>	 (03PS2) 10Muehlenhoff: testreduce: Switch systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829119
[08:59:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:59:26] <icinga-wm>	 RECOVERY - k8s requests count to the API on ml-serve-ctrl1001 is OK: (C)100 ge (W)50 ge 48.47 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[09:00:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney)
[09:02:26] <wikibugs>	 (03PS1) 10Muehlenhoff: webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121
[09:03:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff)
[09:04:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:04:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:05:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:06:32] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:07:04] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:09:34] <wikibugs>	 (03PS2) 10Muehlenhoff: webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121
[09:10:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) Re-added SRE as this is ready to move forward.
[09:11:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Pretty nice, let's finally puppetize this! Couple of inline suggestions." [puppet] - 10https://gerrit.wikimedia.org/r/828673 (owner: 10AOkoth)
[09:12:07] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Exclude cloud-eqiad prefix from lists trusted networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[09:14:07] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Ladsgroup) \o/ Welcome Peter!
[09:18:46] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:21:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:22:09] <wikibugs>	 (03PS1) 10Muehlenhoff: acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122
[09:22:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff)
[09:23:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "looking good, just fix the syntax error :)" [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff)
[09:25:49] <wikibugs>	 (03PS2) 10Muehlenhoff: acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122
[09:25:51] <wikibugs>	 (03CR) 10Muehlenhoff: acme-chief: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff)
[09:26:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:26:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[09:26:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[09:26:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[09:26:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[09:27:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T314041)', diff saved to https://phabricator.wikimedia.org/P33743 and previous config saved to /var/cache/conftool/dbconfig/20220902-092704-ladsgroup.json
[09:27:09] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[09:30:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff)
[09:33:48] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:34:51] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Volans) The decom cookbook is meant to be idempotent, so you can safely re-run it. That said I can look next week on the logs of the prev...
[09:35:44] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:46:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work): Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) p:05Triage→03Medium a:03Jelto
[09:47:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T267673)
[09:47:58] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:49:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/829038 (https://phabricator.wikimedia.org/T302145) (owner: 10ArielGlenn)
[09:50:10] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:50:18] <wikibugs>	 (03PS2) 10ArielGlenn: Add Hannah Okwelum to icinga read access, remove Holger Knust [puppet] - 10https://gerrit.wikimedia.org/r/829038 (https://phabricator.wikimedia.org/T302145)
[09:51:40] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add Hannah Okwelum to icinga read access, remove Holger Knust [puppet] - 10https://gerrit.wikimedia.org/r/829038 (https://phabricator.wikimedia.org/T302145) (owner: 10ArielGlenn)
[09:53:20] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:58:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:00:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:00:52] <wikibugs>	 (03PS1) 10Jelto: admin: add production access for pfischer [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090)
[10:01:42] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto)
[10:06:55] <wikibugs>	 (03PS1) 10Jelto: admin: add pfischer to search*, analytics and deployment group [puppet] - 10https://gerrit.wikimedia.org/r/829150 (https://phabricator.wikimedia.org/T316090)
[10:07:26] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:08:15] <jayme>	 !log depooled kubemaster1002 for tests 
[10:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:34] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:10:44] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:12:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) Welcome @pfischer! Thanks for the request and all the approvals.  We are missing one last approval from @thcipri...
[10:14:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for mmandere [puppet] - 10https://gerrit.wikimedia.org/r/829155
[10:20:10] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:33] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+1] "checked public key" [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto)
[10:31:12] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:48] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: update outlink image to support pre-transformed data [deployment-charts] - 10https://gerrit.wikimedia.org/r/829015 (https://phabricator.wikimedia.org/T315998) (owner: 10AikoChou)
[10:33:04] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:33:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:33:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Jelto) SSH key verified by Meet session and Gerrit +1 in https://gerrit.wikimedia.org/r/829148
[10:33:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for mmandere [puppet] - 10https://gerrit.wikimedia.org/r/829155 (owner: 10Muehlenhoff)
[10:34:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging MMandere out of all services on: 779 hosts
[10:34:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging MMandere out of all services on: 779 hosts
[10:35:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging MMandere out of all services on: 1235 hosts
[10:35:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging MMandere out of all services on: 1235 hosts
[10:35:45] <wikibugs>	 (03CR) 10DCausse: "nice!" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[10:35:49] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update outlink image to support pre-transformed data [deployment-charts] - 10https://gerrit.wikimedia.org/r/829015 (https://phabricator.wikimedia.org/T315998) (owner: 10AikoChou)
[10:38:30] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:56] <wikibugs>	 (03PS2) 10Hnowlan: Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104)
[10:43:02] <wikibugs>	 (03CR) 10Hnowlan: Fix environment in prep stage (032 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[10:44:26] <wikibugs>	 (03PS1) 10Vgutierrez: mtail::atsbackend: Add SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/829163 (https://phabricator.wikimedia.org/T316921)
[10:51:26] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:58:06] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:58:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Marc from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/829165
[11:03:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder)
[11:05:20] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:26:46] <wikibugs>	 (03PS2) 10Vgutierrez: mtail::atsbackend: Add SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/829163 (https://phabricator.wikimedia.org/T316921)
[11:27:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Marc from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/829165 (owner: 10Muehlenhoff)
[11:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[11:31:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Also remove Marc's lowercase username variant from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/829169
[11:42:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Also remove Marc's lowercase username variant from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/829169 (owner: 10Muehlenhoff)
[11:43:34] <wikibugs>	 (03PS6) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595)
[11:44:53] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T267673) (owner: 10Muehlenhoff)
[11:45:28] <wikibugs>	 (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (034 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[11:48:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Marc from sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/829173
[11:51:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Marc from sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/829173 (owner: 10Muehlenhoff)
[12:00:34] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:03:52] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:14] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui)
[12:41:06] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1132 has now the fix installed. I will leave it replicating during the weekend and start pooling it back next week.
[12:47:25] <wikibugs>	 (03CR) 10DCausse: "some unit tests would be nice :)" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar)
[12:56:19] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[13:00:40] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[13:01:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Jclark-ctr)
[13:04:57] <wikibugs>	 (03CR) 10Vgutierrez: "As mentioned in T316932:" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm)
[13:07:06] <wikibugs>	 (03CR) 10DCausse: "lgtm," [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar)
[13:09:37] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 (owner: 10Hashar)
[13:10:12] <jayme>	 !log redepooled kubemaster1002 
[13:10:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:55] <vgutierrez>	 "redepooling".. that's tricky :)
[13:15:07] <jayme>	 hrhr
[13:15:14] <jayme>	 !log repooled kubemaster1002 
[13:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:22] <jayme>	 thanks for pointing out :)
[13:15:29] <vgutierrez>	 oh you didn't depool it again actually
[13:15:53] <jayme>	 nope :)
[13:31:10] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196']
[13:31:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Jclark-ctr) kafka-stretch1001  E3   U17   Port 17    cableid  20220230 kafka-stretch1002  F3   U17   Port 17    cableid  20220229
[13:31:43] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1196']
[13:31:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Jclark-ctr)
[13:32:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[13:48:30] <wikibugs>	 (03PS1) 10Papaul: when using sorted i get the error msg:TypeError: '<' not supported between instances of 'DellDriver' and 'DellDriver' changed this back to list to get some install for now [cookbooks] - 10https://gerrit.wikimedia.org/r/829193
[13:51:10] <wikibugs>	 (03PS1) 10JMeybohm: Cache helm list results for one minute [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/829194
[13:52:32] <wikibugs>	 (03PS2) 10JMeybohm: Cache helm list results for one minute [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/829194
[13:53:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) druid1009 A5 U06   Port25   CableID  230000145 druid1010 B5 U13   Port 4  CableID 2988  druid1011 D6 U37   Port 37  Cabl...
[13:54:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Add SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/829163 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez)
[13:55:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr)
[13:55:32] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] when using sorted i get the error msg:TypeError: '<' not supported between instances of 'DellDriver' and 'DellDriver' changed this back to l [cookbooks] - 10https://gerrit.wikimedia.org/r/829193 (owner: 10Papaul)
[13:57:09] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196']
[13:57:31] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1196']
[13:57:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Cache helm list results for one minute [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/829194 (owner: 10JMeybohm)
[13:59:02] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Cache helm list results for one minute [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/829194 (owner: 10JMeybohm)
[14:01:29] <wikibugs>	 (03PS1) 10JMeybohm: helm-state-metrics: Update to v0.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/829197
[14:01:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196']
[14:02:33] <wikibugs>	 (03PS1) 10JMeybohm: helm-state-metrics: Update to v0.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/829198
[14:05:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1197']
[14:07:17] <wikibugs>	 (03CR) 10Hashar: "I have to:" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[14:07:33] <wikibugs>	 (03PS3) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947)
[14:08:03] <icinga-wm>	 PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:11:41] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1003
[14:14:53] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[14:16:37] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['db1197']
[14:18:26] <wikibugs>	 (03PS1) 10Muehlenhoff: puppet_compiler: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829200 (https://phabricator.wikimedia.org/T308013)
[14:18:28] <wikibugs>	 (03PS1) 10Muehlenhoff: keyholder: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013)
[14:18:30] <wikibugs>	 (03PS1) 10Muehlenhoff: releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013)
[14:18:32] <wikibugs>	 (03PS1) 10Muehlenhoff: udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013)
[14:18:34] <wikibugs>	 (03PS1) 10Muehlenhoff: sslcert: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013)
[14:18:36] <wikibugs>	 (03PS1) 10Muehlenhoff: k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013)
[14:18:36] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:18:36] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices1003
[14:18:42] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudservices1003` - cloudservices1003 (**F...
[14:19:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] keyholder: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:20:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:20:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:21:44] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['db1196']
[14:22:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:22:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert)
[14:23:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sslcert: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:31:28] <wikibugs>	 (03PS2) 10Muehlenhoff: keyholder: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013)
[14:32:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196']
[14:33:46] <wikibugs>	 10SRE: Upgrade ferm to 2.5.1 - https://phabricator.wikimedia.org/T248954 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None
[14:33:53] <wikibugs>	 10SRE: Upgrade ferm to 2.5.1 - https://phabricator.wikimedia.org/T248954 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:34:07] <icinga-wm>	 RECOVERY - Disk space on ms-be2037 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2037&var-datasource=codfw+prometheus/ops
[14:34:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Dzahn) dmarc-rua@ is an alias for dmarc@donate.wikimedia.org  donate.wikimedia.org mail is routed to the fundraising-tech's CiviCRM system.  While it's po...
[14:34:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Rebuild php 7.4 images with newer php versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/829206
[14:34:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Dzahn)
[14:37:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Rebuild php 7.4 images with newer php versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/829206 (owner: 10Giuseppe Lavagetto)
[14:38:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1196']
[14:39:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1196']
[14:41:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1197']
[14:42:21] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1003.wikimedia.org
[14:43:08] <wikibugs>	 (03PS2) 10Muehlenhoff: releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013)
[14:43:12] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Andrew) It looks like the decom script can't find it in puppetdb even though the alert says it is in puppetdb.
[14:46:13] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[14:47:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1196']
[14:47:27] <wikibugs>	 (03PS2) 10Muehlenhoff: udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013)
[14:47:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1197']
[14:48:41] <wikibugs>	 (03PS1) 10CDanis: vcl: Temporarily restore past Badtitle behavior [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932)
[14:49:06] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:49:06] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices1003.wikimedia.org
[14:49:12] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudservices1003.wikimedia.org` - cloudser...
[14:49:29] <icinga-wm>	 RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:49:40] <wikibugs>	 (03PS1) 10Vgutierrez: mtail::atsbacken: Handle cache_read|write_time=-1 [puppet] - 10https://gerrit.wikimedia.org/r/829208 (https://phabricator.wikimedia.org/T316938)
[14:51:54] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1003.wikimedia.org
[14:52:05] <wikibugs>	 (03PS2) 10Vgutierrez: mtail::atsbackend: Handle cache_read|write_time=-1 [puppet] - 10https://gerrit.wikimedia.org/r/829208 (https://phabricator.wikimedia.org/T316938)
[14:53:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1197']
[14:55:05] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object-replicator.service,swift-object.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[14:57:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1198']
[14:57:33] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:44] <wikibugs>	 (03PS2) 10Muehlenhoff: k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013)
[14:58:45] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:58:46] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices1003.wikimedia.org
[14:58:51] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudservices1003.wikimedia.org` - cloudser...
[14:59:08] <wikibugs>	 (03CR) 10Dzahn: "Adding Arnold. This is about what I had in mind but if the mail gets sent only to infra-foundations he won't be able to debug it. Maybe if" [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede)
[14:59:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Handle cache_read|write_time=-1 [puppet] - 10https://gerrit.wikimedia.org/r/829208 (https://phabricator.wikimedia.org/T316938) (owner: 10Vgutierrez)
[15:00:02] <wikibugs>	 (03CR) 10Dzahn: "and like Moritz said, we do not expect this is caused by switching to systemd timer. We expect that this merely gave us the new monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede)
[15:00:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) druid1009  E2  U38  Port43  Cableid 23000023 druid1009  E3  U38  Port43  Cableid 23000054 druid1009  F2  U38  Port43  Ca...
[15:00:55] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1196.eqiad.wmnet with OS bullseye
[15:01:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[15:01:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1196.eqiad.wmnet with OS bullseye
[15:01:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1197']
[15:02:38] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1198']
[15:03:02] <wikibugs>	 (03PS2) 10Muehlenhoff: sslcert: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013)
[15:03:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1198']
[15:03:30] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:20] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1198']
[15:04:59] <jayme>	 !log depooled kubemaster1001
[15:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:51] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Andrew) a:05Cmjohnson→03Volans I am now officially out of ideas :)  Over to you, @volans!
[15:06:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:09:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1198']
[15:09:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1199']
[15:09:33] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1198']
[15:09:53] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: atsbackend.mtail doesn't track requests with cache read|write time set to -1 properly - https://phabricator.wikimedia.org/T316938 (10Vgutierrez) 05In progress→03Resolved after merging https://gerrit.wikimedia.org/r/829208 sli_total|sli_good counters seem sane: `vguti...
[15:13:00] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage
[15:14:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1199']
[15:15:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1199']
[15:16:51] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage
[15:18:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1197.eqiad.wmnet with OS bullseye
[15:18:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1197.eqiad.wmnet with OS bullseye
[15:19:01] <wikibugs>	 (03Abandoned) 10Jdrewniak: Add wgWMEWebUIScrollTrackingSamplingRate config to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784690 (https://phabricator.wikimedia.org/T303297) (owner: 10Jdrewniak)
[15:19:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1198']
[15:23:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1199']
[15:27:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1201']
[15:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:28:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1198']
[15:30:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage
[15:31:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @Jclark-ctr for more information: the MMF/MTP fibers ordered in https://phabricator.wikimedia.org/T313464 we want   1 fiber from rack c2 to rack a1 1 fi...
[15:31:59] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1196.eqiad.wmnet with OS bullseye
[15:32:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1196.eqiad.wmnet with OS bullseye completed: - db1...
[15:32:20] <wikibugs>	 (03PS1) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921)
[15:32:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1200']
[15:32:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:33:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1201']
[15:33:58] <wikibugs>	 (03PS2) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921)
[15:34:01] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1198.eqiad.wmnet with OS bullseye
[15:34:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1198.eqiad.wmnet with OS bullseye
[15:34:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage
[15:34:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul)
[15:35:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1201']
[15:35:59] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1201']
[15:37:27] <jayme>	 !log repooled kubemaster1001
[15:37:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:52] <jayme>	 !log depool kubemaster2002
[15:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:12] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1200']
[15:42:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1200']
[15:45:59] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage
[15:47:59] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: Distinguish between internal host and host header setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/829216
[15:48:44] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[15:49:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1197.eqiad.wmnet with OS bullseye
[15:49:54] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage
[15:49:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1197.eqiad.wmnet with OS bullseye completed: - db1...
[15:50:25] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:50:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1200']
[15:51:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] vcl: Temporarily restore past Badtitle behavior (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932) (owner: 10CDanis)
[15:52:37] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: api_appserver: convert all canaries to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829217 (https://phabricator.wikimedia.org/T271736)
[15:54:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:55:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37100/console" [puppet] - 10https://gerrit.wikimedia.org/r/829217 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[15:56:18] <wikibugs>	 (03PS1) 10David Caro: bullseye0: Improve the install-packages script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854)
[15:57:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1199.eqiad.wmnet with OS bullseye
[15:57:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1199.eqiad.wmnet with OS bullseye
[15:57:25] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1201']
[15:57:57] <jayme>	 !log repool kubemaster2002
[15:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:58:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:58:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10MatthewVernon) Hi @Jclark-ctr sorry to be a bother, but could you check another drive in this system, please? `/dev/sdy` has started being unhappy since 31 August at about the tim...
[15:58:50] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10MatthewVernon) 05Resolved→03Open
[16:01:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1202']
[16:02:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:03:49] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1198.eqiad.wmnet with OS bullseye
[16:03:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1198.eqiad.wmnet with OS bullseye completed: - db1...
[16:05:47] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1201']
[16:07:17] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1200.eqiad.wmnet with OS bullseye
[16:07:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:07:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1200.eqiad.wmnet with OS bullseye
[16:08:14] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:09:04] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] vcl: Temporarily restore past Badtitle behavior [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932) (owner: 10CDanis)
[16:09:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1199.eqiad.wmnet with reason: host reimage
[16:10:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1202']
[16:10:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1203']
[16:11:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:11:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:11:19] <wikibugs>	 (03PS2) 10CDanis: vcl: Temporarily restore past Badtitle behavior [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932)
[16:11:29] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] vcl: Temporarily restore past Badtitle behavior (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829207 (https://phabricator.wikimedia.org/T316932) (owner: 10CDanis)
[16:13:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1199.eqiad.wmnet with reason: host reimage
[16:15:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:15:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1202']
[16:17:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1203']
[16:19:15] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1200.eqiad.wmnet with reason: host reimage
[16:23:15] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1200.eqiad.wmnet with reason: host reimage
[16:23:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1202']
[16:26:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1199.eqiad.wmnet with OS bullseye
[16:26:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1199.eqiad.wmnet with OS bullseye completed: - db1...
[16:27:09] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1203']
[16:31:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1201.eqiad.wmnet with OS bullseye
[16:31:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1201.eqiad.wmnet with OS bullseye
[16:37:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1203']
[16:39:05] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1200.eqiad.wmnet with OS bullseye
[16:39:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1200.eqiad.wmnet with OS bullseye completed: - db1...
[16:40:56] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1202.eqiad.wmnet with OS bullseye
[16:41:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1202.eqiad.wmnet with OS bullseye
[16:42:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul)
[16:43:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage
[16:47:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1201.eqiad.wmnet with reason: host reimage
[16:52:50] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1202.eqiad.wmnet with reason: host reimage
[16:56:17] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1202.eqiad.wmnet with reason: host reimage
[17:00:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1201.eqiad.wmnet with OS bullseye
[17:00:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1201.eqiad.wmnet with OS bullseye completed: - db1...
[17:01:51] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:07:57] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:11:04] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1202.eqiad.wmnet with OS bullseye
[17:11:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1202.eqiad.wmnet with OS bullseye completed: - db1...
[17:12:45] <wikibugs>	 (03CR) 10JMeybohm: "This change is ready for review." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm)
[17:13:55] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] "I am okay with it if it is okay with Daniel. :)" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff)
[17:14:05] <wikibugs>	 (03PS7) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663)
[17:28:20] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10wiki_willy) Hi @Vgutierrez - yeah, probably makes more sense to replace than purchase a replacement part, since the new servers have already been ordered and are expected to arrive in Oct...
[17:28:44] <wikibugs>	 (03PS1) 10Andrew Bogott: sre-sandbox: remove automatic VM purge logic [puppet] - 10https://gerrit.wikimedia.org/r/829231 (https://phabricator.wikimedia.org/T247517)
[17:29:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1203.eqiad.wmnet with OS bullseye
[17:29:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1203.eqiad.wmnet with OS bullseye
[17:41:11] <wikibugs>	 10SRE, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10Andrew) @jbond, after the recent unpleasantness with @herron having VMs deleted by surprise I'm revisiting the practices in...
[17:41:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage
[17:44:28] <wikibugs>	 (03PS1) 10Milimetric: aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/829233
[17:45:03] <wikibugs>	 (03CR) 10Milimetric: [C: 04-1] "this has to wait for the druid load to finish, just putting it out before I forget." [puppet] - 10https://gerrit.wikimedia.org/r/829233 (owner: 10Milimetric)
[17:45:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage
[17:46:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:51:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:51:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:51:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:51:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:52:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:58:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1203.eqiad.wmnet with OS bullseye
[17:58:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1203.eqiad.wmnet with OS bullseye completed: - db1...
[18:09:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul)
[18:10:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Papaul) 05Open→03Resolved @Marostegui @Jclark-ctr this is complete
[18:16:47] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:24:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move profile::wmcs::backup_glance_images from cloudcontrols to backup servers [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) (owner: 10Andrew Bogott)
[18:30:39] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10RobH) >>! In T314256#8207835, @Vgutierrez wrote: > @wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing...
[18:39:31] <dancy>	 I'm going to do some mw-on-k8s image build/deploy testing
[18:39:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yes, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff)
[18:39:50] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37101/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff)
[18:40:02] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing T299648
[18:40:08] <stashbot>	 T299648: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648
[18:41:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "only affects testreduce1001 - https://puppet-compiler.wmflabs.org/pcc-worker1001/37102/" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff)
[18:42:28] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "File[/etc/sysusers.d/testreduce.conf]/ensure: defined content .. /Exec[Refresh sysusers]: Triggered 'refresh' from 1 event ..and that's al" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff)
[18:43:20] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "[testreduce1001:~] $ cat /etc/sysusers.d/testreduce.conf" [puppet] - 10https://gerrit.wikimedia.org/r/829119 (owner: 10Muehlenhoff)
[18:46:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37103/webperf1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff)
[18:47:30] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:48:10] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "just merged a similar change for testreduce at I8304306b989642 and it was just fine. it will create the new config file and refresh sysuse" [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff)
[18:49:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff)
[18:51:38] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:51:56] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:55:51] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:56:07] <logmsgbot>	 !log dancy@deploy1002 dancy: testing T299648 synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[18:56:11] <stashbot>	 T299648: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648
[18:58:04] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[19:00:09] <dancy>	 Test completed
[19:01:54] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "testing if it's gone everywhere with: cumin -b 20 -s 2 -x C:puppet::agent 'file /etc/cron.d/puppet'" [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff)
[19:02:31] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) I cleaned out the `flink_ha_storage` pseudofolder from the `rdf-streaming-updater-codfw` bucket as r...
[19:03:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:03:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:03:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:03:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:11:42] <wikibugs>	 (03PS2) 10Dzahn: Update various comments [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff)
[19:12:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "was accidentally linked to an unrelated ticket. fixing bug: link" [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff)
[19:13:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "comments-only and all look good to me" [puppet] - 10https://gerrit.wikimedia.org/r/829146 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff)
[19:14:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) >>! In T316899#8208607, @Dzahn wrote: > dmarc-rua@ is an alias for dmarc@donate.wikimedia.org >  > donate.wikim...
[19:19:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Dzahn) @Jgreen Alright, gotcha!  In that case can we move the alias into the section that says managed by fr-tech and y...
[19:22:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) >>! In T316899#8209163, @Dzahn wrote: > @Jgreen Alright, gotcha!  In that case can we move the alias into the s...
[19:24:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Dzahn) @Jgreen I am not sure but let's ask for input from Infra Foundations.
[19:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[19:33:59] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:38:55] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder)
[20:01:27] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:09:39] <wikibugs>	 (03PS3) 10Dzahn: rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[20:11:19] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:21:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) T86209 is relevant, we do appear to still be sending dmarc-ruf@ to dmarcian, although I don't know who has acce...
[20:35:05] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:41:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[20:42:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "on netmon1002, netmon1003, netmon2001 /etc/sysusers.d/rancid.conf was created and sysusers was refreshed and nothing else happened because" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[20:48:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, 10Mail: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) I was able to log into the "wikimedia" dmarcian account and determine that the subscription expired, so I sent...
[20:49:53] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, and 2 others: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10Jgreen) p:05Triage→03Medium
[20:53:47] <wikibugs>	 (03PS1) 10Dzahn: phabricator: ensure only the one active_server connects to rw mysql [puppet] - 10https://gerrit.wikimedia.org/r/829244 (https://phabricator.wikimedia.org/T315713)
[20:56:16] <wikibugs>	 (03CR) 10Volans: Remove obsolete absented cron file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff)
[20:57:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37104/" [puppet] - 10https://gerrit.wikimedia.org/r/829244 (https://phabricator.wikimedia.org/T315713) (owner: 10Dzahn)
[21:01:04] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "Ok, I'll revert if that's a problem. I am not entirely sure what the call to action is though." [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff)
[21:06:50] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "well.. if the file doesn't exist anywhere I fail to see what the problem is that we need to fix" [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff)
[21:07:10] <wikibugs>	 (03PS1) 10Dzahn: Revert "Remove obsolete absented cron file" [puppet] - 10https://gerrit.wikimedia.org/r/829143
[21:07:21] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:08:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Remove obsolete absented cron file" [puppet] - 10https://gerrit.wikimedia.org/r/829143 (owner: 10Dzahn)
[21:10:45] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "reverted" [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff)
[21:15:01] <wikibugs>	 (03PS3) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953)
[21:16:34] <wikibugs>	 (03CR) 10Dduvall: phabricator: Deploy user should own everything under old rev directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[21:18:25] <wikibugs>	 (03CR) 10Dzahn: phabricator: Deploy user should own everything under old rev directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[21:19:46] <wikibugs>	 (03CR) 10Dzahn: phabricator: Deploy user should own everything under old rev directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[21:20:44] <wikibugs>	 (03PS4) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953)
[21:21:08] <wikibugs>	 (03PS5) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953)
[21:21:17] <wikibugs>	 (03CR) 10Dzahn: "there is a rule that shell scripts are not supposed to be created from .sh.erb files anymore because then CI can't check the .sh files. Th" [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[21:22:45] <wikibugs>	 (03CR) 10Dduvall: phabricator: Deploy user should own everything under old rev directories (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[21:40:31] <wikibugs>	 (03PS6) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953)
[21:42:15] <wikibugs>	 (03PS7) 10Dduvall: phabricator: Deploy user should own everything under old rev directories [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953)
[21:43:56] <wikibugs>	 (03CR) 10Dduvall: phabricator: Deploy user should own everything under old rev directories (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[21:46:29] <wikibugs>	 (03CR) 10Volans: Remove obsolete absented cron file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff)
[22:01:45] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:08:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) fwiw WMF-NDA on Phabricator and LDAP groups are entirely unrelated things and handled by different people.
[22:11:01] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "Remove obsolete absented cron file"" [puppet] - 10https://gerrit.wikimedia.org/r/829144
[22:13:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Revert "Remove obsolete absented cron file"" [puppet] - 10https://gerrit.wikimedia.org/r/829144 (owner: 10Dzahn)
[22:14:06] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] Remove obsolete absented cron file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826274 (owner: 10Muehlenhoff)
[22:19:22] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "cool, thank you. lgtm and compiles" [puppet] - 10https://gerrit.wikimedia.org/r/829063 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[23:07:45] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:26:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert