[00:02:13] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:44:23] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy)
[00:48:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:53:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[01:36:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:50:37] <icinga-wm>	 PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:13:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[02:20:31] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:20:35] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:21:05] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:21:10] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:21:23] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:21:24] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:21:43] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:22:07] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:22:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:22:53] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:23:23] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:23:29] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:23:41] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:23:42] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:24:01] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:24:21] <andrewbogott>	 I'm on top of ^ for the moment
[02:24:25] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:51:49] <icinga-wm>	 RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:57:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[02:59:47] <wikibugs>	 (03PS18) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[03:05:36] <wikibugs>	 (03PS19) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[03:07:58] <wikibugs>	 (03PS20) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[03:29:02] <wikibugs>	 (03CR) 10Raymond Ndibe: "Fixed the issue of how to run tests without root. It required making measure changes to the code, but it's done now. Next is to implement " [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[03:30:03] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:36:55] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:02:50] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)
[04:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[05:27:57] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:28:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:32:35] <icinga-wm>	 PROBLEM - SSH on db1116.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:36:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:41:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:42:13] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:54:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:59:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:07:23] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "On one side it's better when the default option do the right thing out of the box, but on the other it's not worth the time investigating " [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney)
[06:11:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) This opened {T314998} automatically.  Please sync up with Netops before doing the work as live traffic is using the port.
[06:12:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:13:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[06:32:33] <icinga-wm>	 PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:33:43] <icinga-wm>	 RECOVERY - SSH on db1116.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:57:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221007T0700)
[07:03:01] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:06:49] <wikibugs>	 (03CR) 10Elukey: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[07:11:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1026.eqiad.wmnet with OS bullseye
[07:11:10] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1026.eqiad.wmnet with OS bullseye
[07:14:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1014.eqiad.wmnet with reason: Downtime for removal from Ganeti cluster and eventual bullseye reimage
[07:14:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1014.eqiad.wmnet with reason: Downtime for removal from Ganeti cluster and eventual bullseye reimage
[07:15:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) I went through the useful https://apps.juniper.net/feature-explorer/select-software.html?typ=1&swName=Junos%20OS&rel=21.2R3&sid=1211&platform=MX204&pi...
[07:16:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Left a note but LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[07:22:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A
[07:22:44] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A
[07:23:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1026.eqiad.wmnet with reason: host reimage
[07:26:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1026.eqiad.wmnet with reason: host reimage
[07:30:18] <wikibugs>	 (03CR) 10Elukey: "Left some comments, but LGTM! (tested also the get-kubernetes-release.sh script as well)." [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[07:32:30] <icinga-wm>	 RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:41:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1026.eqiad.wmnet with OS bullseye
[07:41:14] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1026.eqiad.wmnet with OS bullseye completed: - ganeti1026 (**PASS**)   - Downtimed on...
[07:49:55] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10hashar)
[07:49:57] <elukey>	 !log re-initialize docker on dse-k8s-worker100[5-8] - wrong storage type set (devicemapper instead of overlay2)
[07:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1014.eqiad.wmnet with OS bullseye
[07:50:21] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1014.eqiad.wmnet with OS bullseye
[07:54:31] <elukey>	 !log re-initialize docker on dse-k8s-worker1004 - wrong storage type set (devicemapper instead of overlay2)
[07:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:02:45] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-openstack-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1014.eqiad.wmnet with reason: host reimage
[08:07:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1014.eqiad.wmnet with reason: host reimage
[08:07:46] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9200 on elastic1094 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:07:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:08:36] <gehel>	 ^^ ryankemper: looks like we have settings mismatch. I suspect this is related to the change in eligible masters?
[08:11:41] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[08:11:53] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[08:12:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:14:03] <wikibugs>	 (03PS3) 10JMeybohm: Update to Kubernetes v1.23.12 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943)
[08:15:25] <wikibugs>	 (03CR) 10JMeybohm: Update to Kubernetes v1.23.12 (033 comments) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[08:19:03] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:19:29] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[08:19:38] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[08:19:59] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[08:20:57] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[08:21:35] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Enable cache partitioning in cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/840061 (https://phabricator.wikimedia.org/T317748)
[08:21:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:21:56] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[08:22:14] <vgutierrez>	 !log partition ats-be cache in cp6016 - T317748
[08:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:18] <stashbot>	 T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748
[08:22:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1014.eqiad.wmnet with OS bullseye
[08:22:30] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic1102 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:22:32] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Enable cache partitioning in cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/840061 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[08:22:35] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1014.eqiad.wmnet with OS bullseye completed: - ganeti1014 (**PASS**)   - Downtimed on...
[08:22:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:23:56] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudnet1003.eqiad.wmnet with OS bullseye
[08:23:58] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudnet1004.eqiad.wmnet with OS bullseye
[08:24:01] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudnet1003.eqiad.wmnet with OS bullseye execu...
[08:24:04] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T319683 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudnet1004.eqiad.wmnet with OS bullseye execu...
[08:24:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:25:02] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic1095 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:25:02] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9200 on elastic1100 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:25:03] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic1093 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:25:17] <wikibugs>	 (03CR) 10Muehlenhoff: sre.hosts.reimage: support different installers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[08:26:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A
[08:27:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:28:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A
[08:29:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:30:58] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[08:31:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:32:23] <wikibugs>	 (03CR) 10Elukey: Update to Kubernetes v1.23.12 (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[08:32:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Update to Kubernetes v1.23.12 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[08:32:59] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:33:04] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:33:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[08:35:34] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:35:59] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[08:36:07] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[08:36:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:37:57] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:38:10] <wikibugs>	 (03CR) 10Cathal Mooney: Add explicit BFD session mode (single/multi-hop) to Anycast groups (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney)
[08:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:39:56] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[08:39:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:41:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:42:06] <wikibugs>	 (03PS2) 10Cathal Mooney: Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501)
[08:43:04] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[08:43:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10karapayneWMDE) a:05karapayneWMDE→03Arnoldokoth Approved! And thanks :)
[08:43:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:44:24] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[08:44:50] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[08:44:57] <wikibugs>	 (03CR) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[08:44:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:45:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney)
[08:46:08] <wikibugs>	 (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney)
[08:46:22] <wikibugs>	 (03Merged) 10jenkins-bot: Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney)
[08:48:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:52:24] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[08:53:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:57:49] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)
[08:58:19] <wikibugs>	 (03CR) 10Mabualruz: Automate icon generation (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[08:59:30] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:00:12] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic1098 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:00:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10aborrero)
[09:02:19] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10aborrero)
[09:06:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:11:19] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) p:05Lowest→03Medium
[09:12:14] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) I change the priority to medium,.  The lack of a proper solution for network management causes period problems eno...
[09:17:48] <wikibugs>	 (03CR) 10Jbond: "lgtm, optional comment still open" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff)
[09:17:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff)
[09:18:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[09:23:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) 05Open→03Resolved Change applied across all routers now, so hopefully the last we see this kind of issue.
[09:23:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:25:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Jclark-ctr please let me or @BCornwall know when it would be a good time for you to perform the change
[09:26:33] <elukey>	 !log delete calico pods in CrashLoop on dse-k8s-codfw (probably due to the incorrect docker settings)
[09:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[09:28:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:49:05] <wikibugs>	 (03CR) 10Btullis: Remove legacy AQS host configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[09:49:09] <wikibugs>	 (03PS4) 10Btullis: Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277)
[09:56:35] <wikibugs>	 (03PS4) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529)
[10:00:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi)
[10:04:06] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:05:14] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:11:56] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[10:12:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:15:36] <wikibugs>	 (03CR) 10Volans: "couple of optional nits inline" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[10:19:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:24:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:27:17] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: support different installers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[10:31:57] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[10:33:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:36:30] <wikibugs>	 (03PS1) 10Btullis: Remove absented resource definitions for aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277)
[10:38:06] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37479/console" [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[10:38:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:40:50] <wikibugs>	 (03PS6) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[10:41:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[10:43:33] <wikibugs>	 (03PS1) 10Jgiannelos: changeprop: Disable pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365)
[10:44:14] <wikibugs>	 (03CR) 10Jgiannelos: [C: 04-1] "Blocking this patch until deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[10:45:42] <wikibugs>	 (03PS2) 10Jgiannelos: changeprop: Disable restbase pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365)
[10:47:36] <wikibugs>	 (03CR) 10Muehlenhoff: Remove absented resource definitions for aqs nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[10:47:48] <wikibugs>	 (03PS3) 10Jgiannelos: changeprop: Disable restbase pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365)
[10:48:49] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye
[10:49:18] <wikibugs>	 (03PS4) 10Jgiannelos: changeprop: Disable restbase pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365)
[10:49:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[10:49:36] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Remove absented resource definitions for aqs nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[10:57:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[11:01:30] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS bullseye
[11:04:17] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:07:50] <wikibugs>	 (03PS1) 10Volans: sre.hosts.dhcp: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/840103 (https://phabricator.wikimedia.org/T319067)
[11:10:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/840103 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[11:24:52] <wikibugs>	 (03PS1) 10Cathal Mooney: Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529)
[11:25:49] <wikibugs>	 (03PS2) 10Btullis: Remove absented resource definitions for aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277)
[11:26:16] <wikibugs>	 (03PS5) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529)
[11:26:48] <wikibugs>	 (03PS7) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[11:27:53] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: eqiad1: depool rabbitmq02 [puppet] - 10https://gerrit.wikimedia.org/r/840107 (https://phabricator.wikimedia.org/T320232)
[11:27:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[11:28:26] <wikibugs>	 (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (033 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[11:28:53] <wikibugs>	 (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[11:29:46] <wikibugs>	 (03PS4) 10Btullis: Add a spark-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730)
[11:31:38] <wikibugs>	 (03PS1) 10Clément Goubert: Prepare 3.9.4 release [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108
[11:32:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Prepare 3.9.4 release [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[11:32:45] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[11:34:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "may not be needed after all." [puppet] - 10https://gerrit.wikimedia.org/r/840107 (https://phabricator.wikimedia.org/T320232) (owner: 10Arturo Borrero Gonzalez)
[11:36:33] <wikibugs>	 (03CR) 10Btullis: Remove absented resource definitions for aqs nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[11:38:23] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) a:03colewhite
[11:38:55] <wikibugs>	 (03PS6) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529)
[11:44:20] <wikibugs>	 (03PS2) 10Clément Goubert: Prepare 3.9.4 release [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108
[11:45:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Prepare 3.9.4 release [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[11:46:03] <wikibugs>	 (03CR) 10JMeybohm: Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[11:49:36] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster
[11:50:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[11:50:08] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS buster
[11:50:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[11:50:50] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS buster
[11:51:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[11:56:46] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster
[11:57:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[11:57:40] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster
[11:58:30] <wikibugs>	 (03CR) 10Clément Goubert: Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[11:59:17] <wikibugs>	 10SRE, 10API Platform: Block non-browser requests that use generic agents - https://phabricator.wikimedia.org/T319423 (10daniel)
[12:02:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[12:16:03] <wikibugs>	 (03CR) 10Hashar: Json schema from Gerrit Java event classes (032 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[12:17:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[12:20:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[12:21:15] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: noc: use php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/840117
[12:21:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] noc: use php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/840117 (owner: 10Giuseppe Lavagetto)
[12:27:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[12:29:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Ah, wait. Changelog bump is missing here. Also make sure to set distribution to bullseye-wikimedia for the new version." [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[12:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:38:23] <wikibugs>	 (03PS1) 10KartikMistry: RecentSignificantEditStore: Force section titles to be an index array [extensions/ContentTranslation] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840041 (https://phabricator.wikimedia.org/T319799)
[12:39:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
[12:40:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[12:46:02] <wikibugs>	 (03PS2) 10Slavina Stefanova: Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934)
[12:48:04] <wikibugs>	 (03PS1) 10Majavah: openstack: keystone: enable app credentials everywhere [puppet] - 10https://gerrit.wikimedia.org/r/840121 (https://phabricator.wikimedia.org/T294195)
[12:49:20] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37480/console" [puppet] - 10https://gerrit.wikimedia.org/r/840121 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah)
[12:54:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi)
[13:02:14] <wikibugs>	 (03PS2) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947)
[13:02:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[13:04:15] <wikibugs>	 (03CR) 10Hashar: "recheck" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[13:04:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:07:16] <wikibugs>	 (03CR) 10Hashar: Json schema from Gerrit Java event classes (031 comment) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[13:07:24] <wikibugs>	 (03PS1) 10Volans: cumin: fix missing lab->cloud alias rename [puppet] - 10https://gerrit.wikimedia.org/r/840123
[13:07:30] <wikibugs>	 (03PS3) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947)
[13:08:36] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster
[13:09:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) @Vgutierrez  will schedule for next week i will not be on site today unless @Cmjohnson  is available today  i will have to get wi...
[13:11:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS buster
[13:13:18] <wikibugs>	 (03CR) 10Clément Goubert: Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[13:16:20] <wikibugs>	 (03PS6) 10Hashar: Implement REST API and Ssh commands [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947)
[13:17:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Implement REST API and Ssh commands [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[13:18:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[13:18:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/840125 (https://phabricator.wikimedia.org/T319067)
[13:24:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) @Jclark-ctr if you want, you can also ping me for the port configuration.
[13:24:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[13:26:29] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:26:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[13:33:47] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:33:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[13:34:19] <wikibugs>	 (03PS1) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277)
[13:37:34] <wikibugs>	 (03CR) 10Clément Goubert: Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[13:38:21] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:39:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:41:08] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Remove absented resource definitions for aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[13:41:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede)
[13:44:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "And thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/840125 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[13:44:19] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) I created a new netinst environment based on the latest buster plus the 5.10.136 Linux kernel under /var/lib/puppet/volatile/tftpboot/...
[13:44:23] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:44:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[13:47:42] <wikibugs>	 (03CR) 10Hashar: "recheck" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[13:51:13] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove absented resource definitions for aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[13:51:28] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove absented resource definitions for aqs nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis)
[13:52:03] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:54:23] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ssingh) Thanks for the update and for working on this!  >>! In T319067#8300254, @MoritzMuehlenhoff wrote: > I created a new netinst environment based on...
[13:55:38] <wikibugs>	 10ops-eqiad, 10Data Engineering Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Sprint 02): Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10BTullis) a:05BTullis→03Cmjohnson
[13:56:14] <wikibugs>	 (03PS3) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108
[13:56:30] <wikibugs>	 (03CR) 10Clément Goubert: Release upstream version 3.9.4 (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[13:57:58] <wikibugs>	 10ops-eqiad, 10Data Engineering Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Sprint 02): Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10BTullis) I believe that the //service owner// part of this task is all done, so I'm tagging #ops-eqiad and assigning to @...
[14:00:20] <wikibugs>	 (03CR) 10Clément Goubert: Release upstream version 3.9.4 (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[14:01:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[14:07:55] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:09:21] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:09:23] <wikibugs>	 (03CR) 10Muehlenhoff: cumin: fix missing lab->cloud alias rename (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans)
[14:12:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:21] <wikibugs>	 (03CR) 10Reedy: Automate icon generation (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[14:17:58] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add a spark-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[14:19:35] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[14:21:08] <wikibugs>	 (03CR) 10Btullis: [C: 04-1] "I'm only giving a -1 because of the merge conflict, since the 'aqs' role has been removed at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/838167 (owner: 10Snwachukwu)
[14:30:27] <wikibugs>	 (03PS1) 10Muehlenhoff: maps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840139 (https://phabricator.wikimedia.org/T308013)
[14:30:29] <wikibugs>	 (03PS1) 10Muehlenhoff: microsites: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840140 (https://phabricator.wikimedia.org/T308013)
[14:30:31] <wikibugs>	 (03PS1) 10Muehlenhoff: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013)
[14:30:33] <wikibugs>	 (03PS1) 10Muehlenhoff: pki: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840142 (https://phabricator.wikimedia.org/T308013)
[14:30:35] <wikibugs>	 (03PS1) 10Muehlenhoff: wmcs::services: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840143 (https://phabricator.wikimedia.org/T308013)
[14:30:37] <wikibugs>	 (03PS1) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013)
[14:30:42] <wikibugs>	 (03PS1) 10Muehlenhoff: logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013)
[14:33:47] <wikibugs>	 (03PS2) 10Snwachukwu: role::common::aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/838167
[14:35:05] <wikibugs>	 (03CR) 10Snwachukwu: role::common::aqs: update mw history snapshot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838167 (owner: 10Snwachukwu)
[14:35:24] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS buster
[14:37:32] <wikibugs>	 (03PS3) 10Snwachukwu: role::common::aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/838167
[14:37:52] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)
[14:38:51] <wikibugs>	 (03PS4) 10Snwachukwu: role::common::aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/838167
[14:42:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cumin: fix missing lab->cloud alias rename (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans)
[14:42:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) p:05Triage→03Medium
[14:44:16] <wikibugs>	 (03PS2) 10Volans: cumin: remove unused misc-wmcs alias [puppet] - 10https://gerrit.wikimedia.org/r/840123
[14:44:18] <wikibugs>	 (03CR) 10Volans: "ack, removed, thx" [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans)
[14:45:12] <wikibugs>	 (03CR) 10Majavah: "Does this work properly when building images with a :testing tag?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/839745 (https://phabricator.wikimedia.org/T320100) (owner: 10BryanDavis)
[14:45:14] <wikibugs>	 (03PS2) 10AOkoth: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326)
[14:45:16] <wikibugs>	 (03PS1) 10AOkoth: admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014)
[14:46:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Lucas_Werkmeister_WMDE) >  [] User has provided a public SSH key. This ssh key pair should only be used for WMF cluster access, and not share...
[14:46:46] <wikibugs>	 (03PS2) 10Muehlenhoff: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013)
[14:46:51] <wikibugs>	 (03CR) 10Muehlenhoff: trafficserver: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:53:14] <wikibugs>	 (03PS4) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108
[14:54:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] role::common::aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/838167 (owner: 10Snwachukwu)
[14:56:55] <wikibugs>	 (03PS2) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013)
[14:57:07] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[14:57:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[14:58:23] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Looks good thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:58:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans)
[14:59:58] <wikibugs>	 (03CR) 10Hashar: "Well done :-]" [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert)
[15:00:16] <wikibugs>	 (03PS3) 10Muehlenhoff: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013)
[15:06:42] <wikibugs>	 (03PS5) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (https://phabricator.wikimedia.org/T317511)
[15:08:02] <wikibugs>	 (03PS2) 10Clément Goubert: Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550
[15:08:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 (owner: 10Clément Goubert)
[15:08:48] <wikibugs>	 (03PS1) 10Hokwelum: Add labstore1006 to dumps distribution active web server [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269)
[15:09:25] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[15:15:14] <wikibugs>	 (03PS2) 10Hokwelum: Add labstore1006 to dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269)
[15:15:58] <wikibugs>	 (03PS3) 10Hokwelum: Add labstore1006 to dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269)
[15:16:49] <wikibugs>	 (03PS4) 10Hokwelum: Add labstore1006 to dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269)
[15:18:00] <wikibugs>	 (03PS1) 10Ssingh: P:base: configure Linux 5.10 on buster via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/840162
[15:20:54] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37481/console" [puppet] - 10https://gerrit.wikimedia.org/r/840162 (owner: 10Ssingh)
[15:22:07] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:26:45] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "WMCS (NOOP): https://puppet-compiler.wmflabs.org/pcc-worker1003/37483/cloudgw1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/840162 (owner: 10Ssingh)
[15:26:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10nskaggs) Given the new machines much larger capacity, I believe any pending requests for more space can now be reconsidered....
[15:29:30] <wikibugs>	 (03PS1) 10Sbisson: Make discovery mode config default to 'off' [extensions/Wikistories] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840178 (https://phabricator.wikimedia.org/T314582)
[15:32:07] <wikibugs>	 (03PS2) 10Ssingh: P:base: configure Linux 5.10 on buster via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067)
[15:34:25] <wikibugs>	 (03PS1) 10Btullis: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196)
[15:35:53] <wikibugs>	 (03PS2) 10Btullis: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196)
[15:36:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[15:38:57] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.dhcp: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/840103 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[15:40:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[15:44:14] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.dhcp: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/840103 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans)
[15:45:00] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cumin: remove unused misc-wmcs alias [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans)
[15:50:59] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:06:14] <wikibugs>	 (03PS1) 10Brennen Bearnes: Check whether title actually exists [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840180 (https://phabricator.wikimedia.org/T319798)
[16:09:25] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:11:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Cool!" [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert)
[16:15:39] <brennen>	 !log train 1.40.0-wmf.4 (T314193) blockers have patches; after discussion in releng, going ahead with friday deploy in interest of avoiding a scramble during the coming holiday week
[16:15:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:44] <stashbot>	 T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193
[16:19:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840180 (https://phabricator.wikimedia.org/T319798) (owner: 10Brennen Bearnes)
[16:25:26] <kart_>	 brennen: It is pretty late, but feel free to deploy fix: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/840041
[16:25:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) p:05Triage→03Low
[16:25:51] <brennen>	 kart_: will do.
[16:26:56] <kart_>	 brennen: Thanks! 
[16:27:09] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] RecentSignificantEditStore: Force section titles to be an index array [extensions/ContentTranslation] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840041 (https://phabricator.wikimedia.org/T319799) (owner: 10KartikMistry)
[16:28:48] <wikibugs>	 (03PS7) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529)
[16:29:24] <wikibugs>	 (03PS8) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529)
[16:34:16] <wikibugs>	 (03CR) 10BryanDavis: Use explicit 'latest' tags on upstream base images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/839745 (https://phabricator.wikimedia.org/T320100) (owner: 10BryanDavis)
[16:35:46] <wikibugs>	 (03Merged) 10jenkins-bot: Check whether title actually exists [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840180 (https://phabricator.wikimedia.org/T319798) (owner: 10Brennen Bearnes)
[16:36:14] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:840180|Check whether title actually exists (T319798)]]
[16:36:19] <stashbot>	 T319798: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent a proper page, but merely a link target (from SearchResultSetWidget) - https://phabricator.wikimedia.org/T319798
[16:36:41] <logmsgbot>	 !log brennen@deploy1002 brennen and brennen: Backport for [[gerrit:840180|Check whether title actually exists (T319798)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[16:36:51] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Use explicit 'latest' tags on upstream base images (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/839745 (https://phabricator.wikimedia.org/T320100) (owner: 10BryanDavis)
[16:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:41:17] <wikibugs>	 (03Merged) 10jenkins-bot: RecentSignificantEditStore: Force section titles to be an index array [extensions/ContentTranslation] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840041 (https://phabricator.wikimedia.org/T319799) (owner: 10KartikMistry)
[16:42:01] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:840180|Check whether title actually exists (T319798)]] (duration: 05m 47s)
[16:42:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:42:06] <stashbot>	 T319798: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent a proper page, but merely a link target (from SearchResultSetWidget) - https://phabricator.wikimedia.org/T319798
[16:43:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840041 (https://phabricator.wikimedia.org/T319799) (owner: 10KartikMistry)
[16:43:37] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:840041|RecentSignificantEditStore: Force section titles to be an index array (T319799)]]
[16:43:41] <stashbot>	 T319799: TypeError: Argument 6 passed to ContentTranslation\Entity\RecentSignificantEdit::__construct() must be of the type array, object given - https://phabricator.wikimedia.org/T319799
[16:44:00] <logmsgbot>	 !log brennen@deploy1002 brennen and kartik: Backport for [[gerrit:840041|RecentSignificantEditStore: Force section titles to be an index array (T319799)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[16:46:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:46:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:46:16] <brennen>	 going ahead with this one, should be clear if the errors stop i think.
[16:46:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:50:19] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:840041|RecentSignificantEditStore: Force section titles to be an index array (T319799)]] (duration: 06m 41s)
[16:50:23] <stashbot>	 T319799: TypeError: Argument 6 passed to ContentTranslation\Entity\RecentSignificantEdit::__construct() must be of the type array, object given - https://phabricator.wikimedia.org/T319799
[16:51:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:52:20] <wikibugs>	 (03CR) 10Ayounsi: "Some comments but overall LGTM! That will help catch cabling errors sooner." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[16:52:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:52:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:55:04] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840195 (https://phabricator.wikimedia.org/T314193)
[16:55:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840195 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot)
[16:55:57] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840195 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot)
[16:56:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:58:47] <wikibugs>	 (03CR) 10FNegri: [C: 04-1] "I think the current policy is that this key should be a different key from the one that was added in https://gerrit.wikimedia.org/r/c/oper" [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova)
[16:59:58] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.4  refs T314193
[17:00:03] <stashbot>	 T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193
[17:01:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:02:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:02:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:03:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:10:35] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:13:08] <sukhe>	 !migrate ganeti4004: T317249
[17:13:09] <stashbot>	 T317249: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249
[17:13:34] <taavi>	 sukhe: missing !log from the start of that line
[17:13:39] <sukhe>	 ha
[17:13:47] <sukhe>	 !log migrate ganeti4004: T317249
[17:13:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:12] <wikibugs>	 (03PS1) 10Ssingh: hiera: decom ganeti4004 [puppet] - 10https://gerrit.wikimedia.org/r/840199 (https://phabricator.wikimedia.org/T317249)
[17:18:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Ottomata) > Access to full superset information, especially for the banner bump investigation  This sounds like ssh-less access to analytics-privatedata-users group  Appr...
[17:18:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[17:20:38] <sukhe>	 !log sudo gnt-node evacuate -s ganeti4004.ulsfo.wmnet
[17:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:51] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:51:23] <ryankemper>	 !log [Elastic] Updated list of cross-cluster remote seeds for all eqiad/codfw elastic clusters; should resolve `ElasticSearch setting check` alerts
[17:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:16] <wikibugs>	 (03PS6) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016)
[17:58:38] <wikibugs>	 (03PS7) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016)
[18:12:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:25:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] microsites: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840140 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[18:25:42] <wikibugs>	 (03PS2) 10Dzahn: microsites: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840140 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[18:26:07] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:27:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth)
[18:32:24] <wikibugs>	 (03CR) 10Dzahn: "re: "only needed when restarting Gerrit which will happen at some point in the future anyway". That's what I don't like, when we change co" [puppet] - 10https://gerrit.wikimedia.org/r/839694 (owner: 10Hashar)
[18:33:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: use 2 threads to replicate to GitHub [puppet] - 10https://gerrit.wikimedia.org/r/839694 (owner: 10Hashar)
[18:40:43] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic1057 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:43] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic1068 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:45] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic1076 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:47] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic1093 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:47] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic1098 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:49] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:49] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2047 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:51] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2052 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:51] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic2054 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:42:30] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth)
[18:42:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth)
[18:45:47] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[18:53:13] <wikibugs>	 (03PS2) 10AOkoth: admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014)
[18:53:32] <wikibugs>	 (03PS3) 10AOkoth: admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014)
[18:53:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth)
[18:54:35] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:57:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[18:58:33] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth)
[19:00:05] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) Hey @Lucas_Werkmeister_WMDE Yeah, I was actually debating whether to remove that checkbox but I'll just leave it unchecked since...
[19:01:21] <wikibugs>	 (03CR) 10Dzahn: "Adding new types to wmflib needs different reviewers, so yea, this is unfortunately now mixing different things into a single patch." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[19:05:02] <sukhe>	 !log sudo gnt-node remove ganeti4004.ulsfo.wmnet T317249
[19:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:07] <stashbot>	 T317249: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249
[19:07:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert1001:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload
[19:07:21] <sukhe>	 !log decommission ganeti4004.ulsfo.wmnet: T317249
[19:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:31] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti4004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[19:11:17] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti4004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[19:11:47] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti4004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[19:12:57] <mutante>	 sukhe: the decom cookbook should handle it
[19:14:11] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] ""Evaluation Error: Resource type not found: HTTP_PROXY " https://puppet-compiler.wmflabs.org/pcc-worker1003/37485/gitlab-runner1004.eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[19:17:05] <sukhe>	 mutante: yes sorry 
[19:17:31] <sukhe>	 stepped out for a didn't know it would complain :D
[19:17:44] <wikibugs>	 (03PS1) 10Andrew Bogott: clouddumps1001: profile::dumps::distribution::web::is_primary_server = true [puppet] - 10https://gerrit.wikimedia.org/r/840213 (https://phabricator.wikimedia.org/T319269)
[19:17:48] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] P:gitlab::runner: Provide proxy variables to runner jobs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[19:18:33] <mutante>	 sukhe: ack :)
[19:20:25] <wikibugs>	 (03CR) 10Ahmon Dancy: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[19:21:18] <wikibugs>	 (03CR) 10Ahmon Dancy: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[19:24:52] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add type Wmflib::POSIX::Name [puppet] - 10https://gerrit.wikimedia.org/r/840215
[19:25:12] <wikibugs>	 (03CR) 10Ahmon Dancy: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[19:31:27] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[19:37:14] <wikibugs>	 (03PS2) 10Andrew Bogott: clouddumps1001: profile::dumps::distribution::web::is_primary_server = true [puppet] - 10https://gerrit.wikimedia.org/r/840213 (https://phabricator.wikimedia.org/T319269)
[19:37:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4004.ulsfo.wmnet
[19:42:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[19:46:08] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:46:08] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ganeti4004.ulsfo.wmnet
[19:46:40] <sukhe>	 ha
[19:47:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] clouddumps1001: profile::dumps::distribution::web::is_primary_server = true [puppet] - 10https://gerrit.wikimedia.org/r/840213 (https://phabricator.wikimedia.org/T319269) (owner: 10Andrew Bogott)
[19:49:12] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: decom ganeti4004 [puppet] - 10https://gerrit.wikimedia.org/r/840199 (https://phabricator.wikimedia.org/T317249) (owner: 10Ssingh)
[19:49:59] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10ssingh) @RobH: ganeti4004 has been decommissioned and is ready for you. Thanks!
[20:02:13] <jinxer-wm>	 (IcingaOverload) firing: (2) Checks are taking long to execute on alert1001:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload
[20:08:52] <wikibugs>	 (03CR) 10Dzahn: "you were using this in "misc-wmcs"" [puppet] - 10https://gerrit.wikimedia.org/r/836795 (owner: 10Muehlenhoff)
[20:14:17] <wikibugs>	 (03PS1) 10Dzahn: cumin: fix misc-wmcs alias, labweb->cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/840229
[20:15:42] <wikibugs>	 (03Abandoned) 10Dzahn: cumin: fix misc-wmcs alias, labweb->cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/840229 (owner: 10Dzahn)
[20:20:41] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:55] <wikibugs>	 (03PS8) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[20:22:57] <wikibugs>	 (03CR) 10Jdlrobson: Automate icon generation (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:23:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:32:44] <wikibugs>	 (03PS9) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[20:32:46] <wikibugs>	 (03PS4) 10Jdlrobson: Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223)
[20:33:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[20:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:43:23] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[20:44:14] <wikibugs>	 (03PS5) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[20:46:31] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:52:37] <wikibugs>	 (03CR) 10Dzahn: "thanks Ahmon. I took the liberty to amend and remove that part and also comment the lines causing the error I pointed out before. Just to " [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[20:56:49] <wikibugs>	 (03PS6) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[20:57:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[20:58:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:00:07] <wikibugs>	 (03PS2) 10Ahmon Dancy: Add type Wmflib::POSIX::Name [puppet] - 10https://gerrit.wikimedia.org/r/840215
[21:00:27] <wikibugs>	 (03PS7) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:00:37] <wikibugs>	 (03CR) 10Ahmon Dancy: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:02:07] <wikibugs>	 (03PS8) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:03:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:06:13] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1001/37487/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:06:19] <wikibugs>	 (03CR) 10Mabualruz: [C: 03+1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[21:08:41] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:09:01] <wikibugs>	 (03CR) 10Dzahn: "This compiles now and looks like it would work, except we only have the non-capitalized version of the env variable names. The capitalized" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:15:00] <wikibugs>	 (03CR) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:15:09] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:18:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[21:18:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:21:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "on gitlab-runner1003: /etc/default/buildkitd has been created. /home/gitlab-runner/.gitlab-runner/managed.toml has been edited. command li" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:21:19] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:22:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:24:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Oct 07 21:23:43 gitlab-runner1002 systemd[1]: Stopped buildkitd." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[21:25:37] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:25:53] <mutante>	 sigh, caused by latest merge
[21:26:32] <dancy>	 mutante: I'm around if needed.
[21:27:49] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:28:21] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:28:56] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: debugging
[21:29:19] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: debugging
[21:31:31] <mutante>	 dancy: looks like the "<%= @image %>" was removed from the docker command line ... let me add that back :p
[21:31:42] <mutante>	 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/833125/9/modules/buildkitd/templates/buildkitd.service.erb#b20
[21:33:06] <dancy>	 ah yes.. overly aggressive deletion
[21:33:58] <mutante>	 would it be nicer to move <%= @image %> \ to the very end anyways?
[21:34:45] <dancy>	 It has to be between the docker flags and the flags to buildkitd itself (which start with --addr)
[21:35:05] <mutante>	 ack
[21:36:41] <wikibugs>	 (03PS1) 10Dzahn: buildkitd: re-add <%= @image %> to the docker ExecStart command line [puppet] - 10https://gerrit.wikimedia.org/r/840234 (https://phabricator.wikimedia.org/T317997)
[21:37:11] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] buildkitd: re-add <%= @image %> to the docker ExecStart command line [puppet] - 10https://gerrit.wikimedia.org/r/840234 (https://phabricator.wikimedia.org/T317997) (owner: 10Dzahn)
[21:37:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] buildkitd: re-add <%= @image %> to the docker ExecStart command line [puppet] - 10https://gerrit.wikimedia.org/r/840234 (https://phabricator.wikimedia.org/T317997) (owner: 10Dzahn)
[21:39:11] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:29] <wikibugs>	 (03PS1) 10Stang: trwikivoyage: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840235 (https://phabricator.wikimedia.org/T319537)
[21:39:41] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:41:25] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:41:42] <mutante>	 dancy: it's running again and with --env-file /etc/default/buildkitd
[21:42:05] <dancy>	 ok.. I'll look into BUILDKITD_HOS
[21:42:07] <dancy>	 T
[21:42:15] <mutante>	 that file has the settings but it's just the non-capitalized versins
[21:42:16] <wikibugs>	 (03CR) 10Stang: "See T319537#8301427" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840235 (https://phabricator.wikimedia.org/T319537) (owner: 10Stang)
[21:42:56] <mutante>	 dancy: does this actually work? BUILDKIT_HOST=tcp://buildkitd.gitlab-runner:1234"
[21:43:06] <mutante>	 just ending in .gitlab-runner
[21:44:38] <wikibugs>	 (03PS10) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223)
[21:45:30] <dancy>	 https://www.irccloud.com/pastebin/0PnzyVns/
[21:46:05] <mutante>	 :) ok, thanks
[21:53:01] <Lazowik>	 Hi, does anyone here know what is the release schedule of the mobileapps service? I can see that in the past it has been at a sub-month frequency. The thing is that a pretty important fix (for end users, not ops) landed more than a month ago. It makes page previews show flagged revision instead of latest. Right now on prod vandalisms are shown even when the changes have not been reviewed.
[21:55:07] <Lazowik>	 Current version is `2022-08-16-171635-production`, latest docker label is `2022-10-03-111410-production`
[21:56:28] <Lazowik>	 Of course this only affects wikis with flagged revisions + it's not a regression, it's been like that since launch...
[21:58:14] <mutante>	 nemo-yiannis: ^
[21:59:44] <Lazowik>	 The fix for reference: https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/821218
[22:00:25] <mutante>	 Lazowik: maybe try an email to https://meta.wikimedia.org/wiki/User:JGiannelos_(WMF)
[22:00:33] <mutante>	 I just say that based on logs
[22:04:10] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[22:04:20] <dancy>	 mutante: ^
[22:10:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[22:11:13] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic2080 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:12:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:13:26] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[22:15:49] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic2075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:16] <mutante>	 dancy: fully agree with your comment, let's continue after the weekend
[22:17:39] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2073 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:18:35] <dancy>	 Have a good one!
[22:19:22] <mutante>	 you too. I am leaving IRC, watching the staff meeting, then off. cu
[22:26:21] <icinga-wm>	 PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:26:29] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2086 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:31:01] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:36:23] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic2083 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:36:23] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic2076 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:40:27] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:57:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[23:02:15] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:27:29] <icinga-wm>	 RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook