[00:02:13] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:44:23] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy) [00:48:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:53:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:36:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:41:45] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:50:37] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:51:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [02:20:31] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:20:35] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:21:05] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:21:10] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:21:23] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:21:24] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:21:43] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:22:07] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:22:49] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:22:53] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:23:23] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:23:29] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:23:41] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:23:42] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:24:01] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:24:21] I'm on top of ^ for the moment [02:24:25] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:51:49] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [02:59:47] (03PS18) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [03:05:36] (03PS19) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [03:07:58] (03PS20) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [03:29:02] (03CR) 10Raymond Ndibe: "Fixed the issue of how to run tests without root. It required making measure changes to the code, but it's done now. Next is to implement " [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:30:03] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:55] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder) [04:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:27:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:23] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:35] PROBLEM - SSH on db1116.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:36:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:41:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:42:13] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:54:27] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:59:05] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:07:23] (03CR) 10Ayounsi: [C: 03+1] "On one side it's better when the default option do the right thing out of the box, but on the other it's not worth the time investigating " [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney) [06:11:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) This opened {T314998} automatically. Please sync up with Netops before doing the work as live traffic is using the port. [06:12:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:13:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [06:32:33] PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:33:43] RECOVERY - SSH on db1116.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221007T0700) [07:03:01] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:06:49] (03CR) 10Elukey: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [07:11:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1026.eqiad.wmnet with OS bullseye [07:11:10] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1026.eqiad.wmnet with OS bullseye [07:14:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1014.eqiad.wmnet with reason: Downtime for removal from Ganeti cluster and eventual bullseye reimage [07:14:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1014.eqiad.wmnet with reason: Downtime for removal from Ganeti cluster and eventual bullseye reimage [07:15:10] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) I went through the useful https://apps.juniper.net/feature-explorer/select-software.html?typ=1&swName=Junos%20OS&rel=21.2R3&sid=1211&platform=MX204&pi... [07:16:06] (03CR) 10Elukey: [C: 03+1] "Left a note but LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [07:22:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A [07:22:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A [07:23:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1026.eqiad.wmnet with reason: host reimage [07:26:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1026.eqiad.wmnet with reason: host reimage [07:30:18] (03CR) 10Elukey: "Left some comments, but LGTM! (tested also the get-kubernetes-release.sh script as well)." [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [07:32:30] RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1026.eqiad.wmnet with OS bullseye [07:41:14] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1026.eqiad.wmnet with OS bullseye completed: - ganeti1026 (**PASS**) - Downtimed on... [07:49:55] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10hashar) [07:49:57] !log re-initialize docker on dse-k8s-worker100[5-8] - wrong storage type set (devicemapper instead of overlay2) [07:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1014.eqiad.wmnet with OS bullseye [07:50:21] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1014.eqiad.wmnet with OS bullseye [07:54:31] !log re-initialize docker on dse-k8s-worker1004 - wrong storage type set (devicemapper instead of overlay2) [07:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:02:45] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-openstack-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1014.eqiad.wmnet with reason: host reimage [08:07:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1014.eqiad.wmnet with reason: host reimage [08:07:46] PROBLEM - ElasticSearch setting check - 9200 on elastic1094 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:07:58] (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:08:36] ^^ ryankemper: looks like we have settings mismatch. I suspect this is related to the change in eligible masters? [08:11:41] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:11:53] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:12:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:14:03] (03PS3) 10JMeybohm: Update to Kubernetes v1.23.12 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) [08:15:25] (03CR) 10JMeybohm: Update to Kubernetes v1.23.12 (033 comments) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:19:03] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:19:29] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:19:38] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:19:59] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:20:57] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:21:35] (03PS1) 10Vgutierrez: trafficserver: Enable cache partitioning in cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/840061 (https://phabricator.wikimedia.org/T317748) [08:21:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:21:56] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:22:14] !log partition ats-be cache in cp6016 - T317748 [08:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:18] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [08:22:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1014.eqiad.wmnet with OS bullseye [08:22:30] PROBLEM - ElasticSearch setting check - 9600 on elastic1102 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:22:32] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Enable cache partitioning in cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/840061 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [08:22:35] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1014.eqiad.wmnet with OS bullseye completed: - ganeti1014 (**PASS**) - Downtimed on... [08:22:58] (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:23:56] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudnet1003.eqiad.wmnet with OS bullseye [08:23:58] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudnet1004.eqiad.wmnet with OS bullseye [08:24:01] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudnet1003.eqiad.wmnet with OS bullseye execu... [08:24:04] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T319683 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudnet1004.eqiad.wmnet with OS bullseye execu... [08:24:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:25:02] PROBLEM - ElasticSearch setting check - 9600 on elastic1095 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:25:02] PROBLEM - ElasticSearch setting check - 9200 on elastic1100 is CRITICAL: CRITICAL - [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500] does not match [elastic1057.eqiad.wmnet:9500, elastic1068.eqiad.wmnet:9500, elastic1076.eqiad.wmnet:9500, elastic1093.eqiad.wmnet:9500, elastic1098.eqiad.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:25:03] PROBLEM - ElasticSearch setting check - 9400 on elastic1093 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:25:17] (03CR) 10Muehlenhoff: sre.hosts.reimage: support different installers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [08:26:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A [08:27:58] (KubernetesCalicoDown) resolved: (2) dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:28:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A [08:29:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:30:58] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [08:31:58] (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:32:23] (03CR) 10Elukey: Update to Kubernetes v1.23.12 (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:32:50] (03CR) 10Elukey: [C: 03+1] Update to Kubernetes v1.23.12 [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/820888 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:32:59] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:33:04] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:33:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [08:35:34] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:35:59] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:36:07] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:36:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:37:57] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:38:10] (03CR) 10Cathal Mooney: Add explicit BFD session mode (single/multi-hop) to Anycast groups (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney) [08:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:39:56] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:39:58] (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:41:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:42:06] (03PS2) 10Cathal Mooney: Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) [08:43:04] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:43:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10karapayneWMDE) a:05karapayneWMDE→03Arnoldokoth Approved! And thanks :) [08:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:44:24] (03CR) 10Cathal Mooney: [C: 03+2] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [08:44:50] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:44:57] (03CR) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [08:44:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:45:41] (03CR) 10Cathal Mooney: [C: 03+2] Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney) [08:46:08] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney) [08:46:22] (03Merged) 10jenkins-bot: Add explicit BFD session mode (single/multi-hop) to Anycast groups [homer/public] - 10https://gerrit.wikimedia.org/r/839634 (https://phabricator.wikimedia.org/T304501) (owner: 10Cathal Mooney) [08:48:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:52:24] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:53:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:57:49] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder) [08:58:19] (03CR) 10Mabualruz: Automate icon generation (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [08:59:30] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:00:12] PROBLEM - ElasticSearch setting check - 9400 on elastic1098 is CRITICAL: CRITICAL - [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300] does not match [elastic1054.eqiad.wmnet:9300, elastic1074.eqiad.wmnet:9300, elastic1081.eqiad.wmnet:9300, elastic1094.eqiad.wmnet:9300, elastic1100.eqiad.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [09:00:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10aborrero) [09:02:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10aborrero) [09:06:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:11:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) p:05Lowest→03Medium [09:12:14] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) I change the priority to medium,. The lack of a proper solution for network management causes period problems eno... [09:17:48] (03CR) 10Jbond: "lgtm, optional comment still open" [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [09:17:53] (03CR) 10Jbond: [C: 03+1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [09:18:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [09:23:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) 05Open→03Resolved Change applied across all routers now, so hopefully the last we see this kind of issue. [09:23:58] (KubernetesCalicoDown) firing: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:25:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Jclark-ctr please let me or @BCornwall know when it would be a good time for you to perform the change [09:26:33] !log delete calico pods in CrashLoop on dse-k8s-codfw (probably due to the incorrect docker settings) [09:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [09:28:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1005.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:49:05] (03CR) 10Btullis: Remove legacy AQS host configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [09:49:09] (03PS4) 10Btullis: Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) [09:56:35] (03PS4) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [10:00:51] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [10:04:06] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:05:14] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:56] (03CR) 10Btullis: [C: 03+2] Remove legacy AQS host configuration [puppet] - 10https://gerrit.wikimedia.org/r/839605 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [10:12:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:36] (03CR) 10Volans: "couple of optional nits inline" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [10:19:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:24:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:27:17] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: support different installers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [10:31:57] (03Merged) 10jenkins-bot: sre.hosts.reimage: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/839627 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [10:33:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:36:30] (03PS1) 10Btullis: Remove absented resource definitions for aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) [10:38:06] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37479/console" [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [10:38:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:40:50] (03PS6) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [10:41:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [10:43:33] (03PS1) 10Jgiannelos: changeprop: Disable pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) [10:44:14] (03CR) 10Jgiannelos: [C: 04-1] "Blocking this patch until deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [10:45:42] (03PS2) 10Jgiannelos: changeprop: Disable restbase pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) [10:47:36] (03CR) 10Muehlenhoff: Remove absented resource definitions for aqs nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [10:47:48] (03PS3) 10Jgiannelos: changeprop: Disable restbase pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) [10:48:49] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye [10:49:18] (03PS4) 10Jgiannelos: changeprop: Disable restbase pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) [10:49:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [10:49:36] (03CR) 10Btullis: [V: 03+1] Remove absented resource definitions for aqs nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [10:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [11:01:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS bullseye [11:04:17] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:07:50] (03PS1) 10Volans: sre.hosts.dhcp: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/840103 (https://phabricator.wikimedia.org/T319067) [11:10:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/840103 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [11:24:52] (03PS1) 10Cathal Mooney: Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) [11:25:49] (03PS2) 10Btullis: Remove absented resource definitions for aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) [11:26:16] (03PS5) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [11:26:48] (03PS7) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [11:27:53] (03PS1) 10Arturo Borrero Gonzalez: cloud: eqiad1: depool rabbitmq02 [puppet] - 10https://gerrit.wikimedia.org/r/840107 (https://phabricator.wikimedia.org/T320232) [11:27:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [11:28:26] (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (033 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [11:28:53] (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [11:29:46] (03PS4) 10Btullis: Add a spark-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) [11:31:38] (03PS1) 10Clément Goubert: Prepare 3.9.4 release [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 [11:32:41] (03CR) 10CI reject: [V: 04-1] Prepare 3.9.4 release [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [11:32:45] (03CR) 10Btullis: Add a spark-operator production image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [11:34:18] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "may not be needed after all." [puppet] - 10https://gerrit.wikimedia.org/r/840107 (https://phabricator.wikimedia.org/T320232) (owner: 10Arturo Borrero Gonzalez) [11:36:33] (03CR) 10Btullis: Remove absented resource definitions for aqs nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [11:38:23] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) a:03colewhite [11:38:55] (03PS6) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [11:44:20] (03PS2) 10Clément Goubert: Prepare 3.9.4 release [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 [11:45:18] (03CR) 10CI reject: [V: 04-1] Prepare 3.9.4 release [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [11:46:03] (03CR) 10JMeybohm: Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [11:49:36] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster [11:50:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [11:50:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS buster [11:50:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [11:50:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1001.eqiad.wmnet with OS buster [11:51:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [11:56:46] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster [11:57:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [11:57:40] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster [11:58:30] (03CR) 10Clément Goubert: Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [11:59:17] 10SRE, 10API Platform: Block non-browser requests that use generic agents - https://phabricator.wikimedia.org/T319423 (10daniel) [12:02:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [12:16:03] (03CR) 10Hashar: Json schema from Gerrit Java event classes (032 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [12:17:02] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:20:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:21:15] (03PS1) 10Giuseppe Lavagetto: noc: use php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/840117 [12:21:47] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] noc: use php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/840117 (owner: 10Giuseppe Lavagetto) [12:27:55] (03CR) 10JMeybohm: [C: 03+1] Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [12:29:13] (03CR) 10JMeybohm: [C: 04-1] "Ah, wait. Changelog bump is missing here. Also make sure to set distribution to bullseye-wikimedia for the new version." [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [12:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:38:23] (03PS1) 10KartikMistry: RecentSignificantEditStore: Force section titles to be an index array [extensions/ContentTranslation] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840041 (https://phabricator.wikimedia.org/T319799) [12:39:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [12:40:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [12:46:02] (03PS2) 10Slavina Stefanova: Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) [12:48:04] (03PS1) 10Majavah: openstack: keystone: enable app credentials everywhere [puppet] - 10https://gerrit.wikimedia.org/r/840121 (https://phabricator.wikimedia.org/T294195) [12:49:20] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37480/console" [puppet] - 10https://gerrit.wikimedia.org/r/840121 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah) [12:54:48] 10SRE, 10Infrastructure-Foundations, 10netops: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) [13:02:14] (03PS2) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) [13:02:42] (03CR) 10CI reject: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [13:04:15] (03CR) 10Hashar: "recheck" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [13:04:46] (03CR) 10Ssingh: [C: 03+1] dns: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837098 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:07:16] (03CR) 10Hashar: Json schema from Gerrit Java event classes (031 comment) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [13:07:24] (03PS1) 10Volans: cumin: fix missing lab->cloud alias rename [puppet] - 10https://gerrit.wikimedia.org/r/840123 [13:07:30] (03PS3) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) [13:08:36] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster [13:09:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) @Vgutierrez will schedule for next week i will not be on site today unless @Cmjohnson is available today i will have to get wi... [13:11:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS buster [13:13:18] (03CR) 10Clément Goubert: Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [13:16:20] (03PS6) 10Hashar: Implement REST API and Ssh commands [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) [13:17:01] (03CR) 10CI reject: [V: 04-1] Implement REST API and Ssh commands [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [13:18:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:18:21] (03PS1) 10Muehlenhoff: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/840125 (https://phabricator.wikimedia.org/T319067) [13:24:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) @Jclark-ctr if you want, you can also ping me for the port configuration. [13:24:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [13:26:29] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:26:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [13:33:47] PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:33:57] (03CR) 10JMeybohm: [C: 03+1] Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [13:34:19] (03PS1) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [13:37:34] (03CR) 10Clément Goubert: Prepare 3.9.4 release (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [13:38:21] RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:39:49] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:41:08] (03CR) 10Eevans: [C: 03+1] Remove absented resource definitions for aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [13:41:40] (03CR) 10CI reject: [V: 04-1] icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [13:44:01] (03CR) 10Ssingh: [C: 03+1] "And thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/840125 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [13:44:19] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) I created a new netinst environment based on the latest buster plus the 5.10.136 Linux kernel under /var/lib/puppet/volatile/tftpboot/... [13:44:23] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:44:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [13:47:42] (03CR) 10Hashar: "recheck" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [13:51:13] (03CR) 10Btullis: [C: 03+2] Remove absented resource definitions for aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [13:51:28] (03CR) 10Btullis: [C: 03+2] Remove absented resource definitions for aqs nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840096 (https://phabricator.wikimedia.org/T302277) (owner: 10Btullis) [13:52:03] PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ssingh) Thanks for the update and for working on this! >>! In T319067#8300254, @MoritzMuehlenhoff wrote: > I created a new netinst environment based on... [13:55:38] 10ops-eqiad, 10Data Engineering Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Sprint 02): Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10BTullis) a:05BTullis→03Cmjohnson [13:56:14] (03PS3) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 [13:56:30] (03CR) 10Clément Goubert: Release upstream version 3.9.4 (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [13:57:58] 10ops-eqiad, 10Data Engineering Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Sprint 02): Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10BTullis) I believe that the //service owner// part of this task is all done, so I'm tagging #ops-eqiad and assigning to @... [14:00:20] (03CR) 10Clément Goubert: Release upstream version 3.9.4 (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [14:01:10] (03CR) 10CI reject: [V: 04-1] Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [14:07:55] RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:09:21] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:09:23] (03CR) 10Muehlenhoff: cumin: fix missing lab->cloud alias rename (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans) [14:12:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:21] (03CR) 10Reedy: Automate icon generation (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [14:17:58] (03CR) 10Klausman: [C: 03+1] Add a spark-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [14:19:35] (03CR) 10Klausman: [C: 03+1] Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [14:21:08] (03CR) 10Btullis: [C: 04-1] "I'm only giving a -1 because of the merge conflict, since the 'aqs' role has been removed at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/838167 (owner: 10Snwachukwu) [14:30:27] (03PS1) 10Muehlenhoff: maps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840139 (https://phabricator.wikimedia.org/T308013) [14:30:29] (03PS1) 10Muehlenhoff: microsites: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840140 (https://phabricator.wikimedia.org/T308013) [14:30:31] (03PS1) 10Muehlenhoff: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) [14:30:33] (03PS1) 10Muehlenhoff: pki: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840142 (https://phabricator.wikimedia.org/T308013) [14:30:35] (03PS1) 10Muehlenhoff: wmcs::services: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840143 (https://phabricator.wikimedia.org/T308013) [14:30:37] (03PS1) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013) [14:30:42] (03PS1) 10Muehlenhoff: logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) [14:33:47] (03PS2) 10Snwachukwu: role::common::aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/838167 [14:35:05] (03CR) 10Snwachukwu: role::common::aqs: update mw history snapshot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838167 (owner: 10Snwachukwu) [14:35:24] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS buster [14:37:32] (03PS3) 10Snwachukwu: role::common::aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/838167 [14:37:52] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder) [14:38:51] (03PS4) 10Snwachukwu: role::common::aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/838167 [14:42:03] (03CR) 10Arturo Borrero Gonzalez: cumin: fix missing lab->cloud alias rename (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans) [14:42:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) p:05Triage→03Medium [14:44:16] (03PS2) 10Volans: cumin: remove unused misc-wmcs alias [puppet] - 10https://gerrit.wikimedia.org/r/840123 [14:44:18] (03CR) 10Volans: "ack, removed, thx" [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans) [14:45:12] (03CR) 10Majavah: "Does this work properly when building images with a :testing tag?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/839745 (https://phabricator.wikimedia.org/T320100) (owner: 10BryanDavis) [14:45:14] (03PS2) 10AOkoth: admin: add hshaikh and ptiwary to private-data users [puppet] - 10https://gerrit.wikimedia.org/r/839667 (https://phabricator.wikimedia.org/T319326) [14:45:16] (03PS1) 10AOkoth: admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) [14:46:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Lucas_Werkmeister_WMDE) > [] User has provided a public SSH key. This ssh key pair should only be used for WMF cluster access, and not share... [14:46:46] (03PS2) 10Muehlenhoff: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) [14:46:51] (03CR) 10Muehlenhoff: trafficserver: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:53:14] (03PS4) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 [14:54:36] (03CR) 10Btullis: [C: 03+2] role::common::aqs: update mw history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/838167 (owner: 10Snwachukwu) [14:56:55] (03PS2) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013) [14:57:07] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [14:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:58:23] (03CR) 10Ema: [C: 03+1] "Looks good thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:58:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans) [14:59:58] (03CR) 10Hashar: "Well done :-]" [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (owner: 10Clément Goubert) [15:00:16] (03PS3) 10Muehlenhoff: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) [15:06:42] (03PS5) 10Clément Goubert: Release upstream version 3.9.4 [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (https://phabricator.wikimedia.org/T317511) [15:08:02] (03PS2) 10Clément Goubert: Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 [15:08:43] (03CR) 10CI reject: [V: 04-1] Add build instructions in debian/README [debs/helm3] - 10https://gerrit.wikimedia.org/r/839550 (owner: 10Clément Goubert) [15:08:48] (03PS1) 10Hokwelum: Add labstore1006 to dumps distribution active web server [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269) [15:09:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [15:15:14] (03PS2) 10Hokwelum: Add labstore1006 to dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269) [15:15:58] (03PS3) 10Hokwelum: Add labstore1006 to dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269) [15:16:49] (03PS4) 10Hokwelum: Add labstore1006 to dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269) [15:18:00] (03PS1) 10Ssingh: P:base: configure Linux 5.10 on buster via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/840162 [15:20:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37481/console" [puppet] - 10https://gerrit.wikimedia.org/r/840162 (owner: 10Ssingh) [15:22:07] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/840144 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:26:45] (03CR) 10Ssingh: [V: 03+1] "WMCS (NOOP): https://puppet-compiler.wmflabs.org/pcc-worker1003/37483/cloudgw1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/840162 (owner: 10Ssingh) [15:26:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10nskaggs) Given the new machines much larger capacity, I believe any pending requests for more space can now be reconsidered.... [15:29:30] (03PS1) 10Sbisson: Make discovery mode config default to 'off' [extensions/Wikistories] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840178 (https://phabricator.wikimedia.org/T314582) [15:32:07] (03PS2) 10Ssingh: P:base: configure Linux 5.10 on buster via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) [15:34:25] (03PS1) 10Btullis: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) [15:35:53] (03PS2) 10Btullis: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) [15:36:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [15:38:57] (03CR) 10Volans: [C: 03+2] sre.hosts.dhcp: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/840103 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [15:40:44] (03CR) 10Elukey: [C: 03+1] Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [15:44:14] (03Merged) 10jenkins-bot: sre.hosts.dhcp: support different installers [cookbooks] - 10https://gerrit.wikimedia.org/r/840103 (https://phabricator.wikimedia.org/T319067) (owner: 10Volans) [15:45:00] (03CR) 10Volans: [C: 03+2] cumin: remove unused misc-wmcs alias [puppet] - 10https://gerrit.wikimedia.org/r/840123 (owner: 10Volans) [15:50:59] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:06:14] (03PS1) 10Brennen Bearnes: Check whether title actually exists [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840180 (https://phabricator.wikimedia.org/T319798) [16:09:25] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:11:53] (03CR) 10JMeybohm: [C: 03+1] "Cool!" [debs/helm3] - 10https://gerrit.wikimedia.org/r/840108 (https://phabricator.wikimedia.org/T317511) (owner: 10Clément Goubert) [16:15:39] !log train 1.40.0-wmf.4 (T314193) blockers have patches; after discussion in releng, going ahead with friday deploy in interest of avoiding a scramble during the coming holiday week [16:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:44] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [16:19:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840180 (https://phabricator.wikimedia.org/T319798) (owner: 10Brennen Bearnes) [16:25:26] brennen: It is pretty late, but feel free to deploy fix: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/840041 [16:25:28] 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) p:05Triage→03Low [16:25:51] kart_: will do. [16:26:56] brennen: Thanks! [16:27:09] (03CR) 10Brennen Bearnes: [C: 03+2] RecentSignificantEditStore: Force section titles to be an index array [extensions/ContentTranslation] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840041 (https://phabricator.wikimedia.org/T319799) (owner: 10KartikMistry) [16:28:48] (03PS7) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [16:29:24] (03PS8) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [16:34:16] (03CR) 10BryanDavis: Use explicit 'latest' tags on upstream base images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/839745 (https://phabricator.wikimedia.org/T320100) (owner: 10BryanDavis) [16:35:46] (03Merged) 10jenkins-bot: Check whether title actually exists [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840180 (https://phabricator.wikimedia.org/T319798) (owner: 10Brennen Bearnes) [16:36:14] !log brennen@deploy1002 Started scap: Backport for [[gerrit:840180|Check whether title actually exists (T319798)]] [16:36:19] T319798: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent a proper page, but merely a link target (from SearchResultSetWidget) - https://phabricator.wikimedia.org/T319798 [16:36:41] !log brennen@deploy1002 brennen and brennen: Backport for [[gerrit:840180|Check whether title actually exists (T319798)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [16:36:51] (03CR) 10Majavah: [C: 03+1] Use explicit 'latest' tags on upstream base images (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/839745 (https://phabricator.wikimedia.org/T320100) (owner: 10BryanDavis) [16:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:41:17] (03Merged) 10jenkins-bot: RecentSignificantEditStore: Force section titles to be an index array [extensions/ContentTranslation] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840041 (https://phabricator.wikimedia.org/T319799) (owner: 10KartikMistry) [16:42:01] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:840180|Check whether title actually exists (T319798)]] (duration: 05m 47s) [16:42:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:42:06] T319798: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent a proper page, but merely a link target (from SearchResultSetWidget) - https://phabricator.wikimedia.org/T319798 [16:43:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840041 (https://phabricator.wikimedia.org/T319799) (owner: 10KartikMistry) [16:43:37] !log brennen@deploy1002 Started scap: Backport for [[gerrit:840041|RecentSignificantEditStore: Force section titles to be an index array (T319799)]] [16:43:41] T319799: TypeError: Argument 6 passed to ContentTranslation\Entity\RecentSignificantEdit::__construct() must be of the type array, object given - https://phabricator.wikimedia.org/T319799 [16:44:00] !log brennen@deploy1002 brennen and kartik: Backport for [[gerrit:840041|RecentSignificantEditStore: Force section titles to be an index array (T319799)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [16:46:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:46:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:46:16] going ahead with this one, should be clear if the errors stop i think. [16:46:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:50:19] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:840041|RecentSignificantEditStore: Force section titles to be an index array (T319799)]] (duration: 06m 41s) [16:50:23] T319799: TypeError: Argument 6 passed to ContentTranslation\Entity\RecentSignificantEdit::__construct() must be of the type array, object given - https://phabricator.wikimedia.org/T319799 [16:51:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:52:20] (03CR) 10Ayounsi: "Some comments but overall LGTM! That will help catch cabling errors sooner." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:52:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:52:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:55:04] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840195 (https://phabricator.wikimedia.org/T314193) [16:55:05] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840195 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [16:55:57] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840195 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [16:56:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:58:47] (03CR) 10FNegri: [C: 04-1] "I think the current policy is that this key should be a different key from the one that was added in https://gerrit.wikimedia.org/r/c/oper" [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) (owner: 10Slavina Stefanova) [16:59:58] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.4 refs T314193 [17:00:03] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [17:01:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:02:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:02:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:03:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:10:35] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:13:08] !migrate ganeti4004: T317249 [17:13:09] T317249: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 [17:13:34] sukhe: missing !log from the start of that line [17:13:39] ha [17:13:47] !log migrate ganeti4004: T317249 [17:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:12] (03PS1) 10Ssingh: hiera: decom ganeti4004 [puppet] - 10https://gerrit.wikimedia.org/r/840199 (https://phabricator.wikimedia.org/T317249) [17:18:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Ottomata) > Access to full superset information, especially for the banner bump investigation This sounds like ssh-less access to analytics-privatedata-users group Appr... [17:18:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [17:20:38] !log sudo gnt-node evacuate -s ganeti4004.ulsfo.wmnet [17:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:51] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:51:23] !log [Elastic] Updated list of cross-cluster remote seeds for all eqiad/codfw elastic clusters; should resolve `ElasticSearch setting check` alerts [17:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:16] (03PS6) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [17:58:38] (03PS7) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [18:12:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:25:28] (03CR) 10Dzahn: [C: 03+2] microsites: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840140 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:25:42] (03PS2) 10Dzahn: microsites: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840140 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:26:07] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:27:24] (03CR) 10Dzahn: [C: 03+1] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth) [18:32:24] (03CR) 10Dzahn: "re: "only needed when restarting Gerrit which will happen at some point in the future anyway". That's what I don't like, when we change co" [puppet] - 10https://gerrit.wikimedia.org/r/839694 (owner: 10Hashar) [18:33:01] (03CR) 10Dzahn: [C: 03+2] gerrit: use 2 threads to replicate to GitHub [puppet] - 10https://gerrit.wikimedia.org/r/839694 (owner: 10Hashar) [18:40:43] RECOVERY - ElasticSearch setting check - 9400 on elastic1057 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:43] RECOVERY - ElasticSearch setting check - 9400 on elastic1068 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:45] RECOVERY - ElasticSearch setting check - 9400 on elastic1076 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:47] RECOVERY - ElasticSearch setting check - 9400 on elastic1093 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:47] RECOVERY - ElasticSearch setting check - 9400 on elastic1098 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:49] RECOVERY - ElasticSearch setting check - 9400 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:49] RECOVERY - ElasticSearch setting check - 9400 on elastic2047 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:51] RECOVERY - ElasticSearch setting check - 9400 on elastic2052 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:51] RECOVERY - ElasticSearch setting check - 9600 on elastic2054 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [18:42:30] (03CR) 10AOkoth: [C: 03+2] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth) [18:42:58] (03CR) 10Dzahn: [C: 03+2] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth) [18:45:47] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [18:53:13] (03PS2) 10AOkoth: admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) [18:53:32] (03PS3) 10AOkoth: admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) [18:53:36] (03CR) 10Dzahn: [C: 03+1] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth) [18:54:35] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [18:58:33] (03CR) 10AOkoth: [C: 03+2] admin: add lucas to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/840152 (https://phabricator.wikimedia.org/T319014) (owner: 10AOkoth) [19:00:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Lucas Werkmeister - https://phabricator.wikimedia.org/T319014 (10Arnoldokoth) Hey @Lucas_Werkmeister_WMDE Yeah, I was actually debating whether to remove that checkbox but I'll just leave it unchecked since... [19:01:21] (03CR) 10Dzahn: "Adding new types to wmflib needs different reviewers, so yea, this is unfortunately now mixing different things into a single patch." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:05:02] !log sudo gnt-node remove ganeti4004.ulsfo.wmnet T317249 [19:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:07] T317249: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 [19:07:13] (IcingaOverload) firing: Checks are taking long to execute on alert1001:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [19:07:21] !log decommission ganeti4004.ulsfo.wmnet: T317249 [19:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:31] PROBLEM - ganeti-mond running on ganeti4004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [19:11:17] PROBLEM - ganeti-confd running on ganeti4004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [19:11:47] PROBLEM - ganeti-noded running on ganeti4004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [19:12:57] sukhe: the decom cookbook should handle it [19:14:11] (03CR) 10Dzahn: [C: 04-1] ""Evaluation Error: Resource type not found: HTTP_PROXY " https://puppet-compiler.wmflabs.org/pcc-worker1003/37485/gitlab-runner1004.eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:17:05] mutante: yes sorry [19:17:31] stepped out for a didn't know it would complain :D [19:17:44] (03PS1) 10Andrew Bogott: clouddumps1001: profile::dumps::distribution::web::is_primary_server = true [puppet] - 10https://gerrit.wikimedia.org/r/840213 (https://phabricator.wikimedia.org/T319269) [19:17:48] (03CR) 10Dzahn: [C: 04-1] P:gitlab::runner: Provide proxy variables to runner jobs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:18:33] sukhe: ack :) [19:20:25] (03CR) 10Ahmon Dancy: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:21:18] (03CR) 10Ahmon Dancy: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:24:52] (03PS1) 10Ahmon Dancy: Add type Wmflib::POSIX::Name [puppet] - 10https://gerrit.wikimedia.org/r/840215 [19:25:12] (03CR) 10Ahmon Dancy: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:31:27] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [19:37:14] (03PS2) 10Andrew Bogott: clouddumps1001: profile::dumps::distribution::web::is_primary_server = true [puppet] - 10https://gerrit.wikimedia.org/r/840213 (https://phabricator.wikimedia.org/T319269) [19:37:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4004.ulsfo.wmnet [19:42:56] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [19:46:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:46:08] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts ganeti4004.ulsfo.wmnet [19:46:40] ha [19:47:33] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps1001: profile::dumps::distribution::web::is_primary_server = true [puppet] - 10https://gerrit.wikimedia.org/r/840213 (https://phabricator.wikimedia.org/T319269) (owner: 10Andrew Bogott) [19:49:12] (03CR) 10Ssingh: [C: 03+2] hiera: decom ganeti4004 [puppet] - 10https://gerrit.wikimedia.org/r/840199 (https://phabricator.wikimedia.org/T317249) (owner: 10Ssingh) [19:49:59] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10ssingh) @RobH: ganeti4004 has been decommissioned and is ready for you. Thanks! [20:02:13] (IcingaOverload) firing: (2) Checks are taking long to execute on alert1001:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [20:08:52] (03CR) 10Dzahn: "you were using this in "misc-wmcs"" [puppet] - 10https://gerrit.wikimedia.org/r/836795 (owner: 10Muehlenhoff) [20:14:17] (03PS1) 10Dzahn: cumin: fix misc-wmcs alias, labweb->cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/840229 [20:15:42] (03Abandoned) 10Dzahn: cumin: fix misc-wmcs alias, labweb->cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/840229 (owner: 10Dzahn) [20:20:41] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:55] (03PS8) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [20:22:57] (03CR) 10Jdlrobson: Automate icon generation (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:23:37] (03CR) 10CI reject: [V: 04-1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:32:44] (03PS9) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [20:32:46] (03PS4) 10Jdlrobson: Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) [20:33:36] (03CR) 10CI reject: [V: 04-1] Move wordmarks and taglines from InitialiseSettings.php to yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839679 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:43:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [20:44:14] (03PS5) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [20:46:31] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:52:37] (03CR) 10Dzahn: "thanks Ahmon. I took the liberty to amend and remove that part and also comment the lines causing the error I pointed out before. Just to " [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [20:56:49] (03PS6) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [20:57:23] (03CR) 10CI reject: [V: 04-1] P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [20:58:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:00:07] (03PS2) 10Ahmon Dancy: Add type Wmflib::POSIX::Name [puppet] - 10https://gerrit.wikimedia.org/r/840215 [21:00:27] (03PS7) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:00:37] (03CR) 10Ahmon Dancy: P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:02:07] (03PS8) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:03:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:06:13] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1001/37487/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:06:19] (03CR) 10Mabualruz: [C: 03+1] Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [21:08:41] (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:09:01] (03CR) 10Dzahn: "This compiles now and looks like it would work, except we only have the non-capitalized version of the env variable names. The capitalized" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:15:00] (03CR) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:15:09] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:18:20] (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:21:16] (03CR) 10Dzahn: [C: 03+2] "on gitlab-runner1003: /etc/default/buildkitd has been created. /home/gitlab-runner/.gitlab-runner/managed.toml has been edited. command li" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:21:19] (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:22:04] (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:24:41] (03CR) 10Dzahn: [C: 03+2] "Oct 07 21:23:43 gitlab-runner1002 systemd[1]: Stopped buildkitd." [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [21:25:37] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:53] sigh, caused by latest merge [21:26:32] mutante: I'm around if needed. [21:27:49] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:21] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:56] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: debugging [21:29:19] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: debugging [21:31:31] dancy: looks like the "<%= @image %>" was removed from the docker command line ... let me add that back :p [21:31:42] in https://gerrit.wikimedia.org/r/c/operations/puppet/+/833125/9/modules/buildkitd/templates/buildkitd.service.erb#b20 [21:33:06] ah yes.. overly aggressive deletion [21:33:58] would it be nicer to move <%= @image %> \ to the very end anyways? [21:34:45] It has to be between the docker flags and the flags to buildkitd itself (which start with --addr) [21:35:05] ack [21:36:41] (03PS1) 10Dzahn: buildkitd: re-add <%= @image %> to the docker ExecStart command line [puppet] - 10https://gerrit.wikimedia.org/r/840234 (https://phabricator.wikimedia.org/T317997) [21:37:11] (03CR) 10Ahmon Dancy: [C: 03+1] buildkitd: re-add <%= @image %> to the docker ExecStart command line [puppet] - 10https://gerrit.wikimedia.org/r/840234 (https://phabricator.wikimedia.org/T317997) (owner: 10Dzahn) [21:37:26] (03CR) 10Dzahn: [C: 03+2] buildkitd: re-add <%= @image %> to the docker ExecStart command line [puppet] - 10https://gerrit.wikimedia.org/r/840234 (https://phabricator.wikimedia.org/T317997) (owner: 10Dzahn) [21:39:11] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:29] (03PS1) 10Stang: trwikivoyage: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840235 (https://phabricator.wikimedia.org/T319537) [21:39:41] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:25] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:42] dancy: it's running again and with --env-file /etc/default/buildkitd [21:42:05] ok.. I'll look into BUILDKITD_HOS [21:42:07] T [21:42:15] that file has the settings but it's just the non-capitalized versins [21:42:16] (03CR) 10Stang: "See T319537#8301427" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/840235 (https://phabricator.wikimedia.org/T319537) (owner: 10Stang) [21:42:56] dancy: does this actually work? BUILDKIT_HOST=tcp://buildkitd.gitlab-runner:1234" [21:43:06] just ending in .gitlab-runner [21:44:38] (03PS10) 10Jdlrobson: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) [21:45:30] https://www.irccloud.com/pastebin/0PnzyVns/ [21:46:05] :) ok, thanks [21:53:01] Hi, does anyone here know what is the release schedule of the mobileapps service? I can see that in the past it has been at a sub-month frequency. The thing is that a pretty important fix (for end users, not ops) landed more than a month ago. It makes page previews show flagged revision instead of latest. Right now on prod vandalisms are shown even when the changes have not been reviewed. [21:55:07] Current version is `2022-08-16-171635-production`, latest docker label is `2022-10-03-111410-production` [21:56:28] Of course this only affects wikis with flagged revisions + it's not a regression, it's been like that since launch... [21:58:14] nemo-yiannis: ^ [21:59:44] The fix for reference: https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/821218 [22:00:25] Lazowik: maybe try an email to https://meta.wikimedia.org/wiki/User:JGiannelos_(WMF) [22:00:33] I just say that based on logs [22:04:10] (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [22:04:20] mutante: ^ [22:10:21] (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [22:11:13] RECOVERY - ElasticSearch setting check - 9600 on elastic2080 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:12:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:13:26] (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Provide proxy variables to runner jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [22:15:49] RECOVERY - ElasticSearch setting check - 9600 on elastic2075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:16] dancy: fully agree with your comment, let's continue after the weekend [22:17:39] RECOVERY - ElasticSearch setting check - 9400 on elastic2073 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:18:35] Have a good one! [22:19:22] you too. I am leaving IRC, watching the staff meeting, then off. cu [22:26:21] PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:26:29] RECOVERY - ElasticSearch setting check - 9400 on elastic2086 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:31:01] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:36:23] RECOVERY - ElasticSearch setting check - 9600 on elastic2083 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:36:23] RECOVERY - ElasticSearch setting check - 9600 on elastic2076 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:40:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:57:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [23:02:15] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:27:29] RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook