[00:00:25] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1226']
[00:00:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1227']
[00:00:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1231']
[00:00:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1232']
[00:01:37] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1228']
[00:01:40] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1230']
[00:01:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1233']
[00:02:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1229']
[00:02:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:04:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:09:36] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1231']
[00:10:10] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1233']
[00:10:36] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1232']
[00:13:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr)
[00:26:53] <wikibugs>	 (03PS2) 10Ebernhardson: Draft: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315)
[00:26:55] <wikibugs>	 (03PS1) 10Ebernhardson: Draft: flink-app: Provide kafka hosts as properties file [deployment-charts] - 10https://gerrit.wikimedia.org/r/959066
[00:38:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pki1002.eqiad.wmnet with OS bullseye
[00:38:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pki1002.eqiad.wmnet with OS bullseye
[00:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958970
[00:38:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958970 (owner: 10TrainBranchBot)
[00:42:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:43:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:47:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:47:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:49:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:49:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:50:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:51:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:51:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage
[00:52:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:53:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958970 (owner: 10TrainBranchBot)
[00:53:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:54:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:54:30] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage
[00:54:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:57:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:58:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[01:12:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[01:12:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki1002.eqiad.wmnet with OS bullseye
[01:12:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pki1002.eqiad.wmnet with OS bullseye completed: - pki1002 (**PASS**)   - R...
[01:13:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jhancock.wm)
[01:13:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jhancock.wm) 05Open→03Resolved @joanna_borun all finished
[01:40:13] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:44:27] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:46:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:48:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:48:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:49:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:49:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:50:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:55:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:59:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:07:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:12:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:22:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:23:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[02:23:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:25:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:37:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:42:28] <wikibugs>	 (03Abandoned) 10Krinkle: speed-tests: Test selector changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912366 (owner: 10Krinkle)
[02:46:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:48:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:49:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:50:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:51:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:52:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:53:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:55:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:03:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:39:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:40:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:40:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:41:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:55:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:55:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:56:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:57:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:15:51] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:23:19] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudservices2004-dev), Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:52:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:52:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:53:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:54:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:55:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:55:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:55:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:55:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:00:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:03:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:03:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:04:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:06:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:07:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:15:39] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100%
[05:15:51] <icinga-wm>	 PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:19:15] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[05:25:43] <icinga-wm>	 RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:43] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 76.96 ms
[05:30:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:30:17] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 72.19 ms
[05:30:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:31:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:32:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:36:27] <wikibugs>	 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, and 2 others: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Joe) 05Open→03Resolved AIUI this is now resolved
[05:45:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[05:45:47] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:47:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:49:45] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:55:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno
[05:55:39] <icinga-wm>	 s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:57:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:57:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T0600)
[06:00:59] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:41] <icinga-wm>	 PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync_bitu_username_block.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:04:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:06:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:07:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:08:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:23:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[06:28:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:30:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:38:09] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:40:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1011.eqiad.wmnet
[06:40:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:42:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:42:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:42:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1011.eqiad.wmnet
[06:43:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:45:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet
[06:47:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:50:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:50:32] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::redis bind to both IPv4 and IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/958936 (owner: 10Slyngshede)
[06:50:50] <wikibugs>	 (03PS1) 10Elukey: Replace yaml load() calls with safe_load() [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959080
[06:51:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:51:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet
[06:51:42] <wikibugs>	 (03PS2) 10Elukey: Replace yaml load() calls with safe_load() [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959080
[06:52:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1009.eqiad.wmnet
[06:54:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:55:25] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:55:35] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:55:57] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:56:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1009.eqiad.wmnet
[06:56:26] <wikibugs>	 (03Abandoned) 10Slyngshede: C:idm::redis Allow replication via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/958907 (owner: 10Slyngshede)
[06:57:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:58:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:58:58] <wikibugs>	 (03PS1) 10Slyngshede: R:IDM Switch idm1001 to install as package. [puppet] - 10https://gerrit.wikimedia.org/r/959145
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:25] <taavi>	 morning. I'll deploy some patches of my own
[07:00:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet
[07:02:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959042 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[07:02:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959043 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[07:02:56] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43407/console" [puppet] - 10https://gerrit.wikimedia.org/r/959145 (owner: 10Slyngshede)
[07:03:25] <wikibugs>	 (03Merged) 10jenkins-bot: Set READ_NEW for Wikitech on OATHAuth multiple devices migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959042 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[07:03:29] <wikibugs>	 (03Merged) 10jenkins-bot: Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959043 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[07:03:59] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:05:06] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:959042|Set READ_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]], [[gerrit:959043|Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates (T242031)]]
[07:05:16] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[07:05:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet
[07:06:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet
[07:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:08:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:08:36] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] R:IDM Switch idm1001 to install as package. [puppet] - 10https://gerrit.wikimedia.org/r/959145 (owner: 10Slyngshede)
[07:09:19] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm1001.wikimedia.org with OS bookworm
[07:09:28] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm1001.wikimedia.org with OS bookworm
[07:10:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet
[07:14:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1005.eqiad.wmnet
[07:15:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1005.eqiad.wmnet
[07:16:35] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:22:03] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm1001.wikimedia.org with reason: host reimage
[07:22:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-web1001.eqiad.wmnet
[07:24:35] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm1001.wikimedia.org with reason: host reimage
[07:24:54] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.458 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:25:14] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:26:54] <logmsgbot>	 !log taavi@deploy2002 taavi: Backport for [[gerrit:959042|Set READ_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]], [[gerrit:959043|Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates (T242031)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental X
[07:26:54] <logmsgbot>	 WD option)
[07:26:59] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[07:28:21] <logmsgbot>	 !log taavi@deploy2002 taavi: Continuing with sync
[07:28:41] <wikibugs>	 (03PS1) 10Stevemunene: Bring druid1009.equad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042)
[07:28:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-web1001.eqiad.wmnet
[07:29:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet
[07:30:28] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:31:28] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:33:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet
[07:34:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:34:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:34:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet
[07:34:50] <icinga-wm>	 RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:35:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:36:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:38:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:39:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet
[07:39:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:40:12] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:41:16] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:959042|Set READ_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]], [[gerrit:959043|Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates (T242031)]] (duration: 36m 09s)
[07:41:21] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[07:41:36] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) >>! In T341546#9178623, @Jhancock.wm wrote: > yes, it's a bios setting. so it would require a reboot to apply. I should have caught that when I was fixing it the first time around so that's...
[07:42:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:42:46] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm1001.wikimedia.org with OS bookworm
[07:42:52] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm1001.wikimedia.org with OS bookworm completed: - idm1001 (**PASS**)   - Downtimed...
[07:43:50] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:44:36] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/959148
[07:44:54] <wikibugs>	 (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/959148
[07:45:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetdb2002:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[07:46:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:46:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:46:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:47:49] <wikibugs>	 (03PS12) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233)
[07:47:51] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring [puppet] - 10https://gerrit.wikimedia.org/r/959149 (https://phabricator.wikimedia.org/T339894)
[07:47:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/959148 (owner: 10Muehlenhoff)
[07:48:15] <wikibugs>	 (03PS13) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233)
[07:48:27] <wikibugs>	 (03PS1) 10Slyngshede: idm: switch back to idm1001 as primary. [dns] - 10https://gerrit.wikimedia.org/r/959150
[07:48:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:50:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:50:50] <wikibugs>	 (03PS8) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648)
[07:50:59] <jinxer-wm>	 (PuppetDisabled) firing: (2) Puppet disabled on puppetdb1002:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[07:51:02] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Switch production back to idm1001 [puppet] - 10https://gerrit.wikimedia.org/r/959151
[07:57:17] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43408/console" [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[07:57:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (we don't strictly need to move back except validating the new server works fine, the active IDM can be floating freely between" [puppet] - 10https://gerrit.wikimedia.org/r/959151 (owner: 10Slyngshede)
[07:59:00] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] admin: Create analytics-wmde system user and airflow admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[08:00:05] <jouncebot>	 brennen and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T0800).
[08:01:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "All approvals are in and the patch looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez)
[08:02:11] <moritzm>	 !log installing libwebp security updates on buster
[08:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[08:04:47] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] idm: switch back to idm1001 as primary. [dns] - 10https://gerrit.wikimedia.org/r/959150 (owner: 10Slyngshede)
[08:04:56] <wikibugs>	 (03PS2) 10Stevemunene: Bring druid1009.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042)
[08:07:04] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:07:41] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:IDM Switch production back to idm1001 [puppet] - 10https://gerrit.wikimedia.org/r/959151 (owner: 10Slyngshede)
[08:08:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi)
[08:08:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick reviews folks" [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi)
[08:08:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi)
[08:09:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:09:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:10:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:10:31] <moritzm>	 !log restarting FPM on mw* to pick up libwebp security updates
[08:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:18] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: start adding per-env ZK path root [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[08:12:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:13:11] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: expire_bitu_signups.service,sync_bitu_username_block.service Slyngshede Switch over https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:13:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10hashar) My intent was to let @Mabualruz run a backport during the training which in turns require access to the deployment group hence why I came back...
[08:15:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[08:16:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[08:16:19] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=11ec6d55-6d8f-4537-a398-4863d7f38c9c) set by jmm@cumin2002 for...
[08:16:22] <icinga-wm>	 RECOVERY - Check systemd state on idm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:16:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[08:17:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[08:17:17] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=708cd0d4-307e-4f35-acfa-ddae4ae88236) set by jmm@cumin2002 for...
[08:17:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:19:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:20:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry
[08:20:33] <wikibugs>	 (03PS1) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[08:20:35] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[08:21:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[08:21:07] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-09-13-074325-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/959156 (https://phabricator.wikimedia.org/T346045)
[08:21:59] <wikibugs>	 (03PS1) 10Phedenskog: alertmanager: setup QTE mailing group. [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870)
[08:22:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry
[08:23:07] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] admin: Grant shell access to acooper [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez)
[08:23:57] <wikibugs>	 (03PS2) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[08:24:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: read-only access for thanos.w.o/bucket [puppet] - 10https://gerrit.wikimedia.org/r/959159
[08:24:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[08:25:09] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43410/console" [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[08:27:23] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042)
[08:28:20] <wikibugs>	 (03CR) 10JMeybohm: "Nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson)
[08:28:36] <wikibugs>	 (03PS3) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[08:29:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[08:30:02] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices1005
[08:30:14] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudservices1005
[08:30:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[08:30:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Patch has been merged, it should be effective in ~30 minutes when puppet runs. @acooper should h...
[08:31:13] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices1005
[08:31:33] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudservices1005
[08:31:38] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[08:32:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez)
[08:32:59] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:33:21] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1005.eqiad.wmnet with OS bullseye
[08:33:29] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye
[08:34:22] <wikibugs>	 (03CR) 10Fabfur: [V: 03+2 C: 03+2] add simple Makefile and README [software/purged] - 10https://gerrit.wikimedia.org/r/958477 (owner: 10Fabfur)
[08:36:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring [puppet] - 10https://gerrit.wikimedia.org/r/959149 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo)
[08:36:57] <godog>	 !log stop benthos@webrequest_live.service on centrallog1002 to test redudancy/capacity - T346871
[08:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:03] <stashbot>	 T346871: Test benthos webrequest_live with only one host - https://phabricator.wikimedia.org/T346871
[08:39:22] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:39:22] <wikibugs>	 (03PS4) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[08:40:17] <klausman>	 !log Draining ml-serve1002 for kubelet partition increase (T339231)
[08:40:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:22] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudservices1005.eqiad.wmnet with OS bullseye
[08:40:23] <stashbot>	 T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231
[08:40:30] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye exe...
[08:40:34] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1005.eqiad.wmnet with OS bullseye
[08:40:43] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye
[08:41:11] <wikibugs>	 (03PS2) 10Fabfur: add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049
[08:41:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[08:42:36] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:43:26] <wikibugs>	 (03PS5) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[08:45:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[08:46:53] <wikibugs>	 (03PS3) 10Fabfur: add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874)
[08:47:09] <godog>	 !log temp bump threads to 15 for benthos@webrequest_live on centrallog2002 - T346871
[08:47:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[08:47:10] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[08:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:14] <stashbot>	 T346871: Test benthos webrequest_live with only one host - https://phabricator.wikimedia.org/T346871
[08:47:20] <wikibugs>	 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero)
[08:47:50] <klausman>	 !log Draining ml-serve1003 for kubelet partition increase (T339231)
[08:47:54] <wikibugs>	 (03PS6) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[08:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:55] <stashbot>	 T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231
[08:48:42] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::node: Reserve CPU resources for system daemons [puppet] - 10https://gerrit.wikimedia.org/r/959164 (https://phabricator.wikimedia.org/T277876)
[08:48:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:49:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:50:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[08:50:08] <wikibugs>	 (03PS1) 10Fabfur: makefile: do not fail on already removed files [software/purged] - 10https://gerrit.wikimedia.org/r/959165
[08:50:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:50:48] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10aborrero)
[08:51:06] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43411/console" [puppet] - 10https://gerrit.wikimedia.org/r/959164 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm)
[08:51:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:52:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:53:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] add Dockerfile just for build (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[08:53:40] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Separate ensure file and 'ensure' job as followup to 51607b8 [puppet] - 10https://gerrit.wikimedia.org/r/959166 (https://phabricator.wikimedia.org/T346233)
[08:54:50] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Separate ensure file and 'ensure' job as followup to 51607b8 [puppet] - 10https://gerrit.wikimedia.org/r/959166 (https://phabricator.wikimedia.org/T346233)
[08:54:53] <wikibugs>	 (03PS7) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[08:55:04] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959166 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[08:57:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[08:57:03] <klausman>	 !log Draining ml-serve1004 for kubelet partition increase (T339231)
[08:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:09] <stashbot>	 T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231
[08:57:58] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Separate ensure file and 'ensure' job as followup to 51607b8 [puppet] - 10https://gerrit.wikimedia.org/r/959166 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[08:58:51] <wikibugs>	 (03PS4) 10Fabfur: add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874)
[08:58:55] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero)
[08:59:45] <wikibugs>	 (03CR) 10Fabfur: add Dockerfile just for build (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[08:59:59] <godog>	 !log restore benthos@webrequest_live running on both centrallog hosts - T346871
[09:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:10] <stashbot>	 T346871: Test benthos webrequest_live with only one host - https://phabricator.wikimedia.org/T346871
[09:00:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro)
[09:01:02] <wikibugs>	 (03PS8) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[09:01:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:01:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[09:02:36] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:02:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:03:01] <wikibugs>	 (03PS9) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[09:03:44] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Fix bug on config directory path: defaults -> default [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233)
[09:04:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Fix bug on config directory path: defaults -> default [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[09:04:27] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: recursor: fix list of pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/959168 (https://phabricator.wikimedia.org/T346042)
[09:04:40] <wikibugs>	 (03PS2) 10Fabfur: makefile: do not fail on already removed files [software/purged] - 10https://gerrit.wikimedia.org/r/959165
[09:04:47] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: recursor: fix list of pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/959168 (https://phabricator.wikimedia.org/T346042)
[09:05:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[09:05:25] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Fix bug on config directory path: defaults -> default [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233)
[09:05:41] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[09:06:00] <klausman>	 !log Draining ml-serve1005 for kubelet partition increase (T339231)
[09:06:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:06] <stashbot>	 T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231
[09:06:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959168 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[09:06:23] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur)
[09:08:00] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[09:08:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: pdns: recursor: fix list of pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/959168 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[09:08:47] <fabfur>	 !log applied patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/957292 (T344175) to add new mobile redirect domains to Varnish. Changes will be applied automatically by puppet on all cp hosts 
[09:08:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:53] <stashbot>	 T344175: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175
[09:09:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Fix bug on config directory path: defaults -> default [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo)
[09:09:37] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[09:09:39] <wikibugs>	 (03PS10) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[09:09:43] <wikibugs>	 (03CR) 10Fabfur: [V: 03+2 C: 03+2] add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[09:09:57] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage
[09:11:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[09:12:09] <wikibugs>	 10SRE, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Hi @nshahquinn-wmf , changes to the first batch of domains (https://gerrit.wikimedia.org/r/c/operations/puppet/+/957292) should be applied during the next 30'. If you notice something strange p...
[09:12:15] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes: Make control_plane_class_name mandatory [puppet] - 10https://gerrit.wikimedia.org/r/959170
[09:12:26] <wikibugs>	 (03CR) 10Vgutierrez: allow to specify buffer size for backend, frontend or both (033 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (owner: 10Fabfur)
[09:12:47] <wikibugs>	 (03CR) 10Vgutierrez: allow to specify buffer size for backend, frontend or both (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (owner: 10Fabfur)
[09:13:03] <wikibugs>	 (03PS11) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155
[09:13:04] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage
[09:15:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:15:38] <klausman>	 !log Draining ml-serve1006 for kubelet partition increase (T339231)
[09:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:44] <stashbot>	 T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231
[09:16:30] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43413/console" [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[09:16:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:17:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: alertmanager: setup QTE mailing group. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog)
[09:17:38] <wikibugs>	 (03CR) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[09:18:24] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43412/console" [puppet] - 10https://gerrit.wikimedia.org/r/959170 (owner: 10JMeybohm)
[09:21:42] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: services: enable cloud-private for cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/959171 (https://phabricator.wikimedia.org/T346042)
[09:22:21] <wikibugs>	 (03CR) 10Fabfur: allow to specify buffer size for backend, frontend or both (034 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (owner: 10Fabfur)
[09:22:23] <wikibugs>	 (03PS17) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[09:22:59] <wikibugs>	 (03PS1) 10David Caro: replica_cnf_api: handle the mysql hashing only at the api layer [puppet] - 10https://gerrit.wikimedia.org/r/959172 (https://phabricator.wikimedia.org/T345742)
[09:23:01] <wikibugs>	 (03PS2) 10Fabfur: allow to specify buffer size for backend, frontend or both [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874)
[09:24:00] <klausman>	 !log Draining ml-serve1007 for kubelet partition increase (T339231)
[09:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:06] <stashbot>	 T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231
[09:25:45] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: services: enable cloud-private for cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/959171 (https://phabricator.wikimedia.org/T346042)
[09:27:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Ladsgroup) Today we have the datacenter switchover.
[09:27:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959171 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[09:29:23] <klausman>	 !log Draining ml-serve1008 for kubelet partition increase (T339231)
[09:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:30] <stashbot>	 T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231
[09:29:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: services: enable cloud-private for cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/959171 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[09:30:32] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede)
[09:31:55] <wikibugs>	 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10cmooney) Hi @aborrero   We should order the same as we already have for cloudsw1-b1-codfw.  Which is Juniper QFX5120 (Broadcom Trident 3).  To be...
[09:32:10] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[09:33:12] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[09:33:29] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:34:02] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[09:34:08] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[09:34:24] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"
[09:34:27] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "LGTM based on other similar changes to remove Ferm syntax." [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff)
[09:34:29] <stashbot>	 T346042: cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042
[09:35:11] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"
[09:36:14] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] "elukey@grafana1002:/srv/grafana-grizzly$ grr apply slo_dashboards.jsonnet" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[09:38:46] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"
[09:39:46] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042"
[09:39:52] <stashbot>	 T346042: cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042
[09:40:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:40:46] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] replica_cnf_api: handle the mysql hashing only at the api layer [puppet] - 10https://gerrit.wikimedia.org/r/959172 (https://phabricator.wikimedia.org/T345742) (owner: 10David Caro)
[09:41:04] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner2004.codfw.wmnet with OS bullseye
[09:41:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:41:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: envoyproxy: make sure 'envoy' user exists on log directory creation [puppet] - 10https://gerrit.wikimedia.org/r/959174 (https://phabricator.wikimedia.org/T346129)
[09:46:27] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] replica_cnf_api: handle the mysql hashing only at the api layer [puppet] - 10https://gerrit.wikimedia.org/r/959172 (https://phabricator.wikimedia.org/T345742) (owner: 10David Caro)
[09:48:07] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on kafka-jumbo1003.eqiad.wmnet with reason: investigation by brouberol and elukey about kafka ACL issues that might be fixed by a broker restart
[09:48:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/959170 (owner: 10JMeybohm)
[09:48:31] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on kafka-jumbo1003.eqiad.wmnet with reason: investigation by brouberol and elukey about kafka ACL issues that might be fixed by a broker restart
[09:48:49] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes: Make control_plane_class_name mandatory [puppet] - 10https://gerrit.wikimedia.org/r/959170 (owner: 10JMeybohm)
[09:49:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:50:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:50:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:51:48] <wikibugs>	 (03PS3) 10Elukey: Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (https://phabricator.wikimedia.org/T341696)
[09:51:50] <wikibugs>	 (03PS2) 10Elukey: Set ores.wikimedia.org as CNAME for ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957690
[09:51:58] <wikibugs>	 (03CR) 10Elukey: Lower ores.wikimedia.org's TTL to 5M (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/957689 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey)
[09:52:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:54:32] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey)
[09:54:46] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin1001"
[09:55:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/959175 (https://phabricator.wikimedia.org/T346042)
[09:55:33] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin1001"
[09:55:34] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1005.eqiad.wmnet with OS bullseye
[09:55:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmne...
[09:56:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:56:38] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Let's merge when we are happy cloudservices1005 is ready to take over, just before we configure cloudsw1-d5-eqiad to speak BGP to it." [puppet] - 10https://gerrit.wikimedia.org/r/959175 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[09:56:58] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage
[09:57:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:57:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: make sure 'envoy' user exists on log directory creation [puppet] - 10https://gerrit.wikimedia.org/r/959174 (https://phabricator.wikimedia.org/T346129) (owner: 10Filippo Giunchedi)
[09:58:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:58:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] envoyproxy: make sure 'envoy' user exists on log directory creation [puppet] - 10https://gerrit.wikimedia.org/r/959174 (https://phabricator.wikimedia.org/T346129) (owner: 10Filippo Giunchedi)
[09:59:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:59:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:00:02] <Emperor>	 !log ms-be10[61-75] swift package updates T346730
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1000)
[10:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:17] <stashbot>	 T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730
[10:00:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:01:23] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey)
[10:01:55] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage
[10:02:23] <klausman>	 !log Merging change 957689 (T341696) to lower DNS TTL to 5m for ORES name.
[10:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:30] <stashbot>	 T341696: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696
[10:02:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/959159 (owner: 10Filippo Giunchedi)
[10:02:59] <klausman>	 !log RUnning authdns-update to activate change 957689 (T341696)
[10:03:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: read-only access for thanos.w.o/bucket [puppet] - 10https://gerrit.wikimedia.org/r/959159 (owner: 10Filippo Giunchedi)
[10:04:08] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
[10:07:37] <wikibugs>	 (03PS2) 10David Caro: replica_cnf_api: handle the mysql hashing only at the api layer [puppet] - 10https://gerrit.wikimedia.org/r/959172 (https://phabricator.wikimedia.org/T345742)
[10:09:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: load allowmethods httpd module [puppet] - 10https://gerrit.wikimedia.org/r/959176
[10:12:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: load allowmethods httpd module [puppet] - 10https://gerrit.wikimedia.org/r/959176 (owner: 10Filippo Giunchedi)
[10:13:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:14:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:16:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] conntrackd: Add explicit check [puppet] - 10https://gerrit.wikimedia.org/r/958918 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:18:29] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2004.codfw.wmnet with OS bullseye
[10:19:36] <wikibugs>	 (03PS5) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497)
[10:21:38] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[10:22:10] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[10:22:16] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:22:49] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "The change would probably be a no-op as Tobias pointed out, we'd need a HTTP redirect of sort in this case. Or we should change the follow" [dns] - 10https://gerrit.wikimedia.org/r/957690 (owner: 10Elukey)
[10:23:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:23:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org
[10:25:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10aborrero)
[10:26:26] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) 05Open→03Resolved a:03aborrero
[10:27:54] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10aborrero)
[10:29:09] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179
[10:30:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org
[10:34:02] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10aborrero) I did `update domains set master="172.20.1.5:5354 172.20.2.4:5354 185.15.56.162:5354 185.15.56.163:5354";` on the pdns DB in both cloudser...
[10:35:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero)
[10:36:30] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff)
[10:37:53] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney)
[10:37:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:40:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:42:37] <wikibugs>	 (03PS1) 10Clément Goubert: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345243)
[10:45:08] <wikibugs>	 (03CR) 10Vgutierrez: allow to specify buffer size for backend, frontend or both (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[10:45:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:45:30] <wikibugs>	 (03PS1) 10Kamila Součková: geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474)
[10:46:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:46:51] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila)
[10:47:24] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] "to be merged after the DC switchover" [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[10:47:49] <icinga-wm>	 PROBLEM - Memcached on cloudweb1003 is CRITICAL: connect to address 208.80.154.150 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:48:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb1003.wikimedia.org
[10:49:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:51:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:51:46] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] "@bblack: please let me know in case I should reorder any not-top-level things in addition to this" [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[10:52:01] <icinga-wm>	 RECOVERY - Memcached on cloudweb1003 is OK: TCP OK - 0.000 second response time on 208.80.154.150 port 11000 https://wikitech.wikimedia.org/wiki/Memcached
[10:52:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/959175 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[10:52:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:53:01] <wikibugs>	 (03PS2) 10Muehlenhoff: bastion: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952459
[10:53:17] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952459 (owner: 10Muehlenhoff)
[10:55:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[10:55:03] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:55:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1003.wikimedia.org
[10:56:01] <wikibugs>	 (03PS1) 10Ladsgroup: Add note that this repo has been archived [software/schema-changes] - 10https://gerrit.wikimedia.org/r/959183
[10:56:21] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add note that this repo has been archived [software/schema-changes] - 10https://gerrit.wikimedia.org/r/959183 (owner: 10Ladsgroup)
[10:56:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) 05In progress→03Resolved
[10:56:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:56:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[10:57:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:58:58] <wikibugs>	 (03PS1) 10Clément Goubert: eventgate-logging-external: Remove CPU limit for tls-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345243)
[10:59:05] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) 05Open→03Resolved For reference, we ended up also having to deal with a spurious "config file changed" from openssh-server, so the rune used was of the form ` sudo cumin -b...
[10:59:37] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[11:00:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:02:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:02:20] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: remove CNAME for openstack.eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/959186 (https://phabricator.wikimedia.org/T346439)
[11:02:43] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[11:05:29] <wikibugs>	 (03PS2) 10Clément Goubert: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244)
[11:05:31] <wikibugs>	 (03PS2) 10Clément Goubert: eventgate-logging-external: Remove CPU limit for tls-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244)
[11:07:06] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: introduce openstack.eqiad1 endpoint with cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/959187 (https://phabricator.wikimedia.org/T346439)
[11:07:36] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:10:45] <wikibugs>	 (03PS1) 10Gmodena: mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877)
[11:10:47] <wikibugs>	 (03PS3) 10Stevemunene: druid: Bring druid1009.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042)
[11:11:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: remove CNAME for openstack.eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/959186 (https://phabricator.wikimedia.org/T346439) (owner: 10Arturo Borrero Gonzalez)
[11:11:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: introduce openstack.eqiad1 endpoint with cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/959187 (https://phabricator.wikimedia.org/T346439) (owner: 10Arturo Borrero Gonzalez)
[11:11:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) (owner: 10Gmodena)
[11:11:34] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:13:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[11:13:41] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack.eqiad1 - aborrero@cumin1001"
[11:14:04] <wikibugs>	 (03CR) 10David Caro: openstack: eqiad1: introduce openstack.eqiad1 endpoint with cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959187 (https://phabricator.wikimedia.org/T346439) (owner: 10Arturo Borrero Gonzalez)
[11:14:30] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack.eqiad1 - aborrero@cumin1001"
[11:14:30] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:17:04] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache openstack.eqiad1.wikimediacloud.org on all recursors
[11:17:08] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) openstack.eqiad1.wikimediacloud.org on all recursors
[11:18:22] <wikibugs>	 (03CR) 10Btullis: "Looks good. I would do a pcc run against the new host, plus I would check whether there is any immediate impact on the LVS servers like lv" [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[11:20:07] <logmsgbot>	 !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[11:20:15] <logmsgbot>	 !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[11:21:24] <wikibugs>	 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10Volans) @aborrero was the host that you decommissioned reachable (as in, was the wipefs performed)? This is the current wipefs command that we exe...
[11:24:42] <wikibugs>	 (03PS1) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194
[11:24:58] <wikibugs>	 (03PS2) 10Gmodena: mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877)
[11:25:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[11:25:43] <wikibugs>	 (03PS2) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194
[11:26:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[11:29:07] <wikibugs>	 (03PS3) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194
[11:29:49] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43416/console" [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[11:29:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) (owner: 10Gmodena)
[11:33:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb1004.wikimedia.org
[11:33:35] <wikibugs>	 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) >>! In T346875#9182537, @Volans wrote: >  > Could you give me the hostname of the decommissioned host so I can have a look at the logs?...
[11:33:37] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43418/console" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[11:37:30] <wikibugs>	 (03CR) 10Fabfur: allow to specify buffer size for backend, frontend or both (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[11:39:32] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[11:39:57] <wikibugs>	 (03PS4) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194
[11:40:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1004.wikimedia.org
[11:41:12] <wikibugs>	 (03CR) 10Muehlenhoff: C:idm:jobs Use bitu command for systemd jobs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede)
[11:42:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[11:42:08] <wikibugs>	 (03PS11) 10Brouberol: Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741)
[11:42:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[11:42:09] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[11:42:38] <wikibugs>	 (03PS3) 10Fabfur: allow to specify buffer size for backend and frontend [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874)
[11:43:03] <wikibugs>	 (03CR) 10Gmodena: [C: 03+2] mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) (owner: 10Gmodena)
[11:43:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:43:52] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) (owner: 10Gmodena)
[11:44:07] <wikibugs>	 (03CR) 10Brouberol: "cc-ing Brian as we were talking about similarities in design between kafka & ES, and about the fact that kafka does not give you any built" [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[11:44:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:45:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:48:17] <wikibugs>	 (03PS1) 10Majavah: hieradata: fix eqiad1 rabbitmq firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610)
[11:49:18] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[11:49:24] <logmsgbot>	 !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[11:49:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43420/console" [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah)
[11:51:03] <wikibugs>	 (03CR) 10David Caro: "LGTM, though pcc only changes cloudservices, not cloudcontrol, is that ok?" [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah)
[11:51:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: fix eqiad1 rabbitmq firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah)
[11:52:14] <wikibugs>	 (03PS5) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194
[11:54:41] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: fix eqiad1 rabbitmq firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah)
[11:54:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:55:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:56:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:56:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:56:45] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[11:56:49] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:04:18] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43421/console" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[12:04:55] <wikibugs>	 (03PS1) 10Majavah: cloudlb: add hack to grant cloudcontrol1006/7 database access [puppet] - 10https://gerrit.wikimedia.org/r/959199 (https://phabricator.wikimedia.org/T346439)
[12:06:10] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43422/console" [puppet] - 10https://gerrit.wikimedia.org/r/959199 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah)
[12:06:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudlb: add hack to grant cloudcontrol1006/7 database access [puppet] - 10https://gerrit.wikimedia.org/r/959199 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah)
[12:06:50] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] "This is very ugly but also very temporary." [puppet] - 10https://gerrit.wikimedia.org/r/959199 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah)
[12:06:54] <wikibugs>	 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: decomission - https://phabricator.wikimedia.org/T346033 (10aborrero) a:03Jclark-ctr
[12:07:20] <wikibugs>	 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: decomission - https://phabricator.wikimedia.org/T346033 (10aborrero)
[12:07:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699)
[12:07:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[12:07:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699)
[12:07:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699)
[12:08:31] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43423/console" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[12:09:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Create DNS records for new LDAP Bookworm cluster [dns] - 10https://gerrit.wikimedia.org/r/959203 (https://phabricator.wikimedia.org/T331699)
[12:09:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Create DNS records for new LDAP Bookworm cluster [dns] - 10https://gerrit.wikimedia.org/r/959203 (https://phabricator.wikimedia.org/T331699)
[12:13:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10aborrero)
[12:14:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: use 'up' for swagger probes failures too [alerts] - 10https://gerrit.wikimedia.org/r/959206
[12:14:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[12:15:24] <wikibugs>	 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero)
[12:15:34] <wikibugs>	 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: decomission - https://phabricator.wikimedia.org/T346033 (10aborrero)
[12:16:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[12:16:25] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:16:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero)
[12:17:22] <wikibugs>	 (03PS1) 10David Caro: cloudlb: add hack to grant cloudbackup2002.codfw.wmnet database access [puppet] - 10https://gerrit.wikimedia.org/r/959208 (https://phabricator.wikimedia.org/T346439)
[12:17:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10aborrero)
[12:17:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:18:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero)
[12:18:47] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43424/console" [puppet] - 10https://gerrit.wikimedia.org/r/959208 (https://phabricator.wikimedia.org/T346439) (owner: 10David Caro)
[12:19:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:19:40] <wikibugs>	 (03PS6) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194
[12:20:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero) a:05Jclark-ctr→03taavi
[12:20:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: use 'up' for swagger probes failures too [alerts] - 10https://gerrit.wikimedia.org/r/959206 (https://phabricator.wikimedia.org/T346893)
[12:20:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:20:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[12:21:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero)
[12:24:11] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43425/console" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[12:25:42] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero)
[12:26:43] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "Given that the VM runs nothing but the test installation of Bitu, I see little reason to keep using the virtualenv. This way we also ensur" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[12:30:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699)
[12:32:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:32:52] <wikibugs>	 (03CR) 10Gehel: "minor comments inline. I haven't looked at the python script itself yet." [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[12:34:24] <wikibugs>	 (03PS3) 10Muehlenhoff: Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699)
[12:35:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:36:09] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[12:40:47] <logmsgbot>	 !log akosiaris@deploy2002 Started deploy [restbase/deploy@e8a6ae4]: (no justification provided)
[12:41:18] <akosiaris>	 !log T346354 deploy RESTBase after bug is fixed
[12:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:23] <stashbot>	 T346354: restbase deploys via scap lead to all hosts being disabled in conftool  - https://phabricator.wikimedia.org/T346354
[12:41:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:41:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:42:41] <akosiaris>	 the deploy is supposed to fix these ^ once and for all
[12:42:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:43:42] <wikibugs>	 (03CR) 10Gehel: "This change is ready for review." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[12:44:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:45:21] <logmsgbot>	 !log akosiaris@deploy2002 Finished deploy [restbase/deploy@e8a6ae4]: (no justification provided) (duration: 04m 34s)
[12:46:51] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[12:47:19] <logmsgbot>	 !log akosiaris@deploy2002 Started deploy [restbase/deploy@e8a6ae4]: (no justification provided)
[12:49:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:49:42] <wikibugs>	 (03PS7) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194
[12:49:59] <wikibugs>	 (03CR) 10Slyngshede: C:IDM Rework git install to function with new repo layout. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[12:50:50] <wikibugs>	 (03Abandoned) 10David Caro: cloudlb: add hack to grant cloudbackup2002.codfw.wmnet database access [puppet] - 10https://gerrit.wikimedia.org/r/959208 (https://phabricator.wikimedia.org/T346439) (owner: 10David Caro)
[12:51:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:52:03] <logmsgbot>	 !log akosiaris@deploy2002 Finished deploy [restbase/deploy@e8a6ae4]: (no justification provided) (duration: 04m 43s)
[12:52:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:52:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:52:45] <logmsgbot>	 !log akosiaris@deploy2002 Started deploy [restbase/deploy@e8a6ae4]: (no justification provided)
[12:54:55] <logmsgbot>	 !log akosiaris@deploy2002 Finished deploy [restbase/deploy@e8a6ae4]: (no justification provided) (duration: 02m 10s)
[12:58:33] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/958968 (owner: 10PipelineBot)
[12:59:24] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/958968 (owner: 10PipelineBot)
[12:59:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:13] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:24] <TheresNoTime>	 no patches in the queue :)
[13:01:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:01:23] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:26] <wikibugs>	 (03PS1) 10Slyngshede: Enable SSH key management for all users. [software/bitu] - 10https://gerrit.wikimedia.org/r/959211
[13:01:29] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10Volans) p:05Triage→03Medium Ok, from logs I see that: ` ["lsblk --all --output 'NAME,TYPE' --pa...
[13:01:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] build: Update eslint-config-wikimedia to 0.25.1 [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959014 (https://phabricator.wikimedia.org/T346629) (owner: 10Urbanecm)
[13:01:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson)
[13:02:00] <logmsgbot>	 !log akosiaris@deploy2002 Started deploy [restbase/deploy@e8a6ae4]: (no justification provided)
[13:02:18] <urbanecm>	 TheresNoTime: in that case, good time to add patches
[13:02:27] <logmsgbot>	 !log akosiaris@deploy2002 Finished deploy [restbase/deploy@e8a6ae4]: (no justification provided) (duration: 00m 27s)
[13:03:16] <wikibugs>	 (03PS1) 10FNegri: Package for Debian Bookworm [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762)
[13:03:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699)
[13:03:40] <wikibugs>	 10SRE, 10serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki)
[13:08:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[13:09:12] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[13:11:45] <wikibugs>	 (03CR) 10Slyngshede: C:IDM Rework git install to function with new repo layout. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[13:11:47] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[13:12:55] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm-test1001.wikimedia.org with OS bookworm
[13:13:01] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm
[13:14:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] C:IDM Rework git install to function with new repo layout. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede)
[13:16:32] <claime>	 Reminder that we'll start locking things down in about 15 minutes for the switchover
[13:18:05] <wikibugs>	 (03Merged) 10jenkins-bot: build: Update eslint-config-wikimedia to 0.25.1 [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959014 (https://phabricator.wikimedia.org/T346629) (owner: 10Urbanecm)
[13:18:07] <wikibugs>	 (03CR) 10Herron: [C: 03+1] sre: use 'up' for swagger probes failures too [alerts] - 10https://gerrit.wikimedia.org/r/959206 (https://phabricator.wikimedia.org/T346893) (owner: 10Filippo Giunchedi)
[13:18:38] <urbanecm>	 claime: wdym by locking? I'm doing a MW deployment as part of the scheduled B&C window
[13:18:42] <urbanecm>	 do you want me to abort?
[13:18:59] <claime>	 urbanecm: That window should have been removed, my fault
[13:19:03] <claime>	 Finish up
[13:19:16] <urbanecm>	 ack, ty. i need about ~20 minutes, hopefully.
[13:19:28] <claime>	 ack
[13:20:16] <claime>	 We don't need to lock scap right at the beginning, we do it just to be safe, so that should be ok, but it's cutting it kinda close
[13:20:35] <claime>	 I'll add to remove surrounding deployment windows to the scheduling doc
[13:21:00] <urbanecm>	 yep. i'm waiting on CI rn (it says ETA 0 min, so hopefully should merge too) and then it'll be one scap sync and that's all i have for today. 
[13:21:09] <urbanecm>	 will ping once done
[13:21:12] <wikibugs>	 (03Merged) 10jenkins-bot: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson)
[13:21:13] <claime>	 ok thanks
[13:21:31] <wikibugs>	 (03Abandoned) 10Stevemunene: airflow-wmde: create analytics-wmde users class for wmde services [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[13:21:48] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:959014|build: Update eslint-config-wikimedia to 0.25.1 (T346629)]], [[gerrit:959007|Change CSS selector for Minerva mobile menu icon (T346459)]]
[13:21:56] <stashbot>	 T346629: GrowthExperiments CI fails on master: mwgate-node16-docker - https://phabricator.wikimedia.org/T346629
[13:21:56] <stashbot>	 T346459: Mobile main menu icon missing when Growth home page enabled - https://phabricator.wikimedia.org/T346459
[13:22:49] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "dbproxy1018: depool clouddb1019 in favor of clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/959018
[13:24:03] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb.service,prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "dbproxy1018: depool clouddb1019 in favor of clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/959018 (owner: 10Andrew Bogott)
[13:26:51] <icinga-wm>	 RECOVERY - Check systemd state on clouddb1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:54] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage
[13:27:18] <wikibugs>	 (03CR) 10FNegri: Package for Debian Bookworm (034 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri)
[13:29:33] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Replace yaml load() calls with safe_load() [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959080 (owner: 10Elukey)
[13:29:53] <wikibugs>	 (03CR) 10Eevans: [C: 04-1] Be explicit about the yaml loader class [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959038 (owner: 10Eevans)
[13:30:35] <_joe_>	 jouncebot: now
[13:30:35] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1300)
[13:30:36] <wikibugs>	 (03Abandoned) 10Eevans: Be explicit about the yaml loader class [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959038 (owner: 10Eevans)
[13:31:14] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage
[13:31:24] <wikibugs>	 (03PS2) 10Muehlenhoff: Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699)
[13:31:54] <wikibugs>	 (03CR) 10Herron: [V: 03+1 C: 03+2] dispatch::web: add ensure param and ensure => absent (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[13:31:57] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila)
[13:32:23] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[13:32:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[13:32:54] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Create DNS records for new LDAP Bookworm cluster [dns] - 10https://gerrit.wikimedia.org/r/959203 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[13:33:39] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[13:33:52] <wikibugs>	 (03PS1) 10Muehlenhoff: conftool: Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959215 (https://phabricator.wikimedia.org/T331699)
[13:33:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[13:34:20] <wikibugs>	 (03PS1) 10Majavah: Connect eqiad1 cloudvirts to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959216 (https://phabricator.wikimedia.org/T346651)
[13:39:38] <wikibugs>	 (03PS1) 10Herron: dispatch::web: correct /usr/local/bin/dispatch ensure [puppet] - 10https://gerrit.wikimedia.org/r/959220 (https://phabricator.wikimedia.org/T344937)
[13:41:13] <wikibugs>	 (03PS3) 10Muehlenhoff: Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699)
[13:41:41] <wikibugs>	 (03PS1) 10Stevemunene: airflow-wmde: Remove statsd analytics-wmde user [puppet] - 10https://gerrit.wikimedia.org/r/959222 (https://phabricator.wikimedia.org/T340648)
[13:42:16] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and jdlrobson: Backport for [[gerrit:959014|build: Update eslint-config-wikimedia to 0.25.1 (T346629)]], [[gerrit:959007|Change CSS selector for Minerva mobile menu icon (T346459)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:42:24] <stashbot>	 T346629: GrowthExperiments CI fails on master: mwgate-node16-docker - https://phabricator.wikimedia.org/T346629
[13:42:24] <stashbot>	 T346459: Mobile main menu icon missing when Growth home page enabled - https://phabricator.wikimedia.org/T346459
[13:43:08] <wikibugs>	 (03CR) 10Herron: [C: 03+2] dispatch::web: correct /usr/local/bin/dispatch ensure [puppet] - 10https://gerrit.wikimedia.org/r/959220 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[13:43:23] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and jdlrobson: Continuing with sync
[13:44:26] <wikibugs>	 (03PS1) 10JHathaway: httpd: ensure mod commands are available [puppet] - 10https://gerrit.wikimedia.org/r/959224 (https://phabricator.wikimedia.org/T344868)
[13:45:24] <wikibugs>	 (03PS1) 10JHathaway: puppet agent: protect against missing client bucket path [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970)
[13:46:27] <wikibugs>	 (03PS1) 10JHathaway: nginx: add toggle for mounting lib on tmpfs vol [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842)
[13:46:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:47:02] <wikibugs>	 (03PS1) 10JHathaway: apt: fix use of alternative mirror [puppet] - 10https://gerrit.wikimedia.org/r/959227
[13:47:06] <wikibugs>	 (03PS3) 10Kamila Součková: db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474)
[13:47:23] <wikibugs>	 (03CR) 10Hashar: "Side track the `deployment-ssh` resource title has the dash replaced by an underscore which is reflected by changes in the catalogues:" [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff)
[13:47:50] <wikibugs>	 (03PS1) 10JHathaway: postgresql: fix ordering on a new install [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842)
[13:47:56] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[13:48:24] <wikibugs>	 (03PS1) 10JHathaway: puppetdb: add ability to configure db_ro_host [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842)
[13:48:26] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[13:49:04] <wikibugs>	 (03PS1) 10JHathaway: prometheus-postgres-exporter: install configs before service [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842)
[13:49:25] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm-test1001.wikimedia.org with OS bookworm
[13:49:30] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm completed: - idm-test1001 (**PASS**)   -...
[13:49:48] <wikibugs>	 (03PS1) 10JHathaway: puppetdb: preseed to avoid creating database users [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842)
[13:49:56] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet
[13:49:58] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0)
[13:50:08] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks
[13:50:17] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0)
[13:50:21] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
[13:50:35] <wikibugs>	 (03PS1) 10JHathaway: puppetdb prometheus exporter: in a container listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842)
[13:51:19] <wikibugs>	 (03PS1) 10JHathaway: pki: disable mysql specific scripts when using sqlite [puppet] - 10https://gerrit.wikimedia.org/r/959234 (https://phabricator.wikimedia.org/T344868)
[13:51:33] <wikibugs>	 (03CR) 10Muehlenhoff: scap:ferm: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff)
[13:51:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:52:07] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: fix perma-diff on /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970)
[13:52:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:48] <wikibugs>	 (03PS1) 10JHathaway: ferm: fix ferm-status on container bullseye instances [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868)
[13:53:17] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: dispatch-scheduler.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:19] <wikibugs>	 (03PS1) 10JHathaway: pki::multirootca: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/959237
[13:53:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:54:12] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: Serve the full cert chain via jetty [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970)
[13:55:08] <wikibugs>	 (03PS1) 10JHathaway: pki dev: cfssl configs for the dev env pki image [puppet] - 10https://gerrit.wikimedia.org/r/959241 (https://phabricator.wikimedia.org/T344868)
[13:56:03] <claime>	 urbanecm: cutting it close, where is it at?
[13:56:09] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0)
[13:56:09] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:959014|build: Update eslint-config-wikimedia to 0.25.1 (T346629)]], [[gerrit:959007|Change CSS selector for Minerva mobile menu icon (T346459)]] (duration: 34m 21s)
[13:56:12] <claime>	 lol
[13:56:14] <claime>	 k
[13:56:19] <stashbot>	 T346629: GrowthExperiments CI fails on master: mwgate-node16-docker - https://phabricator.wikimedia.org/T346629
[13:56:20] <stashbot>	 T346459: Mobile main menu icon missing when Growth home page enabled - https://phabricator.wikimedia.org/T346459
[13:56:33] <urbanecm>	 claime: I think that's your answer. Sorry, scap was a bit slower. 
[13:56:34] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] "Thanks Eric! Do you want to create the new deb change + package or should I?" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959080 (owner: 10Elukey)
[13:56:36] <logmsgbot>	 !log kamila@deploy2002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover: MediaWiki - T346474
[13:56:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:56:39] <urbanecm>	 I'm done, thanks for waiting. 
[13:56:40] <claime>	 urbanecm: no worries
[13:56:41] <stashbot>	 T346474: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474
[13:57:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:57:20] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance
[13:57:31] <logmsgbot>	 !log kamila@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99)
[13:57:48] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance
[13:57:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:57:59] <logmsgbot>	 !log kamila@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99)
[13:58:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10LSobanski) p:05Low→03Medium
[13:58:16] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[13:58:34] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] pki::multirootca: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/959237 (owner: 10JHathaway)
[13:58:54] <wikibugs>	 (03CR) 10DCausse: Draft: cirrus streaming updater service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson)
[14:00:06] <jouncebot>	 kamila_: gettimeofday() says it's time for Datacenter switchover: MediaWiki. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1400)
[14:00:07] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1400)
[14:00:29] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s2.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s3.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s5.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s6.service,mediawiki_job_growthexperiments-refreshLinkRecommendati
[14:00:29] <icinga-wm>	 ervice https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:32] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly
[14:00:32] <logmsgbot>	 !log kamila@cumin1001 MediaWiki read-only period starts at: 2023-09-20 14:00:32.114116
[14:00:40] <claime>	 mwmaint alert expected
[14:00:47] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: start adding per-env ZK path root [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[14:00:49] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0)
[14:00:51] <stashbot>	 kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:01:01] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly
[14:01:02] <taavi>	 stashbot failing is expected
[14:01:02] <stashbot>	 kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:01:03] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[14:01:31] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0)
[14:01:33] <stashbot>	 kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:01:40] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki
[14:01:42] <stashbot>	 kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:01:58] <marostegui>	 that is expected
[14:02:02] <_joe_>	 yes lol
[14:02:19] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0)
[14:02:21] <stashbot>	 kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:02:27] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite
[14:02:29] <stashbot>	 kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:02:30] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0)
[14:02:32] <stashbot>	 kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:02:37] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[14:02:39] <stashbot>	 kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[14:02:53] <logmsgbot>	 !log kamila@cumin1001 MediaWiki read-only period ends at: 2023-09-20 14:02:53.790615
[14:02:53] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
[14:02:58] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[14:02:59] <logmsgbot>	 !log kamila@cumin1001 MediaWiki read-only period ends at: 2023-09-20 14:02:59.798838
[14:02:59] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
[14:03:07] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners
[14:03:09] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0)
[14:03:17] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance
[14:04:09] <marostegui>	 !log Testing
[14:04:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:04:16] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[14:04:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:04:30] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[14:04:38] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959234 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[14:04:52] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474)
[14:04:56] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[14:05:08] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[14:05:20] <Amir1>	 \o/
[14:05:21] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[14:05:38] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959227 (owner: 10JHathaway)
[14:05:49] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[14:06:05] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0)
[14:06:12] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl
[14:06:33] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[14:06:49] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[14:06:52] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0)
[14:07:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:07:08] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[14:07:09] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[14:07:25] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959224 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[14:07:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:07:42] <kamila_>	 !log Phase 9.5 Update DNS records for new database masters - T346474
[14:07:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:48] <stashbot>	 T346474: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474
[14:08:31] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[14:09:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:09:31] <logmsgbot>	 !log kamila@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover: MediaWiki - T346474 (duration: 12m 54s)
[14:09:49] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] conftool: Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959215 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[14:09:53] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/959243 (https://phabricator.wikimedia.org/T346474)
[14:10:13] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters
[14:10:15] <wikibugs>	 (03CR) 10Marostegui: "Required a manual rebase and I am lazy so: https://gerrit.wikimedia.org/r/c/operations/dns/+/959243/" [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui)
[14:10:23] <wikibugs>	 (03Abandoned) 10Marostegui: wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui)
[14:13:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/959243 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui)
[14:15:34] <wikibugs>	 (03PS2) 10JHathaway: postgresql: fix ordering on a new install [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842)
[14:15:42] <wikibugs>	 (03CR) 10Muehlenhoff: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[14:15:55] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[14:16:03] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) The banners were set and read: someone took the opportunity to [[ https://meta.wikimedia.org/w/inde...
[14:16:06] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila)
[14:16:27] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957996 (https://phabricator.wikimedia.org/T346472) (owner: 10Kamila Součková)
[14:16:41] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff)
[14:16:45] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0)
[14:17:04] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila)
[14:17:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:32] <wikibugs>	 (03CR) 10JHathaway: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[14:19:52] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Looks great, nice work!" [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[14:21:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:21:14] <wikibugs>	 (03CR) 10Muehlenhoff: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[14:22:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[14:22:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:23:21] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ssh-gitlab.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:23] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] allow to specify buffer size for backend and frontend (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[14:23:28] <wikibugs>	 (03PS2) 10JHathaway: puppetdb: preseed to avoid creating database users [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842)
[14:23:58] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[14:24:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Connect eqiad1 cloudvirts to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959216 (https://phabricator.wikimedia.org/T346651) (owner: 10Majavah)
[14:25:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2044 for high load - bking@cumin1001
[14:25:52] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2044 for high load - bking@cumin1001
[14:26:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2044.codfw.wmnet for high load - bking@cumin1001
[14:26:05] <wikibugs>	 (03PS17) 10Brouberol: Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741)
[14:26:07] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2044.codfw.wmnet for high load - bking@cumin1001
[14:26:49] <wikibugs>	 (03CR) 10Fabfur: [V: 03+2 C: 03+2] allow to specify buffer size for backend and frontend [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur)
[14:27:05] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:14] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) The rotation/compression appears to work fine and usual day chunks are in the 2.5G ballpark, was there any unusual extra traffic which made it spike t...
[14:28:25] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] "trivial change" [software/purged] - 10https://gerrit.wikimedia.org/r/959165 (owner: 10Fabfur)
[14:28:29] <wikibugs>	 (03CR) 10JHathaway: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[14:28:31] <wikibugs>	 (03CR) 10Fabfur: [V: 03+2 C: 03+2] makefile: do not fail on already removed files [software/purged] - 10https://gerrit.wikimedia.org/r/959165 (owner: 10Fabfur)
[14:30:58] <wikibugs>	 (03PS2) 10Kamila Součková: wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474)
[14:31:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:31:34] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[14:34:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:35:46] <wikibugs>	 (03CR) 10Hashar: scap:ferm: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff)
[14:35:57] <kamila_>	 !log update maintenance.eqiad.wmnet to point to mwmaint2002
[14:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:23] <wikibugs>	 (03PS1) 10Clément Goubert: mw-api-ext: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959248
[14:38:59] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959)
[14:39:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:39:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm)
[14:40:31] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila)
[14:41:59] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[14:42:36] <wikibugs>	 (03PS2) 10Kamila Součková: geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474)
[14:44:11] <wikibugs>	 (03PS1) 10Majavah: O:wmcs::openstack::instance_backups: enable cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959252
[14:44:38] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[14:44:54] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new cloud-private records - cmooney@cumin1001"
[14:45:03] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959)
[14:45:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] O:wmcs::openstack::instance_backups: enable cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959252 (owner: 10Majavah)
[14:45:37] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new cloud-private records - cmooney@cumin1001"
[14:45:37] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:45:38] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] O:wmcs::openstack::instance_backups: enable cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959252 (owner: 10Majavah)
[14:48:49] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43429/console" [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm)
[14:54:41] <nemo-yiannis>	 Can we continue with helm deployments after the switchover or still things are in flight? I would like to deploy some changes on wikifeeds
[14:56:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm)
[14:56:57] <wikibugs>	 (03CR) 10Vgutierrez: vrts: add ticket-cert.crt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth)
[14:57:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:18] <cdanis>	 nemo-yiannis: please proceed :)
[14:58:15] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[14:58:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:59:03] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[15:00:03] <wikibugs>	 (03CR) 10Muehlenhoff: scap:ferm: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff)
[15:02:45] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[15:02:48] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[15:03:23] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw
[15:03:25] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw
[15:04:42] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[15:05:10] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[15:05:49] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[15:06:04] <wikibugs>	 (03PS1) 10Fabfur: Release 0.21 [software/purged] - 10https://gerrit.wikimedia.org/r/959255
[15:06:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 0.21 [software/purged] - 10https://gerrit.wikimedia.org/r/959255 (owner: 10Fabfur)
[15:06:56] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
[15:08:19] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[15:08:56] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[15:09:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:09:15] <moritzm>	 !log added Taavi and Effie (new key) to pwstore
[15:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:49] <wikibugs>	 (03PS1) 10AOkoth: ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/959256
[15:12:17] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure: Puppet package_builder module should have the apt cache auto cleaned - https://phabricator.wikimedia.org/T339251 (10hashar) 05Open→03Resolved After a quick check on `integration-agent-pkgbuilder-1001` and `integration-agent-pkgbuilder-1002` it looks like the...
[15:12:33] <wikibugs>	 (03PS1) 10JMeybohm: prometheus::k8s: Scape scheduler and controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/959257 (https://phabricator.wikimedia.org/T324959)
[15:12:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/959256 (owner: 10AOkoth)
[15:13:18] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/959256 (owner: 10AOkoth)
[15:13:25] <wikibugs>	 (03CR) 10AOkoth: [V: 03+2 C: 03+2] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/959256 (owner: 10AOkoth)
[15:14:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:16:20] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] vrts: add ticket-cert.crt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth)
[15:16:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134) (owner: 10Volans)
[15:16:39] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: superset.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:19:15] <wikibugs>	 (03PS2) 10Fabfur: Release 0.21+deb11u1 for debian bullseye import [software/purged] - 10https://gerrit.wikimedia.org/r/959255
[15:20:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 0.21+deb11u1 for debian bullseye import [software/purged] - 10https://gerrit.wikimedia.org/r/959255 (owner: 10Fabfur)
[15:22:37] <icinga-wm>	 PROBLEM - puppet last run on an-tool1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:23:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:23:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm)
[15:23:30] <wikibugs>	 (03PS3) 10Fabfur: Release 0.21+deb11u1 for debian bullseye import [software/purged] - 10https://gerrit.wikimedia.org/r/959255
[15:24:03] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:24:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:25:22] <wikibugs>	 (03PS1) 10AOkoth: ticket-test: add all dummy files [labs/private] - 10https://gerrit.wikimedia.org/r/959259
[15:26:48] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959260
[15:27:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-web: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959260 (owner: 10Clément Goubert)
[15:27:57] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959260 (owner: 10Clément Goubert)
[15:28:47] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959260 (owner: 10Clément Goubert)
[15:29:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:29:29] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-2] "fix the commit tree?" [software/purged] - 10https://gerrit.wikimedia.org/r/959255 (owner: 10Fabfur)
[15:29:32] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[15:29:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[15:29:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[15:29:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[15:30:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[15:30:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[15:32:11] <wikibugs>	 (03CR) 10Volans: [C: 03+2] decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134) (owner: 10Volans)
[15:33:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:33:52] <Skynet>	 I'm getting a swarm of HTTP 400 responses on mowiki.  Can anyone help me figure out what's causing it?
[15:35:09] <wikibugs>	 (03PS11) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[15:35:45] <wikibugs>	 (03PS12) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[15:36:04] <Skynet>	 action=sitematrix on mo.wikipedia.org is failing with 400 responses.
[15:36:57] <wikibugs>	 (03Merged) 10jenkins-bot: decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134) (owner: 10Volans)
[15:38:14] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:41:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:43:17] <_joe_>	 Skynet: mo.wikipedia now redirects to ro.wikipedia I'd say?
[15:43:29] <_joe_>	 not sure if that's new
[15:43:36] <Skynet>	 https://mo.wikipedia.org/w/api.php?action=sitematrix&format=json&smtype=language&smstate=all&smlangprop=code%7Cname%7Csite%7Cdir%7Clocalname&smsiteprop=url%7Cdbname%7Ccode%7Csitename%7Clang&smlimit=max&formatversion=2
[15:43:50] <Skynet>	 Doesn't seem to redirect properly then.
[15:44:03] <Skynet>	 And is fairly new since I haven't seen this before.
[15:44:24] <Skynet>	 Has the wiki changed from mowiki to rowiki now?
[15:45:02] <_joe_>	 can you open a task, please? I don't think we can help yu here
[15:45:56] <wikibugs>	 (03PS1) 10AOkoth: vrts: add ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959272 (https://phabricator.wikimedia.org/T340027)
[15:46:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:46:14] <wikibugs>	 (03PS2) 10AOkoth: vrts: add ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959272 (https://phabricator.wikimedia.org/T340027)
[15:46:18] <akosiaris>	 mowiki has been redirecting to ro.wikipedia.org since 2019
[15:46:33] <akosiaris>	 no, more. 
[15:46:55] <Skynet>	 Well it broke recently.  Seems like it's not redirecting properly anymore.
[15:47:03] <akosiaris>	 https://phabricator.wikimedia.org/T169450
[15:47:06] <akosiaris>	 2017
[15:47:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:47:18] <_joe_>	 Skynet: again it's a software bug maybe, please open a task
[15:47:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] vrts: add ticket-test cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959272 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[15:48:05] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43432/console" [puppet] - 10https://gerrit.wikimedia.org/r/959257 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm)
[15:48:09] <wikibugs>	 (03PS1) 10Elukey: Revert "Lower ores.wikimedia.org's TTL to 5M" [dns] - 10https://gerrit.wikimedia.org/r/959286
[15:48:17] <wikibugs>	 (03PS2) 10Elukey: Revert "Lower ores.wikimedia.org's TTL to 5M" [dns] - 10https://gerrit.wikimedia.org/r/959286
[15:48:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:48:43] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: add ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959272 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[15:49:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/959257 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm)
[15:50:09] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s: Scape scheduler and controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/959257 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm)
[15:51:14] <wikibugs>	 (03PS7) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638)
[15:51:42] <wikibugs>	 (03CR) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[15:52:57] <wikibugs>	 (03PS1) 10Cwhite: prometheus: add service_name_override parameter [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893)
[15:53:04] <wikibugs>	 (03CR) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[15:55:29] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:57:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:57:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:59:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:59:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:59:20] <wikibugs>	 (03Abandoned) 10Elukey: modules: copy configuration 1.4.1 to 1.5.0 for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956440 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[15:59:24] <wikibugs>	 (03Abandoned) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[16:02:21] <wikibugs>	 (03PS2) 10Cwhite: prometheus: add service_name_override parameter [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893)
[16:02:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:03:07] <wikibugs>	 (03PS1) 10Elukey: modules: copy mesh:configuration 1.4.1 to 1.4.2 to facilitate reviews [deployment-charts] - 10https://gerrit.wikimedia.org/r/959279 (https://phabricator.wikimedia.org/T346638)
[16:03:09] <wikibugs>	 (03PS1) 10Elukey: modules: rename uses_ingress to uses_sni in mesh:configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/959280 (https://phabricator.wikimedia.org/T346638)
[16:04:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:08:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm)
[16:09:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[16:11:26] <wikibugs>	 (03PS1) 10Elukey: ml-services: upgrade docker images for revscoring-based isvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/959309 (https://phabricator.wikimedia.org/T346445)
[16:12:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] modules: rename uses_ingress to uses_sni in mesh:configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/959280 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[16:12:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[16:14:34] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: fix memory leak in revscoring servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/959310 (https://phabricator.wikimedia.org/T346445)
[16:19:24] <wikibugs>	 (03Abandoned) 10Elukey: ml-services: upgrade docker images for revscoring-based isvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/959309 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey)
[16:20:05] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: fix memory leak in revscoring servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/959310 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[16:21:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] modules: copy mesh:configuration 1.4.1 to 1.4.2 to facilitate reviews [deployment-charts] - 10https://gerrit.wikimedia.org/r/959279 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[16:21:15] <wikibugs>	 (03PS2) 10Elukey: modules: rename uses_ingress to uses_sni in mesh:configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/959280 (https://phabricator.wikimedia.org/T346638)
[16:24:30] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Revert "Lower ores.wikimedia.org's TTL to 5M" [dns] - 10https://gerrit.wikimedia.org/r/959286 (owner: 10Elukey)
[16:24:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[16:24:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] modules: rename uses_ingress to uses_sni in mesh:configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/959280 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[16:25:37] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Revert "Lower ores.wikimedia.org's TTL to 5M" [dns] - 10https://gerrit.wikimedia.org/r/959286 (owner: 10Elukey)
[16:26:28] <klausman>	 !log pushing revert of ORES TTL change
[16:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[16:28:11] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[16:29:13] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[16:31:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:31:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:32:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) Hi @LSobanski, @taavi mentioned to me privately that if we want the stewards machine to run `ircservserv`, as di...
[16:36:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:36:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:39:17] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] ticket-test: add all dummy files [labs/private] - 10https://gerrit.wikimedia.org/r/959259 (owner: 10AOkoth)
[16:39:36] <wikibugs>	 (03CR) 10AOkoth: [V: 03+2 C: 03+2] ticket-test: add all dummy files [labs/private] - 10https://gerrit.wikimedia.org/r/959259 (owner: 10AOkoth)
[16:45:26] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search: 2023-09-20 Elasticsearch unavailable incident ticket - https://phabricator.wikimedia.org/T346945 (10bking)
[16:47:57] <wikibugs>	 (03PS2) 10JHathaway: puppetdb: add ability to configure db_ro_host [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842)
[16:48:19] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[16:48:44] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking)
[16:48:52] <wikibugs>	 (03CR) 10FNegri: "This successfully builds a .deb package, if that package works as expected is harder to say. :D" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri)
[16:53:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:54:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:55:47] <wikibugs>	 (03PS1) 10David Caro: d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316
[16:57:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1700)
[17:00:11] <wikibugs>	 (03PS3) 10Herron: titan: add pyrra/slo envoy/cfssl config [puppet] - 10https://gerrit.wikimedia.org/r/956902
[17:02:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:02:23] <wikibugs>	 (03PS7) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937)
[17:02:50] <wikibugs>	 (03PS8) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937)
[17:03:27] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance titan2001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[17:03:55] <wikibugs>	 (03CR) 10David Caro: Package for Debian Bookworm (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri)
[17:08:27] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) resolved: Prometheus rule evaluation failures (instance titan2001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[17:09:40] <wikibugs>	 (03PS3) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027)
[17:09:44] <wikibugs>	 (03PS13) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[17:09:46] <wikibugs>	 (03PS1) 10AOkoth: ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959317
[17:10:04] <wikibugs>	 (03PS2) 10AOkoth: ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959317
[17:12:36] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959317 (owner: 10AOkoth)
[17:12:39] <wikibugs>	 (03CR) 10AOkoth: [V: 03+2 C: 03+2] ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959317 (owner: 10AOkoth)
[17:16:47] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:18:33] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "caught one already! https://phabricator.wikimedia.org/T346950" [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi)
[17:21:09] <icinga-wm>	 RECOVERY - puppet last run on an-tool1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:23:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:26:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:30:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:31:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:35:11] <wikibugs>	 (03PS4) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027)
[17:35:13] <wikibugs>	 (03PS14) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[17:35:16] <wikibugs>	 (03PS1) 10AOkoth: ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959318
[17:35:45] <wikibugs>	 (03CR) 10AOkoth: [V: 03+2 C: 03+2] ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959318 (owner: 10AOkoth)
[17:35:53] <wikibugs>	 (03PS2) 10AOkoth: ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959318
[17:35:59] <wikibugs>	 (03CR) 10AOkoth: [V: 03+2] ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959318 (owner: 10AOkoth)
[17:38:12] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking)
[17:38:23] <wikibugs>	 (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/958973/43435/" [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite)
[17:41:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:43:40] <wikibugs>	 (03PS4) 10AOkoth: vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027)
[17:43:59] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[17:46:17] <wikibugs>	 (03CR) 10AOkoth: [V: 03+2 C: 03+2] vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[17:47:08] <wikibugs>	 (03Abandoned) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth)
[17:48:28] <wikibugs>	 (03PS5) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027)
[17:59:48] <wikibugs>	 (03PS1) 10Samtar: .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951)
[18:00:05] <jouncebot>	 brennen and jnuche: OwO what's this, a deployment window?? Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1800). nyaa~
[18:00:05] <jouncebot>	 brennen and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1800).
[18:01:42] <brennen>	 o/
[18:01:58] <wikibugs>	 (03Abandoned) 10Fabfur: Release 0.21+deb11u1 for debian bullseye import [software/purged] - 10https://gerrit.wikimedia.org/r/959255 (owner: 10Fabfur)
[18:02:18] <brennen>	 !log train 1.41.0-wmf.27 (T345888): no current blockers, logs clean, rolling to group1
[18:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:24] <stashbot>	 T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888
[18:05:38] <wikibugs>	 (03PS1) 10Fabfur: Release 0.21+deb11u1 for bullseye [software/purged] - 10https://gerrit.wikimedia.org/r/959328
[18:06:15] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959329 (https://phabricator.wikimedia.org/T345888)
[18:06:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959329 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot)
[18:07:00] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959329 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot)
[18:07:05] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:09:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:10:01] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:12:13] <brennen>	 i note that PHPFPMTooBusy seems to be a recurring thing with deploys currently.
[18:14:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:14:34] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.27  refs T345888
[18:14:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:14:41] <stashbot>	 T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888
[18:19:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:21:53] <logmsgbot>	 !log brennen@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.27  refs T345888 (duration: 07m 17s)
[18:22:00] <stashbot>	 T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888
[18:23:09] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:26:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:26:18] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Aklapper)
[18:31:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:34:57] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:36:04] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] scap:ferm: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff)
[18:36:21] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:57:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:58:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:04:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:41] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:12:25] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:13:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:14:45] <wikibugs>	 (03CR) 10Dbrant: [C: 04-1] "Nice, thanks! Google's documentation seems to be a little ambiguous about this, but it looks like some people have reported difficulties w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) (owner: 10Samtar)
[19:15:47] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:24:58] <logmsgbot>	 !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@80496b8]: (no justification provided)
[19:25:07] <logmsgbot>	 !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@80496b8]: (no justification provided) (duration: 00m 09s)
[19:26:40] <logmsgbot>	 !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@80496b8]: (no justification provided)
[19:26:46] <logmsgbot>	 !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@80496b8]: (no justification provided) (duration: 00m 05s)
[19:46:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:47:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:52:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:53:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:02:21] <urbanecm>	 indeed
[20:10:43] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) > Some of them are just artifacts of starting from a fork of one of the legacy services. For example, we'll want to adop...
[20:20:53] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:21:01] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:37:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:39:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:42:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:45:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:46:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:47:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:00:06] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T2100)
[21:03:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:04:21] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:14:01] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:14:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:44:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:46:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:48:43] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:50:07] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:00:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:02:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:13:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:15:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:19:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:23:09] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:23:46] <wikibugs>	 (03PS1) 10Varnent: [foundationwiki] Grant translation admin rights to 'sysop' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187)
[22:24:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:24:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [foundationwiki] Grant translation admin rights to 'sysop' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent)
[22:30:07] <wikibugs>	 (03PS2) 10Varnent: [foundationwiki] Grant translation admin rights to 'sysop' and 'global-sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187)
[22:30:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [foundationwiki] Grant translation admin rights to 'sysop' and 'global-sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent)
[22:48:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:50:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:50:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:52:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:59:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:00:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:23:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:23:52] <wikibugs>	 (03PS1) 10Jclark-ctr: Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176)
[23:26:43] <wikibugs>	 (03PS2) 10Jclark-ctr: Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176)
[23:28:10] <wikibugs>	 (03PS3) 10Jclark-ctr: Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176)
[23:28:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:29:13] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176) (owner: 10Jclark-ctr)
[23:44:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[23:47:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt pc1016 - jclark@cumin1001"
[23:48:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt pc1016 - jclark@cumin1001"
[23:48:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:48:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1016
[23:49:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1016
[23:49:09] <logmsgbot>	 !log jclark@cumin1001 END (ERROR) - Cookbook sre.network.configure-switch-interfaces (exit_code=97) for host pc1016
[23:49:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1015
[23:49:15] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1015
[23:50:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED
[23:50:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1016
[23:50:06] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED
[23:50:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1016.mgmt.eqiad.wmnet with reboot policy FORCED
[23:51:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED
[23:51:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED
[23:53:24] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "I haven't tested this but LGTM in principle. Thanks for this." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey)
[23:54:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) @VRiley-WMF   Serial was entered into netbox incorrectly   if you are not onsite sometimes you can look at procurement ticket packing slip that is attached.
[23:54:55] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1016.mgmt.eqiad.wmnet with reboot policy FORCED
[23:55:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1016.mgmt.eqiad.wmnet with reboot policy FORCED