[00:00:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1226'] [00:00:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1227'] [00:00:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1231'] [00:00:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1232'] [00:01:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1228'] [00:01:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1230'] [00:01:48] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1233'] [00:02:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1229'] [00:02:57] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:04:21] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:09:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1231'] [00:10:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1233'] [00:10:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1232'] [00:13:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) [00:26:53] (03PS2) 10Ebernhardson: Draft: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) [00:26:55] (03PS1) 10Ebernhardson: Draft: flink-app: Provide kafka hosts as properties file [deployment-charts] - 10https://gerrit.wikimedia.org/r/959066 [00:38:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pki1002.eqiad.wmnet with OS bullseye [00:38:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pki1002.eqiad.wmnet with OS bullseye [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958970 [00:38:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958970 (owner: 10TrainBranchBot) [00:42:05] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:43:31] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:45] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:45] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:09] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:53] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:50:35] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:51:19] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:51:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage [00:52:57] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:53:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958970 (owner: 10TrainBranchBot) [00:53:43] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:54:09] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:54:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage [00:54:57] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:57:25] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:58:27] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:08:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:12:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:12:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki1002.eqiad.wmnet with OS bullseye [01:12:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pki1002.eqiad.wmnet with OS bullseye completed: - pki1002 (**PASS**) - R... [01:13:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jhancock.wm) [01:13:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jhancock.wm) 05Open→03Resolved @joanna_borun all finished [01:40:13] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:27] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:57] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:48:09] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:48:23] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:49:19] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:49:33] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:50:43] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:55:09] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:59:23] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:07:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:31] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:12:21] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:22:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [02:23:45] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:25:09] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:28] (03Abandoned) 10Krinkle: speed-tests: Test selector changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912366 (owner: 10Krinkle) [02:46:51] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:48:03] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:49:41] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:50:51] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:51:05] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:52:31] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:53:55] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:21] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:03:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:39:01] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:40:25] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:40:31] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:41:57] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:55:11] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:55:59] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:56:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:57:25] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:19] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudservices2004-dev), Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:52:33] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:52:57] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:53:43] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:54:11] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:55:07] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:55:23] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:55:37] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:55:47] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:00:25] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:03:13] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:03:33] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:04:59] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:06:23] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:07:49] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:15:39] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [05:15:51] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:15] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:25:43] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:43] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 76.96 ms [05:30:03] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:30:17] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 72.19 ms [05:30:55] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:31:27] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:32:19] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:36:27] 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, and 2 others: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Joe) 05Open→03Resolved AIUI this is now resolved [05:45:47] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [05:45:47] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:47:11] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:49:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:53] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:55:39] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno [05:55:39] s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:57:05] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:57:43] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T0600) [06:00:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:41] PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync_bitu_username_block.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:51] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:06:17] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:07:29] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:08:55] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:23:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:28:59] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:30:25] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:38:09] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:40:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1011.eqiad.wmnet [06:40:57] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:42:21] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:42:21] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:42:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1011.eqiad.wmnet [06:43:47] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:45:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet [06:47:11] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:50:01] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:50:32] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::redis bind to both IPv4 and IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/958936 (owner: 10Slyngshede) [06:50:50] (03PS1) 10Elukey: Replace yaml load() calls with safe_load() [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959080 [06:51:19] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:51:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet [06:51:42] (03PS2) 10Elukey: Replace yaml load() calls with safe_load() [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959080 [06:52:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1009.eqiad.wmnet [06:54:09] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:55:25] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:55:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:55:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:56:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1009.eqiad.wmnet [06:56:26] (03Abandoned) 10Slyngshede: C:idm::redis Allow replication via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/958907 (owner: 10Slyngshede) [06:57:31] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:58:57] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:58:58] (03PS1) 10Slyngshede: R:IDM Switch idm1001 to install as package. [puppet] - 10https://gerrit.wikimedia.org/r/959145 [07:00:05] Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:25] morning. I'll deploy some patches of my own [07:00:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet [07:02:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959042 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [07:02:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959043 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [07:02:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43407/console" [puppet] - 10https://gerrit.wikimedia.org/r/959145 (owner: 10Slyngshede) [07:03:25] (03Merged) 10jenkins-bot: Set READ_NEW for Wikitech on OATHAuth multiple devices migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959042 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [07:03:29] (03Merged) 10jenkins-bot: Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959043 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [07:03:59] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:06] !log taavi@deploy2002 Started scap: Backport for [[gerrit:959042|Set READ_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]], [[gerrit:959043|Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates (T242031)]] [07:05:16] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [07:05:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet [07:06:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet [07:06:53] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:08:17] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:08:36] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] R:IDM Switch idm1001 to install as package. [puppet] - 10https://gerrit.wikimedia.org/r/959145 (owner: 10Slyngshede) [07:09:19] !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm1001.wikimedia.org with OS bookworm [07:09:28] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm1001.wikimedia.org with OS bookworm [07:10:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet [07:14:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1005.eqiad.wmnet [07:15:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1005.eqiad.wmnet [07:16:35] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:03] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm1001.wikimedia.org with reason: host reimage [07:22:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-web1001.eqiad.wmnet [07:24:35] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm1001.wikimedia.org with reason: host reimage [07:24:54] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.458 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:25:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:26:54] !log taavi@deploy2002 taavi: Backport for [[gerrit:959042|Set READ_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]], [[gerrit:959043|Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates (T242031)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental X [07:26:54] WD option) [07:26:59] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [07:28:21] !log taavi@deploy2002 taavi: Continuing with sync [07:28:41] (03PS1) 10Stevemunene: Bring druid1009.equad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) [07:28:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-web1001.eqiad.wmnet [07:29:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet [07:30:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:31:28] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet [07:34:22] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:34:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:34:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [07:34:50] RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:28] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:40] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:38:58] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:39:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet [07:39:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:40:12] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:16] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:959042|Set READ_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]], [[gerrit:959043|Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates (T242031)]] (duration: 36m 09s) [07:41:21] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [07:41:36] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) >>! In T341546#9178623, @Jhancock.wm wrote: > yes, it's a bios setting. so it would require a reboot to apply. I should have caught that when I was fixing it the first time around so that's... [07:42:34] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:42:46] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm1001.wikimedia.org with OS bookworm [07:42:52] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm1001.wikimedia.org with OS bookworm completed: - idm1001 (**PASS**) - Downtimed... [07:43:50] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:36] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/959148 [07:44:54] (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/959148 [07:45:59] (PuppetDisabled) firing: Puppet disabled on puppetdb2002:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [07:46:18] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:46:20] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:46:50] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:47:49] (03PS12) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [07:47:51] (03PS1) 10Jcrespo: bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring [puppet] - 10https://gerrit.wikimedia.org/r/959149 (https://phabricator.wikimedia.org/T339894) [07:47:53] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/959148 (owner: 10Muehlenhoff) [07:48:15] (03PS13) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [07:48:27] (03PS1) 10Slyngshede: idm: switch back to idm1001 as primary. [dns] - 10https://gerrit.wikimedia.org/r/959150 [07:48:56] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:50:46] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:50:50] (03PS8) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [07:50:59] (PuppetDisabled) firing: (2) Puppet disabled on puppetdb1002:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [07:51:02] (03PS1) 10Slyngshede: P:IDM Switch production back to idm1001 [puppet] - 10https://gerrit.wikimedia.org/r/959151 [07:57:17] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43408/console" [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [07:57:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (we don't strictly need to move back except validating the new server works fine, the active IDM can be floating freely between" [puppet] - 10https://gerrit.wikimedia.org/r/959151 (owner: 10Slyngshede) [07:59:00] (03CR) 10Stevemunene: [V: 03+1] admin: Create analytics-wmde system user and airflow admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [08:00:05] brennen and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T0800). [08:01:21] (03CR) 10Muehlenhoff: [C: 03+1] "All approvals are in and the patch looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez) [08:02:11] !log installing libwebp security updates on buster [08:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:14] (03CR) 10Marostegui: [C: 03+1] dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [08:04:47] (03CR) 10Slyngshede: [C: 03+2] idm: switch back to idm1001 as primary. [dns] - 10https://gerrit.wikimedia.org/r/959150 (owner: 10Slyngshede) [08:04:56] (03PS2) 10Stevemunene: Bring druid1009.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) [08:07:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:41] (03CR) 10Slyngshede: [C: 03+2] P:IDM Switch production back to idm1001 [puppet] - 10https://gerrit.wikimedia.org/r/959151 (owner: 10Slyngshede) [08:08:38] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi) [08:08:50] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick reviews folks" [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi) [08:08:52] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi) [08:09:02] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:09:20] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:10:24] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:10:31] !log restarting FPM on mw* to pick up libwebp security updates [08:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:18] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: start adding per-env ZK path root [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [08:12:08] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:13:11] ACKNOWLEDGEMENT - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: expire_bitu_signups.service,sync_bitu_username_block.service Slyngshede Switch over https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10hashar) My intent was to let @Mabualruz run a backport during the training which in turns require access to the deployment group hence why I came back... [08:15:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [08:16:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [08:16:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=11ec6d55-6d8f-4537-a398-4863d7f38c9c) set by jmm@cumin2002 for... [08:16:22] RECOVERY - Check systemd state on idm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [08:17:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [08:17:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=708cd0d4-307e-4f35-acfa-ddae4ae88236) set by jmm@cumin2002 for... [08:17:44] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:19:10] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:20:15] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [08:20:33] (03PS1) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [08:20:35] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [08:21:00] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [08:21:07] (03PS1) 10KartikMistry: Update cxserver to 2023-09-13-074325-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/959156 (https://phabricator.wikimedia.org/T346045) [08:21:59] (03PS1) 10Phedenskog: alertmanager: setup QTE mailing group. [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) [08:22:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [08:23:07] (03CR) 10Vgutierrez: [C: 03+2] admin: Grant shell access to acooper [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez) [08:23:57] (03PS2) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [08:24:19] (03PS1) 10Filippo Giunchedi: thanos: read-only access for thanos.w.o/bucket [puppet] - 10https://gerrit.wikimedia.org/r/959159 [08:24:28] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [08:25:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43410/console" [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [08:27:23] (03PS4) 10Arturo Borrero Gonzalez: cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042) [08:28:20] (03CR) 10JMeybohm: "Nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [08:28:36] (03PS3) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [08:29:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [08:30:02] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices1005 [08:30:14] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudservices1005 [08:30:45] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [08:30:59] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Patch has been merged, it should be effective in ~30 minutes when puppet runs. @acooper should h... [08:31:13] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices1005 [08:31:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudservices1005 [08:31:38] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [08:32:02] 10SRE, 10SRE-Access-Requests: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) [08:32:59] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:33:21] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1005.eqiad.wmnet with OS bullseye [08:33:29] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye [08:34:22] (03CR) 10Fabfur: [V: 03+2 C: 03+2] add simple Makefile and README [software/purged] - 10https://gerrit.wikimedia.org/r/958477 (owner: 10Fabfur) [08:36:14] (03CR) 10Jcrespo: [C: 03+2] bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring [puppet] - 10https://gerrit.wikimedia.org/r/959149 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [08:36:57] !log stop benthos@webrequest_live.service on centrallog1002 to test redudancy/capacity - T346871 [08:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:03] T346871: Test benthos webrequest_live with only one host - https://phabricator.wikimedia.org/T346871 [08:39:22] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:39:22] (03PS4) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [08:40:17] !log Draining ml-serve1002 for kubelet partition increase (T339231) [08:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:22] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudservices1005.eqiad.wmnet with OS bullseye [08:40:23] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [08:40:30] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye exe... [08:40:34] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1005.eqiad.wmnet with OS bullseye [08:40:43] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye [08:41:11] (03PS2) 10Fabfur: add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 [08:41:28] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [08:42:36] (JobUnavailable) firing: (7) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:43:26] (03PS5) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [08:45:35] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [08:46:53] (03PS3) 10Fabfur: add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) [08:47:09] !log temp bump threads to 15 for benthos@webrequest_live on centrallog2002 - T346871 [08:47:10] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [08:47:10] wikimedia.org/wiki/Services/Monitoring/restbase [08:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:14] T346871: Test benthos webrequest_live with only one host - https://phabricator.wikimedia.org/T346871 [08:47:20] 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) [08:47:50] !log Draining ml-serve1003 for kubelet partition increase (T339231) [08:47:54] (03PS6) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [08:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:55] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [08:48:42] (03PS1) 10JMeybohm: kubernetes::node: Reserve CPU resources for system daemons [puppet] - 10https://gerrit.wikimedia.org/r/959164 (https://phabricator.wikimedia.org/T277876) [08:48:48] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:49:58] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:50:02] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [08:50:08] (03PS1) 10Fabfur: makefile: do not fail on already removed files [software/purged] - 10https://gerrit.wikimedia.org/r/959165 [08:50:12] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:50:48] 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10aborrero) [08:51:06] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43411/console" [puppet] - 10https://gerrit.wikimedia.org/r/959164 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [08:51:06] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:52:32] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:53:18] (03CR) 10Vgutierrez: [C: 04-1] add Dockerfile just for build (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [08:53:40] (03PS1) 10Jcrespo: dbbackups: Separate ensure file and 'ensure' job as followup to 51607b8 [puppet] - 10https://gerrit.wikimedia.org/r/959166 (https://phabricator.wikimedia.org/T346233) [08:54:50] (03PS2) 10Jcrespo: dbbackups: Separate ensure file and 'ensure' job as followup to 51607b8 [puppet] - 10https://gerrit.wikimedia.org/r/959166 (https://phabricator.wikimedia.org/T346233) [08:54:53] (03PS7) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [08:55:04] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959166 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [08:57:01] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [08:57:03] !log Draining ml-serve1004 for kubelet partition increase (T339231) [08:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:09] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [08:57:58] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Separate ensure file and 'ensure' job as followup to 51607b8 [puppet] - 10https://gerrit.wikimedia.org/r/959166 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [08:58:51] (03PS4) 10Fabfur: add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) [08:58:55] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [08:59:45] (03CR) 10Fabfur: add Dockerfile just for build (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [08:59:59] !log restore benthos@webrequest_live running on both centrallog hosts - T346871 [09:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:10] T346871: Test benthos webrequest_live with only one host - https://phabricator.wikimedia.org/T346871 [09:00:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [09:01:02] (03PS8) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [09:01:18] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:01:42] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [09:02:36] (JobUnavailable) firing: (7) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:02:42] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:03:01] (03PS9) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [09:03:44] (03PS1) 10Jcrespo: dbbackups: Fix bug on config directory path: defaults -> default [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233) [09:04:08] (03CR) 10CI reject: [V: 04-1] dbbackups: Fix bug on config directory path: defaults -> default [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [09:04:27] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: recursor: fix list of pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/959168 (https://phabricator.wikimedia.org/T346042) [09:04:40] (03PS2) 10Fabfur: makefile: do not fail on already removed files [software/purged] - 10https://gerrit.wikimedia.org/r/959165 [09:04:47] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: recursor: fix list of pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/959168 (https://phabricator.wikimedia.org/T346042) [09:05:05] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [09:05:25] (03PS2) 10Jcrespo: dbbackups: Fix bug on config directory path: defaults -> default [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233) [09:05:41] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [09:06:00] !log Draining ml-serve1005 for kubelet partition increase (T339231) [09:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:06] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [09:06:13] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959168 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [09:06:23] (03CR) 10Fabfur: [C: 03+2] varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [09:08:00] (03CR) 10Vgutierrez: [C: 03+1] add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [09:08:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: pdns: recursor: fix list of pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/959168 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [09:08:47] !log applied patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/957292 (T344175) to add new mobile redirect domains to Varnish. Changes will be applied automatically by puppet on all cp hosts [09:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:53] T344175: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 [09:09:16] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Fix bug on config directory path: defaults -> default [puppet] - 10https://gerrit.wikimedia.org/r/959167 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [09:09:37] (03CR) 10Fabfur: [C: 03+2] add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [09:09:39] (03PS10) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [09:09:43] (03CR) 10Fabfur: [V: 03+2 C: 03+2] add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [09:09:57] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage [09:11:52] (03CR) 10CI reject: [V: 04-1] C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [09:12:09] 10SRE, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Hi @nshahquinn-wmf , changes to the first batch of domains (https://gerrit.wikimedia.org/r/c/operations/puppet/+/957292) should be applied during the next 30'. If you notice something strange p... [09:12:15] (03PS1) 10JMeybohm: kubernetes: Make control_plane_class_name mandatory [puppet] - 10https://gerrit.wikimedia.org/r/959170 [09:12:26] (03CR) 10Vgutierrez: allow to specify buffer size for backend, frontend or both (033 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (owner: 10Fabfur) [09:12:47] (03CR) 10Vgutierrez: allow to specify buffer size for backend, frontend or both (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (owner: 10Fabfur) [09:13:03] (03PS11) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 [09:13:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage [09:15:29] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:15:38] !log Draining ml-serve1006 for kubelet partition increase (T339231) [09:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:44] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [09:16:30] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43413/console" [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [09:16:31] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:17:11] (03CR) 10Filippo Giunchedi: alertmanager: setup QTE mailing group. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog) [09:17:38] (03CR) 10Slyngshede: C:idm:jobs Use bitu command for systemd jobs. [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [09:18:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43412/console" [puppet] - 10https://gerrit.wikimedia.org/r/959170 (owner: 10JMeybohm) [09:21:42] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: services: enable cloud-private for cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/959171 (https://phabricator.wikimedia.org/T346042) [09:22:21] (03CR) 10Fabfur: allow to specify buffer size for backend, frontend or both (034 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (owner: 10Fabfur) [09:22:23] (03PS17) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [09:22:59] (03PS1) 10David Caro: replica_cnf_api: handle the mysql hashing only at the api layer [puppet] - 10https://gerrit.wikimedia.org/r/959172 (https://phabricator.wikimedia.org/T345742) [09:23:01] (03PS2) 10Fabfur: allow to specify buffer size for backend, frontend or both [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) [09:24:00] !log Draining ml-serve1007 for kubelet partition increase (T339231) [09:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:06] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [09:25:45] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: services: enable cloud-private for cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/959171 (https://phabricator.wikimedia.org/T346042) [09:27:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Ladsgroup) Today we have the datacenter switchover. [09:27:42] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959171 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [09:29:23] !log Draining ml-serve1008 for kubelet partition increase (T339231) [09:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:30] T339231: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 [09:29:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: services: enable cloud-private for cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/959171 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [09:30:32] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [09:31:55] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10cmooney) Hi @aborrero We should order the same as we already have for cloudsw1-b1-codfw. Which is Juniper QFX5120 (Broadcom Trident 3). To be... [09:32:10] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [09:33:12] (03CR) 10Elukey: [V: 03+2 C: 03+2] Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [09:33:29] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:34:02] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [09:34:08] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:34:24] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042" [09:34:27] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM based on other similar changes to remove Ferm syntax." [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [09:34:29] T346042: cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 [09:35:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042" [09:36:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] "elukey@grafana1002:/srv/grafana-grizzly$ grr apply slo_dashboards.jsonnet" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [09:38:46] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042" [09:39:46] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudservices1005 - aborrero@cumin1001 - T346042" [09:39:52] T346042: cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 [09:40:11] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:40:46] (03CR) 10Majavah: [C: 03+1] replica_cnf_api: handle the mysql hashing only at the api layer [puppet] - 10https://gerrit.wikimedia.org/r/959172 (https://phabricator.wikimedia.org/T345742) (owner: 10David Caro) [09:41:04] !log jelto@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner2004.codfw.wmnet with OS bullseye [09:41:33] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:41:40] (03PS1) 10Filippo Giunchedi: envoyproxy: make sure 'envoy' user exists on log directory creation [puppet] - 10https://gerrit.wikimedia.org/r/959174 (https://phabricator.wikimedia.org/T346129) [09:46:27] (03CR) 10David Caro: [C: 03+2] replica_cnf_api: handle the mysql hashing only at the api layer [puppet] - 10https://gerrit.wikimedia.org/r/959172 (https://phabricator.wikimedia.org/T345742) (owner: 10David Caro) [09:48:07] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on kafka-jumbo1003.eqiad.wmnet with reason: investigation by brouberol and elukey about kafka ACL issues that might be fixed by a broker restart [09:48:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/959170 (owner: 10JMeybohm) [09:48:31] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on kafka-jumbo1003.eqiad.wmnet with reason: investigation by brouberol and elukey about kafka ACL issues that might be fixed by a broker restart [09:48:49] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes: Make control_plane_class_name mandatory [puppet] - 10https://gerrit.wikimedia.org/r/959170 (owner: 10JMeybohm) [09:49:07] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:50:31] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:50:55] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:51:48] (03PS3) 10Elukey: Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (https://phabricator.wikimedia.org/T341696) [09:51:50] (03PS2) 10Elukey: Set ores.wikimedia.org as CNAME for ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957690 [09:51:58] (03CR) 10Elukey: Lower ores.wikimedia.org's TTL to 5M (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/957689 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [09:52:19] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:54:32] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [09:54:46] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin1001" [09:55:13] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/959175 (https://phabricator.wikimedia.org/T346042) [09:55:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin1001" [09:55:34] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1005.eqiad.wmnet with OS bullseye [09:55:47] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmne... [09:56:07] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:56:38] (03CR) 10Cathal Mooney: [C: 03+1] "Let's merge when we are happy cloudservices1005 is ready to take over, just before we configure cloudsw1-d5-eqiad to speak BGP to it." [puppet] - 10https://gerrit.wikimedia.org/r/959175 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [09:56:58] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [09:57:31] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:57:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: make sure 'envoy' user exists on log directory creation [puppet] - 10https://gerrit.wikimedia.org/r/959174 (https://phabricator.wikimedia.org/T346129) (owner: 10Filippo Giunchedi) [09:58:01] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:58:36] (03CR) 10Filippo Giunchedi: [C: 03+2] envoyproxy: make sure 'envoy' user exists on log directory creation [puppet] - 10https://gerrit.wikimedia.org/r/959174 (https://phabricator.wikimedia.org/T346129) (owner: 10Filippo Giunchedi) [09:59:41] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:59:41] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:00:02] !log ms-be10[61-75] swift package updates T346730 [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1000) [10:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:17] T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 [10:00:39] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:01:23] (03CR) 10Klausman: [C: 03+2] Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [10:01:55] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [10:02:23] !log Merging change 957689 (T341696) to lower DNS TTL to 5m for ORES name. [10:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:30] T341696: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 [10:02:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/959159 (owner: 10Filippo Giunchedi) [10:02:59] !log RUnning authdns-update to activate change 957689 (T341696) [10:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:37] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: read-only access for thanos.w.o/bucket [puppet] - 10https://gerrit.wikimedia.org/r/959159 (owner: 10Filippo Giunchedi) [10:04:08] !log brouberol@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [10:07:37] (03PS2) 10David Caro: replica_cnf_api: handle the mysql hashing only at the api layer [puppet] - 10https://gerrit.wikimedia.org/r/959172 (https://phabricator.wikimedia.org/T345742) [10:09:27] (03PS1) 10Filippo Giunchedi: thanos: load allowmethods httpd module [puppet] - 10https://gerrit.wikimedia.org/r/959176 [10:12:34] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: load allowmethods httpd module [puppet] - 10https://gerrit.wikimedia.org/r/959176 (owner: 10Filippo Giunchedi) [10:13:35] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:14:45] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:16:39] (03CR) 10Muehlenhoff: [C: 03+2] conntrackd: Add explicit check [puppet] - 10https://gerrit.wikimedia.org/r/958918 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:18:29] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2004.codfw.wmnet with OS bullseye [10:19:36] (03PS5) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) [10:21:38] 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [10:22:10] 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [10:22:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:22:49] (03CR) 10Elukey: [C: 04-1] "The change would probably be a no-op as Tobias pointed out, we'd need a HTTP redirect of sort in this case. Or we should change the follow" [dns] - 10https://gerrit.wikimedia.org/r/957690 (owner: 10Elukey) [10:23:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:23:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org [10:25:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10aborrero) [10:26:26] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) 05Open→03Resolved a:03aborrero [10:27:54] 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10aborrero) [10:29:09] (03PS1) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179 [10:30:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org [10:34:02] 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10aborrero) I did `update domains set master="172.20.1.5:5354 172.20.2.4:5354 185.15.56.162:5354 185.15.56.163:5354";` on the pdns DB in both cloudser... [10:35:03] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [10:36:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [10:37:53] 10SRE, 10Cloud-VPS, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [10:37:55] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:43] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:42:37] (03PS1) 10Clément Goubert: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345243) [10:45:08] (03CR) 10Vgutierrez: allow to specify buffer size for backend, frontend or both (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [10:45:17] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:45:30] (03PS1) 10Kamila Součková: geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) [10:46:41] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:51] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [10:47:24] (03CR) 10Kamila Součková: [C: 04-2] "to be merged after the DC switchover" [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [10:47:49] PROBLEM - Memcached on cloudweb1003 is CRITICAL: connect to address 208.80.154.150 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:48:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb1003.wikimedia.org [10:49:39] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:03] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:46] (03CR) 10Kamila Součková: [C: 04-2] "@bblack: please let me know in case I should reorder any not-top-level things in addition to this" [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [10:52:01] RECOVERY - Memcached on cloudweb1003 is OK: TCP OK - 0.000 second response time on 208.80.154.150 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [10:52:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: stop announcing ns0.openstack.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/959175 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [10:52:36] (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:53:01] (03PS2) 10Muehlenhoff: bastion: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952459 [10:53:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952459 (owner: 10Muehlenhoff) [10:55:03] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [10:55:03] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:55:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1003.wikimedia.org [10:56:01] (03PS1) 10Ladsgroup: Add note that this repo has been archived [software/schema-changes] - 10https://gerrit.wikimedia.org/r/959183 [10:56:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add note that this repo has been archived [software/schema-changes] - 10https://gerrit.wikimedia.org/r/959183 (owner: 10Ladsgroup) [10:56:27] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) 05In progress→03Resolved [10:56:29] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:37] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [10:57:36] (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:58:58] (03PS1) 10Clément Goubert: eventgate-logging-external: Remove CPU limit for tls-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345243) [10:59:05] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) 05Open→03Resolved For reference, we ended up also having to deal with a spurious "config file changed" from openssh-server, so the rune used was of the form ` sudo cumin -b... [10:59:37] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [11:00:41] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:02:05] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:02:20] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: remove CNAME for openstack.eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/959186 (https://phabricator.wikimedia.org/T346439) [11:02:43] (03CR) 10Clément Goubert: [C: 03+1] geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [11:05:29] (03PS2) 10Clément Goubert: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) [11:05:31] (03PS2) 10Clément Goubert: eventgate-logging-external: Remove CPU limit for tls-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) [11:07:06] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: introduce openstack.eqiad1 endpoint with cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/959187 (https://phabricator.wikimedia.org/T346439) [11:07:36] (JobUnavailable) firing: (7) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:45] (03PS1) 10Gmodena: mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) [11:10:47] (03PS3) 10Stevemunene: druid: Bring druid1009.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) [11:11:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: remove CNAME for openstack.eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/959186 (https://phabricator.wikimedia.org/T346439) (owner: 10Arturo Borrero Gonzalez) [11:11:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: introduce openstack.eqiad1 endpoint with cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/959187 (https://phabricator.wikimedia.org/T346439) (owner: 10Arturo Borrero Gonzalez) [11:11:32] (03CR) 10CI reject: [V: 04-1] mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) (owner: 10Gmodena) [11:11:34] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:13:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:13:41] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack.eqiad1 - aborrero@cumin1001" [11:14:04] (03CR) 10David Caro: openstack: eqiad1: introduce openstack.eqiad1 endpoint with cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959187 (https://phabricator.wikimedia.org/T346439) (owner: 10Arturo Borrero Gonzalez) [11:14:30] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack.eqiad1 - aborrero@cumin1001" [11:14:30] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:17:04] !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache openstack.eqiad1.wikimediacloud.org on all recursors [11:17:08] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) openstack.eqiad1.wikimediacloud.org on all recursors [11:18:22] (03CR) 10Btullis: "Looks good. I would do a pcc run against the new host, plus I would check whether there is any immediate impact on the LVS servers like lv" [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [11:20:07] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:20:15] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:21:24] 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10Volans) @aborrero was the host that you decommissioned reachable (as in, was the wipefs performed)? This is the current wipefs command that we exe... [11:24:42] (03PS1) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 [11:24:58] (03PS2) 10Gmodena: mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) [11:25:04] (03CR) 10CI reject: [V: 04-1] C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [11:25:43] (03PS2) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 [11:26:07] (03CR) 10CI reject: [V: 04-1] C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [11:29:07] (03PS3) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 [11:29:49] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43416/console" [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [11:29:51] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) (owner: 10Gmodena) [11:33:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb1004.wikimedia.org [11:33:35] 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) >>! In T346875#9182537, @Volans wrote: > > Could you give me the hostname of the decommissioned host so I can have a look at the logs?... [11:33:37] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43418/console" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [11:37:30] (03CR) 10Fabfur: allow to specify buffer size for backend, frontend or both (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [11:39:32] (03CR) 10Brouberol: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [11:39:57] (03PS4) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 [11:40:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1004.wikimedia.org [11:41:12] (03CR) 10Muehlenhoff: C:idm:jobs Use bitu command for systemd jobs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959155 (owner: 10Slyngshede) [11:42:01] (03CR) 10CI reject: [V: 04-1] C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [11:42:08] (03PS11) 10Brouberol: Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) [11:42:09] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [11:42:09] wikimedia.org/wiki/Services/Monitoring/restbase [11:42:38] (03PS3) 10Fabfur: allow to specify buffer size for backend and frontend [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) [11:43:03] (03CR) 10Gmodena: [C: 03+2] mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) (owner: 10Gmodena) [11:43:31] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:43:52] (03Merged) 10jenkins-bot: mw-page-content-change: fix swift egress rules. [deployment-charts] - 10https://gerrit.wikimedia.org/r/959189 (https://phabricator.wikimedia.org/T346877) (owner: 10Gmodena) [11:44:07] (03CR) 10Brouberol: "cc-ing Brian as we were talking about similarities in design between kafka & ES, and about the fact that kafka does not give you any built" [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [11:44:33] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:45:57] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:48:17] (03PS1) 10Majavah: hieradata: fix eqiad1 rabbitmq firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) [11:49:18] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [11:49:24] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:49:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43420/console" [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [11:51:03] (03CR) 10David Caro: "LGTM, though pcc only changes cloudservices, not cloudcontrol, is that ok?" [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [11:51:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: fix eqiad1 rabbitmq firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [11:52:14] (03PS5) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 [11:54:41] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: fix eqiad1 rabbitmq firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/959198 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [11:54:43] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:55:09] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:56:07] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:56:35] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:56:45] !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [11:56:49] !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:04:18] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43421/console" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [12:04:55] (03PS1) 10Majavah: cloudlb: add hack to grant cloudcontrol1006/7 database access [puppet] - 10https://gerrit.wikimedia.org/r/959199 (https://phabricator.wikimedia.org/T346439) [12:06:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43422/console" [puppet] - 10https://gerrit.wikimedia.org/r/959199 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [12:06:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudlb: add hack to grant cloudcontrol1006/7 database access [puppet] - 10https://gerrit.wikimedia.org/r/959199 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [12:06:50] (03CR) 10Majavah: [V: 03+1 C: 03+2] "This is very ugly but also very temporary." [puppet] - 10https://gerrit.wikimedia.org/r/959199 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [12:06:54] 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: decomission - https://phabricator.wikimedia.org/T346033 (10aborrero) a:03Jclark-ctr [12:07:20] 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: decomission - https://phabricator.wikimedia.org/T346033 (10aborrero) [12:07:23] (03PS1) 10Muehlenhoff: Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) [12:07:25] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [12:07:28] (03PS1) 10Muehlenhoff: Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699) [12:07:39] (03PS2) 10Muehlenhoff: Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699) [12:08:31] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43423/console" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [12:09:27] (03PS1) 10Muehlenhoff: Create DNS records for new LDAP Bookworm cluster [dns] - 10https://gerrit.wikimedia.org/r/959203 (https://phabricator.wikimedia.org/T331699) [12:09:41] (03PS2) 10Muehlenhoff: Create DNS records for new LDAP Bookworm cluster [dns] - 10https://gerrit.wikimedia.org/r/959203 (https://phabricator.wikimedia.org/T331699) [12:13:38] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10aborrero) [12:14:12] (03PS1) 10Filippo Giunchedi: sre: use 'up' for swagger probes failures too [alerts] - 10https://gerrit.wikimedia.org/r/959206 [12:14:47] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [12:15:24] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) [12:15:34] 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: decomission - https://phabricator.wikimedia.org/T346033 (10aborrero) [12:16:25] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:16:25] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:16:55] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero) [12:17:22] (03PS1) 10David Caro: cloudlb: add hack to grant cloudbackup2002.codfw.wmnet database access [puppet] - 10https://gerrit.wikimedia.org/r/959208 (https://phabricator.wikimedia.org/T346439) [12:17:30] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10aborrero) [12:17:49] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:18:12] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero) [12:18:47] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43424/console" [puppet] - 10https://gerrit.wikimedia.org/r/959208 (https://phabricator.wikimedia.org/T346439) (owner: 10David Caro) [12:19:05] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:19:40] (03PS6) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 [12:20:02] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero) a:05Jclark-ctr→03taavi [12:20:07] (03PS2) 10Filippo Giunchedi: sre: use 'up' for swagger probes failures too [alerts] - 10https://gerrit.wikimedia.org/r/959206 (https://phabricator.wikimedia.org/T346893) [12:20:27] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:20:42] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [12:21:12] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10aborrero) [12:24:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43425/console" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [12:25:42] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [12:26:43] (03CR) 10Slyngshede: [V: 03+1] "Given that the VM runs nothing but the test installation of Bitu, I see little reason to keep using the virtualenv. This way we also ensur" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [12:30:41] (03PS2) 10Muehlenhoff: Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) [12:32:49] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:52] (03CR) 10Gehel: "minor comments inline. I haven't looked at the python script itself yet." [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [12:34:24] (03PS3) 10Muehlenhoff: Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) [12:35:39] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [12:40:47] !log akosiaris@deploy2002 Started deploy [restbase/deploy@e8a6ae4]: (no justification provided) [12:41:18] !log T346354 deploy RESTBase after bug is fixed [12:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:23] T346354: restbase deploys via scap lead to all hosts being disabled in conftool - https://phabricator.wikimedia.org/T346354 [12:41:33] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:41:35] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:41] the deploy is supposed to fix these ^ once and for all [12:42:59] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:43:42] (03CR) 10Gehel: "This change is ready for review." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [12:44:23] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:21] !log akosiaris@deploy2002 Finished deploy [restbase/deploy@e8a6ae4]: (no justification provided) (duration: 04m 34s) [12:46:51] (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [12:47:19] !log akosiaris@deploy2002 Started deploy [restbase/deploy@e8a6ae4]: (no justification provided) [12:49:37] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:42] (03PS7) 10Slyngshede: C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 [12:49:59] (03CR) 10Slyngshede: C:IDM Rework git install to function with new repo layout. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [12:50:50] (03Abandoned) 10David Caro: cloudlb: add hack to grant cloudbackup2002.codfw.wmnet database access [puppet] - 10https://gerrit.wikimedia.org/r/959208 (https://phabricator.wikimedia.org/T346439) (owner: 10David Caro) [12:51:13] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:52:03] !log akosiaris@deploy2002 Finished deploy [restbase/deploy@e8a6ae4]: (no justification provided) (duration: 04m 43s) [12:52:25] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:52:37] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:52:45] !log akosiaris@deploy2002 Started deploy [restbase/deploy@e8a6ae4]: (no justification provided) [12:54:55] !log akosiaris@deploy2002 Finished deploy [restbase/deploy@e8a6ae4]: (no justification provided) (duration: 02m 10s) [12:58:33] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/958968 (owner: 10PipelineBot) [12:59:24] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/958968 (owner: 10PipelineBot) [12:59:45] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:13] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:24] no patches in the queue :) [13:01:09] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:01:23] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:26] (03PS1) 10Slyngshede: Enable SSH key management for all users. [software/bitu] - 10https://gerrit.wikimedia.org/r/959211 [13:01:29] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10Volans) p:05Triage→03Medium Ok, from logs I see that: ` ["lsblk --all --output 'NAME,TYPE' --pa... [13:01:49] (03CR) 10Urbanecm: [C: 03+2] build: Update eslint-config-wikimedia to 0.25.1 [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959014 (https://phabricator.wikimedia.org/T346629) (owner: 10Urbanecm) [13:01:57] (03CR) 10Urbanecm: [C: 03+2] Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson) [13:02:00] !log akosiaris@deploy2002 Started deploy [restbase/deploy@e8a6ae4]: (no justification provided) [13:02:18] TheresNoTime: in that case, good time to add patches [13:02:27] !log akosiaris@deploy2002 Finished deploy [restbase/deploy@e8a6ae4]: (no justification provided) (duration: 00m 27s) [13:03:16] (03PS1) 10FNegri: Package for Debian Bookworm [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) [13:03:29] (03PS1) 10Muehlenhoff: Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) [13:03:40] 10SRE, 10serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) [13:08:52] (03CR) 10Muehlenhoff: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [13:09:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:11:45] (03CR) 10Slyngshede: C:IDM Rework git install to function with new repo layout. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [13:11:47] (03CR) 10Slyngshede: [C: 03+2] C:IDM Rework git install to function with new repo layout. [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [13:12:55] !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm-test1001.wikimedia.org with OS bookworm [13:13:01] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm [13:14:19] (03CR) 10Muehlenhoff: [C: 03+1] C:IDM Rework git install to function with new repo layout. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959194 (owner: 10Slyngshede) [13:16:32] Reminder that we'll start locking things down in about 15 minutes for the switchover [13:18:05] (03Merged) 10jenkins-bot: build: Update eslint-config-wikimedia to 0.25.1 [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959014 (https://phabricator.wikimedia.org/T346629) (owner: 10Urbanecm) [13:18:07] (03CR) 10Herron: [C: 03+1] sre: use 'up' for swagger probes failures too [alerts] - 10https://gerrit.wikimedia.org/r/959206 (https://phabricator.wikimedia.org/T346893) (owner: 10Filippo Giunchedi) [13:18:38] claime: wdym by locking? I'm doing a MW deployment as part of the scheduled B&C window [13:18:42] do you want me to abort? [13:18:59] urbanecm: That window should have been removed, my fault [13:19:03] Finish up [13:19:16] ack, ty. i need about ~20 minutes, hopefully. [13:19:28] ack [13:20:16] We don't need to lock scap right at the beginning, we do it just to be safe, so that should be ok, but it's cutting it kinda close [13:20:35] I'll add to remove surrounding deployment windows to the scheduling doc [13:21:00] yep. i'm waiting on CI rn (it says ETA 0 min, so hopefully should merge too) and then it'll be one scap sync and that's all i have for today. [13:21:09] will ping once done [13:21:12] (03Merged) 10jenkins-bot: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson) [13:21:13] ok thanks [13:21:31] (03Abandoned) 10Stevemunene: airflow-wmde: create analytics-wmde users class for wmde services [puppet] - 10https://gerrit.wikimedia.org/r/947714 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [13:21:48] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:959014|build: Update eslint-config-wikimedia to 0.25.1 (T346629)]], [[gerrit:959007|Change CSS selector for Minerva mobile menu icon (T346459)]] [13:21:56] T346629: GrowthExperiments CI fails on master: mwgate-node16-docker - https://phabricator.wikimedia.org/T346629 [13:21:56] T346459: Mobile main menu icon missing when Growth home page enabled - https://phabricator.wikimedia.org/T346459 [13:22:49] (03PS1) 10Andrew Bogott: Revert "dbproxy1018: depool clouddb1019 in favor of clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/959018 [13:24:03] PROBLEM - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb.service,prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:22] (03CR) 10Andrew Bogott: [C: 03+2] Revert "dbproxy1018: depool clouddb1019 in favor of clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/959018 (owner: 10Andrew Bogott) [13:26:51] RECOVERY - Check systemd state on clouddb1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:54] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage [13:27:18] (03CR) 10FNegri: Package for Debian Bookworm (034 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri) [13:29:33] (03CR) 10Eevans: [C: 03+1] Replace yaml load() calls with safe_load() [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959080 (owner: 10Elukey) [13:29:53] (03CR) 10Eevans: [C: 04-1] Be explicit about the yaml loader class [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959038 (owner: 10Eevans) [13:30:35] <_joe_> jouncebot: now [13:30:35] For the next 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1300) [13:30:36] (03Abandoned) 10Eevans: Be explicit about the yaml loader class [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959038 (owner: 10Eevans) [13:31:14] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage [13:31:24] (03PS2) 10Muehlenhoff: Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) [13:31:54] (03CR) 10Herron: [V: 03+1 C: 03+2] dispatch::web: add ensure param and ensure => absent (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [13:31:57] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [13:32:23] (03CR) 10JHathaway: [C: 03+1] Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:32:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:32:54] (03CR) 10JHathaway: [C: 03+1] Create DNS records for new LDAP Bookworm cluster [dns] - 10https://gerrit.wikimedia.org/r/959203 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:33:39] (03CR) 10JHathaway: [C: 03+1] Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:33:52] (03PS1) 10Muehlenhoff: conftool: Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959215 (https://phabricator.wikimedia.org/T331699) [13:33:55] (03CR) 10Vgutierrez: [C: 03+1] Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:34:20] (03PS1) 10Majavah: Connect eqiad1 cloudvirts to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959216 (https://phabricator.wikimedia.org/T346651) [13:39:38] (03PS1) 10Herron: dispatch::web: correct /usr/local/bin/dispatch ensure [puppet] - 10https://gerrit.wikimedia.org/r/959220 (https://phabricator.wikimedia.org/T344937) [13:41:13] (03PS3) 10Muehlenhoff: Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) [13:41:41] (03PS1) 10Stevemunene: airflow-wmde: Remove statsd analytics-wmde user [puppet] - 10https://gerrit.wikimedia.org/r/959222 (https://phabricator.wikimedia.org/T340648) [13:42:16] !log urbanecm@deploy2002 urbanecm and jdlrobson: Backport for [[gerrit:959014|build: Update eslint-config-wikimedia to 0.25.1 (T346629)]], [[gerrit:959007|Change CSS selector for Minerva mobile menu icon (T346459)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:42:24] T346629: GrowthExperiments CI fails on master: mwgate-node16-docker - https://phabricator.wikimedia.org/T346629 [13:42:24] T346459: Mobile main menu icon missing when Growth home page enabled - https://phabricator.wikimedia.org/T346459 [13:43:08] (03CR) 10Herron: [C: 03+2] dispatch::web: correct /usr/local/bin/dispatch ensure [puppet] - 10https://gerrit.wikimedia.org/r/959220 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [13:43:23] !log urbanecm@deploy2002 urbanecm and jdlrobson: Continuing with sync [13:44:26] (03PS1) 10JHathaway: httpd: ensure mod commands are available [puppet] - 10https://gerrit.wikimedia.org/r/959224 (https://phabricator.wikimedia.org/T344868) [13:45:24] (03PS1) 10JHathaway: puppet agent: protect against missing client bucket path [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) [13:46:27] (03PS1) 10JHathaway: nginx: add toggle for mounting lib on tmpfs vol [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) [13:46:53] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:47:02] (03PS1) 10JHathaway: apt: fix use of alternative mirror [puppet] - 10https://gerrit.wikimedia.org/r/959227 [13:47:06] (03PS3) 10Kamila Součková: db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) [13:47:23] (03CR) 10Hashar: "Side track the `deployment-ssh` resource title has the dash replaced by an underscore which is reflected by changes in the catalogues:" [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [13:47:50] (03PS1) 10JHathaway: postgresql: fix ordering on a new install [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) [13:47:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:48:24] (03PS1) 10JHathaway: puppetdb: add ability to configure db_ro_host [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) [13:48:26] (03CR) 10Clément Goubert: [C: 03+1] db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [13:49:04] (03PS1) 10JHathaway: prometheus-postgres-exporter: install configs before service [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) [13:49:25] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm-test1001.wikimedia.org with OS bookworm [13:49:30] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm completed: - idm-test1001 (**PASS**) -... [13:49:48] (03PS1) 10JHathaway: puppetdb: preseed to avoid creating database users [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) [13:49:56] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [13:49:58] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [13:50:08] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks [13:50:17] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) [13:50:21] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [13:50:35] (03PS1) 10JHathaway: puppetdb prometheus exporter: in a container listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842) [13:51:19] (03PS1) 10JHathaway: pki: disable mysql specific scripts when using sqlite [puppet] - 10https://gerrit.wikimedia.org/r/959234 (https://phabricator.wikimedia.org/T344868) [13:51:33] (03CR) 10Muehlenhoff: scap:ferm: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [13:51:49] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:07] (03PS1) 10JHathaway: puppetserver: fix perma-diff on /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970) [13:52:35] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:48] (03PS1) 10JHathaway: ferm: fix ferm-status on container bullseye instances [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) [13:53:17] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: dispatch-scheduler.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:19] (03PS1) 10JHathaway: pki::multirootca: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/959237 [13:53:53] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:54:12] (03PS1) 10JHathaway: puppetserver: Serve the full cert chain via jetty [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970) [13:55:08] (03PS1) 10JHathaway: pki dev: cfssl configs for the dev env pki image [puppet] - 10https://gerrit.wikimedia.org/r/959241 (https://phabricator.wikimedia.org/T344868) [13:56:03] urbanecm: cutting it close, where is it at? [13:56:09] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [13:56:09] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:959014|build: Update eslint-config-wikimedia to 0.25.1 (T346629)]], [[gerrit:959007|Change CSS selector for Minerva mobile menu icon (T346459)]] (duration: 34m 21s) [13:56:12] lol [13:56:14] k [13:56:19] T346629: GrowthExperiments CI fails on master: mwgate-node16-docker - https://phabricator.wikimedia.org/T346629 [13:56:20] T346459: Mobile main menu icon missing when Growth home page enabled - https://phabricator.wikimedia.org/T346459 [13:56:33] claime: I think that's your answer. Sorry, scap was a bit slower. [13:56:34] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Thanks Eric! Do you want to create the new deb change + package or should I?" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959080 (owner: 10Elukey) [13:56:36] !log kamila@deploy2002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover: MediaWiki - T346474 [13:56:37] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:56:39] I'm done, thanks for waiting. [13:56:40] urbanecm: no worries [13:56:41] T346474: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 [13:57:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:20] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:57:31] !log kamila@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) [13:57:48] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:57:59] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:57:59] !log kamila@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) [13:58:09] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10LSobanski) p:05Low→03Medium [13:58:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [13:58:34] (03CR) 10JHathaway: [C: 03+2] pki::multirootca: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/959237 (owner: 10JHathaway) [13:58:54] (03CR) 10DCausse: Draft: cirrus streaming updater service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [14:00:06] kamila_: gettimeofday() says it's time for Datacenter switchover: MediaWiki. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1400) [14:00:07] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1400) [14:00:29] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s2.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s3.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s5.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s6.service,mediawiki_job_growthexperiments-refreshLinkRecommendati [14:00:29] ervice https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:32] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:00:32] !log kamila@cumin1001 MediaWiki read-only period starts at: 2023-09-20 14:00:32.114116 [14:00:40] mwmaint alert expected [14:00:47] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: start adding per-env ZK path root [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [14:00:49] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:00:51] kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:01] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:01:02] stashbot failing is expected [14:01:02] kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:03] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [14:01:31] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:01:33] kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:40] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:01:42] kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:01:58] that is expected [14:02:02] <_joe_> yes lol [14:02:19] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:02:21] kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:27] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:02:29] kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:30] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:02:32] kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:37] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:02:39] kamila@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:02:53] !log kamila@cumin1001 MediaWiki read-only period ends at: 2023-09-20 14:02:53.790615 [14:02:53] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:02:58] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:02:59] !log kamila@cumin1001 MediaWiki read-only period ends at: 2023-09-20 14:02:59.798838 [14:02:59] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:03:07] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [14:03:09] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0) [14:03:17] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:04:09] !log Testing [14:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:15] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:04:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [14:04:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:04:30] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [14:04:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959234 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [14:04:52] (03PS2) 10Marostegui: wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) [14:04:56] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [14:05:08] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [14:05:20] \o/ [14:05:21] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [14:05:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959227 (owner: 10JHathaway) [14:05:49] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [14:06:05] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [14:06:12] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl [14:06:33] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [14:06:49] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [14:06:52] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) [14:07:03] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:07:08] (03CR) 10Kamila Součková: [C: 03+2] db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [14:07:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [14:07:25] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959224 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [14:07:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:42] !log Phase 9.5 Update DNS records for new database masters - T346474 [14:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:48] T346474: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 [14:08:31] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [14:09:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:09:31] !log kamila@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover: MediaWiki - T346474 (duration: 12m 54s) [14:09:49] (03CR) 10JHathaway: [C: 03+1] conftool: Add new LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959215 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [14:09:53] (03PS1) 10Marostegui: wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/959243 (https://phabricator.wikimedia.org/T346474) [14:10:13] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters [14:10:15] (03CR) 10Marostegui: "Required a manual rebase and I am lazy so: https://gerrit.wikimedia.org/r/c/operations/dns/+/959243/" [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui) [14:10:23] (03Abandoned) 10Marostegui: wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui) [14:13:50] (03CR) 10Marostegui: [C: 03+2] wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/959243 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui) [14:15:34] (03PS2) 10JHathaway: postgresql: fix ordering on a new install [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) [14:15:42] (03CR) 10Muehlenhoff: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [14:15:55] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [14:16:03] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) The banners were set and read: someone took the opportunity to [[ https://meta.wikimedia.org/w/inde... [14:16:06] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [14:16:27] (03CR) 10Kamila Součková: [C: 03+2] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957996 (https://phabricator.wikimedia.org/T346472) (owner: 10Kamila Součková) [14:16:41] (03CR) 10JHathaway: [C: 03+1] Add new Bookworm LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/959213 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [14:16:45] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) [14:17:04] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila) [14:17:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:32] (03CR) 10JHathaway: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [14:19:52] (03CR) 10BBlack: [C: 03+1] "Looks great, nice work!" [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [14:21:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:21:14] (03CR) 10Muehlenhoff: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [14:22:22] (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [14:22:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:23:21] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ssh-gitlab.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:23] (03CR) 10Fabfur: [C: 03+2] allow to specify buffer size for backend and frontend (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [14:23:28] (03PS2) 10JHathaway: puppetdb: preseed to avoid creating database users [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) [14:23:58] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [14:24:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Connect eqiad1 cloudvirts to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959216 (https://phabricator.wikimedia.org/T346651) (owner: 10Majavah) [14:25:52] !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2044 for high load - bking@cumin1001 [14:25:52] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2044 for high load - bking@cumin1001 [14:26:03] !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2044.codfw.wmnet for high load - bking@cumin1001 [14:26:05] (03PS17) 10Brouberol: Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) [14:26:07] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2044.codfw.wmnet for high load - bking@cumin1001 [14:26:49] (03CR) 10Fabfur: [V: 03+2 C: 03+2] allow to specify buffer size for backend and frontend [software/purged] - 10https://gerrit.wikimedia.org/r/959050 (https://phabricator.wikimedia.org/T346874) (owner: 10Fabfur) [14:27:05] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:14] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10MoritzMuehlenhoff) The rotation/compression appears to work fine and usual day chunks are in the 2.5G ballpark, was there any unusual extra traffic which made it spike t... [14:28:25] (03CR) 10Fabfur: [C: 03+2] "trivial change" [software/purged] - 10https://gerrit.wikimedia.org/r/959165 (owner: 10Fabfur) [14:28:29] (03CR) 10JHathaway: ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [14:28:31] (03CR) 10Fabfur: [V: 03+2 C: 03+2] makefile: do not fail on already removed files [software/purged] - 10https://gerrit.wikimedia.org/r/959165 (owner: 10Fabfur) [14:30:58] (03PS2) 10Kamila Součková: wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) [14:31:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:31:34] (03CR) 10Kamila Součková: [C: 03+2] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [14:34:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:35:46] (03CR) 10Hashar: scap:ferm: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [14:35:57] !log update maintenance.eqiad.wmnet to point to mwmaint2002 [14:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:23] (03PS1) 10Clément Goubert: mw-api-ext: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959248 [14:38:59] (03PS1) 10JMeybohm: k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) [14:39:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:39:23] (03CR) 10CI reject: [V: 04-1] k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm) [14:40:31] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [14:41:59] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:42:36] (03PS2) 10Kamila Součková: geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) [14:44:11] (03PS1) 10Majavah: O:wmcs::openstack::instance_backups: enable cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959252 [14:44:38] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [14:44:54] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new cloud-private records - cmooney@cumin1001" [14:45:03] (03PS2) 10JMeybohm: k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) [14:45:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] O:wmcs::openstack::instance_backups: enable cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959252 (owner: 10Majavah) [14:45:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new cloud-private records - cmooney@cumin1001" [14:45:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:38] (03CR) 10Majavah: [C: 03+2] O:wmcs::openstack::instance_backups: enable cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/959252 (owner: 10Majavah) [14:48:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43429/console" [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm) [14:54:41] Can we continue with helm deployments after the switchover or still things are in flight? I would like to deploy some changes on wikifeeds [14:56:27] (03CR) 10Elukey: [C: 03+1] k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm) [14:56:57] (03CR) 10Vgutierrez: vrts: add ticket-cert.crt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth) [14:57:00] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:18] nemo-yiannis: please proceed :) [14:58:15] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:58:24] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:59:03] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:00:03] (03CR) 10Muehlenhoff: scap:ferm: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [15:02:45] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:02:48] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:03:23] !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [15:03:25] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [15:04:42] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:05:10] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [15:05:49] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [15:06:04] (03PS1) 10Fabfur: Release 0.21 [software/purged] - 10https://gerrit.wikimedia.org/r/959255 [15:06:16] (03CR) 10CI reject: [V: 04-1] Release 0.21 [software/purged] - 10https://gerrit.wikimedia.org/r/959255 (owner: 10Fabfur) [15:06:56] !log brouberol@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [15:08:19] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [15:08:56] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [15:09:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:09:15] !log added Taavi and Effie (new key) to pwstore [15:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:49] (03PS1) 10AOkoth: ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/959256 [15:12:17] 10SRE, 10Continuous-Integration-Infrastructure: Puppet package_builder module should have the apt cache auto cleaned - https://phabricator.wikimedia.org/T339251 (10hashar) 05Open→03Resolved After a quick check on `integration-agent-pkgbuilder-1001` and `integration-agent-pkgbuilder-1002` it looks like the... [15:12:33] (03PS1) 10JMeybohm: prometheus::k8s: Scape scheduler and controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/959257 (https://phabricator.wikimedia.org/T324959) [15:12:42] (03CR) 10Vgutierrez: [C: 03+1] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/959256 (owner: 10AOkoth) [15:13:18] (03CR) 10AOkoth: [C: 03+2] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/959256 (owner: 10AOkoth) [15:13:25] (03CR) 10AOkoth: [V: 03+2 C: 03+2] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/959256 (owner: 10AOkoth) [15:14:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:16:20] (03CR) 10Vgutierrez: [C: 04-1] vrts: add ticket-cert.crt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth) [15:16:33] (03CR) 10Elukey: [C: 03+1] decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134) (owner: 10Volans) [15:16:39] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: superset.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:19:15] (03PS2) 10Fabfur: Release 0.21+deb11u1 for debian bullseye import [software/purged] - 10https://gerrit.wikimedia.org/r/959255 [15:20:53] (03CR) 10CI reject: [V: 04-1] Release 0.21+deb11u1 for debian bullseye import [software/purged] - 10https://gerrit.wikimedia.org/r/959255 (owner: 10Fabfur) [15:22:37] PROBLEM - puppet last run on an-tool1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:23:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:23:22] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Scrape controller-manager and scheduler metrics [puppet] - 10https://gerrit.wikimedia.org/r/959249 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm) [15:23:30] (03PS3) 10Fabfur: Release 0.21+deb11u1 for debian bullseye import [software/purged] - 10https://gerrit.wikimedia.org/r/959255 [15:24:03] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:24:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:25:22] (03PS1) 10AOkoth: ticket-test: add all dummy files [labs/private] - 10https://gerrit.wikimedia.org/r/959259 [15:26:48] (03PS1) 10Clément Goubert: mw-web: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959260 [15:27:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-web: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959260 (owner: 10Clément Goubert) [15:27:57] (03CR) 10Clément Goubert: [C: 03+2] mw-web: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959260 (owner: 10Clément Goubert) [15:28:47] (03Merged) 10jenkins-bot: mw-web: Raise main replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959260 (owner: 10Clément Goubert) [15:29:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:29:29] (03CR) 10Vgutierrez: [C: 04-2] "fix the commit tree?" [software/purged] - 10https://gerrit.wikimedia.org/r/959255 (owner: 10Fabfur) [15:29:32] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:29:39] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:29:42] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:29:57] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:30:01] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:30:10] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:32:11] (03CR) 10Volans: [C: 03+2] decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134) (owner: 10Volans) [15:33:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:33:52] I'm getting a swarm of HTTP 400 responses on mowiki. Can anyone help me figure out what's causing it? [15:35:09] (03PS11) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [15:35:45] (03PS12) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [15:36:04] action=sitematrix on mo.wikipedia.org is failing with 400 responses. [15:36:57] (03Merged) 10jenkins-bot: decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134) (owner: 10Volans) [15:38:14] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:41:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:43:17] <_joe_> Skynet: mo.wikipedia now redirects to ro.wikipedia I'd say? [15:43:29] <_joe_> not sure if that's new [15:43:36] https://mo.wikipedia.org/w/api.php?action=sitematrix&format=json&smtype=language&smstate=all&smlangprop=code%7Cname%7Csite%7Cdir%7Clocalname&smsiteprop=url%7Cdbname%7Ccode%7Csitename%7Clang&smlimit=max&formatversion=2 [15:43:50] Doesn't seem to redirect properly then. [15:44:03] And is fairly new since I haven't seen this before. [15:44:24] Has the wiki changed from mowiki to rowiki now? [15:45:02] <_joe_> can you open a task, please? I don't think we can help yu here [15:45:56] (03PS1) 10AOkoth: vrts: add ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959272 (https://phabricator.wikimedia.org/T340027) [15:46:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:46:14] (03PS2) 10AOkoth: vrts: add ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959272 (https://phabricator.wikimedia.org/T340027) [15:46:18] mowiki has been redirecting to ro.wikipedia.org since 2019 [15:46:33] no, more. [15:46:55] Well it broke recently. Seems like it's not redirecting properly anymore. [15:47:03] https://phabricator.wikimedia.org/T169450 [15:47:06] 2017 [15:47:15] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:47:18] <_joe_> Skynet: again it's a software bug maybe, please open a task [15:47:27] (03CR) 10Vgutierrez: [C: 03+1] vrts: add ticket-test cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959272 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:48:05] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43432/console" [puppet] - 10https://gerrit.wikimedia.org/r/959257 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm) [15:48:09] (03PS1) 10Elukey: Revert "Lower ores.wikimedia.org's TTL to 5M" [dns] - 10https://gerrit.wikimedia.org/r/959286 [15:48:17] (03PS2) 10Elukey: Revert "Lower ores.wikimedia.org's TTL to 5M" [dns] - 10https://gerrit.wikimedia.org/r/959286 [15:48:37] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:48:43] (03CR) 10AOkoth: [C: 03+2] vrts: add ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959272 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:49:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/959257 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm) [15:50:09] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s: Scape scheduler and controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/959257 (https://phabricator.wikimedia.org/T324959) (owner: 10JMeybohm) [15:51:14] (03PS7) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) [15:51:42] (03CR) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [15:52:57] (03PS1) 10Cwhite: prometheus: add service_name_override parameter [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) [15:53:04] (03CR) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [15:55:29] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:57:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:57:53] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:59:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:59:19] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:59:20] (03Abandoned) 10Elukey: modules: copy configuration 1.4.1 to 1.5.0 for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956440 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [15:59:24] (03Abandoned) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [16:02:21] (03PS2) 10Cwhite: prometheus: add service_name_override parameter [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) [16:02:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:03:07] (03PS1) 10Elukey: modules: copy mesh:configuration 1.4.1 to 1.4.2 to facilitate reviews [deployment-charts] - 10https://gerrit.wikimedia.org/r/959279 (https://phabricator.wikimedia.org/T346638) [16:03:09] (03PS1) 10Elukey: modules: rename uses_ingress to uses_sni in mesh:configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/959280 (https://phabricator.wikimedia.org/T346638) [16:04:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:08:11] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) [16:09:59] (03CR) 10JMeybohm: [C: 03+1] profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [16:11:26] (03PS1) 10Elukey: ml-services: upgrade docker images for revscoring-based isvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/959309 (https://phabricator.wikimedia.org/T346445) [16:12:13] (03CR) 10JMeybohm: [C: 03+1] modules: rename uses_ingress to uses_sni in mesh:configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/959280 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [16:12:51] (03CR) 10Elukey: [C: 03+2] profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [16:14:34] (03PS1) 10Ilias Sarantopoulos: ml-services: fix memory leak in revscoring servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/959310 (https://phabricator.wikimedia.org/T346445) [16:19:24] (03Abandoned) 10Elukey: ml-services: upgrade docker images for revscoring-based isvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/959309 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey) [16:20:05] (03CR) 10Elukey: [C: 03+2] ml-services: fix memory leak in revscoring servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/959310 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [16:21:08] (03CR) 10Elukey: [C: 03+2] modules: copy mesh:configuration 1.4.1 to 1.4.2 to facilitate reviews [deployment-charts] - 10https://gerrit.wikimedia.org/r/959279 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [16:21:15] (03PS2) 10Elukey: modules: rename uses_ingress to uses_sni in mesh:configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/959280 (https://phabricator.wikimedia.org/T346638) [16:24:30] (03CR) 10Klausman: [C: 03+1] Revert "Lower ores.wikimedia.org's TTL to 5M" [dns] - 10https://gerrit.wikimedia.org/r/959286 (owner: 10Elukey) [16:24:32] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:24:40] (03CR) 10Elukey: [C: 03+2] modules: rename uses_ingress to uses_sni in mesh:configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/959280 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [16:25:37] (03CR) 10Klausman: [C: 03+2] Revert "Lower ores.wikimedia.org's TTL to 5M" [dns] - 10https://gerrit.wikimedia.org/r/959286 (owner: 10Elukey) [16:26:28] !log pushing revert of ORES TTL change [16:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:55] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:28:11] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:29:13] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:31:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:31:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:32:29] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) Hi @LSobanski, @taavi mentioned to me privately that if we want the stewards machine to run `ircservserv`, as di... [16:36:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:36:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:39:17] (03CR) 10AOkoth: [C: 03+2] ticket-test: add all dummy files [labs/private] - 10https://gerrit.wikimedia.org/r/959259 (owner: 10AOkoth) [16:39:36] (03CR) 10AOkoth: [V: 03+2 C: 03+2] ticket-test: add all dummy files [labs/private] - 10https://gerrit.wikimedia.org/r/959259 (owner: 10AOkoth) [16:45:26] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search: 2023-09-20 Elasticsearch unavailable incident ticket - https://phabricator.wikimedia.org/T346945 (10bking) [16:47:57] (03PS2) 10JHathaway: puppetdb: add ability to configure db_ro_host [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) [16:48:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [16:48:44] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [16:48:52] (03CR) 10FNegri: "This successfully builds a .deb package, if that package works as expected is harder to say. :D" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri) [16:53:09] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:54:35] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:55:47] (03PS1) 10David Caro: d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 [16:57:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1700) [17:00:11] (03PS3) 10Herron: titan: add pyrra/slo envoy/cfssl config [puppet] - 10https://gerrit.wikimedia.org/r/956902 [17:02:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:02:23] (03PS7) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) [17:02:50] (03PS8) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) [17:03:27] (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance titan2001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [17:03:55] (03CR) 10David Caro: Package for Debian Bookworm (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri) [17:08:27] (PrometheusRuleEvaluationFailures) resolved: Prometheus rule evaluation failures (instance titan2001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [17:09:40] (03PS3) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) [17:09:44] (03PS13) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [17:09:46] (03PS1) 10AOkoth: ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959317 [17:10:04] (03PS2) 10AOkoth: ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959317 [17:12:36] (03CR) 10AOkoth: [C: 03+2] ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959317 (owner: 10AOkoth) [17:12:39] (03CR) 10AOkoth: [V: 03+2 C: 03+2] ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959317 (owner: 10AOkoth) [17:16:47] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:33] (03CR) 10Herron: [C: 03+1] "caught one already! https://phabricator.wikimedia.org/T346950" [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi) [17:21:09] RECOVERY - puppet last run on an-tool1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:23:11] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:26:01] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:30:27] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:31:53] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:35:11] (03PS4) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) [17:35:13] (03PS14) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [17:35:16] (03PS1) 10AOkoth: ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959318 [17:35:45] (03CR) 10AOkoth: [V: 03+2 C: 03+2] ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959318 (owner: 10AOkoth) [17:35:53] (03PS2) 10AOkoth: ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959318 [17:35:59] (03CR) 10AOkoth: [V: 03+2] ssl: update ticket-test cert [puppet] - 10https://gerrit.wikimedia.org/r/959318 (owner: 10AOkoth) [17:38:12] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [17:38:23] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/958973/43435/" [puppet] - 10https://gerrit.wikimedia.org/r/958973 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [17:41:15] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:39] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:43:40] (03PS4) 10AOkoth: vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) [17:43:59] (03CR) 10AOkoth: [C: 03+2] vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [17:46:17] (03CR) 10AOkoth: [V: 03+2 C: 03+2] vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [17:47:08] (03Abandoned) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth) [17:48:28] (03PS5) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) [17:59:48] (03PS1) 10Samtar: .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) [18:00:05] brennen and jnuche: OwO what's this, a deployment window?? Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1800). nyaa~ [18:00:05] brennen and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T1800). [18:01:42] o/ [18:01:58] (03Abandoned) 10Fabfur: Release 0.21+deb11u1 for debian bullseye import [software/purged] - 10https://gerrit.wikimedia.org/r/959255 (owner: 10Fabfur) [18:02:18] !log train 1.41.0-wmf.27 (T345888): no current blockers, logs clean, rolling to group1 [18:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:24] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [18:05:38] (03PS1) 10Fabfur: Release 0.21+deb11u1 for bullseye [software/purged] - 10https://gerrit.wikimedia.org/r/959328 [18:06:15] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959329 (https://phabricator.wikimedia.org/T345888) [18:06:17] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959329 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [18:07:00] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959329 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [18:07:05] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:10:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:12:13] i note that PHPFPMTooBusy seems to be a recurring thing with deploys currently. [18:14:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:14:34] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.27 refs T345888 [18:14:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:14:41] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [18:19:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:21:53] !log brennen@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.27 refs T345888 (duration: 07m 17s) [18:22:00] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [18:23:09] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:26:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:26:18] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Aklapper) [18:31:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:34:57] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:36:04] (03CR) 10Hashar: [C: 03+1] scap:ferm: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [18:36:21] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:33] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:58:59] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:04:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:41] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:12:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:13:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:14:45] (03CR) 10Dbrant: [C: 04-1] "Nice, thanks! Google's documentation seems to be a little ambiguous about this, but it looks like some people have reported difficulties w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) (owner: 10Samtar) [19:15:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:58] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@80496b8]: (no justification provided) [19:25:07] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@80496b8]: (no justification provided) (duration: 00m 09s) [19:26:40] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@80496b8]: (no justification provided) [19:26:46] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@80496b8]: (no justification provided) (duration: 00m 05s) [19:46:15] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:47:41] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:52:09] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:53:35] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:02:21] indeed [20:10:43] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) > Some of them are just artifacts of starting from a fork of one of the legacy services. For example, we'll want to adop... [20:20:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:21:01] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:37:49] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:39:15] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:42:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:45:07] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:31] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:47:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:00:06] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T2100) [21:03:41] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:04:21] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:14:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:14:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:44:39] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:46:03] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:48:43] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:50:07] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:00:59] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:02:25] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:13:55] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:19] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:19:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:23:09] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:23:46] (03PS1) 10Varnent: [foundationwiki] Grant translation admin rights to 'sysop' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187) [22:24:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:24:26] (03CR) 10CI reject: [V: 04-1] [foundationwiki] Grant translation admin rights to 'sysop' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent) [22:30:07] (03PS2) 10Varnent: [foundationwiki] Grant translation admin rights to 'sysop' and 'global-sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187) [22:30:47] (03CR) 10CI reject: [V: 04-1] [foundationwiki] Grant translation admin rights to 'sysop' and 'global-sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent) [22:48:57] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:50:23] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:50:35] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:52:01] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:59:17] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:00:43] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:23:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:23:52] (03PS1) 10Jclark-ctr: Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176) [23:26:43] (03PS2) 10Jclark-ctr: Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176) [23:28:10] (03PS3) 10Jclark-ctr: Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176) [23:28:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:29:13] (03CR) 10Jclark-ctr: [C: 03+2] Add new servers db1226-33 [puppet] - 10https://gerrit.wikimedia.org/r/959359 (https://phabricator.wikimedia.org/T342176) (owner: 10Jclark-ctr) [23:44:05] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [23:47:39] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt pc1016 - jclark@cumin1001" [23:48:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt pc1016 - jclark@cumin1001" [23:48:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:48:54] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1016 [23:49:07] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1016 [23:49:09] !log jclark@cumin1001 END (ERROR) - Cookbook sre.network.configure-switch-interfaces (exit_code=97) for host pc1016 [23:49:13] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1015 [23:49:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1015 [23:50:03] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [23:50:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1016 [23:50:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [23:50:39] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1016.mgmt.eqiad.wmnet with reboot policy FORCED [23:51:15] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [23:51:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [23:53:24] (03CR) 10RLazarus: [C: 03+1] "I haven't tested this but LGTM in principle. Thanks for this." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey) [23:54:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) @VRiley-WMF Serial was entered into netbox incorrectly if you are not onsite sometimes you can look at procurement ticket packing slip that is attached. [23:54:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1016.mgmt.eqiad.wmnet with reboot policy FORCED [23:55:15] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1016.mgmt.eqiad.wmnet with reboot policy FORCED