[00:04:47] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-31, tools-k8s-worker-nfs-32, tools-k8s-worker-nfs-33, tools-k8s-worker-nfs-36 (T359641) [00:04:50] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [00:09:11] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-60, tools-k8s-worker-nfs-61, tools-k8s-worker-nfs-62, tools-k8s-worker-nfs-63 (T359641) [00:09:18] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-38, tools-k8s-worker-nfs-46, tools-k8s-worker-nfs-49, tools-k8s-worker-nfs-50 (T359641) [00:10:09] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-50, tools-k8s-worker-nfs-56, tools-k8s-worker-nfs-57, tools-k8s-worker-nfs-6 (T359641) [00:10:13] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [00:13:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [00:19:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:23:46] 10Tool-schedule-deployment: ScheduleDeploymentBot edits being marked as bot means they break watchlisting of the Deployments page - https://phabricator.wikimedia.org/T374735#10151256 (10bd808) Apparently [[https://github.com/mwclient/mwclient/blob/8fee0dfa4ec1322f0d4824034752bcea9fb26e26/mwclient/page.py#L203|mw... [00:26:54] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-50, tools-k8s-worker-nfs-56, tools-k8s-worker-nfs-57, tools-k8s-worker-nfs-6 (T359641) [00:26:59] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [00:30:03] 10Tool-schedule-deployment: ScheduleDeploymentBot edits being marked as bot means they break watchlisting of the Deployments page - https://phabricator.wikimedia.org/T374735#10151267 (10bd808) p:05Triage→03Medium My local working tree is full of a half done implementation for {T372059}, so for now I'll just... [00:30:18] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-60, tools-k8s-worker-nfs-61, tools-k8s-worker-nfs-62, tools-k8s-worker-nfs-63 (T359641) [00:31:42] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-21 from 1.26.15 to 1.27.16 (T359641) [00:32:38] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-21 from 1.26.15 to 1.27.16 (T359641) [00:32:39] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-22 from 1.26.15 to 1.27.16 (T359641) [00:32:42] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [00:33:37] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-22 from 1.26.15 to 1.27.16 (T359641) [00:33:38] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-26 from 1.26.15 to 1.27.16 (T359641) [00:34:38] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-26 from 1.26.15 to 1.27.16 (T359641) [00:34:39] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-27 from 1.26.15 to 1.27.16 (T359641) [00:34:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:35:36] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-27 from 1.26.15 to 1.27.16 (T359641) [00:35:37] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-28 from 1.26.15 to 1.27.16 (T359641) [00:38:30] (03open) 10bd808: mediawiki: Don't mark edits as bot edits [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/11 (https://phabricator.wikimedia.org/T374735) [00:40:12] 10Tool-schedule-deployment, 13Patch-For-Review: ScheduleDeploymentBot edits being marked as bot means they break watchlisting of the Deployments page - https://phabricator.wikimedia.org/T374735#10151295 (10bd808) a:03bd808 [00:41:28] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-28 from 1.26.15 to 1.27.16 (T359641) [00:41:29] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-29 from 1.26.15 to 1.27.16 (T359641) [00:41:32] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [00:41:51] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-28 (T359641) [00:47:05] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-28 (T359641) [00:47:09] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [00:47:20] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-29 from 1.26.15 to 1.27.16 (T359641) [00:47:21] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-30 from 1.26.15 to 1.27.16 (T359641) [00:47:32] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-29 (T359641) [00:52:46] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-29 (T359641) [00:52:50] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [00:53:13] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-30 from 1.26.15 to 1.27.16 (T359641) [00:53:15] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-31 from 1.26.15 to 1.27.16 (T359641) [00:53:27] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-30 (T359641) [00:58:42] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-30 (T359641) [00:58:46] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [00:59:06] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-31 from 1.26.15 to 1.27.16 (T359641) [00:59:07] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-32 from 1.26.15 to 1.27.16 (T359641) [00:59:21] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-31 (T359641) [01:00:01] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-32 from 1.26.15 to 1.27.16 (T359641) [01:00:02] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-36 from 1.26.15 to 1.27.16 (T359641) [01:00:57] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-36 from 1.26.15 to 1.27.16 (T359641) [01:00:58] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-38 from 1.26.15 to 1.27.16 (T359641) [01:01:56] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-38 from 1.26.15 to 1.27.16 (T359641) [01:01:57] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-46 from 1.26.15 to 1.27.16 (T359641) [01:02:51] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-46 from 1.26.15 to 1.27.16 (T359641) [01:02:52] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-49 from 1.26.15 to 1.27.16 (T359641) [01:04:37] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-31 (T359641) [01:04:41] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:08:39] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-49 from 1.26.15 to 1.27.16 (T359641) [01:08:40] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-50 from 1.26.15 to 1.27.16 (T359641) [01:09:24] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-49 (T359641) [01:09:37] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-50 from 1.26.15 to 1.27.16 (T359641) [01:09:38] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-56 from 1.26.15 to 1.27.16 (T359641) [01:09:43] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:14:34] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [01:14:40] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-49 (T359641) [01:15:26] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-56 from 1.26.15 to 1.27.16 (T359641) [01:15:27] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-57 from 1.26.15 to 1.27.16 (T359641) [01:15:30] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:15:57] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-56 (T359641) [01:16:25] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-57 from 1.26.15 to 1.27.16 (T359641) [01:16:26] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-6 from 1.26.15 to 1.27.16 (T359641) [01:21:14] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-56 (T359641) [01:21:18] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:22:16] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-6 from 1.26.15 to 1.27.16 (T359641) [01:22:19] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-60 from 1.26.15 to 1.27.16 (T359641) [01:23:02] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-6 (T359641) [01:23:09] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-66.tools.eqiad1.wikimedia.cloud to the cluster [01:23:09] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [01:23:41] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [01:28:02] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-60 from 1.26.15 to 1.27.16 (T359641) [01:28:03] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-61 from 1.26.15 to 1.27.16 (T359641) [01:28:06] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:28:16] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-6 (T359641) [01:28:43] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-60 (T359641) [01:28:59] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-61 from 1.26.15 to 1.27.16 (T359641) [01:29:00] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-62 from 1.26.15 to 1.27.16 (T359641) [01:32:45] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-67.tools.eqiad1.wikimedia.cloud to the cluster [01:32:45] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [01:33:17] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [01:33:20] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:33:59] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-60 (T359641) [01:34:53] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-62 from 1.26.15 to 1.27.16 (T359641) [01:34:54] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-63 from 1.26.15 to 1.27.16 (T359641) [01:35:13] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-62 (T359641) [01:40:06] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-28 (T359641) [01:40:28] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-62 (T359641) [01:40:46] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-63 from 1.26.15 to 1.27.16 (T359641) [01:40:47] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-64 from 1.26.15 to 1.27.16 (T359641) [01:40:51] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:40:52] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-63 (T359641) [01:42:14] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-68.tools.eqiad1.wikimedia.cloud to the cluster [01:42:14] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [01:45:21] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-28 [01:46:08] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-63 (T359641) [01:46:12] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:46:37] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node tools-k8s-worker-nfs-64 from 1.26.15 to 1.27.16 (T359641) [01:48:10] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-29 (T359641) [01:48:22] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [01:49:10] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-64 (T359641) [01:50:50] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-30 (T359641) [01:53:27] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-29 [01:54:26] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-64 (T359641) [01:54:29] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [01:56:08] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-30 [01:57:22] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-31 (T359641) [01:57:36] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-49 (T359641) [01:57:58] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-69.tools.eqiad1.wikimedia.cloud to the cluster [01:57:59] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [01:58:15] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [01:58:22] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [02:02:39] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-31 [02:02:55] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-49 [02:04:52] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-56 (T359641) [02:04:55] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [02:05:11] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-6 (T359641) [02:08:17] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-70.tools.eqiad1.wikimedia.cloud to the cluster [02:08:17] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [02:10:10] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-56 [02:10:29] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-6 [02:12:35] !log raymond-ndibe@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=97) for a worker-nfs role in the tools cluster (T359641) [02:12:39] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [02:15:22] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [02:22:33] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [02:22:37] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [02:24:10] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-71.tools.eqiad1.wikimedia.cloud to the cluster [02:24:10] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [02:29:28] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [02:29:32] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [02:32:22] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-72.tools.eqiad1.wikimedia.cloud to the cluster [02:32:22] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [02:36:06] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [02:36:10] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [02:38:19] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-73.tools.eqiad1.wikimedia.cloud to the cluster [02:38:19] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [02:40:07] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-60 (T359641) [02:40:10] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-62 (T359641) [02:45:23] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-60 [02:45:28] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-62 [02:46:24] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-74.tools.eqiad1.wikimedia.cloud to the cluster [02:46:25] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [02:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:50:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-70 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [02:50:30] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [02:50:34] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [02:57:32] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T359641) [02:57:37] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [03:00:20] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-75.tools.eqiad1.wikimedia.cloud to the cluster [03:00:21] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [03:00:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-75 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:04:40] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-63 (T359641) [03:04:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:04:45] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [03:05:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-75 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:07:36] !log raymond-ndibe@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-76.tools.eqiad1.wikimedia.cloud to the cluster [03:07:36] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [03:08:20] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-64 (T359641) [03:09:59] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-63 [03:13:37] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=99) for host tools-k8s-worker-nfs-64 [03:18:01] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-ingress-7 from 1.26.15 to 1.27.16 (T359641) [03:18:05] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [03:18:59] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-ingress-7 from 1.26.15 to 1.27.16 (T359641) [03:19:51] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-ingress-8 from 1.26.15 to 1.27.16 (T359641) [03:20:49] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-ingress-8 from 1.26.15 to 1.27.16 (T359641) [03:21:29] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-64 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:23:11] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-ingress-9 from 1.26.15 to 1.27.16 (T359641) [03:23:16] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [03:23:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-75 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:24:19] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-ingress-9 from 1.26.15 to 1.27.16 (T359641) [03:26:29] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-64 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:40:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-70 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:48:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-75 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:12:37] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [builds-builder,builds-api] Upgrade tekton - https://phabricator.wikimedia.org/T374908 (10dcaro) 03NEW [08:13:40] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [builds-builder,builds-api] Upgrade tekton - https://phabricator.wikimedia.org/T374908#10151757 (10dcaro) 05Open→03In progress a:05Raymond_Ndibe→03dcaro [08:13:50] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [builds-builder,builds-api] Upgrade tekton - https://phabricator.wikimedia.org/T374908#10151773 (10dcaro) p:05Triage→03High [08:17:21] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner - https://phabricator.wikimedia.org/T329709#10151792 (10dcaro) 05In progress→03Resolved [08:22:52] (03PS2) 10David Caro: toolforge.component.deploy: run tests by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1072162 [08:26:18] (03CR) 10CI reject: [V:04-1] toolforge.component.deploy: run tests by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1072162 (owner: 10David Caro) [08:31:39] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10151833 (10dcaro) There were a couple worker nodes this morning having puppet issues, one of them was stuck with the dpkg error... [08:32:42] !log dcaro@urcuchillay tools START - Cookbook wmcs.vps.refresh_puppet_certs on tools-k8s-worker-nfs-75.tools.eqiad1.wikimedia.cloud (T359641) [08:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:32:46] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [08:35:24] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-k8s-worker-nfs-75.tools.eqiad1.wikimedia.cloud (T359641) [08:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:35:54] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10151860 (10dcaro) Running the refresh certs cookbook did the trick: ` dcaro@urcuchillay$ cookbooks wmcs.vps.refresh_puppet_certs... [08:40:19] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-75 (T359641) [08:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:40:24] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [08:40:49] !log dcaro@urcuchillay tools START - Cookbook wmcs.vps.refresh_puppet_certs on tools-k8s-worker-nfs-70.tools.eqiad1.wikimedia.cloud (T359641) [08:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:41:24] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-75 (T359641) [08:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:43:31] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-k8s-worker-nfs-70.tools.eqiad1.wikimedia.cloud (T359641) [08:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:43:40] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-70 (T359641) [08:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:45:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-75 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:46:35] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-70 (T359641) [08:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:46:41] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [08:47:06] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10151914 (10dcaro) I found tools-k8s-worker-nfs-75 and 70 that were missing a full puppet run (I'm guessing after creation?), the... [08:58:58] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-65 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:01:47] 06cloud-services-team, 06Infrastructure-Foundations, 10SRE-tools, 07IPv6: Some WMCS clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271139#10151973 (10Volans) This is the update list as of today: `clouddb2002-dev,cloudlb2004-dev,clouddb[1013-1020]`. I guess that the clouddb are... [09:04:21] (03PS3) 10David Caro: toolforge.component.deploy: run tests by default [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1072162 [09:05:32] (03open) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [09:28:28] 06cloud-services-team, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Retire anycast_healthchecker Icinga check - https://phabricator.wikimedia.org/T374842#10152066 (10fgiunchedi) [09:28:53] 06cloud-services-team, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Retire anycast_healthchecker Icinga check - https://phabricator.wikimedia.org/T374842#10152061 (10fgiunchedi) >>! In T374842#10149670, @ssingh wrote: >>>! In T374842#10149337, @fgiunchedi wrote: >> Thank you @ssingh, there... [09:35:17] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10152106 (10dcaro) It seems also that when removing the old workers, the certs were not cleaned up properly (maybe they were remo... [09:50:59] RESOLVED: PuppetStaleCertificates: Found non-revoked Puppet certificates for 11 deleted instances on tools-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [10:07:15] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:08:56] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:09:57] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:10:15] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:11:36] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:14:03] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:26:58] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:29:00] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:37:10] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:57:27] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:58:56] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [10:59:36] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:34:16] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: Communication for Wikitech/Wikimedia Developer Account migration - https://phabricator.wikimedia.org/T373615#10152624 (10jijiki) [11:38:09] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:39:07] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:40:07] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:41:45] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [11:47:37] 10Data-Services, 06Data-Engineering, 10GlobalBlocking, 06Stewards-and-global-tools, and 3 others: Hide the value of gbw_address in the public replicas if the associated globalblocks table row has gb_autoblock_parent_id as not null - https://phabricator.wikimedia.org/T374855#10152646 (10Dreamy_Jazz) 05... [12:24:19] 10Data-Services, 06Data-Engineering, 10GlobalBlocking, 06Stewards-and-global-tools, and 3 others: Hide the value of gbw_address in the public replicas if the associated globalblocks table row has gb_autoblock_parent_id as not null - https://phabricator.wikimedia.org/T374855#10152817 (10Dreamy_Jazz) [12:24:21] 10Data-Services, 06Data-Engineering, 06Trust and Safety Product Team, 13Patch-For-Review, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10152818 (10Dreamy_Jazz) [12:34:18] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [12:35:15] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [12:36:24] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [12:37:33] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [12:42:21] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [12:43:47] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [12:49:57] (03PS1) 10Elukey: README: add dots to trigger a change [labs/private] - 10https://gerrit.wikimedia.org/r/1073449 (https://phabricator.wikimedia.org/T374443) [12:51:56] (03CR) 10Elukey: [V:03+2 C:03+2] README: add dots to trigger a change [labs/private] - 10https://gerrit.wikimedia.org/r/1073449 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [12:57:57] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [12:59:56] (03update) 10dcaro: [toolforge-deploy] test multi-replica support for continuous jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/521 (https://phabricator.wikimedia.org/T341066) (owner: 10raymond-ndibe) [13:05:38] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [13:10:00] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [13:10:57] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [13:12:03] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [13:12:14] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: tofu-infra: replace wmcs-wikireplica-dns.py with tofu - https://phabricator.wikimedia.org/T374953 (10fnegri) 03NEW [13:12:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: tofu-infra: replace wmcs-wikireplica-dns.py with tofu - https://phabricator.wikimedia.org/T374953#10153075 (10fnegri) p:05Triage→03Medium [13:35:19] 10Data-Services, 06Data-Engineering, 06SRE, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10153174 (10Dreamy_Jazz) [13:43:56] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:48:56] RESOLVED: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:04:42] (03CR) 10Stevemunene: [V:03+2 C:03+2] Add new an worker keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1072655 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [14:27:13] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [14:27:15] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [14:27:59] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [14:29:47] 10cloud-services-team (FY2024/2025-Q1-Q2): cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10153481 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6ca197a8-f8fb-4fad-8834-f4d89337e282) set by cmooney@cumin1002 for 1:30:00 on 3 host(s) and their services with reason:... [14:31:35] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [14:31:36] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [14:32:02] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [14:32:46] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [14:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:52:52] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [15:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:02:47] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [15:03:23] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [15:04:43] 10cloud-services-team (FY2024/2025-Q1-Q2): cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10153653 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b046581a-9da3-41c8-8d93-e0eeee44732e) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their services with reason:... [15:05:17] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [15:05:18] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [15:05:49] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [15:05:50] 10cloud-services-team (FY2024/2025-Q1-Q2): cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10153658 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5b4f18de-e6fc-4fd9-ace2-e29e419c51b0) set by cmooney@cumin1002 for 1:30:00 on 24 host(s) and their services with reason... [15:07:28] (03update) 10aborrero: tofu-infra: add support for neutron security groups [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 (https://phabricator.wikimedia.org/T374835) [15:07:43] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [15:08:13] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46 [15:15:43] (03update) 10dcaro: tekton: upgrade to v0.60.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/61 [15:15:44] (03merge) 10bd808: mediawiki: Don't mark edits as bot edits [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/11 (https://phabricator.wikimedia.org/T374735) [15:21:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:21:17] PROBLEM - Host checker.tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (checker.tools.wmflabs.org) [15:21:20] PROBLEM - Host cloudcephmon1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:22] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:24:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 149 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [15:25:47] FIRING: NodeDown: The node cloudcephmon1001 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephmon1001 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [15:26:10] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:26:58] FIRING: [3x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [15:27:43] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 31.49 ms [15:27:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:31:58] FIRING: [4x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [15:32:04] 06cloud-services-team: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T373277#10153800 (10phaultfinder) [15:43:12] 10Tool-schedule-deployment: ScheduleDeploymentBot edits being marked as bot means they break watchlisting of the Deployments page - https://phabricator.wikimedia.org/T374735#10153840 (10bd808) 05Open→03Resolved https://wikitech.wikimedia.org/w/index.php?title=Special:Log&logid=973398 [15:47:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:51:49] RECOVERY - Host cloudcephmon1001 is UP: PING OK - Packet loss = 0%, RTA = 30.48 ms [15:53:28] RESOLVED: [4x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [15:54:45] PROBLEM - Host cloudcephmon1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:47] RESOLVED: NodeDown: The node cloudcephmon1001 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephmon1001 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [15:56:39] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 55 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [16:00:37] 06cloud-services-team, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Retire anycast_healthchecker Icinga check - https://phabricator.wikimedia.org/T374842#10153975 (10ssingh) On perhaps a related note, while it is true that many of the things the script is doing are taken care by the self-h... [16:01:47] RECOVERY - Host cloudcephmon1001 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [16:04:07] RESOLVED: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [16:11:27] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node (T374043) [16:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:11:32] T374043: Drain C8 rack - https://phabricator.wikimedia.org/T374043 [16:11:37] !log dcaro@urcuchillay admin END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) (T374043) [16:11:40] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:13:37] RESOLVED: [9x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:24:14] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_node (T374043) [16:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:24:20] T374043: Drain C8 rack - https://phabricator.wikimedia.org/T374043 [16:29:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:39:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:49:29] 10cloud-services-team (FY2024/2025-Q1-Q2): cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10154217 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dda4c44a-52fb-46f0-8e02-d2d4b5a2ee3e) set by cmooney@cumin1002 for 0:30:00 on 24 host(s) and their services with reason... [16:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:50:01] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [16:51:11] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:52:30] (03open) 10dcaro: build.Start: initialize the envvars array so it does not send nil [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/112 [16:52:52] (03open) 10dcaro: celery: upgrade to a version without the timeout bug [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/4 [17:01:31] 10Cloud-VPS, 10Data-Services, 06Data-Engineering, 06SRE, and 4 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10154270 (10fnegri) [17:02:00] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544#10154255 (10cmooney) 05Open→03Resolved Upgrade was successful today on cloudsw1-c8-codfw, the last of th... [17:02:15] 10Data-Services, 06Data-Engineering, 06SRE, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10154274 (10fnegri) [17:04:18] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [17:14:36] (03update) 10dcaro: tekton: upgrade to v0.60.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/61 [17:14:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:19:17] 10cloud-services-team (FY2024/2025-Q1-Q2): cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10154356 (10cmooney) The switch upgrade / reboot was successful earlier today which hopefully will mean we don't have any repeat of this incident. All protocols established and looking stable but... [17:19:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:20:23] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [17:30:22] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [17:41:12] 06Toolforge-standards-committee, 07User-notice: Refresh membership of Toolforge standards committee - https://phabricator.wikimedia.org/T370474#10154451 (10bd808) [17:44:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:49:52] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [17:51:02] (03open) 10dcaro: builds-api: pull always for local environment [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/522 [17:53:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:54:00] FIRING: [2x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:56:34] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993 (10bd808) 03NEW [17:58:55] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10154509 (10bd808) 05Open→03In progress p:05Triage→03Medium [17:59:00] FIRING: [3x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:13:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:16:43] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10154575 (10bd808) [18:22:02] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10154597 (10bd808) [18:23:21] 06Toolforge-standards-committee, 07User-notice: Refresh membership of Toolforge standards committee - https://phabricator.wikimedia.org/T370474#10154613 (10bd808) [18:23:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:23:57] 06Toolforge-standards-committee, 07User-notice: Refresh membership of Toolforge standards committee - https://phabricator.wikimedia.org/T370474#10154617 (10bd808) 05In progress→03Stalled Currently blocked on {T374993} [18:33:39] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Antonin Delpeuch (Pintoch) - https://phabricator.wikimedia.org/T374995 (10Pintoch) 03NEW [18:39:47] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537#10154673 (10dancy) The updated Firefox add-on is available at https://addons.mozilla.org/en-US/firefox/addon/wikimedia-debug-header/. The up... [18:43:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:49:47] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for SD0001 - https://phabricator.wikimedia.org/T374998 (10SD0001) 03NEW [18:57:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:04:03] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Lucas Werkmeister - https://phabricator.wikimedia.org/T375001 (10LucasWerkmeister) 03NEW [19:06:06] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for Lucas Werkmeister - https://phabricator.wikimedia.org/T375001#10154826 (10LucasWerkmeister) As mentioned in T370474#10098474, @Lucas_Werkmeister_WMDE has signed some kind of NDA already. I don’t know if that also covers me (same legal perso... [19:17:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:20:01] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:23:43] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10154846 (10Legoktm) [19:26:13] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10154848 (10LucasWerkmeister) [19:43:00] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10154901 (10bd808) [19:44:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:47:08] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10154910 (10MusikAnimal) [19:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:55:28] 06Toolforge-standards-committee, 06WMF-NDA-Requests: Volunteer NDA for TheProtonade - https://phabricator.wikimedia.org/T375007 (10theprotonade) 03NEW [19:58:34] 06Toolforge-standards-committee: Facilitate Volunteer NDA application process for 2024 Toolforge standards committee appointees - https://phabricator.wikimedia.org/T374993#10154966 (10bd808) [20:14:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:19:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:24:00] FIRING: [4x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:25:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [20:25:39] 10Toolforge (Toolforge iteration 14), 07Upstream: [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) - https://phabricator.wikimedia.org/T356016#10155045 (10bd808) 05Stalled→03Open >>! In T356016#9542053, @dcaro wr... [20:29:18] 06cloud-services-team, 10Toolforge: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#10155063 (10bd808) I have a mostly working solution for this issue in a [[https://gitlab.wikimedia.org/toolforge-repos/bd808-buildpack-perl-bastion/|custom containe... [20:39:00] FIRING: [5x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:44:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:09:00] FIRING: [6x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:14:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:19:00] FIRING: [7x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:24:55] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T374043) [21:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [21:25:00] T374043: Drain C8 rack - https://phabricator.wikimedia.org/T374043 [21:27:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:47:11] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:53:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:13:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:23:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:43:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:04:27] (03update) 10raymond-ndibe: [jobs-cli] multi-replica support for continuous jobs [repos/cloud/toolforge/jobs-cli] (remove_unknown_keys_in_dump) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/63 (https://phabricator.wikimedia.org/T341066) [23:14:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:19:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:44:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown