[00:17:45] 06cloud-services-team, 10Cloud-VPS: Import Fedora CoreOS 42 image for use with Magnum - https://phabricator.wikimedia.org/T396912 (10bd808) 03NEW [00:17:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T309789) [00:18:04] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [00:20:50] 06cloud-services-team, 10Cloud-VPS: Import Fedora CoreOS 42 image for use with Magnum - https://phabricator.wikimedia.org/T396912#10914903 (10bd808) @Andrew has been poking at some Magnum related things, so maybe he would be interested in picking this up? Fedora has a 6 month release cycle for it's CoreOS vers... [01:36:17] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [01:36:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [01:36:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [01:37:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:37:36] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:38:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [01:38:49] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [02:07:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/245 [02:08:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/245 [02:08:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248 [02:09:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248 [03:14:18] RESOLVED: KernelErrors: Server cloudcephosd1017 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1017 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [03:51:35] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T309789) [03:51:42] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [03:54:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [03:54:42] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [03:55:25] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [03:56:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T309789) [03:56:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [03:56:53] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [03:57:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789) [03:58:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [03:59:09] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [04:02:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [04:02:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [04:06:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:20:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [04:21:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [04:23:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [04:23:06] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [04:23:51] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T309789) [04:25:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [04:38:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:12:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [05:12:57] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [05:13:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [05:13:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [05:14:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [05:14:41] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [06:58:17] 10Quarry: [bug] Query results do not appear due to JS error - https://phabricator.wikimedia.org/T396904#10914991 (10Liz) Any progress with this problem? [07:29:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789) [07:29:55] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [07:52:33] supertassu opened https://github.com/toolforge/quarry/pull/88 [07:56:10] supertassu closed https://github.com/toolforge/quarry/pull/88 [07:57:42] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: [bug] Query results do not appear due to JS error - https://phabricator.wikimedia.org/T396904#10915001 (10taavi) 05Open→03Resolved p:05Triage→03High a:03taavi At first I thought this was the same issue as T396893#10914253, i.e. someone trying t... [08:42:33] 10Toolforge (Toolforge iteration 21): [jobs-emailer] stops processing k8s events - https://phabricator.wikimedia.org/T396850#10915026 (10dcaro) I got stuck today also, the last log of the event fetching task does not show any error, but there's a log for connection error and retrying right after (from the config... [08:47:09] 10Toolforge (Toolforge iteration 21): [jobs-emailer] stops processing k8s events - https://phabricator.wikimedia.org/T396850#10915027 (10dcaro) This might be a good place to start to get example of setting timeouts and recovering from connection issues: https://github.com/kubernetes-client/python/tree/master/exa... [08:57:02] 10Toolforge (Toolforge iteration 21): [jobs-emailer] stops processing k8s events - https://phabricator.wikimedia.org/T396850#10915039 (10dcaro) This seems very similar https://github.com/kubernetes-client/python/issues/1148 [09:05:38] (03close) 10dcaro: emailer: run webserver in a different thread [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/9 (https://phabricator.wikimedia.org/T379924) (owner: 10aborrero) [09:59:18] FIRING: KernelErrors: Server cloudcephosd1020 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1020 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [09:59:30] 06cloud-services-team: KernelErrors Server cloudcephosd1020 logged kernel errors - https://phabricator.wikimedia.org/T396917 (10phaultfinder) 03NEW [11:49:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:59:00] FIRING: [3x] OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:59:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,heat [11:59:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for service: project,heat [12:00:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [12:01:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [12:03:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [12:06:31] PROBLEM - Host cloudcephosd1020 is DOWN: PING CRITICAL - Packet loss = 100% [12:08:09] RECOVERY - Host cloudcephosd1020 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:10:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:11:42] andrew@cloudcumin1001 bootstrap_and_add (PID 2984097) is awaiting input [12:15:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [12:29:18] FIRING: KernelErrors: Server cloudcephosd1022 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1022 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [12:29:28] 06cloud-services-team: KernelErrors Server cloudcephosd1022 logged kernel errors - https://phabricator.wikimedia.org/T396921 (10phaultfinder) 03NEW [12:30:54] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:36:35] PROBLEM - Host cloudcephosd1022 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:03] RECOVERY - Host cloudcephosd1022 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [12:44:18] RESOLVED: KernelErrors: Server cloudcephosd1018 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1018 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [13:18:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [13:18:25] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [13:19:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [13:19:08] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [13:19:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [13:23:33] PROBLEM - Host cloudcephosd1022 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:09] RECOVERY - Host cloudcephosd1022 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [13:27:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:28:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [13:28:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [13:30:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1023.eqiad.wmnet' (T394727) [13:30:11] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [13:30:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1023.eqiad.wmnet' (T394727) [13:30:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [13:31:02] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [13:32:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:26:43] (03PS1) 10Essa237: [Fix] added a landing page [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157586 [15:36:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:08:06] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-37 [16:11:30] RESOLVED: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:12:22] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-37 [16:25:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [16:26:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:16:26] 10Quarry: [bug] Another problem with Quarry - https://phabricator.wikimedia.org/T396910#10915302 (10Aklapper) For future reference, please summarize the actual problem in the task title - thanks!] [18:44:34] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789) [18:44:40] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [18:45:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [18:46:08] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [19:24:18] FIRING: KernelErrors: Server cloudcephosd1023 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1023 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [19:24:28] 06cloud-services-team: KernelErrors Server cloudcephosd1023 logged kernel errors - https://phabricator.wikimedia.org/T396929 (10phaultfinder) 03NEW [19:29:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [19:54:18] RESOLVED: KernelErrors: Server cloudcephosd1019 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1019 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [20:27:32] 06cloud-services-team, 10Cloud-VPS: ZuulDevOpsBot user can create but not delete a cluster template - https://phabricator.wikimedia.org/T396932 (10bd808) 03NEW [20:41:12] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T394614#10915437 (10LibUp-bot) A new upstream version of Pywikibot is now available: 10.2.0. * https://gerrit.wikimedia.org/g/pywikibot/core/+/refs/tags/10.2.0 * https://doc.wikimedia.org/pywikibot/stable/changelog.html [20:41:13] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T396933#10915438 (10LibUp-bot) [20:42:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [20:42:07] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T396934 (10phaultfinder) 03NEW [20:43:34] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): Magnum created instances failing to talk to OpenStack user_data service - https://phabricator.wikimedia.org/T396935 (10bd808) 03NEW [20:44:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0)