[02:13:37] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10907270 (10Andrew) [02:18:21] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907285 (10Andrew) a:05cmooney→03Jclark-ctr @jclark-ctr, we would like to wait until the 25G dacs come in, and then have each of these hosts reconnect... [02:18:43] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907288 (10Andrew) [02:19:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [02:19:09] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [02:28:41] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907303 (10Andrew) [02:32:16] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10907306 (10Andrew) [02:32:32] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10907307 (10Andrew) Note that we need two ports for each of these, I've just updated the task description. Does that make fitting them even harder? [03:08:19] 06cloud-services-team, 10Horizon, 05Cloud-Services-Origin-User, 07Upstream: Horizon: network topology panel ignores user policy, suggests deleting networks and instances - https://phabricator.wikimedia.org/T389965#10907378 (10Andrew) 05Open→03Resolved This is now fixed in our deployment and merged... [05:23:14] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789) [05:23:21] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [07:08:41] FIRING: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [07:13:41] FIRING: [2x] PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [07:33:41] RESOLVED: [2x] PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [07:33:56] FIRING: [2x] PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [07:34:11] RESOLVED: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [07:48:42] 06cloud-services-team, 10Cloud-VPS, 10Data-Engineering (Q4 2025 April 1st - June 30th), 07IPv6: Add new WMCS IP ranges to analytics - https://phabricator.wikimedia.org/T392468#10907888 (10JAllemandou) [07:49:45] 06cloud-services-team, 10Cloud-VPS, 10Data-Engineering (Q4 2025 April 1st - June 30th), 07IPv6: Add new WMCS IP ranges to analytics - https://phabricator.wikimedia.org/T392468#10907889 (10JAllemandou) 05Open→03Resolved Sorry I forgot to follow up. The boxes are ticked, the code is live, I'm resolvi... [08:20:56] 06cloud-services-team, 10Cloud-VPS: Support keystone role management with tofu-infra - https://phabricator.wikimedia.org/T396671#10908005 (10taavi) [08:20:57] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Create OpenStack role that allows object storage access only - https://phabricator.wikimedia.org/T396594#10908006 (10taavi) [08:37:40] 10Data-Services, 06Data-Engineering: Create a view for existencelinks table - https://phabricator.wikimedia.org/T394898#10908068 (10Tacsipacsi) Combined with {T395366}, this is very bad and urgent. Useful and previously-accessible data is currently **COMPLETELY** inaccessible for anyone without an NDA signed:... [09:08:03] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724 (10fnegri) 03NEW [09:08:10] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908222 (10fnegri) p:05Triage→03High a:03fnegri [09:08:26] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908225 (10fnegri) 05Open→03In progress [09:10:28] FIRING: InstanceDown: Project tools instance tools-prometheus-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:10:38] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908244 (10fnegri) The same instance crashed last year because the disk filled up: {T355138} This time it looks slightly different, the `wal_ar... [09:12:22] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908260 (10fnegri) I'm not sure if I can just delete files from `wal_archive`, so I will attempt resizing the disk first. [09:19:47] FIRING: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [09:19:47] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [09:20:17] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [09:20:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:24:47] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [09:24:47] RESOLVED: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [09:25:17] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [09:28:28] FIRING: InstanceDown: Project tools instance tools-prometheus-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:32:03] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908381 (10fnegri) ` sudo OS_PROJECT_ID=glamwikidashboard wmcs-openstack database instance resize volume ee0c90b0-5d21-4d41-9abf-cdabca2787c3 55... [09:35:23] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908390 (10fnegri) I tried `database instance restart`, that moved it to status `REBOOT` and then `ACTIVE / ERROR`. I then tried rebooting the V... [09:39:21] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908404 (10YochayCO) Does it make sense that I still don't get a response when sending a ping command to the host? [09:40:33] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908408 (10fnegri) @YochayCO not sure, let me try fixing that error first. This is similar to what @taavi saw in {T355138} and apparently deleti... [09:50:43] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908462 (10fnegri) ` root@dbapp:/# ls -l /var/lib/postgresql/data/wal_archive/0000000100001577000000A1 -rw------- 1 database database 9940992 J... [10:02:21] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908505 (10fnegri) @YochayCO I can successfully connect to the database now. `ping` does not work, but I think some firewall is blocking that. C... [10:04:16] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908511 (10fnegri) p:05High→03Medium I'll keep this task open to monitor the disk space in the next few days. [10:13:08] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908526 (10YochayCO) Awesome, many thanks! Does it make sense that maybe I don't have permissions to ssh to the db host? I can see it as DbApp... [10:23:06] (03CR) 10Majavah: [C:03+2] build: Upgrade Codex to 2.1.0 [labs/striker] - 10https://gerrit.wikimedia.org/r/1155739 (owner: 10Majavah) [10:25:35] (03Merged) 10jenkins-bot: build: Upgrade Codex to 2.1.0 [labs/striker] - 10https://gerrit.wikimedia.org/r/1155739 (owner: 10Majavah) [10:28:16] !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:30:10] (03CR) 10Majavah: [C:03+2] Switch username validation to Bitu API [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter) [10:31:39] (03Merged) 10jenkins-bot: Switch username validation to Bitu API [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter) [10:34:45] !log dcaro@acme tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [10:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:38:28] FIRING: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-46 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:40:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-46 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [10:45:24] RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-46 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [10:48:28] RESOLVED: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-46 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:51:39] (03approved) 10dcaro: [deploy] add force-build and force-run query params [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/80 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [10:51:54] (03merge) 10dcaro: [deploy] add force-build and force-run query params [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/80 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe) [10:54:37] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.116-20250612105200-81744f77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/814 (https://phabricator.wikimedia.org/T389044) [10:58:45] 10Cloud-Services, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739 (10Jelto) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedi... [10:59:45] 10VPS-Projects, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10908647 (10Jelto) [11:00:56] 06cloud-services-team, 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10908653 (10taavi) [11:41:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [11:41:41] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [11:47:46] (03open) 10addshore: Fix README typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 [12:06:32] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10908834 (10Andrew) [12:08:43] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [12:11:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T396363) [12:11:51] T396363: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363 [12:12:27] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [12:17:14] (03approved) 10dcaro: builds: show also the pending state builds [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/107 [12:17:30] (03approved) 10dcaro: components-api: bump to 0.0.116-20250612105200-81744f77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/814 (https://phabricator.wikimedia.org/T389044) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [12:17:33] (03merge) 10dcaro: components-api: bump to 0.0.116-20250612105200-81744f77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/814 (https://phabricator.wikimedia.org/T389044) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [12:20:46] (03update) 10dcaro: builds: show also the pending state builds [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/107 [12:20:48] (03merge) 10dcaro: builds: show also the pending state builds [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/107 [12:36:43] (03update) 10dcaro: Fix README typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 (owner: 10addshore) [12:45:08] (03open) 10addshore: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109 [12:46:10] (03update) 10addshore: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109 [12:49:12] (03update) 10addshore: Fix README typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 [12:49:30] (03update) 10addshore: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 [12:50:15] (03approved) 10dcaro: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 (owner: 10addshore) [12:50:20] (03update) 10addshore: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 [12:50:41] (03update) 10addshore: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 [12:53:24] (03update) 10addshore: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109 [12:53:41] (03update) 10addshore: builds: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109 [12:53:56] (03merge) 10addshore: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 [12:54:08] (03update) 10addshore: builds: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109 [13:23:06] 06cloud-services-team, 10Toolforge, 06Infrastructure-Foundations, 10netops: [infra] Reports of slow connectivity from APAC - https://phabricator.wikimedia.org/T395135#10909085 (10cmooney) The latency is also reduced when I check for it here (there are no manual overrides of the traffic path in place either... [13:44:50] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Striker, 10Bitu, 06Infrastructure-Foundations, 13Patch-For-Review: Move Striker to Bitu username validation API - https://phabricator.wikimedia.org/T364605#10909209 (10taavi) 05In progress→03Resolved This change is now live. Thank you @Arendpieter for... [13:44:55] 06cloud-services-team, 10Striker: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384#10909212 (10taavi) [13:46:06] 06cloud-services-team, 10Striker: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384#10909217 (10taavi) 05Open→03Resolved a:03Arendpieter I believe this is fixed with {T364605}. [13:59:33] (03approved) 10dcaro: builds: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109 (owner: 10addshore) [14:06:22] !log fnegri@cloudcumin1001 zuul START - Cookbook wmcs.vps.create_project for project zuul in eqiad1 [14:06:24] fnegri@cloudcumin1001: Unknown project "zuul" [14:06:37] !log fnegri@cloudcumin1001 zuul END (ERROR) - Cookbook wmcs.vps.create_project (exit_code=97) for project zuul in eqiad1 [14:06:37] fnegri@cloudcumin1001: Unknown project "zuul" [14:06:42] !log fnegri@cloudcumin1001 zuul START - Cookbook wmcs.vps.create_project for project zuul in eqiad1 (T396540) [14:06:43] fnegri@cloudcumin1001: Unknown project "zuul" [14:06:43] T396540: Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540 [14:07:21] (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project zuul [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/247 (https://phabricator.wikimedia.org/T396540) [14:07:53] !log fnegri@cloudcumin1001 zuul END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project zuul in eqiad1 (T396540) [14:07:53] fnegri@cloudcumin1001: Unknown project "zuul" [14:09:23] (03open) 10dcaro: deploy_task: force reruning when there was a build [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/86 (https://phabricator.wikimedia.org/T389044) [14:10:36] (03update) 10dcaro: deploy_task: force reruning when there was a build [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/86 (https://phabricator.wikimedia.org/T389044) [14:13:17] !log fnegri@cloudcumin1001 zuul START - Cookbook wmcs.vps.create_project for project zuul in eqiad1 (T396540) [14:13:18] fnegri@cloudcumin1001: Unknown project "zuul" [14:13:20] T396540: Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540 [14:13:49] !log fnegri@cloudcumin1001 zuul END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project zuul in eqiad1 (T396540) [14:13:49] fnegri@cloudcumin1001: Unknown project "zuul" [14:16:10] (03approved) 10taavi: projects: added project zuul [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/247 (https://phabricator.wikimedia.org/T396540) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [14:23:40] (03merge) 10fnegri: projects: added project zuul [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/247 (https://phabricator.wikimedia.org/T396540) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [14:23:59] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [14:25:23] !log fnegri@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [14:27:13] !log fnegri@cloudcumin1001 zuul START - Cookbook wmcs.vps.create_project for project zuul in eqiad1 (T396540) [14:27:15] fnegri@cloudcumin1001: Unknown project "zuul" [14:27:16] T396540: Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540 [14:28:39] !log fnegri@cloudcumin1001 zuul END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project zuul in eqiad1 (T396540) [14:28:39] fnegri@cloudcumin1001: Unknown project "zuul" [14:32:37] 06cloud-services-team, 10Cloud-VPS (Project-requests), 10Continuous-Integration-Infrastructure (Zuul upgrade): Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540#10909397 (10fnegri) 05Open→03Resolved Project created! Please double check that the permissions and quotas ar... [14:38:28] (03update) 10dcaro: deploy_task: force reruning when there was a build [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/86 (https://phabricator.wikimedia.org/T389044) [14:48:32] (03update) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813 [14:52:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:52:50] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [14:53:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [14:55:20] (03update) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813 [14:56:44] PROBLEM - Host cloudcephosd1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:09] (03update) 10dcaro: run_functional_tests: add extra logs with filters/components [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/812 [14:57:23] (03approved) 10dcaro: run_functional_tests: add extra logs with filters/components [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/812 [14:57:24] PROBLEM - Host cloudcephosd1015 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:29] (03merge) 10dcaro: run_functional_tests: add extra logs with filters/components [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/812 [14:57:58] (03update) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813 [14:58:12] RECOVERY - Host cloudcephosd1014 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:58:24] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [14:58:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:00:57] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10909526 (10fnegri) @YochayCO We don't have a way at the moment to grant SSH access to the Trove DB hosts, the only way we can SSH is through a s... [15:02:00] RECOVERY - Host cloudcephosd1015 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:04:44] PROBLEM - Host cloudcephosd1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:12] RECOVERY - Host cloudcephosd1014 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [15:09:36] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [15:11:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:12:14] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [15:27:54] (03approved) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813 [15:27:58] (03merge) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813 [15:30:35] 10Toolforge (Toolforge iteration 21), 07good first task: [components-api] add `GET` endpoint `/v1/tool//deployments/latest` - https://phabricator.wikimedia.org/T394990#10909790 (10Chuckonwumelu) 05Open→03In progress [15:33:14] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10909819 (10Andrew) [15:39:14] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10909868 (10dcaro) I did some "hacks" for harbor on this too, might be a better way of doing the same, but helped reduce the usage with the drawb... [15:50:48] FIRING: PuppetFailure: Puppet has failed on cloudcontrol2010-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:50:53] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2010-dev:9100 - https://phabricator.wikimedia.org/T396769 (10phaultfinder) 03NEW [15:51:57] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Improve Quarry's observability - https://phabricator.wikimedia.org/T396770 (10taavi) 03NEW p:05Triage→03High [15:52:22] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Improve Quarry's observability - https://phabricator.wikimedia.org/T396770#10909953 (10taavi) a:03taavi [15:53:59] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Deploy prometheus-redis-exporter - https://phabricator.wikimedia.org/T396771 (10taavi) 03NEW [15:54:06] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Deploy prometheus-redis-exporter - https://phabricator.wikimedia.org/T396771#10909972 (10taavi) p:05Triage→03High [15:54:16] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Deploy prometheus-redis-exporter - https://phabricator.wikimedia.org/T396771#10909975 (10taavi) a:03taavi [15:57:00] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: quarry is leaking tmp files - https://phabricator.wikimedia.org/T395237#10909983 (10taavi) a:03taavi I am planning to "fix" this by disabling the Excel export feature. [16:00:49] 10Quarry: worker nodes issue with garbage collection - https://phabricator.wikimedia.org/T375997#10910003 (10taavi) I suspect this is a duplicate of T395237 so merging this there. [16:00:56] 10Quarry: worker nodes issue with garbage collection - https://phabricator.wikimedia.org/T375997#10910006 (10taavi) →14Duplicate dup:03T395237 [16:01:01] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: quarry is leaking tmp files - https://phabricator.wikimedia.org/T395237#10910008 (10taavi) [16:01:46] 10Quarry: Setup an easy way to have Quarry dump information / results on a wiki page - https://phabricator.wikimedia.org/T137179#10910011 (10taavi) 05Open→03Invalid I'm boldly declining this, as there's already a bot that does essentially the same thing. [16:08:46] 06cloud-services-team, 10Data-Services, 10Wikifunctions, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07Essential-Work: Make wikifunctionsclient_usage table available on cloud wiki replicas - https://phabricator.wikimedia.org/T392475#10910079 (10Jdforrester-WMF) a:03Jdforrester-WMF [16:13:18] RESOLVED: PuppetFailure: Puppet has failed on cloudcontrol2010-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:36:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [16:43:05] FIRING: HostBGPDown: BGP session for cloudservices2004-dev (172.20.5.8) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cloudsw1-b1-codfw:9804&var-bgp_group=cloud_host - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [16:43:14] 06cloud-services-team: HostBGPDown BGP session for cloudservices2004-dev (172.20.5.8) is down - https://phabricator.wikimedia.org/T396782 (10phaultfinder) 03NEW [16:44:19] 06cloud-services-team: HostBGPDown BGP session for cloudservices2004-dev (172.20.5.8) is down - https://phabricator.wikimedia.org/T396782#10910312 (10taavi) 05Open→03Resolved a:03taavi [16:47:12] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Fix Quarry's Redis pod exiting causing frequent outages - https://phabricator.wikimedia.org/T396785 (10taavi) 03NEW p:05Triage→03High [16:48:05] RESOLVED: HostBGPDown: BGP session for cloudservices2004-dev (172.20.5.8) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cloudsw1-b1-codfw:9804&var-bgp_group=cloud_host - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [16:48:16] 10Toolforge (Toolforge iteration 21), 07Documentation: [components-api] Add admin documentation page - https://phabricator.wikimedia.org/T394280#10910350 (10dcaro) a:03dcaro [16:48:21] 10Toolforge (Toolforge iteration 21), 07Documentation: [components-api] Add admin documentation page - https://phabricator.wikimedia.org/T394280#10910352 (10dcaro) 05Open→03In progress [16:49:36] supertassu opened https://github.com/toolforge/quarry/pull/84 [16:50:13] 10Toolforge (Toolforge iteration 21), 07Documentation: [components-api] Add admin documentation page - https://phabricator.wikimedia.org/T394280#10910361 (10dcaro) [16:50:20] (03open) 10dhardy: About screen initial implementation [toolforge-repos/wikirun-game] - 10https://gitlab.wikimedia.org/toolforge-repos/wikirun-game/-/merge_requests/1 [16:57:44] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10910380 (10RobH) Things I checked: * The PXE boot setting is indeed set to the 10G NIC's... [16:58:59] supertassu opened https://github.com/toolforge/quarry/pull/85 [17:03:21] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add alerts and runbooks for basic service health - https://phabricator.wikimedia.org/T394275#10910396 (10dcaro) Hmm... it seems that the alerts are not being deployed in tools (they did in toolsbeta). The alerts-deploy service seems to... [17:04:39] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add alerts and runbooks for basic service health - https://phabricator.wikimedia.org/T394275#10910404 (10dcaro) Just chowning seemed to do the trick: ` root@tools-prometheus-8:/srv# chown -R alerts-deploy:alerts-deploy /srv/alerts ` [17:06:33] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add alerts and runbooks for basic service health - https://phabricator.wikimedia.org/T394275#10910413 (10dcaro) Now showing up on tools too \o/ {F62304702} [17:12:51] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T396363) [17:12:57] T396363: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363 [17:15:52] 10Tool-kuwikibot, 10Toolhub: Invalid source code and issues URL on https://toolsadmin.wikimedia.org/tools/id/kuwikibot - https://phabricator.wikimedia.org/T361553#10910442 (10bd808) https://ldap.toolforge.org/user/roj1 is the Developer account associated with the tool. In the #striker db that Developer account... [17:51:05] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789) [17:51:12] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [17:55:53] (03merge) 10jdrewniak: About screen initial implementation [toolforge-repos/wikirun-game] - 10https://gitlab.wikimedia.org/toolforge-repos/wikirun-game/-/merge_requests/1 (owner: 10dhardy) [18:04:57] 10Quarry: Setup an easy way to have Quarry dump information / results on a wiki page - https://phabricator.wikimedia.org/T137179#10910587 (10Stevietheman) See also the [[ https://en.wikipedia.org/wiki/Template:Database_report | Database report ]] template in the English Wikipedia. It's a brilliant way of usi... [18:44:35] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10910760 (10cmooney) >>! In T309789#10910380, @RobH wrote: > So it appears its sending in t... [19:14:18] FIRING: KernelErrors: Server cloudcephosd1015 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1015 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [19:14:29] 06cloud-services-team: KernelErrors Server cloudcephosd1015 logged kernel errors - https://phabricator.wikimedia.org/T396796 (10phaultfinder) 03NEW [19:18:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [19:18:06] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [19:32:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T396363) [19:33:04] T396363: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363 [19:33:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T396363) [19:49:02] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789) [19:49:09] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [19:49:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [19:51:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [19:52:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T396363) [19:52:22] T396363: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363 [19:52:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T396363) [20:17:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [20:18:24] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [20:18:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [20:22:52] PROBLEM - Host cloudcephosd1015 is DOWN: PING CRITICAL - Packet loss = 100% [20:23:20] RECOVERY - Host cloudcephosd1015 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [20:25:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:27:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [20:27:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [20:29:40] PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:29:40] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10911034 (10Andrew) [20:30:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:31:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [20:31:48] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [20:34:47] FIRING: NodeDown: Node cloudcephosd1016 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1016 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:34:50] RECOVERY - Host cloudcephosd1016 is UP: PING WARNING - Packet loss = 80%, RTA = 0.71 ms [20:35:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [20:38:56] PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:39:18] FIRING: KernelErrors: Server cloudcephosd1016 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1016 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [20:39:21] 06cloud-services-team: KernelErrors Server cloudcephosd1016 logged kernel errors - https://phabricator.wikimedia.org/T396801 (10phaultfinder) 03NEW [20:39:47] RESOLVED: NodeDown: Node cloudcephosd1016 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1016 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:40:49] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) [20:42:24] RECOVERY - Host cloudcephosd1016 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [21:32:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [21:35:55] PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:25] RECOVERY - Host cloudcephosd1016 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [21:38:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:39:46] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [21:40:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [21:40:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [21:43:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:43:33] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10911206 (10Andrew) [21:54:18] FIRING: KernelErrors: Server cloudcephosd1015 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1015 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [22:01:28] FIRING: InstanceDown: Project tools instance tools-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:08:10] (03open) 10chuckonwumelu: GET the latest deployment for a particular tool [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87 (https://phabricator.wikimedia.org/T394990) [22:20:07] (03update) 10chuckonwumelu: GET the latest deployment for a particular tool [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87 (https://phabricator.wikimedia.org/T394990) [22:24:43] (03update) 10chuckonwumelu: GET the latest deployment for a particular tool [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87 (https://phabricator.wikimedia.org/T394990) [22:29:51] 06cloud-services-team, 10Data-Services, 10Wikifunctions, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07Essential-Work: Make wikifunctionsclient_usage table available on cloud wiki replicas - https://phabricator.wikimedia.org/T392475#10911342 (10Ladsgroup) Soon once we switchover the maintain-views of pub... [22:31:28] RESOLVED: InstanceDown: Project tools instance tools-puppetserver-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:43:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-ingress-8 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [22:58:28] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-ingress-8 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:04:18] FIRING: [2x] KernelErrors: Server cloudcephosd1015 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [23:04:24] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T396810 (10phaultfinder) 03NEW [23:14:29] (03open) 10andrew: Add 'magnum' service project in codfw1dev [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248 (https://phabricator.wikimedia.org/T393782) [23:14:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248 [23:14:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248 [23:15:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248 [23:15:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248 [23:50:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-bastionless.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown