[02:13:37] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10907270 (10Andrew)
[02:18:21] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907285 (10Andrew) a:05cmooney→03Jclark-ctr @jclark-ctr, we would like to wait until the 25G dacs come in, and then have each of these hosts reconnect...
[02:18:43] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907288 (10Andrew)
[02:19:02] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789)
[02:19:09] <stashbot>	 T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789
[02:28:41] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907303 (10Andrew)
[02:32:16] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10907306 (10Andrew)
[02:32:32] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10907307 (10Andrew) Note that we need two ports for each of these, I've just updated the task description. Does that make fitting them even harder?
[03:08:19] <wikibugs>	 06cloud-services-team, 10Horizon, 05Cloud-Services-Origin-User, 07Upstream: Horizon: network topology panel ignores user policy, suggests deleting networks and instances - https://phabricator.wikimedia.org/T389965#10907378 (10Andrew) 05Open→03Resolved This is now fixed in our deployment and merged...
[05:23:14] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)
[05:23:21] <stashbot>	 T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789
[07:08:41] <jinxer-wm>	 FIRING: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted
[07:13:41] <jinxer-wm>	 FIRING: [2x] PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted
[07:33:41] <jinxer-wm>	 RESOLVED: [2x] PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted
[07:33:56] <jinxer-wm>	 FIRING: [2x] PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted
[07:34:11] <jinxer-wm>	 RESOLVED: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted
[07:48:42] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Data-Engineering (Q4 2025 April 1st - June 30th), 07IPv6: Add new WMCS IP ranges to analytics - https://phabricator.wikimedia.org/T392468#10907888 (10JAllemandou)
[07:49:45] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Data-Engineering (Q4 2025 April 1st - June 30th), 07IPv6: Add new WMCS IP ranges to analytics - https://phabricator.wikimedia.org/T392468#10907889 (10JAllemandou) 05Open→03Resolved Sorry I forgot to follow up. The boxes are ticked, the code is live, I'm resolvi...
[08:20:56] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Support keystone role management with tofu-infra - https://phabricator.wikimedia.org/T396671#10908005 (10taavi)
[08:20:57] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Create OpenStack role that allows object storage access only - https://phabricator.wikimedia.org/T396594#10908006 (10taavi)
[08:37:40] <wikibugs>	 10Data-Services, 06Data-Engineering: Create a view for existencelinks table - https://phabricator.wikimedia.org/T394898#10908068 (10Tacsipacsi) Combined with {T395366}, this is very bad and urgent. Useful and previously-accessible data is currently **COMPLETELY** inaccessible for anyone without an NDA signed:...
[09:08:03] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724 (10fnegri) 03NEW
[09:08:10] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908222 (10fnegri) p:05Triage→03High a:03fnegri
[09:08:26] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908225 (10fnegri) 05Open→03In progress
[09:10:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-8 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:10:38] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908244 (10fnegri) The same instance crashed last year because the disk filled up: {T355138}  This time it looks slightly different, the `wal_ar...
[09:12:22] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908260 (10fnegri) I'm not sure if I can just delete files from `wal_archive`, so I will attempt resizing the disk first.
[09:19:47] <wmcs-alerts>	 FIRING: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown
[09:19:47] <wmcs-alerts>	 FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[09:20:17] <wmcs-alerts>	 FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown
[09:20:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-8 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:24:47] <wmcs-alerts>	 RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[09:24:47] <wmcs-alerts>	 RESOLVED: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown
[09:25:17] <wmcs-alerts>	 RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown
[09:28:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-8 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:32:03] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908381 (10fnegri) ` sudo OS_PROJECT_ID=glamwikidashboard wmcs-openstack database instance resize volume ee0c90b0-5d21-4d41-9abf-cdabca2787c3 55...
[09:35:23] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908390 (10fnegri) I tried `database instance restart`, that moved it to status `REBOOT` and then `ACTIVE / ERROR`. I then tried rebooting the V...
[09:39:21] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908404 (10YochayCO) Does it make sense that I still don't get a response when sending a ping command to the host?
[09:40:33] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908408 (10fnegri) @YochayCO not sure, let me try fixing that error first. This is similar to what @taavi saw in {T355138} and apparently deleti...
[09:50:43] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908462 (10fnegri) ` root@dbapp:/#  ls -l /var/lib/postgresql/data/wal_archive/0000000100001577000000A1 -rw------- 1 database database 9940992 J...
[10:02:21] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908505 (10fnegri) @YochayCO I can successfully connect to the database now. `ping` does not work, but I think some firewall is blocking that. C...
[10:04:16] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908511 (10fnegri) p:05High→03Medium I'll keep this task open to monitor the disk space in the next few days.
[10:13:08] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10908526 (10YochayCO) Awesome, many thanks!   Does it make sense that maybe I don't have permissions to ssh to the db host? I can see it as DbApp...
[10:23:06] <wikibugs>	 (03CR) 10Majavah: [C:03+2] build: Upgrade Codex to 2.1.0 [labs/striker] - 10https://gerrit.wikimedia.org/r/1155739 (owner: 10Majavah)
[10:25:35] <wikibugs>	 (03Merged) 10jenkins-bot: build: Upgrade Codex to 2.1.0 [labs/striker] - 10https://gerrit.wikimedia.org/r/1155739 (owner: 10Majavah)
[10:28:16] <wm-bot2>	 !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console
[10:28:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:30:10] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Switch username validation to Bitu API [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter)
[10:31:39] <wikibugs>	 (03Merged) 10jenkins-bot: Switch username validation to Bitu API [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter)
[10:34:45] <wm-bot2>	 !log dcaro@acme tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0)
[10:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[10:38:28] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-46 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[10:40:23] <wmcs-alerts>	 FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-46 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady
[10:45:24] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-46 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady
[10:48:28] <wmcs-alerts>	 RESOLVED: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-46 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[10:51:39] <wikibugs>	 (03approved) 10dcaro: [deploy] add force-build and force-run query params [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/80 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe)
[10:51:54] <wikibugs>	 (03merge) 10dcaro: [deploy] add force-build and force-run query params [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/80 (https://phabricator.wikimedia.org/T389044) (owner: 10raymond-ndibe)
[10:54:37] <wikibugs>	 (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.116-20250612105200-81744f77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/814 (https://phabricator.wikimedia.org/T389044)
[10:58:45] <wikibugs>	 10Cloud-Services, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739 (10Jelto) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedi...
[10:59:45] <wikibugs>	 10VPS-Projects, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10908647 (10Jelto)
[11:00:56] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10908653 (10taavi)
[11:41:35] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789)
[11:41:41] <stashbot>	 T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789
[11:47:46] <wikibugs>	 (03open) 10addshore: Fix README typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108
[12:06:32] <wikibugs>	 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10908834 (10Andrew)
[12:08:43] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api
[12:11:45] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T396363)
[12:11:51] <stashbot>	 T396363: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363
[12:12:27] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api
[12:17:14] <wikibugs>	 (03approved) 10dcaro: builds: show also the pending state builds [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/107
[12:17:30] <wikibugs>	 (03approved) 10dcaro: components-api: bump to 0.0.116-20250612105200-81744f77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/814 (https://phabricator.wikimedia.org/T389044) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[12:17:33] <wikibugs>	 (03merge) 10dcaro: components-api: bump to 0.0.116-20250612105200-81744f77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/814 (https://phabricator.wikimedia.org/T389044) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[12:20:46] <wikibugs>	 (03update) 10dcaro: builds: show also the pending state builds [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/107
[12:20:48] <wikibugs>	 (03merge) 10dcaro: builds: show also the pending state builds [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/107
[12:36:43] <wikibugs>	 (03update) 10dcaro: Fix README typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 (owner: 10addshore)
[12:45:08] <wikibugs>	 (03open) 10addshore: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109
[12:46:10] <wikibugs>	 (03update) 10addshore: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109
[12:49:12] <wikibugs>	 (03update) 10addshore: Fix README typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108
[12:49:30] <wikibugs>	 (03update) 10addshore: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108
[12:50:15] <wikibugs>	 (03approved) 10dcaro: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108 (owner: 10addshore)
[12:50:20] <wikibugs>	 (03update) 10addshore: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108
[12:50:41] <wikibugs>	 (03update) 10addshore: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108
[12:53:24] <wikibugs>	 (03update) 10addshore: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109
[12:53:41] <wikibugs>	 (03update) 10addshore: builds: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109
[12:53:56] <wikibugs>	 (03merge) 10addshore: README: Fix typo (actual) [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/108
[12:54:08] <wikibugs>	 (03update) 10addshore: builds: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109
[13:23:06] <wikibugs>	 06cloud-services-team, 10Toolforge, 06Infrastructure-Foundations, 10netops: [infra] Reports of slow connectivity from APAC - https://phabricator.wikimedia.org/T395135#10909085 (10cmooney) The latency is also reduced when I check for it here (there are no manual overrides of the traffic path in place either...
[13:44:50] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Striker, 10Bitu, 06Infrastructure-Foundations, 13Patch-For-Review: Move Striker to Bitu username validation API - https://phabricator.wikimedia.org/T364605#10909209 (10taavi) 05In progress→03Resolved This change is now live. Thank you @Arendpieter for...
[13:44:55] <wikibugs>	 06cloud-services-team, 10Striker: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384#10909212 (10taavi)
[13:46:06] <wikibugs>	 06cloud-services-team, 10Striker: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384#10909217 (10taavi) 05Open→03Resolved a:03Arendpieter I believe this is fixed with {T364605}.
[13:59:33] <wikibugs>	 (03approved) 10dcaro: builds: Introduce _get_status_style with default fallback [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/109 (owner: 10addshore)
[14:06:22] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 zuul START - Cookbook wmcs.vps.create_project for project zuul in eqiad1
[14:06:24] <stashbot>	 fnegri@cloudcumin1001: Unknown project "zuul"
[14:06:37] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 zuul END (ERROR) - Cookbook wmcs.vps.create_project (exit_code=97) for project zuul in eqiad1
[14:06:37] <stashbot>	 fnegri@cloudcumin1001: Unknown project "zuul"
[14:06:42] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 zuul START - Cookbook wmcs.vps.create_project for project zuul in eqiad1 (T396540)
[14:06:43] <stashbot>	 fnegri@cloudcumin1001: Unknown project "zuul"
[14:06:43] <stashbot>	 T396540: Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540
[14:07:21] <wikibugs>	 (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project zuul [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/247 (https://phabricator.wikimedia.org/T396540)
[14:07:53] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 zuul END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project zuul in eqiad1 (T396540)
[14:07:53] <stashbot>	 fnegri@cloudcumin1001: Unknown project "zuul"
[14:09:23] <wikibugs>	 (03open) 10dcaro: deploy_task: force reruning when there was a build [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/86 (https://phabricator.wikimedia.org/T389044)
[14:10:36] <wikibugs>	 (03update) 10dcaro: deploy_task: force reruning when there was a build [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/86 (https://phabricator.wikimedia.org/T389044)
[14:13:17] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 zuul START - Cookbook wmcs.vps.create_project for project zuul in eqiad1 (T396540)
[14:13:18] <stashbot>	 fnegri@cloudcumin1001: Unknown project "zuul"
[14:13:20] <stashbot>	 T396540: Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540
[14:13:49] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 zuul END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project zuul in eqiad1 (T396540)
[14:13:49] <stashbot>	 fnegri@cloudcumin1001: Unknown project "zuul"
[14:16:10] <wikibugs>	 (03approved) 10taavi: projects: added project zuul [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/247 (https://phabricator.wikimedia.org/T396540) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49)
[14:23:40] <wikibugs>	 (03merge) 10fnegri: projects: added project zuul [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/247 (https://phabricator.wikimedia.org/T396540) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49)
[14:23:59] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch
[14:25:23] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch
[14:27:13] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 zuul START - Cookbook wmcs.vps.create_project for project zuul in eqiad1 (T396540)
[14:27:15] <stashbot>	 fnegri@cloudcumin1001: Unknown project "zuul"
[14:27:16] <stashbot>	 T396540: Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540
[14:28:39] <logmsgbot_cloud>	 !log fnegri@cloudcumin1001 zuul END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project zuul in eqiad1 (T396540)
[14:28:39] <stashbot>	 fnegri@cloudcumin1001: Unknown project "zuul"
[14:32:37] <wikibugs>	 06cloud-services-team, 10Cloud-VPS (Project-requests), 10Continuous-Integration-Infrastructure (Zuul upgrade): Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540#10909397 (10fnegri) 05Open→03Resolved Project created! Please double check that the permissions and quotas ar...
[14:38:28] <wikibugs>	 (03update) 10dcaro: deploy_task: force reruning when there was a build [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/86 (https://phabricator.wikimedia.org/T389044)
[14:48:32] <wikibugs>	 (03update) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813
[14:52:08] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy
[14:52:50] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0)
[14:53:14] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[14:55:20] <wikibugs>	 (03update) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813
[14:56:44] <icinga-wm>	 PROBLEM - Host cloudcephosd1014 is DOWN: PING CRITICAL - Packet loss = 100%
[14:57:09] <wikibugs>	 (03update) 10dcaro: run_functional_tests: add extra logs with filters/components [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/812
[14:57:23] <wikibugs>	 (03approved) 10dcaro: run_functional_tests: add extra logs with filters/components [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/812
[14:57:24] <icinga-wm>	 PROBLEM - Host cloudcephosd1015 is DOWN: PING CRITICAL - Packet loss = 100%
[14:57:29] <wikibugs>	 (03merge) 10dcaro: run_functional_tests: add extra logs with filters/components [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/812
[14:57:58] <wikibugs>	 (03update) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813
[14:58:12] <icinga-wm>	 RECOVERY - Host cloudcephosd1014 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[14:58:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97)
[14:58:28] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[15:00:57] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10909526 (10fnegri) @YochayCO We don't have a way at the moment to grant SSH access to the Trove DB hosts, the only way we can SSH is through a s...
[15:02:00] <icinga-wm>	 RECOVERY - Host cloudcephosd1015 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[15:04:44] <icinga-wm>	 PROBLEM - Host cloudcephosd1014 is DOWN: PING CRITICAL - Packet loss = 100%
[15:06:12] <icinga-wm>	 RECOVERY - Host cloudcephosd1014 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[15:09:36] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0)
[15:11:11] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy
[15:12:14] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0)
[15:27:54] <wikibugs>	 (03approved) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813
[15:27:58] <wikibugs>	 (03merge) 10dcaro: functional_tests: use the right webservice tag for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813
[15:30:35] <wikibugs>	 10Toolforge (Toolforge iteration 21), 07good first task: [components-api] add `GET` endpoint `/v1/tool/<toolname>/deployments/latest` - https://phabricator.wikimedia.org/T394990#10909790 (10Chuckonwumelu) 05Open→03In progress
[15:33:14] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10909819 (10Andrew)
[15:39:14] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10909868 (10dcaro) I did some "hacks" for harbor on this too, might be a better way of doing the same, but helped reduce the usage with the drawb...
[15:50:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cloudcontrol2010-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:50:53] <wikibugs>	 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2010-dev:9100 - https://phabricator.wikimedia.org/T396769 (10phaultfinder) 03NEW
[15:51:57] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Improve Quarry's observability - https://phabricator.wikimedia.org/T396770 (10taavi) 03NEW p:05Triage→03High
[15:52:22] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Improve Quarry's observability - https://phabricator.wikimedia.org/T396770#10909953 (10taavi) a:03taavi
[15:53:59] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Deploy prometheus-redis-exporter - https://phabricator.wikimedia.org/T396771 (10taavi) 03NEW
[15:54:06] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Deploy prometheus-redis-exporter - https://phabricator.wikimedia.org/T396771#10909972 (10taavi) p:05Triage→03High
[15:54:16] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Deploy prometheus-redis-exporter - https://phabricator.wikimedia.org/T396771#10909975 (10taavi) a:03taavi
[15:57:00] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: quarry is leaking tmp files - https://phabricator.wikimedia.org/T395237#10909983 (10taavi) a:03taavi I am planning to "fix" this by disabling the Excel export feature.
[16:00:49] <wikibugs>	 10Quarry: worker nodes issue with garbage collection - https://phabricator.wikimedia.org/T375997#10910003 (10taavi) I suspect this is a duplicate of T395237 so merging this there.
[16:00:56] <wikibugs>	 10Quarry: worker nodes issue with garbage collection - https://phabricator.wikimedia.org/T375997#10910006 (10taavi) →14Duplicate dup:03T395237
[16:01:01] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: quarry is leaking tmp files - https://phabricator.wikimedia.org/T395237#10910008 (10taavi)
[16:01:46] <wikibugs>	 10Quarry: Setup an easy way to have Quarry dump information / results on a wiki page - https://phabricator.wikimedia.org/T137179#10910011 (10taavi) 05Open→03Invalid I'm boldly declining this, as there's already a bot that does essentially the same thing.
[16:08:46] <wikibugs>	 06cloud-services-team, 10Data-Services, 10Wikifunctions, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07Essential-Work: Make wikifunctionsclient_usage table available on cloud wiki replicas - https://phabricator.wikimedia.org/T392475#10910079 (10Jdforrester-WMF) a:03Jdforrester-WMF
[16:13:18] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on cloudcontrol2010-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:36:33] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node
[16:43:05] <jinxer-wm>	 FIRING: HostBGPDown: BGP session for cloudservices2004-dev (172.20.5.8) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cloudsw1-b1-codfw:9804&var-bgp_group=cloud_host - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown
[16:43:14] <wikibugs>	 06cloud-services-team: HostBGPDown BGP session for cloudservices2004-dev (172.20.5.8) is down - https://phabricator.wikimedia.org/T396782 (10phaultfinder) 03NEW
[16:44:19] <wikibugs>	 06cloud-services-team: HostBGPDown BGP session for cloudservices2004-dev (172.20.5.8) is down - https://phabricator.wikimedia.org/T396782#10910312 (10taavi) 05Open→03Resolved a:03taavi
[16:47:12] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Fix Quarry's Redis pod exiting causing frequent outages - https://phabricator.wikimedia.org/T396785 (10taavi) 03NEW p:05Triage→03High
[16:48:05] <jinxer-wm>	 RESOLVED: HostBGPDown: BGP session for cloudservices2004-dev (172.20.5.8) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cloudsw1-b1-codfw:9804&var-bgp_group=cloud_host - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown
[16:48:16] <wikibugs>	 10Toolforge (Toolforge iteration 21), 07Documentation: [components-api] Add admin documentation page - https://phabricator.wikimedia.org/T394280#10910350 (10dcaro) a:03dcaro
[16:48:21] <wikibugs>	 10Toolforge (Toolforge iteration 21), 07Documentation: [components-api] Add admin documentation page - https://phabricator.wikimedia.org/T394280#10910352 (10dcaro) 05Open→03In progress
[16:49:36] <notefromgithub>	 supertassu opened https://github.com/toolforge/quarry/pull/84
[16:50:13] <wikibugs>	 10Toolforge (Toolforge iteration 21), 07Documentation: [components-api] Add admin documentation page - https://phabricator.wikimedia.org/T394280#10910361 (10dcaro)
[16:50:20] <wikibugs>	 (03open) 10dhardy: About screen initial implementation [toolforge-repos/wikirun-game] - 10https://gitlab.wikimedia.org/toolforge-repos/wikirun-game/-/merge_requests/1
[16:57:44] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10910380 (10RobH)  Things I checked: * The PXE boot setting is indeed set to the 10G NIC's...
[16:58:59] <notefromgithub>	 supertassu opened https://github.com/toolforge/quarry/pull/85
[17:03:21] <wikibugs>	 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add alerts and runbooks for basic service health - https://phabricator.wikimedia.org/T394275#10910396 (10dcaro) Hmm... it seems that the alerts are not being deployed in tools (they did in toolsbeta).  The alerts-deploy service seems to...
[17:04:39] <wikibugs>	 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add alerts and runbooks for basic service health - https://phabricator.wikimedia.org/T394275#10910404 (10dcaro) Just chowning seemed to do the trick: ` root@tools-prometheus-8:/srv# chown -R alerts-deploy:alerts-deploy /srv/alerts `
[17:06:33] <wikibugs>	 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add alerts and runbooks for basic service health - https://phabricator.wikimedia.org/T394275#10910413 (10dcaro) Now showing up on tools too \o/ {F62304702}
[17:12:51] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T396363)
[17:12:57] <stashbot>	 T396363: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363
[17:15:52] <wikibugs>	 10Tool-kuwikibot, 10Toolhub: Invalid source code and issues URL on https://toolsadmin.wikimedia.org/tools/id/kuwikibot - https://phabricator.wikimedia.org/T361553#10910442 (10bd808) https://ldap.toolforge.org/user/roj1 is the Developer account associated with the tool. In the #striker db that Developer account...
[17:51:05] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)
[17:51:12] <stashbot>	 T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789
[17:55:53] <wikibugs>	 (03merge) 10jdrewniak: About screen initial implementation [toolforge-repos/wikirun-game] - 10https://gitlab.wikimedia.org/toolforge-repos/wikirun-game/-/merge_requests/1 (owner: 10dhardy)
[18:04:57] <wikibugs>	 10Quarry: Setup an easy way to have Quarry dump information / results on a wiki page - https://phabricator.wikimedia.org/T137179#10910587 (10Stevietheman) See also the [[ https://en.wikipedia.org/wiki/Template:Database_report | Database report ]] template in the English Wikipedia. It's a brilliant way of usi...
[18:44:35] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10910760 (10cmooney) >>! In T309789#10910380, @RobH wrote: > So it appears its sending in t...
[19:14:18] <jinxer-wm>	 FIRING: KernelErrors: Server cloudcephosd1015 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1015 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors
[19:14:29] <wikibugs>	 06cloud-services-team: KernelErrors Server cloudcephosd1015 logged kernel errors - https://phabricator.wikimedia.org/T396796 (10phaultfinder) 03NEW
[19:18:00] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789)
[19:18:06] <stashbot>	 T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789
[19:32:58] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T396363)
[19:33:04] <stashbot>	 T396363: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363
[19:33:41] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T396363)
[19:49:02] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)
[19:49:09] <stashbot>	 T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789
[19:49:11] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0)
[19:51:26] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node
[19:52:15] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T396363)
[19:52:22] <stashbot>	 T396363: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363
[19:52:59] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T396363)
[20:17:36] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy
[20:18:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0)
[20:18:48] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[20:22:52] <icinga-wm>	 PROBLEM - Host cloudcephosd1015 is DOWN: PING CRITICAL - Packet loss = 100%
[20:23:20] <icinga-wm>	 RECOVERY - Host cloudcephosd1015 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[20:25:09] <jinxer-wm>	 FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[20:27:11] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0)
[20:27:57] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node
[20:29:40] <icinga-wm>	 PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100%
[20:29:40] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10911034 (10Andrew)
[20:30:09] <jinxer-wm>	 RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[20:31:28] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy
[20:31:48] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99)
[20:34:47] <jinxer-wm>	 FIRING: NodeDown: Node cloudcephosd1016 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1016 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:34:50] <icinga-wm>	 RECOVERY - Host cloudcephosd1016 is UP: PING WARNING - Packet loss = 80%, RTA = 0.71 ms
[20:35:38] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node
[20:38:56] <icinga-wm>	 PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100%
[20:39:18] <jinxer-wm>	 FIRING: KernelErrors: Server cloudcephosd1016 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1016 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors
[20:39:21] <wikibugs>	 06cloud-services-team: KernelErrors Server cloudcephosd1016 logged kernel errors - https://phabricator.wikimedia.org/T396801 (10phaultfinder) 03NEW
[20:39:47] <jinxer-wm>	 RESOLVED: NodeDown: Node cloudcephosd1016 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1016 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:40:49] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97)
[20:42:24] <icinga-wm>	 RECOVERY - Host cloudcephosd1016 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[21:32:03] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add
[21:35:55] <icinga-wm>	 PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100%
[21:36:25] <icinga-wm>	 RECOVERY - Host cloudcephosd1016 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[21:38:09] <jinxer-wm>	 FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[21:39:46] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0)
[21:40:47] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node
[21:40:58] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0)
[21:43:09] <jinxer-wm>	 RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[21:43:33] <wikibugs>	 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10911206 (10Andrew)
[21:54:18] <jinxer-wm>	 FIRING: KernelErrors: Server cloudcephosd1015 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1015 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors
[22:01:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-puppetserver-01 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[22:08:10] <wikibugs>	 (03open) 10chuckonwumelu: GET the latest deployment for a particular tool [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87 (https://phabricator.wikimedia.org/T394990)
[22:20:07] <wikibugs>	 (03update) 10chuckonwumelu: GET the latest deployment for a particular tool [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87 (https://phabricator.wikimedia.org/T394990)
[22:24:43] <wikibugs>	 (03update) 10chuckonwumelu: GET the latest deployment for a particular tool [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87 (https://phabricator.wikimedia.org/T394990)
[22:29:51] <wikibugs>	 06cloud-services-team, 10Data-Services, 10Wikifunctions, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07Essential-Work: Make wikifunctionsclient_usage table available on cloud wiki replicas - https://phabricator.wikimedia.org/T392475#10911342 (10Ladsgroup) Soon once we switchover the maintain-views of pub...
[22:31:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-puppetserver-01 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[22:43:28] <wmcs-alerts>	 FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-ingress-8 on project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[22:58:28] <wmcs-alerts>	 RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-ingress-8 on project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[23:04:18] <jinxer-wm>	 FIRING: [2x] KernelErrors: Server cloudcephosd1015 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors  - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors
[23:04:24] <wikibugs>	 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T396810 (10phaultfinder) 03NEW
[23:14:29] <wikibugs>	 (03open) 10andrew: Add 'magnum' service project in codfw1dev [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248 (https://phabricator.wikimedia.org/T393782)
[23:14:38] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248
[23:14:58] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248
[23:15:04] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248
[23:15:22] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/248
[23:50:56] <jinxer-wm>	 FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-bastionless.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown