[00:05:56] FIRING: MaxConntrack: Max conntrack at 83.3% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:12:23] FIRING: OOM: OOM killer active on cloudcephmon2004-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [00:17:23] RESOLVED: OOM: OOM killer active on cloudcephmon2004-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [00:41:08] FIRING: OOM: OOM killer active on cloudcephmon2004-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [00:44:53] RESOLVED: OOM: OOM killer active on cloudcephmon2004-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [00:55:55] RESOLVED: MaxConntrack: Max conntrack at 82.84% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:57:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:42:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [02:03:35] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10965932 (10Andrew) [02:07:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [02:10:27] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: ceph-mon 16.2.15+ds-0+deb12u1 uses all the RAM - https://phabricator.wikimedia.org/T398389 (10Andrew) 03NEW [02:14:23] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: ceph-mon 16.2.15+ds-0+deb12u1 uses all the RAM - https://phabricator.wikimedia.org/T398389#10965976 (10JJMC89) [02:57:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-39 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [05:13:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service keystone-admin-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:18:22] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service keystone-admin-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [06:32:39] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:33:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:54:39] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:58:31] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29772 bytes in 0.686 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:01:30] FIRING: PuppetStaleCertificates: Found non-revoked Puppet certificates for 1 deleted instances on gitlab-runners-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [08:19:01] (03update) 10dcaro: cli: drop outadated comment [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/70 (owner: 10aborrero) [08:19:27] (03merge) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [08:19:55] (03update) 10dcaro: bash-completion: Add file system recognition to autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/46 (https://phabricator.wikimedia.org/T395077) (owner: 10chuckonwumelu) [08:22:31] (03approved) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/177 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [08:31:28] (03update) 10fnegri: bash-completion: Add file system recognition to autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/46 (https://phabricator.wikimedia.org/T395077) (owner: 10chuckonwumelu) [08:32:27] (03update) 10fnegri: bash-completion: Add file system recognition to autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/46 (https://phabricator.wikimedia.org/T395077) (owner: 10chuckonwumelu) [08:34:39] 10Cloud-VPS (Project-requests): Request creation of content-transformers VPS project - https://phabricator.wikimedia.org/T398405 (10Jgiannelos) 03NEW [08:36:19] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-30 - https://phabricator.wikimedia.org/T398170#10966403 (10fnegri) The new replica caught up with the primary late yesterday, and then started lagging again: {F62776250} This time it's for a differ... [08:36:49] 10Cloud-VPS (Project-requests): Request creation of content-transformers VPS project - https://phabricator.wikimedia.org/T398405#10966405 (10taavi) From #cloud-vps-project-requests: > == Project scope == > Cloud VPS projects should be scoped based around concrete products or software projects, rather than the te... [08:38:16] 10Cloud-VPS (Project-requests): Request creation of content-transformers VPS project - https://phabricator.wikimedia.org/T398405#10966406 (10Jgiannelos) [08:38:46] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10966408 (10Jgiannelos) Thanks @taavi I updated the request. [08:38:48] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10966410 (10Jgiannelos) [08:40:02] (03merge) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/177 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [08:40:31] RESOLVED: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 13h 58m 25s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [08:41:31] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 14h 2m 9s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [08:42:44] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.384-20250702084012-39dd7c77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/859 [08:43:10] 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 06SRE: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10966425 (10cmooney) >>! In T396396#10955048, @Andrew wrote: >>>! In T396396#10954940, @cmooney wrote: >> Folks you need to delete th... [08:44:00] (03approved) 10dcaro: cli: drop outadated comment [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/70 (owner: 10aborrero) [08:44:08] (03merge) 10dcaro: cli: drop outadated comment [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/70 (owner: 10aborrero) [08:45:06] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10966452 (10taavi) 05Open→03Resolved [08:46:34] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.178-20250702084425-15f2dd20 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/860 [08:49:53] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10966465 (10SLopes-WMF) Hey @taavi, thanks for your clarification! I'd just convey some urgency in this request — this can have broad impact, especially on mobile apps, which we'd like to avoid. [08:51:21] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-30 - https://phabricator.wikimedia.org/T398170#10966480 (10fnegri) Replication has restarted, but there are some other heavy queries on the same table (`s51698__yetkin.visited_pages_agg`), it's curre... [08:55:40] (03merge) 10dcaro: scheduled: add scheduled component support [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/94 (https://phabricator.wikimedia.org/T395071) [08:56:34] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [08:58:33] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.130-20250702085600-91391589 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/861 (https://phabricator.wikimedia.org/T395071) [09:05:55] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [09:06:10] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [09:06:13] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [09:09:40] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/toolforge-misctools] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-misctools/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [09:14:43] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/toolforge-misctools] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-misctools/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [09:15:16] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/toolforge-misctools] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-misctools/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [09:16:15] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [09:16:54] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [09:16:55] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-30 - https://phabricator.wikimedia.org/T398170#10966619 (10fnegri) The big INSERT completed, but now there's another big DELETE that will likely take hours because of the missing primary index: ` |... [09:17:13] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [09:18:11] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [09:18:15] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [09:24:31] FIRING: ToolsToolsDBReplicationError: ToolsDB replication is broken on tools-db-6 (errno 1927) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [09:24:31] FIRING: [2x] ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-4 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [09:28:53] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [09:29:01] RESOLVED: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 14h 21m 57s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [09:32:16] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 14h 35m 35s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [09:34:31] RESOLVED: ToolsToolsDBReplicationError: ToolsDB replication is broken on tools-db-6 (errno 1927) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [09:35:10] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-30 - https://phabricator.wikimedia.org/T398170#10966676 (10fnegri) This delete was so massive that even running manually it took 10 minutes to complete: ` MariaDB [s51698__yetkin]> DELETE FROM visit... [09:37:42] 06cloud-services-team, 10Toolforge: [toolsdb] Replica is frequently lagging behind the primary - https://phabricator.wikimedia.org/T357624#10966694 (10fnegri) [09:40:01] RESOLVED: [2x] ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-4 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [09:41:29] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [09:42:49] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [09:46:41] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [09:46:43] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [09:46:45] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [09:51:32] (03approved) 10dcaro: jobs-api: bump to 0.0.384-20250702084012-39dd7c77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/859 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [09:51:35] (03merge) 10dcaro: jobs-api: bump to 0.0.384-20250702084012-39dd7c77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/859 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [09:51:48] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [09:56:36] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [09:56:42] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [09:58:55] 06cloud-services-team, 10Toolforge (Toolforge iteration 21): [tools-static,infra] NFS issues should not bring tools-static down - https://phabricator.wikimedia.org/T397634#10966776 (10taavi) a:03taavi Plan is to put HAProxy in front of the Nginx instance to handle CDNjs and FontCDN requests so that Nginx goi... [10:01:38] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [10:03:54] (03approved) 10dcaro: components-api: bump to 0.0.130-20250702085600-91391589 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/861 (https://phabricator.wikimedia.org/T395071) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [10:03:56] (03update) 10dcaro: components-api: bump to 0.0.130-20250702085600-91391589 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/861 (https://phabricator.wikimedia.org/T395071) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [10:04:31] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add all missing options for scheduled components - https://phabricator.wikimedia.org/T395071#10966819 (10dcaro) 05In progress→03Resolved [10:04:32] 10Toolforge (Toolforge iteration 21): [components-api] Add support for scheduled components - https://phabricator.wikimedia.org/T395065#10966821 (10dcaro) 05In progress→03Resolved [10:05:12] (03merge) 10dcaro: components-api: bump to 0.0.130-20250702085600-91391589 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/861 (https://phabricator.wikimedia.org/T395071) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [10:05:31] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maiantain-kubeusers [10:05:35] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maiantain-kubeusers [10:07:01] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.178-20250702084425-15f2dd20 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/860 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [10:07:05] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [10:09:25] (03update) 10dcaro: tool-config: export the config schema [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T397724) [10:15:45] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Migrate misctools package to GitLab - https://phabricator.wikimedia.org/T398202#10966856 (10dcaro) p:05Low→03Medium [10:16:55] 06cloud-services-team, 10Toolforge: [misctools-cli] generate arm64 packages for lima-kilo/macos combination - https://phabricator.wikimedia.org/T398419 (10dcaro) 03NEW [10:17:34] 06cloud-services-team, 10Toolforge: [misctools-cli] generate arm64 packages for lima-kilo/macos combination - https://phabricator.wikimedia.org/T398419#10966870 (10dcaro) →14Duplicate dup:03T398016 [10:17:35] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [lima-kilo,misctools] no arm64 version for mac-os based installations - https://phabricator.wikimedia.org/T398016#10966872 (10dcaro) [10:17:52] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [lima-kilo,misctools] no arm64 version for mac-os based installations - https://phabricator.wikimedia.org/T398016#10966875 (10dcaro) [10:17:52] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Migrate misctools package to GitLab - https://phabricator.wikimedia.org/T398202#10966876 (10dcaro) [10:21:58] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [10:23:09] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [10:23:33] 10Cloud-VPS (Project-requests): Request creation of dumpstorrents VPS project - https://phabricator.wikimedia.org/T397861#10966887 (10TheresNoTime) >>! In T397861#10964875, @Andrew wrote: >>>! In T397861#10962913, @TheresNoTime wrote: > >> - the [[ https://phabricator.wikimedia.org/T29653#4486036 | concerns rai... [10:38:15] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [10:48:29] 10Toolforge (Toolforge iteration 21): [components-cli] Allow reading tool configuration from stdin - https://phabricator.wikimedia.org/T398424 (10taavi) 03NEW [10:49:20] 10Toolforge (Toolforge iteration 21): [components-cli] Allow reading tool configuration from stdin - https://phabricator.wikimedia.org/T398424#10967073 (10taavi) [10:49:23] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, and 2 others: [Hypothesis] WE6.3.10 start a beta for the push-to-deploy features - https://phabricator.wikimedia.org/T393564#10967074 (10taavi) [10:50:32] 06cloud-services-team, 10Toolforge: [components-cli] Invalid YAML file error should not encourage reporting the issue to admins - https://phabricator.wikimedia.org/T398425 (10taavi) 03NEW [10:55:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [11:13:22] FIRING: [3x] HAProxyBackendUnavailable: HAProxy service keystone-admin-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:15:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:18:22] RESOLVED: [3x] HAProxyBackendUnavailable: HAProxy service keystone-admin-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:22:05] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-39 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [11:24:49] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-30 - https://phabricator.wikimedia.org/T398170#10967185 (10fnegri) 05Open→03Resolved The replica is back in sync. 🎉 {F62778600} [11:28:01] RESOLVED: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-6 is lagging behind the primary, the current lag is 1h 3m 16s - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [12:02:44] 06cloud-services-team, 10Toolforge: toolforge jobs API, schedule vs schedule_actual, and API behaviour change between Feb and June 2025 - https://phabricator.wikimedia.org/T398281#10967349 (10dcaro) The metadata is in the pods: ` tools.sample-complex-app@tools-bastion-13:~$ kubectl get cronjobs -o json | jq '.... [12:03:37] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10967353 (10Jelto) 05Open→03Resolved Yes! I delete the unused volume. So we can resolve th... [12:07:58] (03open) 10taavi: Deploy logging stack by default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/253 (https://phabricator.wikimedia.org/T386480) [12:08:05] (03update) 10taavi: Deploy logging stack by default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/253 (https://phabricator.wikimedia.org/T386480) [12:13:31] (03open) 10dcaro: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [12:13:43] (03update) 10dcaro: Draft: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [12:26:25] (03update) 10dcaro: Draft: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [12:31:31] (03update) 10dcaro: Draft: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [12:39:37] (03update) 10dcaro: Draft: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [12:51:44] (03open) 10taavi: cloudinfra: Cleanup Puppetserver security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/253 [12:52:14] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/253 [12:52:50] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/253 [12:52:56] (03update) 10taavi: cloudinfra: Cleanup Puppetserver security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/253 [13:30:16] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-74, tools-k8s-worker-nfs-39, tools-k8s-worker-nfs-55 [13:43:55] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10967877 (10taavi) +1 [13:44:26] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-74, tools-k8s-worker-nfs-39, tools-k8s-worker-nfs-55 [13:47:51] !log andrew@cloudcumin1001 mobileapps START - Cookbook wmcs.vps.create_project for project mobileapps in eqiad1 [13:47:56] !log andrew@cloudcumin1001 mobileapps END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project mobileapps in eqiad1 [13:50:49] 10Cloud-VPS (Quota-requests): Increase tools-logging object storage quotes - https://phabricator.wikimedia.org/T398447 (10taavi) 03NEW [13:51:23] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Toolforge: Increase tools-logging object storage quotes - https://phabricator.wikimedia.org/T398447#10967915 (10taavi) [13:51:39] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [o11y,logging,infra] Deploy Loki to store Toolforge tool log data - https://phabricator.wikimedia.org/T386480#10967917 (10taavi) [13:51:43] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Toolforge: Increase tools-logging object storage quotes - https://phabricator.wikimedia.org/T398447#10967918 (10taavi) [13:52:45] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10967931 (10Andrew) Hi @Jgiannelos and @SLopes-WMF I'm preparing to create this project, but there's one wrinkle: there is already a proxy within the deployment-prep project named 'mobileapps.... [13:55:39] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10967941 (10Jgiannelos) Sure, lets use "mobileapps-perf-test" [13:58:37] 06cloud-services-team, 10Cloud-VPS: Create OpenStack role that allows object storage access only - https://phabricator.wikimedia.org/T396594#10967958 (10Andrew) 05Open→03Resolved [13:59:48] (03update) 10dcaro: Draft: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [14:04:49] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 05Cloud-Services-Origin-Team, and 3 others: [builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push... - https://phabricator.wikimedia.org/T194332#10967981 [14:08:27] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10967992 (10Andrew) Can I just jam that together into mobleappspertest? I know it's uglier but simple strings tend to behave better when adapted into dns domains &c. (We had a serious bug with... [14:22:35] (03update) 10dcaro: Draft: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [14:22:53] (03update) 10dcaro: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [14:35:09] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10968105 (10Andrew) or maybe 'mobileperformance' [14:47:34] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-39 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [14:47:48] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Toolforge: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge - https://phabricator.wikimedia.org/T391369#10968189 (10Andrew) My favorite fix for this would be to mount the dumps as a 'multi-attach' RO cind... [14:48:04] (03PS1) 10Elukey: profile::thanos::swift: rename machinetranslation account [labs/private] - 10https://gerrit.wikimedia.org/r/1165909 [14:53:11] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Toolforge: Increase tools-logging object storage quotas - https://phabricator.wikimedia.org/T398447#10968205 (10taavi) [15:15:15] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:55:57] (03Abandoned) 10Elukey: profile::thanos::swift: rename machinetranslation account [labs/private] - 10https://gerrit.wikimedia.org/r/1165909 (owner: 10Elukey) [16:01:12] (03update) 10dcaro: api: return the configured schedule [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [16:01:28] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10968552 (10Jgiannelos) Lets use `mobileappsperformance` [16:02:48] !log andrew@cloudcumin1001 mobileappsperformance START - Cookbook wmcs.vps.create_project for project mobileappsperformance in eqiad1 [16:02:49] andrew@cloudcumin1001: Unknown project "mobileappsperformance" [16:03:27] (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project mobileappsperformance [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/254 [16:05:48] (03merge) 10andrew: projects: added project mobileappsperformance [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/254 (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [16:08:35] !log andrew@cloudcumin1001 mobileappsperformance END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project mobileappsperformance in eqiad1 [16:09:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [16:09:59] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [16:10:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [16:10:46] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [16:11:32] !log andrew@cloudcumin1001 mobileappsperformance START - Cookbook wmcs.vps.create_project for project mobileappsperformance in eqiad1 [16:12:06] !log andrew@cloudcumin1001 mobileappsperformance END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project mobileappsperformance in eqiad1 [16:12:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [16:13:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [16:13:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [16:13:50] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [16:26:37] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [16:26:52] 10Cloud-VPS (Project-requests): Request creation of mobileapps VPS project - https://phabricator.wikimedia.org/T398405#10968677 (10Andrew) 05Open→03Resolved a:03Andrew OK -- I've created the project and added jgiannelos -- you should be able to add additional members as appropriate. There was one misf... [16:27:09] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [16:29:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_ceph_node [16:29:27] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.upgrade_ceph_node (exit_code=97) [16:29:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_ceph_node [16:31:54] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [16:32:44] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [16:36:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_ceph_node (exit_code=0) [16:36:30] (03approved) 10taavi: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 (owner: 10dcaro) [16:37:52] !log andrew@cloudcumin1001 dumpstorrents START - Cookbook wmcs.vps.create_project for project dumpstorrents in eqiad1 [16:37:52] andrew@cloudcumin1001: Unknown project "dumpstorrents" [16:38:24] !log andrew@cloudcumin1001 dumpstorrents END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project dumpstorrents in eqiad1 [16:38:24] andrew@cloudcumin1001: Unknown project "dumpstorrents" [16:42:35] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [16:42:46] 10Cloud-VPS (Project-requests): Request creation of dumpstorrents VPS project - https://phabricator.wikimedia.org/T397861#10968744 (10Andrew) [16:43:54] !log andrew@cloudcumin1001 dumpstorrents START - Cookbook wmcs.vps.create_project for project dumpstorrents in eqiad1 [16:43:56] andrew@cloudcumin1001: Unknown project "dumpstorrents" [16:44:34] (03update) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project dumpstorrents [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/255 [16:44:38] (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project dumpstorrents [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/255 [16:46:09] (03merge) 10andrew: projects: added project dumpstorrents [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/255 (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [16:50:37] !log andrew@cloudcumin1001 dumpstorrents END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project dumpstorrents in eqiad1 [16:52:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [16:52:30] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [16:56:42] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [16:57:36] (03update) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [16:57:58] (03merge) 10dcaro: builder: allow specifying the arch [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 [16:58:31] (03update) 10dcaro: build: enable ci builds [repos/cloud/toolforge/misctools-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 (https://phabricator.wikimedia.org/T398202) [17:00:57] (03open) 10sascha: Configure GitLab pipeline to auto-deploy fetcher script [toolforge-repos/transmodel-ids] - 10https://gitlab.wikimedia.org/toolforge-repos/transmodel-ids/-/merge_requests/1 [17:01:14] (03merge) 10sascha: Configure GitLab pipeline to auto-deploy fetcher script [toolforge-repos/transmodel-ids] - 10https://gitlab.wikimedia.org/toolforge-repos/transmodel-ids/-/merge_requests/1 [17:03:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [17:04:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [17:10:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance syslog-server-audit01 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [17:13:13] (03open) 10sascha: Try to fix CI deployment [toolforge-repos/transmodel-ids] - 10https://gitlab.wikimedia.org/toolforge-repos/transmodel-ids/-/merge_requests/2 [17:13:26] (03merge) 10sascha: Try to fix CI deployment [toolforge-repos/transmodel-ids] - 10https://gitlab.wikimedia.org/toolforge-repos/transmodel-ids/-/merge_requests/2 [17:15:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [17:16:32] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [17:17:41] !log andrew@cloudcumin1001 dumpstorrents START - Cookbook wmcs.vps.create_project for project dumpstorrents in eqiad1 [17:18:50] !log andrew@cloudcumin1001 dumpstorrents END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project dumpstorrents in eqiad1 [17:20:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance syslog-server-audit01 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [17:26:28] (03open) 10dcaro: debian-builder-bullseye: add missing packages for arm64 [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/65 [17:29:32] 10Cloud-VPS (Project-requests): Request creation of dumpstorrents VPS project - https://phabricator.wikimedia.org/T397861#10969006 (10Andrew) 05Open→03Resolved a:03Andrew [17:34:36] (03approved) 10dcaro: debian-builder-bullseye: add missing packages for arm64 [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/65 [17:34:37] 10Wikibugs: Wikibugs not reporting Phabricator activity to #wikimedia-zuul as hoped - https://phabricator.wikimedia.org/T396387#10969066 (10Aklapper) 05Open→03Resolved Works for me [17:34:39] (03merge) 10dcaro: debian-builder-bullseye: add missing packages for arm64 [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/65 [17:44:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-24 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:44:59] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 10Toolforge: Increase tools-logging object storage quotas - https://phabricator.wikimedia.org/T398447#10969106 (10Andrew) 05Open→03Resolved a:03Andrew [17:45:00] RESOLVED: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:46:30] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:51:30] RESOLVED: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:52:30] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:52:35] 06cloud-services-team, 10Data-Services, 10VPS-Projects: Requesting access to NFS mount /public/dumps for dumpstorrents Cloud VPS project - https://phabricator.wikimedia.org/T398477 (10TheresNoTime) 03NEW [17:52:45] RESOLVED: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:54:22] FIRING: HAProxyBackendUnavailable: HAProxy service keystone-public-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:57:30] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:59:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service keystone-public-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:12:30] RESOLVED: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:23:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_ceph_node [18:28:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_ceph_node (exit_code=0) [18:29:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_ceph_node [18:32:47] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: ceph-mon 16.2.15+ds-0+deb12u1 uses all the RAM - https://phabricator.wikimedia.org/T398389#10969284 (10Andrew) Upgrading an existing mon node to 16.2.15-1~bpo11+1 on Bullseye does not produce this problem. [18:34:51] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.upgrade_ceph_node (exit_code=0) [18:36:04] (03PS1) 10Krinkle: IPInfo: Simplify and improve getAsnInfo implementation [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1165966 [18:36:04] (03PS1) 10Krinkle: build: Update phpcs and require PHP 8.1 [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1165967 [18:36:52] 06cloud-services-team, 10Data-Services, 10VPS-Projects, 13Patch-For-Review: Requesting access to NFS mount /public/dumps for dumpstorrents Cloud VPS project - https://phabricator.wikimedia.org/T398477#10969308 (10TheresNoTime) 05Open→03Resolved [18:37:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds [18:38:52] (03CR) 10Krinkle: [C:03+2] IPInfo: Simplify and improve getAsnInfo implementation [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1165966 (owner: 10Krinkle) [18:38:58] (03CR) 10Krinkle: [C:03+2] build: Update phpcs and require PHP 8.1 [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1165967 (owner: 10Krinkle) [18:39:29] (03Merged) 10jenkins-bot: IPInfo: Simplify and improve getAsnInfo implementation [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1165966 (owner: 10Krinkle) [18:39:40] (03Merged) 10jenkins-bot: build: Update phpcs and require PHP 8.1 [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1165967 (owner: 10Krinkle) [18:40:47] 10Toolforge (Toolforge iteration 21): [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485 (10Sascha) 03NEW [18:44:45] 06cloud-services-team, 10Toolforge: [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10969362 (10JJMC89) [18:48:51] 06cloud-services-team, 10Toolforge: [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10969403 (10Sascha) [19:07:08] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.upgrade_osds (exit_code=99) [19:23:49] 06cloud-services-team, 10Toolforge: [components-api,beta] Generated configs should contain cpu values as numbers, not strings - https://phabricator.wikimedia.org/T398497#10969515 (10JJMC89) [19:31:03] (03CR) 10Bovimacoco: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157764 (owner: 10Bovimacoco) [20:01:25] (03PS1) 10Bovimacoco: fix: fixing language switching in LanguageSelector [labs/tools/mostvisitedarticle] - 10https://gerrit.wikimedia.org/r/1165977 (https://phabricator.wikimedia.org/T390664) [20:02:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:27:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:37:51] 06cloud-services-team, 10Toolforge: [components-api,beta] CI pipelines should wait until Toolforge deployment is 100% successful - https://phabricator.wikimedia.org/T398485#10969825 (10bd808) I agree that is should be possible to block the CI result on either a deployment success or a hard failure. From a pro... [20:42:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:49:36] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820#10969891 (10Andrew) ceph codfw1 is now running 16.2.15 on all nodes. One of the mons is on Bookworm, the o... [20:51:52] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: ceph-mon 16.2.15+ds-0+deb12u1 uses all the RAM - https://phabricator.wikimedia.org/T398389#10969897 (10Andrew) I upgraded everything in-place to 16.2.15 on Bullseye, then upgraded one node to Bookworm a... [20:52:05] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: ceph-mon 16.2.15+ds-0+deb12u1 uses all the RAM - https://phabricator.wikimedia.org/T398389#10969898 (10Andrew) p:05Triage→03Low [20:57:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:05:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:55:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:10:34] (03PS2) 10Bovimacoco: fix: add search on different selects (continent, country, date and platform) [labs/tools/mostvisitedarticle] - 10https://gerrit.wikimedia.org/r/1165977 (https://phabricator.wikimedia.org/T390440) [22:12:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:22:47] (03PS1) 10Bovimacoco: fix: add search on different selects (continent, country, date and platform) [labs/tools/mostvisitedarticle] - 10https://gerrit.wikimedia.org/r/1166002 (https://phabricator.wikimedia.org/T390440) [22:27:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:35:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:53:38] (03PS1) 10Bovimacoco: fix: fixing language switching in LanguageSelector [labs/tools/mostvisitedarticle] - 10https://gerrit.wikimedia.org/r/1166013 (https://phabricator.wikimedia.org/T390664) [22:55:03] (03Abandoned) 10Bovimacoco: fix: add search on different selects (continent, country, date and platform) [labs/tools/mostvisitedarticle] - 10https://gerrit.wikimedia.org/r/1165977 (https://phabricator.wikimedia.org/T390440) (owner: 10Bovimacoco) [22:55:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:22:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:27:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:36:56] FIRING: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:56:56] RESOLVED: SystemdUnitDown: The service unit maintain-dbusers.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown