[00:15:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-bastionless.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:18:32] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [00:18:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [00:36:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [00:36:53] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [00:57:38] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T309789) [00:57:44] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [00:58:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [00:59:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) [01:00:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [02:32:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [02:33:25] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [02:34:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [02:34:13] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [02:37:04] PROBLEM - Host cloudcephosd1017 is DOWN: PING CRITICAL - Packet loss = 100% [02:40:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [02:40:50] RECOVERY - Host cloudcephosd1017 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [02:48:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [02:49:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [02:50:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [02:52:18] PROBLEM - Host cloudcephosd1017 is DOWN: PING CRITICAL - Packet loss = 100% [02:56:24] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [02:56:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [02:57:20] RECOVERY - Host cloudcephosd1017 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [03:51:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [03:55:38] PROBLEM - Host cloudcephosd1017 is DOWN: PING CRITICAL - Packet loss = 100% [03:57:20] RECOVERY - Host cloudcephosd1017 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [03:59:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:00:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [04:01:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [04:01:31] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [04:04:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:07:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [04:07:07] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [04:07:34] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10911783 (10Andrew) [04:29:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-46 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:34:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-46 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:31:21] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T309789) [06:31:34] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [06:57:21] (03CR) 10Abijeet Patro: "recheck" [labs/tools/intuition-web] - 10https://gerrit.wikimedia.org/r/1156333 (owner: 10L10n-bot) [06:57:29] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [labs/tools/intuition-web] - 10https://gerrit.wikimedia.org/r/1154813 (owner: 10L10n-bot) [06:59:18] FIRING: KernelErrors: Server cloudcephosd1017 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1017 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [06:59:22] 06cloud-services-team: KernelErrors Server cloudcephosd1017 logged kernel errors - https://phabricator.wikimedia.org/T396832 (10phaultfinder) 03NEW [08:23:36] 10Data-Services, 06Data-Engineering: Create a view for existencelinks table - https://phabricator.wikimedia.org/T394898#10912120 (10Tacsipacsi) Half of this was done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1156466, now only the table itself needs to be replicated. [08:27:20] 10Data-Services, 06Data-Engineering: Create a view for existencelinks table - https://phabricator.wikimedia.org/T394898#10912130 (10taavi) Are there plans for MediaWiki to expose this somewhere? Traditionally the Wiki Replicas has only exposed data that is already publicly available somewhere else. [08:30:49] 10Data-Services, 06Data-Engineering: Create a view for existencelinks table - https://phabricator.wikimedia.org/T394898#10912137 (10Tacsipacsi) {T395366} is about exposing it somewhere. (Disclaimer: I’m the author of that task, so I’m probably a bit biased. 😛) [08:56:24] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Create wiki replicas views for globaljsonlinks tables - https://phabricator.wikimedia.org/T387419#10912394 (10Gehel) [08:56:42] 10Cloud-VPS (Quota-requests), 07affects-Kiwix-and-openZIM: Increase RAM quota of mwoffliner project - https://phabricator.wikimedia.org/T396840 (10Benoit74) 03NEW [09:12:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [09:31:23] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [o11y,logging,infra] Deploy Loki to store Toolforge log data - https://phabricator.wikimedia.org/T386480#10912538 (10taavi) p:05High→03Medium [09:31:30] 06cloud-services-team, 10Toolforge: Provision object storage volumes for Loki - https://phabricator.wikimedia.org/T396574#10912541 (10taavi) p:05High→03Medium [10:01:53] (03PS1) 10Alexandros Kosiaris: registry: Add hiera for the new hierarchy [labs/private] - 10https://gerrit.wikimedia.org/r/1156761 (https://phabricator.wikimedia.org/T390251) [10:01:54] (03PS1) 10Alexandros Kosiaris: Remove old docker_registry_ha hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/1156762 (https://phabricator.wikimedia.org/T390251) [10:07:29] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] registry: Add hiera for the new hierarchy [labs/private] - 10https://gerrit.wikimedia.org/r/1156761 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [10:12:21] 10Toolforge (Toolforge iteration 21): [jobs-emailer] stops processing k8s events - https://phabricator.wikimedia.org/T396850 (10dcaro) 03NEW [10:19:02] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: quarry is leaking tmp files - https://phabricator.wikimedia.org/T395237#10912687 (10github-toolforge-bot) supertassu closed https://github.com/toolforge/quarry/pull/85 [10:19:20] supertassu closed https://github.com/toolforge/quarry/pull/85 [10:21:40] 10Toolforge (Toolforge iteration 21): [jobs-emailer] stops processing k8s events - https://phabricator.wikimedia.org/T396850#10912694 (10dcaro) p:05Triage→03High [10:22:34] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Fix Quarry's Redis pod exiting causing frequent outages - https://phabricator.wikimedia.org/T396785#10912698 (10github-toolforge-bot) supertassu closed https://github.com/toolforge/quarry/pull/84 [10:23:29] supertassu closed https://github.com/toolforge/quarry/pull/84 [10:27:40] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: quarry is leaking tmp files - https://phabricator.wikimedia.org/T395237#10912734 (10taavi) 05Open→03Resolved [10:38:05] 10Quarry: Add line numbers in SQL input textarea - https://phabricator.wikimedia.org/T315066#10912764 (10taavi) 05In progress→03Resolved a:03SD0001 [10:38:13] supertassu closed https://github.com/toolforge/quarry/pull/54 [10:44:08] supertassu closed https://github.com/toolforge/quarry/pull/72 [10:45:00] 10Quarry, 13Patch-For-Review: Add statistics to the homepage - https://phabricator.wikimedia.org/T204157#10912785 (10taavi) 05Open→03Resolved a:03Framawiki [10:51:38] (03Abandoned) 10Majavah: app.py: EXPLAIN needs to be executed on the good server [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/462286 (https://phabricator.wikimedia.org/T205214) (owner: 10Framawiki) [10:51:43] (03Abandoned) 10Majavah: models/user: rewrite UserGroup to use int values [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/512546 (owner: 10Framawiki) [10:51:47] (03Abandoned) 10Majavah: app: Add rate limiting on queries execution [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/517177 (https://phabricator.wikimedia.org/T225869) (owner: 10Framawiki) [10:51:51] (03Abandoned) 10Majavah: Create simple CLI management tool [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/512614 (https://phabricator.wikimedia.org/T224376) (owner: 10Framawiki) [10:51:57] (03Abandoned) 10Majavah: minikube helm chart [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/761631 (https://phabricator.wikimedia.org/T301469) (owner: 10Michael DiPietro) [10:52:00] (03Abandoned) 10Majavah: view.html: Change toggle highlighting button from btn-sm to btn-xs [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/561339 (https://phabricator.wikimedia.org/T317222) (owner: 10Zhuyifei1999) [10:52:38] (03open) 10nokibsarkar: Added campwiz [toolforge-repos/wmde-github-phab-bot] - 10https://gitlab.wikimedia.org/toolforge-repos/wmde-github-phab-bot/-/merge_requests/1 [10:54:16] 10Quarry: Add rate limiting on queries execution - https://phabricator.wikimedia.org/T225869#10912812 (10taavi) [10:56:02] (03open) 10taavi: gerrit-channels: Remove Quarry [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/57 [10:56:09] (03update) 10taavi: gerrit-channels: Remove Quarry [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/57 [10:59:33] FIRING: KernelErrors: Server cloudcephosd1017 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1017 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [11:02:10] 10Quarry: Quarry down - web service unreachable - https://phabricator.wikimedia.org/T395201#10912816 (10taavi) [11:02:11] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Quarry: Fix Quarry's Redis pod exiting causing frequent outages - https://phabricator.wikimedia.org/T396785#10912817 (10taavi) [11:05:21] 06cloud-services-team, 10Toolforge, 06Infrastructure-Foundations, 10netops: [infra] Reports of slow connectivity from APAC - https://phabricator.wikimedia.org/T395135#10912818 (10Nokib_Sarkar) ` $ curl https://upload.wikimedia.org/wikipedia/commons/e/eb/SMS_Arcona_NH_65764_-_Restoration.jpg --output test.... [11:05:22] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#10912819 (10fnegri) `pgdata` size is constant, but `wal_archive` is increasing quite rapidly: ` ubuntu@dbapp:~$ sudo du -hs /var/lib/postgresql... [11:05:38] 10Quarry, 13Patch-For-Review: Change toggle highlighting button from btn-sm to btn-xs - https://phabricator.wikimedia.org/T317222#10912821 (10taavi) 05Open→03Declined If someone feels strongly about this and wants to submit a patch, that's fine, but otherwise I don't see much point for this. [11:09:27] 10Quarry: Possible wikidata merges - https://phabricator.wikimedia.org/T63881#10912832 (10taavi) 05Open→03Invalid I don't see anything to do for #Quarry here, and there are no other project tags, so closing. [11:11:36] 10Quarry: On first visit to Quarry in that browser session, error 500 (intermittent) - https://phabricator.wikimedia.org/T345685#10912835 (10taavi) Is this still happening? [11:11:43] 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043#10912837 (10taavi) Is this still happening? [11:13:20] 10Quarry: Remove quarry.wsgi on move to k8s - https://phabricator.wikimedia.org/T349605#10912838 (10github-toolforge-bot) supertassu opened https://github.com/toolforge/quarry/pull/86 [11:14:15] supertassu opened https://github.com/toolforge/quarry/pull/86 [11:22:58] supertassu opened https://github.com/toolforge/quarry/pull/87 [11:24:06] supertassu closed https://github.com/toolforge/quarry/pull/87 [11:46:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [11:46:24] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [11:51:50] PROBLEM - Host cloudcephosd1018 is DOWN: PING CRITICAL - Packet loss = 100% [11:55:18] RECOVERY - Host cloudcephosd1018 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [11:59:18] FIRING: KernelErrors: Server cloudcephosd1018 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1018 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [11:59:22] 06cloud-services-team: KernelErrors Server cloudcephosd1018 logged kernel errors - https://phabricator.wikimedia.org/T396859 (10phaultfinder) 03NEW [12:29:39] 06cloud-services-team, 10Striker: Rotate StrikerBot GitLab PAT before it expires on 2025-07-29 - https://phabricator.wikimedia.org/T395694#10912998 (10taavi) 05Open→03Resolved This is complete. [12:32:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-41 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:33:56] 06cloud-services-team, 10Striker: Update StrikerBot Developer, SUL, and related accounts to email folks besides just bd808 - https://phabricator.wikimedia.org/T395697#10913007 (10taavi) @bd808: Apparently I can't log in to the StrikerBot SUL account because the #EmailAuth token is being sent to your inbox. Cou... [12:35:25] 06cloud-services-team, 10Striker: Update StrikerBot Developer, SUL, and related accounts to email folks besides just bd808 - https://phabricator.wikimedia.org/T395697#10913009 (10taavi) The developer account and GitLab account have been updated. [12:36:20] 06cloud-services-team, 10Striker: Update StrikerBot Developer, SUL, and related accounts to email folks besides just bd808 - https://phabricator.wikimedia.org/T395697#10913012 (10taavi) I did not find any UI for updating the email set for the @StrikerBot account. [12:47:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-41 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:31:28] (03update) 10taavi: toolforge: create an 'admin' tool account, with a fake human user [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/243 (https://phabricator.wikimedia.org/T394786) (owner: 10aborrero) [13:46:02] 10Data-Services, 06Data-Engineering: Create a view for existencelinks table - https://phabricator.wikimedia.org/T394898#10913180 (10Bugreporter) >>! In T394898#10912130, @taavi wrote: > Are there plans for MediaWiki to expose this somewhere? Traditionally the Wiki Replicas has only exposed data that is already... [13:51:33] (03merge) 10jforrester: gerrit-channels: Remove Quarry [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/57 (owner: 10taavi) [13:53:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [13:56:57] PROBLEM - Host cloudcephosd1018 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:25] RECOVERY - Host cloudcephosd1018 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:00:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:03:05] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [14:04:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:04:04] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:04:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:06:13] (03open) 10taavi: shared: Provision storage buckets for Loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/49 (https://phabricator.wikimedia.org/T396574) [14:06:16] (03update) 10taavi: shared: Provision storage buckets for Loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/49 (https://phabricator.wikimedia.org/T396574) [14:07:30] (03update) 10taavi: shared: Provision storage buckets for Loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/49 (https://phabricator.wikimedia.org/T396574) [14:10:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:10:58] (03update) 10taavi: shared: Provision storage buckets for Loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/49 (https://phabricator.wikimedia.org/T396574) [14:12:58] (03update) 10taavi: shared: Provision storage buckets for Loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/49 (https://phabricator.wikimedia.org/T396574) [14:15:15] (03update) 10taavi: shared: Provision storage buckets for Loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/49 (https://phabricator.wikimedia.org/T396574) [14:30:10] (03update) 10taavi: shared: Provision storage buckets for Loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/49 (https://phabricator.wikimedia.org/T396574) [14:31:15] supertassu closed https://github.com/toolforge/quarry/pull/86 [14:32:56] 06cloud-services-team, 10Toolforge, 10Pywikibot: pywikibot: unable to run the nightly batch because of missing venv - https://phabricator.wikimedia.org/T396873 (10Xqt) 03NEW [14:35:41] 06cloud-services-team, 10Toolforge, 10Pywikibot: pywikibot: unable to run the nightly batch because of missing venv - https://phabricator.wikimedia.org/T396873#10913370 (10taavi) 05Open→03Invalid Such scripts must be run via the https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework. [14:40:33] 10Quarry: Remove quarry.wsgi on move to k8s - https://phabricator.wikimedia.org/T349605#10913375 (10taavi) 05Open→03Resolved a:03taavi [14:43:20] 06cloud-services-team, 10Toolforge, 10Pywikibot: pywikibot: unable to run the nightly batch because of missing venv - https://phabricator.wikimedia.org/T396873#10913393 (10Xqt) >>! In T396873#10913370, @taavi wrote: > Such scripts must be run via the https://wikitech.wikimedia.org/wiki/Help:Toolforge/Job... [14:44:32] 10Quarry: Emtpy Quarry query names are not linked/should not be allowed - https://phabricator.wikimedia.org/T375292#10913395 (10taavi) →14Duplicate dup:03T197029 [14:44:37] 10Quarry, 07good first task: Define in a single place the pseudoname of unnamed queries - https://phabricator.wikimedia.org/T197029#10913397 (10taavi) [14:45:07] 10Quarry: Define in a single place the pseudoname of unnamed queries - https://phabricator.wikimedia.org/T197029#10913402 (10taavi) [14:46:57] 10Quarry: "Running" query being displayed as unsubmitted - https://phabricator.wikimedia.org/T71176#10913410 (10taavi) 05Open→03Invalid Closing unless someone can identify a specific recent query where this is happening. [14:48:51] 10Quarry: "KeyError: 'request_token'" in /oauth-callback on local instance - https://phabricator.wikimedia.org/T211917#10913416 (10taavi) 05Open→03Resolved Please re-open if this is still happening. [14:49:03] 10Quarry, 07Documentation: admin docs: quarry - https://phabricator.wikimedia.org/T206710#10913419 (10taavi) →14Duplicate dup:03T392181 [14:49:06] 06cloud-services-team, 10Quarry, 07Documentation: [[wikitech:Portal:Data Services/Admin/Quarry]] update quarry docs to reflect the current setup - https://phabricator.wikimedia.org/T392181#10913421 (10taavi) [14:53:25] 06cloud-services-team, 10Striker: Update StrikerBot Developer, SUL, and related accounts to email folks besides just bd808 - https://phabricator.wikimedia.org/T395697#10913463 (10bd808) >>! In T395697#10913007, @taavi wrote: > @bd808: Apparently I can't log in to the StrikerBot SUL account because the #EmailAu... [14:53:30] 10Quarry: Create 'reports' feature - https://phabricator.wikimedia.org/T78593#10913466 (10taavi) 05Open→03Invalid This is lacking any sort of detail on what's wanted here, so closing. [14:54:05] 10Quarry: Database dump for analysis - https://phabricator.wikimedia.org/T93907#10913475 (10taavi) →14Duplicate dup:03T367415 [14:54:11] 14cloud-services-team (FY2024/2025-Q1-Q2), 10Quarry: Allow Quarry to query its own database - https://phabricator.wikimedia.org/T367415#10913477 (10taavi) [14:55:30] 10Quarry: UI: Allow downloading output via CLI - https://phabricator.wikimedia.org/T325683#10913481 (10taavi) 05Open→03Invalid Per above. [14:56:49] 10Quarry: Check browser user agent and provide line endings \n for Linux based browsers - https://phabricator.wikimedia.org/T327682#10913486 (10taavi) 05Open→03Declined I've no plans for implementing features based on U-A sniffing. [14:58:08] 06cloud-services-team, 10Striker: Update StrikerBot Developer, SUL, and related accounts to email folks besides just bd808 - https://phabricator.wikimedia.org/T395697#10913488 (10bd808) >>! In T395697#10913463, @bd808 wrote: > I forwarded you the code before seeing this. Yes, I will login and change the accoun... [14:59:06] 10Quarry: Quarry won't load results for large queries - https://phabricator.wikimedia.org/T341722#10913492 (10taavi) →14Duplicate dup:03T71076 [14:59:07] 10Quarry: Only load 'head' of result set - https://phabricator.wikimedia.org/T71076#10913494 (10taavi) [15:00:45] 10Quarry: Add page to discover user profiles and their queries - https://phabricator.wikimedia.org/T287462#10913498 (10taavi) 05Open→03Declined I think the recent queries feature is good enough for this. [15:21:22] 06cloud-services-team, 10Striker: Update StrikerBot Developer, SUL, and related accounts to email folks besides just bd808 - https://phabricator.wikimedia.org/T395697#10913591 (10taavi) [15:22:10] 06cloud-services-team, 10Striker: Update StrikerBot Developer, SUL, and related accounts to email folks besides just bd808 - https://phabricator.wikimedia.org/T395697#10913594 (10taavi) 05Open→03Stalled [15:22:41] FIRING: [2x] ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:22:42] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: Enable IPv6 for the Cloud VPS web proxy - https://phabricator.wikimedia.org/T379175#10913597 (10taavi) 05Open→03Resolved [15:31:24] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Maintenance: [webservice shell] Allow a user to delete/stop all running shell pods - https://phabricator.wikimedia.org/T349733#10913642 (10taavi) →14Duplicate dup:03T315735 [15:31:27] 06cloud-services-team, 10Toolforge, 07Kubernetes: Shell pods continue running after ssh session exits - https://phabricator.wikimedia.org/T315735#10913644 (10taavi) [15:46:57] 06cloud-services-team, 10Cloud-VPS (Project-requests), 10Continuous-Integration-Infrastructure (Zuul upgrade): Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540#10913697 (10hashar) As a follow up, I have made a proof of concept for Zuul and object storage (in the zuul3 tena... [15:50:14] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789) [15:50:21] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [16:15:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [16:29:09] 10Tools: Setup Cebuano UI langauge wiki in Toolforge for gadget testing - https://phabricator.wikimedia.org/T396888 (10bd808) 03NEW [16:40:27] 10Tools: Setup Cebuano UI langauge wiki in Toolforge for gadget testing - https://phabricator.wikimedia.org/T396888#10913963 (10bd808) p:05Triage→03Medium a:03Nhu_Gay_Me @Nhu_Gay_Me are you a member of #toolforge yet? That seems like the first thing to start with if not. https://wikitech.wikimedia.org/wiki... [16:41:22] 10Tools: Setup Cebuano UI langauge wiki in Toolforge for gadget testing - https://phabricator.wikimedia.org/T396888#10913972 (10bd808) [16:52:37] 06cloud-services-team, 10Cloud-VPS (Project-requests), 10Continuous-Integration-Infrastructure (Zuul upgrade): Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540#10913992 (10Dzahn) Hah! That's kind of funny. Yea, good that it's just zuul then. [17:09:38] 06cloud-services-team, 10Cloud-VPS (Project-requests), 10Continuous-Integration-Infrastructure (Zuul upgrade): Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540#10914017 (10bd808) >>! In T396540#10913697, @hashar wrote: > As a follow up, I have made a proof of concept for Z... [17:10:49] 06cloud-services-team, 10Data-Services, 10Wikifunctions, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07Essential-Work: Make wikifunctionsclient_usage table available on cloud wiki replicas - https://phabricator.wikimedia.org/T392475#10914019 (10DSantamaria) 05Open→03In progress [17:16:13] 06cloud-services-team, 10Striker: Update StrikerBot Developer, SUL, and related accounts to email folks besides just bd808 - https://phabricator.wikimedia.org/T395697#10914027 (10bd808) [17:28:38] 10Quarry: [bug] "Internal Server Error" when logging into Quarry - https://phabricator.wikimedia.org/T333043#10914049 (10Novem_Linguae) Unable to reproduce just now. This does appear to be an intermittent bug though so take with a grain of salt. [17:28:46] 10Quarry: On first visit to Quarry in that browser session, error 500 (intermittent) - https://phabricator.wikimedia.org/T345685#10914050 (10Novem_Linguae) Unable to reproduce just now. This does appear to be an intermittent bug though so take with a grain of salt. [18:13:51] 10Quarry: Quarry doesn't show any query results - https://phabricator.wikimedia.org/T396893 (10Bdijkstra) 03NEW [18:28:44] 10Quarry: Quarry doesn't show any query results - https://phabricator.wikimedia.org/T396893#10914252 (10taavi) 05Open→03Resolved a:03taavi Thanks for the report! It appears that a Google Cloud customer is running a very inconsiderate crawler against all of Quarry, and the general Cloud VPS rate limits... [18:42:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [18:46:57] PROBLEM - Host cloudcephosd1018 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:25] RECOVERY - Host cloudcephosd1018 is UP: PING OK - Packet loss = 0%, RTA = 4.75 ms [18:50:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:51:26] andrew@cloudcumin1001 bootstrap_and_add (PID 2878771) is awaiting input [18:53:56] FIRING: SystemdUnitDown: The service unit ceph-osd@113.service is in failed status on host cloudcephosd1018. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1018 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:06:02] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [19:06:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [19:06:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [19:07:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [19:07:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [19:08:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [19:08:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [19:08:56] RESOLVED: SystemdUnitDown: The service unit ceph-osd@113.service is in failed status on host cloudcephosd1018. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1018 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:09:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [19:09:33] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [19:10:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [19:10:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [19:10:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [19:10:46] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [19:11:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [19:12:01] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=97) [19:12:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [19:12:16] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) [19:12:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [19:15:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:44:41] 06cloud-services-team, 10Cloud-VPS (Project-requests), 10Continuous-Integration-Infrastructure (Zuul upgrade): Request creation of zuul VPS project - https://phabricator.wikimedia.org/T396540#10914522 (10Andrew) >>! In T396540#10914017, @bd808 wrote: >>>! In T396540#10913697, @hashar wrote: >> As a follo... [20:05:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [20:06:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [20:06:21] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [20:06:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [20:06:28] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [20:08:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [20:13:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T309789) [20:13:22] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [20:20:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [20:20:55] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [20:22:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [20:22:05] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [20:22:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [20:23:26] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [20:26:06] 06cloud-services-team, 10Toolforge, 10Pywikibot: pywikibot: unable to run the nightly batch because of missing venv - https://phabricator.wikimedia.org/T396873#10914596 (10bd808) >>! In T396873#10913393, @Xqt wrote: > Thank you. I rember this worked prevously but that might be ~ 18 months ago. It would... [20:38:22] 10Tool-sitesampler: Simplify `modify_html` by using a `` tag - https://phabricator.wikimedia.org/T385247#10914605 (10bd808) 05In progress→03Declined [21:01:47] 10Quarry: [bug] Query results do not appear due to JS error - https://phabricator.wikimedia.org/T396904 (10Catrope) 03NEW [21:11:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [21:15:05] PROBLEM - Host cloudcephosd1019 is DOWN: PING CRITICAL - Packet loss = 100% [21:16:33] RECOVERY - Host cloudcephosd1019 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [21:19:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [21:19:41] andrew@cloudcumin1001 bootstrap_and_add (PID 2894641) is awaiting input [21:44:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [21:46:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [21:46:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [21:49:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [22:27:22] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10914793 (10Andrew) [22:44:18] FIRING: KernelErrors: Server cloudcephosd1019 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1019 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [22:44:24] 06cloud-services-team: KernelErrors Server cloudcephosd1019 logged kernel errors - https://phabricator.wikimedia.org/T396909 (10phaultfinder) 03NEW [23:36:13] 10Quarry: [bug] Another problem with Quarry - https://phabricator.wikimedia.org/T396910 (10Liz) 03NEW [23:38:40] 10Quarry: [bug] Another problem with Quarry - https://phabricator.wikimedia.org/T396910#10914844 (10Liz) [23:45:33] 10Quarry: [bug] Another problem with Quarry - https://phabricator.wikimedia.org/T396910#10914847 (10JJMC89) →14Duplicate dup:03T396904 [23:45:35] 10Quarry: [bug] Query results do not appear due to JS error - https://phabricator.wikimedia.org/T396904#10914849 (10JJMC89)