[02:01:43] (03update) 10raymond-ndibe: get_job_from_k8s: remove correctly the default filelog [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/212 (owner: 10dcaro) [02:08:02] (03update) 10raymond-ndibe: [diff_with_running_job]: do not use exclude_unset while comparing jobs dump [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/212 (owner: 10dcaro) [02:08:18] (03approved) 10raymond-ndibe: [diff_with_running_job]: do not use exclude_unset while comparing jobs dump [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/212 (owner: 10dcaro) [03:12:36] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T403796 (10Awkwafaba) 03NEW [07:58:49] 10Tool-global-search, 10Discovery-Search (2025.08.15 - 2025.09.05), 07Essential-Work: Global Search displays most search results twice - https://phabricator.wikimedia.org/T391175#11151074 (10Gehel) [08:26:05] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T403796#11151197 (10Novem_Linguae) 05Open→03Resolved a:03Novem_Linguae I restarted it. Appears to be fixed. Thanks for reporting. [08:37:57] 06cloud-services-team, 10Toolforge: [jobs-api] Allow configuring health check timeout - https://phabricator.wikimedia.org/T403733#11151217 (10fnegri) p:05Triage→03Medium [08:38:09] 06cloud-services-team, 10Toolforge: [jobs-api] inconsistent command modification for continuous job - https://phabricator.wikimedia.org/T403735#11151218 (10fnegri) p:05Triage→03Medium [08:38:20] 06cloud-services-team, 10Toolforge: [jobs-cli] jobs are being updated (deleted/created) when no changes present - https://phabricator.wikimedia.org/T403760#11151232 (10fnegri) p:05Triage→03Medium [08:39:02] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cookbook,ceph] bootstrap_and_add ceph cookbook failed to add a new single osd 66 on host cloudcephosd1004 - https://phabricator.wikimedia.org/T402516#11151238 (10fnegri) p:05Triage→03Medium [08:39:21] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cookbook,ceph] depool_and_destroy ceph cookbook failed to destroy a single osd - https://phabricator.wikimedia.org/T402515#11151243 (10fnegri) p:05Triage→03Medium [08:40:25] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): [lima-kilo] toolforge components are reported as changed on every ansible run - https://phabricator.wikimedia.org/T402689#11151244 (10fnegri) [08:40:28] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): [lima-kilo] Only download artefacts if target binary checksum does not match - https://phabricator.wikimedia.org/T402684#11151246 (10fnegri) [08:40:36] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): [k8s,infra] Upgrade tools to Uwubernetes 1.30 - https://phabricator.wikimedia.org/T402378#11151248 (10fnegri) [08:40:46] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): [components-api,beta] Config not updated from remote source - https://phabricator.wikimedia.org/T401868#11151250 (10fnegri) [08:41:02] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [jobs-api] make job status an enum, with clearly defined states - https://phabricator.wikimedia.org/T401172#11151252 (10fnegri) [08:41:16] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 13Patch-For-Review: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#11151254 (10fnegri) [08:41:29] 10cloud-services-team (FY2025/26-Q1), 10Toolforge, 13Patch-For-Review: [cicd] create cicd flow for non repo owners - https://phabricator.wikimedia.org/T394595#11151256 (10fnegri) [08:41:41] 10cloud-services-team (FY2025/26-Q1), 10Toolforge, 07Epic: [cicd] Streamline toolforge cli deployment and external contributor ci flows - https://phabricator.wikimedia.org/T392524#11151258 (10fnegri) [08:41:50] 10Cloud Services Proposals, 10cloud-services-team (FY2025/26-Q1), 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607#11151260 (10fnegri) [08:41:58] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 07Documentation: [tofu-cloudvps] Document using `cloudvps_puppet_project` to manage project-wide and instance specific puppet classes and hiera settings - https://phabricator.wikimedia.org/T397994#11151262 (10fnegri) [08:42:02] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06DC-Ops, 06SRE: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11151264 (10fnegri) [08:42:05] 10cloud-services-team (FY2025/26-Q1), 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥): Recent incidents of buildkitd's storage volume filling up - https://phabricator.wikimedia.org/T395097#11151266 (10fnegri) [08:42:11] 10cloud-services-team (FY2025/26-Q1), 10Toolforge: [builds-cli] No obvious way to delete individual `toolforge build` generated artifacts other than `toolforge clean` - https://phabricator.wikimedia.org/T368317#11151274 (10fnegri) [08:42:14] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [jobs-api] when running a command with wrong quoting, no logs nor useful feedback is given to the user - https://phabricator.wikimedia.org/T356267#11151270 (10fnegri) [08:42:16] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 07Epic: [jobs-api] expose jobs-api continuous jobs to the internet via `toolname.toolforge.org`, just like webservice - https://phabricator.wikimedia.org/T388092#11151276 (10fnegri) [08:42:17] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [jobs-api] Split the `*Job` API models into three - https://phabricator.wikimedia.org/T390136#11151272 (10fnegri) [08:42:26] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [harbor,infra] Find a way to manage toolforge project policies with code - https://phabricator.wikimedia.org/T360509#11151278 (10fnegri) [08:42:27] 10cloud-services-team (FY2025/26-Q1), 06serviceops, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#11151284 (10fnegri) [08:42:29] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version - https://phabricator.wikimedia.org/T359649#11151280 (10fnegri) [08:42:33] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30) - https://phabricator.wikimedia.org/T362869#11151282 (10fnegri) [08:42:41] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): [builds-builder] Upgrade python buildpack to v0.17.0 or newer for Poetry support - https://phabricator.wikimedia.org/T374056#11151288 (10fnegri) [08:42:45] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble - https://phabricator.wikimedia.org/T380127#11151286 (10fnegri) [08:42:49] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: Toolforge: Replace all bastion with grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#11151290 (10fnegri) [08:45:54] 06cloud-services-team, 10Cloud-VPS, 10Ceph: Investigate big spikes up in wmcs ceph dashboards - https://phabricator.wikimedia.org/T403390#11151314 (10fnegri) [08:46:07] 06cloud-services-team, 10Cloud-VPS, 07Upstream: Horizon: Selected server groups do not get cleared after deleting them - https://phabricator.wikimedia.org/T403026#11151315 (10fnegri) [08:47:00] 06cloud-services-team, 10Horizon, 07Upstream: Horizon: Selected server groups do not get cleared after deleting them - https://phabricator.wikimedia.org/T403026#11151319 (10fnegri) [08:59:15] 06cloud-services-team, 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T401076#11151350 (10fnegri) a:03fnegri [10:14:58] 10Tools, 06Project-Admins: Request to create project: Wikidata Reference Validator - https://phabricator.wikimedia.org/T403556#11151436 (10JosefAnthony) [10:15:36] 10Tools, 06Project-Admins: Request to create project: Wikidata Reference Validator - https://phabricator.wikimedia.org/T403556#11151437 (10JosefAnthony) >>! In T403556#11147007, @Bugreporter wrote: > You will be able to create a project yourself once you have a Phabricator tool account (you can create one if y... [10:19:13] 10Tools, 06Project-Admins: Request to create project: Wikidata Reference Validator - https://phabricator.wikimedia.org/T403556#11151440 (10fnegri) 05Stalled→03Declined [11:02:40] 10Cloud-VPS (Project-requests): Request creation of VPS project - https://phabricator.wikimedia.org/T401619#11151562 (10fnegri) 05Open→03Declined @AlvinDulle Declining this request for now, if Toolforge does not work out for you, please reopen! [12:21:31] 06Toolforge-standards-committee: Adoption request for fireflytools - https://phabricator.wikimedia.org/T403814 (10Tenshi_Hinanawi) 03NEW [12:56:38] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T403348#11151828 (10fnegri) a:03fnegri [13:20:19] (03CR) 10Abijeet Patro: "recheck" [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1184763 (owner: 10L10n-bot) [13:25:00] PROBLEM - Host cloudcephosd1052 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:41] (03merge) 10raymond-ndibe: [diff_with_running_job]: do not use exclude_unset while comparing jobs dump [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/212 (owner: 10dcaro) [13:26:47] FIRING: NodeDown: Node cloudcephosd1052 has been down for long. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephosd1052 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [13:26:53] 06cloud-services-team: NodeDown Node cloudcephosd1052 has been down for long. - https://phabricator.wikimedia.org/T403821 (10phaultfinder) 03NEW [13:29:53] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.413-20250905132653-0a493ae5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/957 (https://phabricator.wikimedia.org/T403760) [13:36:10] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [13:47:15] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [13:48:32] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:00:36] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [14:01:32] (03approved) 10raymond-ndibe: jobs-api: bump to 0.0.413-20250905132653-0a493ae5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/957 (https://phabricator.wikimedia.org/T403760) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:01:35] (03merge) 10raymond-ndibe: jobs-api: bump to 0.0.413-20250905132653-0a493ae5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/957 (https://phabricator.wikimedia.org/T403760) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:05:34] 10Cloud-VPS (Quota-requests), 10Content-Transform-Team (Work In Progress): Quote increase request for wikitextexp - https://phabricator.wikimedia.org/T403114#11152075 (10fnegri) 05In progress→03Resolved [14:23:27] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Create wiki replicas views for globaljsonlinks tables - https://phabricator.wikimedia.org/T387419#11152179 (10Gehel) [14:23:35] 10Cloud Services Proposals, 10cloud-services-team (FY2025/26-Q1), 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607#11152185 (10Gehel) [14:58:47] 06cloud-services-team, 10Data-Services, 06DBA, 10DiscussionTools, and 5 others: Deleted data available in DiscussionTools tables - https://phabricator.wikimedia.org/T400420#11152450 (10Gehel) [15:08:44] PROBLEM - Host cloudcephosd1052 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:12] RECOVERY - Host cloudcephosd1052 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [15:10:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T401693) [15:10:31] T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 [15:13:28] PROBLEM - Host cloudcephosd1049 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:31] (03PS1) 10Andrew Bogott: ceph osds: increase drain/fill timeout from 5 hours to 8 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1185114 [15:14:56] RECOVERY - Host cloudcephosd1049 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [15:15:12] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T401693) [15:17:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:18:40] PROBLEM - Host cloudcephosd1052 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:27] (03CR) 10Andrew Bogott: [C:03+2] ceph osds: increase drain/fill timeout from 5 hours to 8 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1185114 (owner: 10Andrew Bogott) [15:23:53] (03Merged) 10jenkins-bot: ceph osds: increase drain/fill timeout from 5 hours to 8 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1185114 (owner: 10Andrew Bogott) [15:30:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T401693) [15:30:37] T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 [15:33:12] RECOVERY - Host cloudcephosd1052 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [15:33:28] PROBLEM - Host cloudcephosd1049 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:16] RECOVERY - Host cloudcephosd1049 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [15:37:10] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T401693) [15:37:18] T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 [15:37:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T401693) [15:37:51] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T401693) [15:37:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T401693) [15:37:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T401693) [15:38:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [15:38:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [15:41:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [15:41:38] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) [15:41:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [15:42:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [15:43:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [15:44:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [15:46:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [15:46:37] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) [15:46:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [15:46:50] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) [16:07:50] (03CR) 10Agamyasamuel: "recheck" [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1009402 (owner: 10AgnesAbah) [16:17:55] 10cloud-services-team (FY2025/26-Q1), 10Data-Services: [wikireplicas] slow query runs every hour, but never completes - https://phabricator.wikimedia.org/T403639#11152868 (10fnegri) 05In progress→03Resolved I double checked and there were no more queries getting killed because they reached 3 hours, whi... [16:25:33] 10cloud-services-team (FY2025/26-Q1), 10Data-Services: [wikireplicas] slow query runs every hour, but never completes - https://phabricator.wikimedia.org/T403639#11152900 (10dschwen) Ok, the total runtime right now is ` START Fri Sep 5 02:17:01 UTC 2025 SUCCESS Fri Sep 5 06:06:43 UTC 2025 ` ~230min and... [16:27:08] 10cloud-services-team (FY2025/26-Q1), 10Data-Services: [wikireplicas] slow query runs every hour, but never completes - https://phabricator.wikimedia.org/T403639#11152907 (10fnegri) It's hard to predict if larger batch windows would be faster overall, feel free to experiment if you have time. Otherwise I t... [16:47:18] FIRING: KernelErrors: Server cloudcephosd1052 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1052 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [16:47:24] 06cloud-services-team: KernelErrors Server cloudcephosd1052 logged kernel errors - https://phabricator.wikimedia.org/T403842 (10phaultfinder) 03NEW [18:33:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:43:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:04:10] 10wikitech.wikimedia.org: [Bug] Wikitech loads as mobile site on desktop in Chrome private browsing mode - https://phabricator.wikimedia.org/T190384#11153446 (10Krinkle) 05Open→03Resolved a:03Krinkle This is fixed today as side-effect of {T214998}, specifically T401595. ## Root cause The root cause f... [19:06:18] 10wikitech.wikimedia.org, 06serviceops: Wikitech displays desktop site on mobile devices - https://phabricator.wikimedia.org/T383656#11153468 (10Krinkle) 05Open→03Resolved a:03Krinkle >>! In T190384#11153446, @Krinkle wrote: > This is fixed today as side-effect of {T214998}, specifically T401595. >... [21:14:20] 10VPS-Projects, 10Content-Transform-Team (Work In Progress), 07Essential-Work: Set up new cloud VPS server for Content Transform Team Visual Diff testing - https://phabricator.wikimedia.org/T402836#11153846 (10ssastry) 05Open→03Resolved We now have ctt-prv-04 set up and configured and a first round o... [21:19:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:36:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node