[00:01:35] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:02:34] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:07:35] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:08:35] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:23:35] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:57:29] (03CR) 10Eugene233: [V:03+2 C:03+2] added endpoints.ts [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1151745 (owner: 10Martindevelops) [08:45:52] (03merge) 10taavi: Remove toolsbeta-prometheus-1 volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/45 [08:58:21] (03CR) 10Majavah: "recheck" [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter) [09:00:39] (03CR) 10CI reject: [V:04-1] Switch username validation to Bitu API [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter) [09:23:09] (03PS11) 10Majavah: Switch username validation to Bitu API [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter) [09:36:24] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Striker, 10Bitu, 06Infrastructure-Foundations, 13Patch-For-Review: Move Striker to Bitu username validation API - https://phabricator.wikimedia.org/T364605#10886930 (10taavi) a:03Arendpieter [09:37:47] (03CR) 10Majavah: "The patch LGTM now, thanks! Only problem that remains is that the `/signup/api/username/` API doesn't seem to be enabled on `idm.wikimedia" [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter) [09:37:54] (03CR) 10Majavah: [C:03+1] Switch username validation to Bitu API [labs/striker] - 10https://gerrit.wikimedia.org/r/1134724 (https://phabricator.wikimedia.org/T364605) (owner: 10Arendpieter) [09:39:04] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Striker, 10Bitu, 06Infrastructure-Foundations, 13Patch-For-Review: Move Striker to Bitu username validation API - https://phabricator.wikimedia.org/T364605#10886947 (10taavi) @SLyngshede-WMF is enabling the `/signup/api/username/` API on `idm.wikimedia.org` ju... [10:47:52] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:52:52] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:15:06] 06cloud-services-team, 10Cloud-VPS: Is it allowed to expose HTTPS services targeting machines without web proxies? - https://phabricator.wikimedia.org/T395721#10887310 (10taavi) 05Open→03Invalid Sorry for the delay here.. as https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication states w... [11:28:54] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 06Privacy Engineering: Set up x1 replication to Wiki Replicas - https://phabricator.wikimedia.org/T395881#10887329 (10fnegri) [11:29:25] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 06Privacy Engineering: Set up x1 replication to Wiki Replicas - https://phabricator.wikimedia.org/T395881#10887330 (10fnegri) [11:34:11] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 06Privacy Engineering: Set up x1 replication to Wiki Replicas - https://phabricator.wikimedia.org/T395881#10887344 (10fnegri) @VirginiaPoundstone I updated the task description with more context. I will also ask in the user-c... [11:35:08] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Create wiki replicas views for globaljsonlinks tables - https://phabricator.wikimedia.org/T387419#10887347 (10fnegri) Does anybody have a specific use case that needs to access t... [11:35:14] 06cloud-services-team, 10Data-Services, 06Data-Persistence, 06Privacy Engineering, 07SecTeam-Processed: Add "wikishared" database to wiki replicas - https://phabricator.wikimedia.org/T395072#10887348 (10fnegri) Does anybody have a specific use case that needs to access these tables? If yes, that might he... [11:45:51] 06cloud-services-team, 10Data-Services, 06Data-Persistence, 06Privacy Engineering, 07SecTeam-Processed: Add "wikishared" database to wiki replicas - https://phabricator.wikimedia.org/T395072#10887391 (10Marostegui) We still have to get a review from Security before we can even think about implementing this. [12:08:38] 10Tool-python-toolforge: Support connecting to extension databases - https://phabricator.wikimedia.org/T396115 (10taavi) 03NEW [12:47:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-72 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:07:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-72 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:09:06] 10PAWS: /home/paws is 100% - https://phabricator.wikimedia.org/T396051#10887604 (10Andrew) The cleanup cron job was running, but crashed out when it encountered a file it didn't have permission to move. I added some very naive exception handling and re-ran it; it moved 310G of files into /srv/paws/files-to-remov... [13:12:40] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Understand Octavia network needs - https://phabricator.wikimedia.org/T394099#10887612 (10Andrew) 05Open→03Resolved a:03Andrew [13:41:34] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Persistence, 13Patch-For-Review: [wikireplicas] Remove maintainviews and maintainindexes users - https://phabricator.wikimedia.org/T395432#10887687 (10fnegri) 05In progress→03Declined I'm gonna decline this as it's not as straightfo... [14:20:24] 10Tool-Pageviews: frwiki pageview - https://phabricator.wikimedia.org/T396122 (10Spartan.arbinger) 03NEW [14:20:56] (03merge) 10chuckonwumelu: [build] Return image_name when retrieving build info [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/135 (https://phabricator.wikimedia.org/T395035) [14:26:23] 06cloud-services-team, 10ContentTranslation: Setup Wiki Family on CX / SX staging - https://phabricator.wikimedia.org/T345340#10887852 (10Nikerabbit) [14:27:48] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: builds-api: bump to 0.0.194-20250605142106-6bd54bba [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/806 (https://phabricator.wikimedia.org/T395035) [14:29:05] PROBLEM - Host cloudcephosd1039 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:15] !log chuckonwumelu@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component builds-api [14:31:05] FIRING: [2x] HostBGPDown: BGP session for cloudlb2003-dev (2a02:ec80:a100:205::4) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cloudsw1-b1-codfw:9804&var-bgp_group=cloud_host6 - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [14:31:10] 06cloud-services-team: HostBGPDown - https://phabricator.wikimedia.org/T396123 (10phaultfinder) 03NEW [14:34:10] 06cloud-services-team: NodeDown Node cloudcephosd1039 has been down for long. - https://phabricator.wikimedia.org/T395811#10887898 (10Andrew) 05Open→03Resolved a:03Andrew This was down for work related to T394333 which we are probably no longer doing. [14:37:57] RECOVERY - Host cloudcephosd1039 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:41:25] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [14:43:01] !log chuckonwumelu@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component builds-api [14:48:51] RESOLVED: [2x] HostBGPDown: BGP session for cloudlb2003-dev (2a02:ec80:a100:205::4) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cloudsw1-b1-codfw:9804&var-bgp_group=cloud_host6 - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [14:51:59] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) [14:53:30] !log chuckonwumelu@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component builds-api [14:56:36] (03approved) 10fnegri: builds-api: bump to 0.0.194-20250605142106-6bd54bba [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/806 (https://phabricator.wikimedia.org/T395035) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [15:05:59] !log chuckonwumelu@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component builds-api [15:25:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [15:34:08] 10Cloud-VPS (Quota-requests): Pixel project "disk40" flavor, and perhaps a few more cores? - https://phabricator.wikimedia.org/T395837#10888171 (10Andrew) Is there a reason that Cinder volumes can't address your storage needs? [16:02:27] (03merge) 10chuckonwumelu: builds-api: bump to 0.0.194-20250605142106-6bd54bba [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/806 (https://phabricator.wikimedia.org/T395035) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:05:50] 10Toolforge (Toolforge iteration 20), 07good first task, 13Patch-For-Review: [builds-api] populate the `image_name` for the builds returned - https://phabricator.wikimedia.org/T395035#10888274 (10Chuckonwumelu) 05In progress→03Resolved [16:45:47] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06collaboration-services: Request to increase RAM and VCPU quota in codesearch - https://phabricator.wikimedia.org/T396073#10888418 (10Dzahn) Thank you for the quick response! :) [16:47:41] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06collaboration-services: Request to increase RAM and VCPU quota in codesearch - https://phabricator.wikimedia.org/T396073#10888420 (10Dzahn) [16:47:42] 10VPS-project-Codesearch, 06collaboration-services: Graduate codesearch to production - https://phabricator.wikimedia.org/T268199#10888421 (10Dzahn) [16:48:58] 10VPS-project-Codesearch, 06collaboration-services: Graduate codesearch to production - https://phabricator.wikimedia.org/T268199#10888422 (10Dzahn) In the linked task above I requested (and was granted) more quote on the codesearch cloud VPS project. I am going to double RAM on the sourcebot1 test instance b... [17:01:31] FIRING: ToolsNfsAlmostFull: Toolforge NFS is 86.66% full - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNfsAlmostFull - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNfsAlmostFull [17:24:04] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Octavia infra followup - https://phabricator.wikimedia.org/T395864#10888537 (10Andrew) [17:26:50] (03PS1) 10Andrew Bogott: restart_openstack: include octavia services [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1154086 (https://phabricator.wikimedia.org/T395864) [17:28:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for service: project,octavia [17:28:46] (03PS2) 10Andrew Bogott: restart_openstack: include octavia services [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1154086 (https://phabricator.wikimedia.org/T395864) [17:28:50] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment codfw1dev for service: project,octavia [17:28:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for service: project,octavia [17:29:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment codfw1dev for service: project,octavia [17:29:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [17:31:46] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment codfw1dev for all services [17:32:53] (03CR) 10Andrew Bogott: [C:03+2] restart_openstack: include octavia services [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1154086 (https://phabricator.wikimedia.org/T395864) (owner: 10Andrew Bogott) [17:33:15] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Octavia infra followup - https://phabricator.wikimedia.org/T395864#10888553 (10Andrew) [17:36:28] (03Merged) 10jenkins-bot: restart_openstack: include octavia services [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1154086 (https://phabricator.wikimedia.org/T395864) (owner: 10Andrew Bogott) [18:09:07] 06cloud-services-team, 10Cloud-VPS (Quota-requests), 06collaboration-services: Request to increase RAM and VCPU quota in codesearch - https://phabricator.wikimedia.org/T396073#10888647 (10Dzahn) Yes, doubling the RAM on the sourcebot-codesearch made a HUGE difference. It's very usable now and before it w... [18:10:30] 10VPS-project-Codesearch, 06collaboration-services: Graduate codesearch to production - https://phabricator.wikimedia.org/T268199#10888648 (10Dzahn) After resizing the instance to 16GB RAM, and upgrading to v4.0.1, now the instance does not swap anymore and the UI is usable and so far does not show any errors.... [18:33:06] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Octavia infra followup - https://phabricator.wikimedia.org/T395864#10888696 (10Andrew) WIP: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Octavia [20:15:14] (03open) 10lucaswerkmeister: Add support for extension databases [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/25 (https://phabricator.wikimedia.org/T396115) [20:16:10] 10Tool-python-toolforge, 13Patch-For-Review: Support connecting to extension databases - https://phabricator.wikimedia.org/T396115#10888918 (10LucasWerkmeister) a:03LucasWerkmeister [20:19:57] (03update) 10lucaswerkmeister: Add support for extension databases [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/25 (https://phabricator.wikimedia.org/T396115) [20:26:30] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) [20:43:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:53:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:06:25] (03open) 10lucaswerkmeister: Update pyupgrade from v3.3.1 to v3.20.0 [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/26 [21:19:38] (03update) 10bd808: Add support for extension databases [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/25 (https://phabricator.wikimedia.org/T396115) (owner: 10lucaswerkmeister) [21:22:19] (03approved) 10lucaswerkmeister: Add support for extension databases [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/25 (https://phabricator.wikimedia.org/T396115) [21:22:46] (03close) 10lucaswerkmeister: Update pyupgrade from v3.3.1 to v3.20.0 [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/26 [21:25:45] (03update) 10bd808: Add support for extension databases [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/25 (https://phabricator.wikimedia.org/T396115) (owner: 10lucaswerkmeister) [21:36:52] (03update) 10bd808: Add support for extension databases [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/25 (https://phabricator.wikimedia.org/T396115) (owner: 10lucaswerkmeister) [21:55:31] (03merge) 10bd808: Add support for extension databases [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/25 (https://phabricator.wikimedia.org/T396115) (owner: 10lucaswerkmeister) [22:20:52] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Octavia infra followup - https://phabricator.wikimedia.org/T395864#10889438 (10Andrew) [22:21:00] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Octavia infra followup - https://phabricator.wikimedia.org/T395864#10889439 (10Andrew) 05Open→03Resolved