[00:40:21] RESOLVED: MaintainKubeusersHang: maintain-kubeusers last finished run is 29.56M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [00:42:21] FIRING: MaintainKubeusersHang: maintain-kubeusers last finished run is 29.56M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [02:07:47] (03open) 10raymond-ndibe: job: clean up orphaned s3 objects [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/63 (https://phabricator.wikimedia.org/T418528) [02:07:51] (03update) 10raymond-ndibe: job: clean up orphaned s3 objects [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/63 (https://phabricator.wikimedia.org/T418528) [02:08:08] (03update) 10raymond-ndibe: Draft: job: clean up orphaned s3 objects [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/63 (https://phabricator.wikimedia.org/T418528) [03:07:38] 10Tool-wikimonitor, 07good first task, 07patch-welcome: Unable to Edit SpEL Condition from Mobile - https://phabricator.wikimedia.org/T420409#11721512 (10SheetalPro) Hi, I’d like to take up the task “Unable to Edit SpEL Condition from Mobile''. I’ve gone through the issue and understand that it’s related to... [03:10:17] 10Cloud-Services: [Toolforge Sustainability Framework]Percentage scoring of framework subcategories - https://phabricator.wikimedia.org/T420425 (10komla) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and... [03:11:04] 10Toolforge (Toolforge iteration 26): [Toolforge Sustainability Framework]Percentage scoring of framework subcategories - https://phabricator.wikimedia.org/T420425#11721526 (10komla) [04:42:21] RESOLVED: MaintainKubeusersHang: maintain-kubeusers last finished run is 29.56M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [07:23:26] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster (T419824) [07:23:31] T419824: Add new k8s toolforge workers to cater for memory requests - https://phabricator.wikimedia.org/T419824 [07:37:11] !log filippo@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-83.tools.eqiad1.wikimedia.cloud to the cluster [07:37:11] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [08:07:47] FIRING: NodeDown: Node cloudgw1003 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1003 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [08:13:17] FIRING: JobUnavailable: Reduced availability for job cloudlb-haproxy in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:14:23] 10Toolforge (Toolforge iteration 26): [Toolforge Sustainability Framework]Percentage scoring of framework subcategories - https://phabricator.wikimedia.org/T420425#11721807 (10dcaro) @komla for both those categories, you'll have to do an inventory of "actions" that can be done on toolforge (deciding also what is... [08:17:05] FIRING: HostBGPDown: BGP session for cloudlb1001 (2a02:ec80:a000:201::2) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-c8-eqiad:9804&var-bgp_group=cloud_host6 - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [08:17:42] (03CR) 10Akoopal: [C:03+2] Detect and report duplicate monument IDs during harvest [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1251595 (https://phabricator.wikimedia.org/T420019) (owner: 10Lokal Profil) [08:19:54] (03Merged) 10jenkins-bot: Detect and report duplicate monument IDs during harvest [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1251595 (https://phabricator.wikimedia.org/T420019) (owner: 10Lokal Profil) [08:22:40] PROBLEM - Host cloudnet1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:47] FIRING: NodeDown: Node cloudlb1001 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudlb1001 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [08:24:49] FIRING: [4x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [08:25:04] PROBLEM - Host cloudservices1006 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:47] FIRING: [2x] NodeDown: Node cloudlb1001 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [08:28:17] FIRING: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:29:58] PROBLEM - Host cloudrabbit1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:05] FIRING: [2x] HostBGPDown: BGP session for cloudservices1006 (172.20.1.5) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [08:32:47] FIRING: [3x] NodeDown: Node cloudlb1001 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [08:34:49] FIRING: [5x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [08:37:47] FIRING: [4x] NodeDown: Node cloudlb1001 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [08:39:49] FIRING: [5x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [08:40:20] RECOVERY - Host cloudrabbit1001 is UP: PING WARNING - Packet loss = 80%, RTA = 0.35 ms [08:42:07] (03PS1) 10Lokal Profil: Detect and report monuments with missing IDs during harvest [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254828 (https://phabricator.wikimedia.org/T420013) [08:42:47] FIRING: [4x] NodeDown: Node cloudlb1001 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [08:46:16] 10Tool-inteGraality: Retrieving labels via SPARQL tanks query performance - https://phabricator.wikimedia.org/T400480#11721854 (10JeanFred) 05Open→03Resolved a:03JeanFred I think 69669c0 is good enough as a fix. [08:51:56] !log filippo@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [08:52:51] FIRING: RabbitmqNetworkPartition: A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/RabbitmqNetworkPartition - https://grafana.wikimedia.org/d/tn5yHr44k/wmcs-rabbitmq-health - https://alerts.wikimedia.org/?q=alertname%3DRabbitmqNetworkPartition [08:55:26] !log filippo@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.restart_openstack (exit_code=97) on deployment eqiad1 for all services [08:57:51] FIRING: [3x] RabbitmqNetworkPartition: A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/RabbitmqNetworkPartition - https://grafana.wikimedia.org/d/tn5yHr44k/wmcs-rabbitmq-health - https://alerts.wikimedia.org/?q=alertname%3DRabbitmqNetworkPartition [08:59:49] FIRING: [5x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [09:02:51] RESOLVED: [3x] RabbitmqNetworkPartition: A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/RabbitmqNetworkPartition - https://grafana.wikimedia.org/d/tn5yHr44k/wmcs-rabbitmq-health - https://alerts.wikimedia.org/?q=alertname%3DRabbitmqNetworkPartition [09:04:49] FIRING: [5x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [09:26:19] RECOVERY - Host cloudservices1006 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [09:26:26] RECOVERY - Host cloudnet1005 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:27:47] RESOLVED: [3x] NodeDown: Node cloudlb1001 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [09:28:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:29:49] RESOLVED: [4x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [09:32:35] RESOLVED: [2x] HostBGPDown: BGP session for cloudservices1006 (172.20.1.5) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [10:02:25] (03open) 10dcaro: core: update jobs in storage too [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/277 [10:04:23] (03update) 10dcaro: core: update jobs in storage too [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/277 [10:23:15] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393#11722180 (10fgiunchedi) Tests today went significantly better: cloud vps networking stayed intact, I did start with failing over cloudgw which meant hosts... [10:28:19] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Increased openstack latency and rabbitmq rolling restarts on certificate update - https://phabricator.wikimedia.org/T418444#11722215 (10fgiunchedi) Today during {T417393} the same failure happened, namely cloudrabbit1001 was disconnected from the network... [11:10:33] 10Tool-campwiz-nxt, 10Google-Summer-of-Code (2026): GSoC 2026: CampWiz NxT Redesign - https://phabricator.wikimedia.org/T414269#11722415 (10Vickova) Hello, I’m interested in working on the CampWiz NxT Redesign project for GSoC 2026. I’m a React developer with experience building modular and scalable user inte... [11:21:25] 10Toolforge (Toolforge iteration 26), 13Patch-For-Review: [harbor,tools] Harbor object usage in S3 is steadily increasing - https://phabricator.wikimedia.org/T418528#11722426 (10Raymond_Ndibe) ==TL;DR:== * maintain-harbor job to handle harbor vs s3 orphaned objects cleanup (is this safe? are their safer altern... [12:44:11] 10Toolforge (Toolforge iteration 26), 13Patch-For-Review: [harbor,tools] Harbor object usage in S3 is steadily increasing - https://phabricator.wikimedia.org/T418528#11722841 (10Raymond_Ndibe) I added documentation for the s3 lifecycle configuration part here https://wikitech.wikimedia.org/wiki/Help:Object_sto... [13:52:29] (03open) 10dcaro: images: don't set the digest if it was not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/278 [14:39:00] (03update) 10kavaljeetsingh: Fix self closing tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/9 (https://phabricator.wikimedia.org/T417483) [14:47:33] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: cloudcephmon2007-dev service implementation - https://phabricator.wikimedia.org/T420282#11723394 (10Andrew) p:05Triage→03Medium [14:48:05] 06cloud-services-team, 10Cloud-VPS: Deprecate and remove 'bastion-restricted' hosts - https://phabricator.wikimedia.org/T420213#11723399 (10Andrew) p:05Triage→03Low [14:50:09] 06cloud-services-team, 10Cloud-VPS: Deprecate and remove 'bastion-restricted' hosts - https://phabricator.wikimedia.org/T420213#11723421 (10Andrew) >>! In T420213#11719139, @taavi wrote: > FWIW I see some value in having the cumin authorized_keys entries have an IP restriction on a host not accessible to most... [14:50:56] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Add floating IP and vanity domain for azwikimedia project - https://phabricator.wikimedia.org/T419582#11723422 (10Andrew) p:05Triage→03Medium [15:10:57] 06cloud-services-team, 10Toolforge: Add new k8s toolforge workers to cater for memory requests - https://phabricator.wikimedia.org/T419824#11723532 (10fnegri) 05Open→03In progress p:05Triage→03High [15:14:10] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge: Add new k8s toolforge workers to cater for memory requests - https://phabricator.wikimedia.org/T419824#11723570 (10fnegri) [15:14:16] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge: [clis] standardize the package names - https://phabricator.wikimedia.org/T399080#11723572 (10fnegri) [15:14:25] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge, 07Documentation: Create a "my first React app" tutorial for Toolforge - https://phabricator.wikimedia.org/T231950#11723574 (10fnegri) [15:14:28] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge (Toolforge iteration 26): Replace ingress-nginx before upstream EOL date - https://phabricator.wikimedia.org/T392356#11723576 (10fnegri) [15:26:33] (03update) 10dcaro: core: update jobs in storage too [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/277 [15:31:48] (03update) 10dcaro: core: update jobs in storage too [repos/cloud/toolforge/jobs-api] (add_job_type_as_set) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/277 [15:32:08] (03update) 10dcaro: images: don't set the digest if it was not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/278 [15:32:18] (03open) 10dcaro: models: make job_type always show as set [repos/cloud/toolforge/jobs-api] (fix_images) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/279 [15:33:10] (03update) 10dcaro: models: make job_type always show as set [repos/cloud/toolforge/jobs-api] (fix_images) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/279 [15:33:41] (03update) 10dcaro: core: update jobs in storage too [repos/cloud/toolforge/jobs-api] (add_job_type_as_set) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/277 [15:34:43] (03update) 10dcaro: models: make job_type always show as set [repos/cloud/toolforge/jobs-api] (fix_images) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/279 [15:37:17] (03update) 10dcaro: core: update jobs in storage too [repos/cloud/toolforge/jobs-api] (add_job_type_as_set) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/277 [15:49:26] (03update) 10dcaro: images: don't set the digest if it was not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/278 [15:54:31] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge (Toolforge iteration 26), 13Patch-For-Review: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble - https://phabricator.wikimedia.org/T380127#11723753 (10dcaro) >>! In T380127#11721215, @bd808 wrote: > @dcaro, I see... [16:06:13] (03update) 10dcaro: models: make job_type always show as set [repos/cloud/toolforge/jobs-api] (fix_images) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/279 [16:08:19] (03update) 10dcaro: core: update jobs in storage too [repos/cloud/toolforge/jobs-api] (add_job_type_as_set) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/277 [16:11:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [16:20:34] 06cloud-services-team, 10Toolforge, 10Phabricator, 10GitLab (Auth & Access), 06Release-Engineering-Team (Radar): Look for ways to consolidate "we trust this human" access lists - https://phabricator.wikimedia.org/T364516#11723934 (10brennen) [16:41:46] (03update) 10dcaro: core: update jobs in storage too [repos/cloud/toolforge/jobs-api] (add_job_type_as_set) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/277 [16:43:21] (03PS1) 10Akoopal: Added hamburg to erfgoedbot config. [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T420019) [16:44:06] 10Tools, 06All-and-every-Wikisource: ws-page-game.toolforge.org does not receive page scan thumbnails (HTTP 429) - https://phabricator.wikimedia.org/T419239#11724134 (10ShakespeareFan00) This now at least fails with a proper error - ` 16:42:31.120 GET https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5... [16:47:15] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11724145 (10fnegri) I have repooled the host since it's worked fine for the past 48 hours. Now let's see if it continues to wor... [16:50:23] 10Toolforge (Toolforge iteration 26): [Toolforge Sustainability Framework]Percentage scoring of framework subcategories - https://phabricator.wikimedia.org/T420425#11724165 (10komla) Cool. I will start off by using the user journey to list these actions (I have some data on this). I should have this ready by nex... [17:05:33] FIRING: [4x] ProbeDown: Service tools-k8s-haproxy-8:30004 has failed probes (http_infra_tracing_loki_svc_tools_eqiad1_wikimedia_cloud_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:10:33] RESOLVED: [4x] ProbeDown: Service tools-k8s-haproxy-8:30004 has failed probes (http_infra_tracing_loki_svc_tools_eqiad1_wikimedia_cloud_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:10:47] (03PS2) 10Akoopal: Added hamburg to erfgoedbot config. [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T55271) [17:11:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [17:13:09] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11724310 (10fnegri) In case there are more segfaults and this need to be depooled while I'm not around, the command is: `lang=... [17:38:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [17:46:32] 06cloud-services-team, 10Cloud-VPS, 10Lingua-Libre: Configure vanity domain for lingualibre - https://phabricator.wikimedia.org/T419525#11724631 (10Yug) >>! In T419525#11719328, @dcaro wrote: > Hi @Yug, can you follow the steps detailed here https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach... [17:47:23] (03update) 10r4356th: Fix self closing tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/9 (https://phabricator.wikimedia.org/T417483) (owner: 10kavaljeetsingh) [17:58:46] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:58:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [18:03:46] RESOLVED: [4x] ProbeDown: Service tools-k8s-haproxy-8:30004 has failed probes (http_infra_tracing_loki_svc_tools_eqiad1_wikimedia_cloud_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:43:33] FIRING: [4x] ProbeDown: Service tools-k8s-haproxy-8:30004 has failed probes (http_infra_tracing_loki_svc_tools_eqiad1_wikimedia_cloud_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:46:32] FIRING: InstanceDown: Project tools instance tools-k8s-haproxy-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:51:32] RESOLVED: InstanceDown: Project tools instance tools-k8s-haproxy-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:53:33] RESOLVED: [4x] ProbeDown: Service tools-k8s-haproxy-8:30004 has failed probes (http_infra_tracing_loki_svc_tools_eqiad1_wikimedia_cloud_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:58:38] (03CR) 10Lokal Profil: "retest" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T55271) (owner: 10Akoopal) [20:02:07] (03CR) 10Lokal Profil: "recheck" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T55271) (owner: 10Akoopal) [20:03:25] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [20:04:06] (03CR) 10CI reject: [V:04-1] Added hamburg to erfgoedbot config. [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T55271) (owner: 10Akoopal) [20:07:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment codfw1dev for all services [20:11:48] 10VPS-project-Extdist: extdist is not rotating logs - https://phabricator.wikimedia.org/T253588#11725299 (10taavi) 05Open→03Resolved `lang=shell-session root@extdist-06:~# tail /var/log/extdist.1 2026-03-14 23:15:45,620 DEBUG:No updates to branch, tarball already exists. 2026-03-14 23:15:45,620 INFO:Cre... [20:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [20:34:34] (03PS3) 10Akoopal: Added hamburg to erfgoedbot config. [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T55271) [20:37:30] (03CR) 10Lokal Profil: "recheck" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T55271) (owner: 10Akoopal) [20:39:26] (03CR) 10CI reject: [V:04-1] Added hamburg to erfgoedbot config. [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T55271) (owner: 10Akoopal) [20:40:50] (03PS4) 10Lokal Profil: Added hamburg to erfgoedbot config. [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1254973 (https://phabricator.wikimedia.org/T55271) (owner: 10Akoopal) [21:36:04] 10Cloud-VPS (Project-requests): Request creation of s3etherpad VPS project - https://phabricator.wikimedia.org/T420532 (10bd808) 03NEW [21:38:07] 10Tool-campwiz-nxt, 10Google-Summer-of-Code (2026): GSoC 2026: CampWiz NxT Redesign - https://phabricator.wikimedia.org/T414269#11725640 (10jahin) Hi @Nokib_Sarkar and @Tiven2240 , I’m interested in working on the CampWiz NxT redesign project for GSoC 2026. I have experience with React and frontend developmen... [21:44:39] 10Cloud-VPS (Project-requests): Request creation of s3etherpad VPS project - https://phabricator.wikimedia.org/T420532#11725664 (10bd808) [23:58:00] 10Cloud-VPS (Quota-requests), 10Catalyst: Disk quota increase for catalyst - https://phabricator.wikimedia.org/T420544 (10thcipriani) 03NEW