[00:10:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:16:56] FIRING: SystemdUnitDown: The service unit opentofu-infra-diff.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:19:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [03:24:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [03:33:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [03:43:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [03:54:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [03:59:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [04:10:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:11:56] FIRING: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:12:03] 06cloud-services-team: SystemdUnitDown The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T396187 (10phaultfinder) 03NEW [06:52:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-5 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:52:56] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T396189 (10Referentis) 03NEW [06:59:24] 06cloud-services-team, 10Cloud-VPS: SystemdUnitDown The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T396187#10889980 (10taavi) a:03Andrew ` Jun 06 03:10:12 cloudcontrol1007 tofu[1114487]: # module.project["t... [07:15:30] 06cloud-services-team: HostBGPDown - https://phabricator.wikimedia.org/T396123#10889999 (10taavi) 05Open→03Resolved [07:23:31] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-5 is lagging behind the primary, the current lag is 3675 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [08:10:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:20:46] 10Tool-techcontribs: visiting techcontribs.toolforge.org/uid/ (with an incomplete URI) redirects to https://localhost:8000 and gives ERR_CONNECTION_REFUSED - https://phabricator.wikimedia.org/T393620#10890068 (10Chlod) 05Open→03Resolved p:05Triage→03Low a:03Chlod Seems like I was using the wrong wa... [08:30:36] 10Tool-techcontribs: Too little spacing between columns in Phabricator projects list - https://phabricator.wikimedia.org/T393328#10890088 (10Chlod) 05Open→03In progress p:05Triage→03Medium a:03Chlod 🎉 Fixed with [636b184](https://gitlab.wikimedia.org/toolforge-repos/techcontribs/-/commit/636b184aa1fe2c... [08:43:13] 10Tool-techcontribs: Emojis used in Phabricator project description causes extra space along with the emoji not being displayed - https://phabricator.wikimedia.org/T393324#10890108 (10Chlod) 05Open→03In progress p:05Triage→03Low a:03Chlod The emoji here is actually an icon (e.g. `{icon lock color=red}`... [09:08:52] 10Tool-techcontribs: clearer explanation of the difference between "developer account shell name" and "developer account name" - https://phabricator.wikimedia.org/T393622#10890162 (10Chlod) 05Open→03In progress p:05Triage→03Low a:03Chlod 🎉 Done with [a3deeaea](https://gitlab.wikimedia.org/toolforge-rep... [09:10:27] 10Tool-techcontribs: Emojis used in Phabricator project description causes extra space along with the emoji not being displayed - https://phabricator.wikimedia.org/T393324#10890181 (10Chlod) 05In progress→03Resolved [09:10:30] 10Tool-techcontribs: Too little spacing between columns in Phabricator projects list - https://phabricator.wikimedia.org/T393328#10890184 (10Chlod) 05In progress→03Resolved [09:18:20] 10Tool-techcontribs: Add some of the data from ldap.toolforge.org tool - https://phabricator.wikimedia.org/T393536#10890229 (10Chlod) p:05Triage→03Low Hmm, general LDAP data (i.e. your int ID, account creation, SSH keys, etc.) is a bit out of scope for Tech Contribs. I'm currently not aiming to replace the `... [09:44:13] 10Tool-techcontribs: Add a link to Phabricator to the footer - https://phabricator.wikimedia.org/T393532#10890300 (10Chlod) 05Open→03In progress p:05Triage→03Low a:03Chlod 🎉 Added with [4a9d43c](https://gitlab.wikimedia.org/toolforge-repos/techcontribs/-/commit/4a9d43c2ee53a5f44ba4ab71e3a59a3c31107352)... [09:46:26] 10Tool-techcontribs: Remove the "uploaded groups" section from Gerrit groups - https://phabricator.wikimedia.org/T393535#10890314 (10Chlod) p:05Triage→03Low [09:52:32] 10Tool-techcontribs: clearer explanation of the difference between "developer account shell name" and "developer account name" - https://phabricator.wikimedia.org/T393622#10890323 (10Chlod) 05In progress→03Resolved [09:52:35] 10Tool-techcontribs: Add a link to Phabricator to the footer - https://phabricator.wikimedia.org/T393532#10890325 (10Chlod) 05In progress→03Resolved [10:01:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199 (10fnegri) 03NEW [10:02:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-5 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:02:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199#10890344 (10fnegri) 05Open→03In progress p:05Triage→03High [10:10:21] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199#10890369 (10fnegri) @Naorleizer your miss-search tool is causing some performance issues to ToolsDB. The problematic query was ` DELETE FROM qid_rank O... [10:12:24] 10Tool-techcontribs: Reduce amount of scrolling and clicks needed to get to the search form - https://phabricator.wikimedia.org/T393537#10890370 (10Chlod) 05Open→03In progress a:03Chlod 🎉 Done with [e79664e](https://gitlab.wikimedia.org/toolforge-repos/techcontribs/-/commit/e79664e15b457147893b20061c554f87... [10:20:00] RESOLVED: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:20:52] 10Tool-techcontribs: Reduce amount of scrolling and clicks needed to get to the search form - https://phabricator.wikimedia.org/T393537#10890403 (10Chlod) 05In progress→03Resolved [10:22:10] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199#10890407 (10fnegri) I'm gonna stop the replication, manually run the DELETE query on the replica host, then restart replication. Plan: ` STOP SLAVE; S... [10:29:38] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Striker, 10Bitu, 06Infrastructure-Foundations, 13Patch-For-Review: Move Striker to Bitu username validation API - https://phabricator.wikimedia.org/T364605#10890422 (10SLyngshede-WMF) @taavi Yes, so half of everything in the IDM is an "option" of sorts. I wasn... [11:21:37] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199#10890596 (10fnegri) `STOP REPLICA` is not enough, because it [waits for all active transactions to complete](https://mariadb.com/kb/en/stop-replica/):... [11:43:48] 06cloud-services-team, 10Toolforge: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210 (10taavi) 03NEW [11:46:11] 06cloud-services-team, 10Toolforge: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210#10890646 (10taavi) The API was seemingly changed in https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/139. cc @Raymond_Ndibe @dcaro. [11:46:16] 06cloud-services-team, 10Toolforge: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210#10890647 (10taavi) [11:46:19] 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: [jobs-api] refactor models - https://phabricator.wikimedia.org/T389118#10890648 (10taavi) [12:05:29] (03update) 10taavi: registry-admission: local: Exempt local-path-storage [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/795 [12:26:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-5 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:43:05] (03update) 10taavi: registry-admission: local: Exempt local-path-storage [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/795 [12:43:06] (03update) 10taavi: logging: Init component [repos/cloud/toolforge/toolforge-deploy] (main-Icb012f1ad81b582b65a569bb493095e12d3fbd72) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/796 (https://phabricator.wikimedia.org/T386480) [12:43:08] (03update) 10taavi: logging: Add basic rate limiting and retention config [repos/cloud/toolforge/toolforge-deploy] (main-If8f503514316703ce91f966fb6ad40b04ef8fdd0) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/807 (https://phabricator.wikimedia.org/T386480) [12:43:09] (03open) 10taavi: logging: Add basic rate limiting and retention config [repos/cloud/toolforge/toolforge-deploy] (main-If8f503514316703ce91f966fb6ad40b04ef8fdd0) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/807 (https://phabricator.wikimedia.org/T386480) [12:43:15] (03update) 10taavi: registry-admission: local: Exempt local-path-storage [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/795 [12:43:19] (03update) 10taavi: logging: Add basic rate limiting and retention config [repos/cloud/toolforge/toolforge-deploy] (main-If8f503514316703ce91f966fb6ad40b04ef8fdd0) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/807 (https://phabricator.wikimedia.org/T386480) [12:43:22] (03update) 10taavi: logging: Init component [repos/cloud/toolforge/toolforge-deploy] (main-Icb012f1ad81b582b65a569bb493095e12d3fbd72) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/796 (https://phabricator.wikimedia.org/T386480) [13:07:16] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T396189#10890837 (10Curb_Safe_Charmer) Hi @Referentis - are you still experiencing this? [13:10:50] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T396189#10890843 (10Referentis) Hi @Curb_Safe_Charmer, I've just checked and it's now working again – thank you. [13:11:30] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T396189#10890844 (10Curb_Safe_Charmer) 05Open→03Resolved a:03Curb_Safe_Charmer [13:12:01] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T396189#10890848 (10Curb_Safe_Charmer) I didn't do anything; problem seems to have gone away of its own accord. [13:44:31] 06cloud-services-team, 10Toolforge: 2025-06-06 Toolforge NFS cleanup - https://phabricator.wikimedia.org/T396220#10890971 (10taavi) [13:52:30] 10Tools: Reduce disk space usage for 'medwiki' database backups - https://phabricator.wikimedia.org/T396222 (10taavi) 03NEW [13:54:59] 06cloud-services-team, 10Toolforge: 2025-05-22 Toolforge NFS cleanup - https://phabricator.wikimedia.org/T395000#10891016 (10taavi) [14:00:17] 10Tools: Automatically cleanup old logs from 'paste' tool - https://phabricator.wikimedia.org/T396224 (10taavi) 03NEW [14:06:00] 14Grid-Engine-to-K8s-Migration, 10Tools, 06All-and-every-Wikisource: Migrate phetools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319965#10891075 (10taavi) Hello folks - this tool has a cache directory (`/data/project/phetools/cache`) that's over 100G in size. Is th... [14:06:08] 06cloud-services-team, 10Toolforge: 2025-06-06 Toolforge NFS cleanup - https://phabricator.wikimedia.org/T396220#10891077 (10taavi) T319965#10891075 [14:11:44] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199#10891095 (10fnegri) My understanding is after my `KILL`, the replication thread just retried replicating the same transaction. I'm not sure if there's a... [14:12:44] 10Tool-fault-tolerance: Low priority: new elastic hosts not showing in web UI - https://phabricator.wikimedia.org/T390902#10891099 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup done. [14:30:36] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: consider removing labs-ip-aliaser - https://phabricator.wikimedia.org/T374129#10891164 (10Andrew) If we make this change, users will have to adjust firewalls on floating-IP-attached VMs in two ways: - They'll have to allow traffic originatin... [14:46:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-5 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:59:41] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: consider removing labs-ip-aliaser - https://phabricator.wikimedia.org/T374129#10891296 (10Andrew) After quite a lot of discussion and testing, @taavi has pointed out a real problem with switching to direct traffic from a VM to a floating IP,... [15:01:47] 10Tools: zoomviewer uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T395020#10891317 (10taavi) The `/srv/tools/project/zoomviewer` directory has grown by about a terabyte in the two weeks since I filed this task. Is the cleanup script working as expected? Can I help with this somehow? [15:02:19] 10Tools: zoomviewer uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T395020#10891318 (10taavi) [15:02:20] 06cloud-services-team, 10Toolforge: 2025-06-06 Toolforge NFS cleanup - https://phabricator.wikimedia.org/T396220#10891319 (10taavi) [15:02:24] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: consider removing labs-ip-aliaser - https://phabricator.wikimedia.org/T374129#10891320 (10Andrew) 05Open→03Stalled [15:03:38] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: openstack: consider removing labs-ip-aliaser - https://phabricator.wikimedia.org/T374129#10891322 (10Andrew) If/when we make that change, here's a bit of wikitext for the announcement page: ` == Summary == Currently most cloud-internal traffic using... [15:05:37] 06cloud-services-team, 10Toolforge: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210#10891327 (10Raymond_Ndibe) thanks @taavi for pointing this out. It was my fault to not change the cli as well. Easiest thing to do here would be to change the cli, but maybe getting... [15:08:21] (03Abandoned) 10Andrew Bogott: Specify project with --project rather than in the environment [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/923605 (owner: 10Andrew Bogott) [15:08:41] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199#10891351 (10fnegri) `systemctl stop mariadb` was stuck, so after 20 minutes I killed the process with `kill -9`. Even after restarting MariaDB, `STOP R... [15:10:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-5 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:10:34] 06cloud-services-team, 10Cloud-VPS: SystemdUnitDown The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T396187#10891358 (10Andrew) 05Open→03Resolved this was for an experiment, now done. I've reverted the q... [15:14:32] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-5 [15:20:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-5 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:20:26] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-5 [15:56:30] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199#10891547 (10fnegri) 05In progress→03Resolved Replication is back in sync! {F61815743} [16:04:00] 06cloud-services-team, 10Toolforge, 10Tools: Flickr blocking image requests from Toolforge k8s, breaking multiple tools - https://phabricator.wikimedia.org/T384468#10891585 (10Andrew) 05Stalled→03Resolved ` Hi Andrew, Becki from Flickr Support here again. Our engineers have made some adjustments... [16:11:36] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10891620 (10Andrew) Sorry @Jclark-ctr, I've made a bit of a mess of this. Ideally each of these hosts would have 2x25G connections, each connected to a cl... [16:28:55] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10891644 (10Andrew) Looks like I'm getting ahead of things a bit. We definitely do need 2 connections per host, but it's unclear on if we're skipping to 25... [16:31:14] 10Toolforge (Toolforge iteration 20): [jobs-api] expose health_check.type deprication metrics - https://phabricator.wikimedia.org/T396236 (10Raymond_Ndibe) 03NEW [16:31:57] 10Toolforge (Toolforge iteration 20): [jobs-api] expose health_check.type deprecation metrics - https://phabricator.wikimedia.org/T396236#10891662 (10Raymond_Ndibe) [16:41:03] 06cloud-services-team, 10Cloud-VPS, 10VPS-Project-wikicommunityhealth: [cinder] Volume failing to attach/detach - https://phabricator.wikimedia.org/T392089#10891686 (10Andrew) Hi @CristianCantoro, I neglected this ticket for long enough that I'm no longer feeling confident about cleanup. Can you delete any u... [16:50:18] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#10891695 (10Andrew) 05Open→03Stalled p:05Triage→03Medium [16:53:55] 10Toolforge (Quota-requests): Elasticsearch credential request for sdzerobot - https://phabricator.wikimedia.org/T396237 (10SD0001) 03NEW [16:54:11] (03open) 10raymond-ndibe: [api.jobs] health_check.type deprecation patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/173 (https://phabricator.wikimedia.org/T396210 https://phabricator.wikimedia.org/T396236) [16:54:40] (03open) 10raymond-ndibe: [cli] use health_check_type [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/105 (https://phabricator.wikimedia.org/T396210) [16:54:59] (03update) 10raymond-ndibe: [api.jobs] health_check.type deprecation patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/173 (https://phabricator.wikimedia.org/T396210 https://phabricator.wikimedia.org/T396236) [17:01:31] FIRING: ToolsNfsAlmostFull: Toolforge NFS is 85.21% full - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNfsAlmostFull - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNfsAlmostFull [17:18:12] (03approved) 10raymond-ndibe: [cli] use health_check_type [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/105 (https://phabricator.wikimedia.org/T396210) [17:18:15] (03update) 10raymond-ndibe: [cli] use health_check_type [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/105 (https://phabricator.wikimedia.org/T396210) [17:36:56] (03update) 10raymond-ndibe: [api.jobs] health_check.type deprecation patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/173 (https://phabricator.wikimedia.org/T396210 https://phabricator.wikimedia.org/T396236) [17:38:44] (03update) 10raymond-ndibe: [cli] use health_check_type [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/105 (https://phabricator.wikimedia.org/T396210) [17:46:18] (03merge) 10raymond-ndibe: [cli] use health_check_type [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/105 (https://phabricator.wikimedia.org/T396210) [17:50:03] 06cloud-services-team, 10Phabricator, 07Developer Productivity, 10Release-Engineering-Team (Seen), 07SecTeam-Processed: Some very specific Maniphest search queries by RelEng, Sec Team and WMCS are global and shown for all users - https://phabricator.wikimedia.org/T214579#10892076 (10A_smart_kitten) Gentl... [17:52:09] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.380-20250602174717-b1b0f757 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/808 [18:04:21] (03open) 10raymond-ndibe: d/changelog: bump to 16.1.13 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/106 (https://phabricator.wikimedia.org/T396210) [18:05:40] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [18:10:03] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [18:11:17] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [18:15:08] !log raymond-ndibe@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component jobs-cli [18:15:12] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [18:19:36] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [18:19:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:20:13] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [18:27:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-74 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:33:38] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [18:33:50] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [18:44:36] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [18:44:49] (03update) 10raymond-ndibe: d/changelog: bump to 16.1.13 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/106 (https://phabricator.wikimedia.org/T396210) [18:44:52] (03approved) 10raymond-ndibe: d/changelog: bump to 16.1.13 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/106 (https://phabricator.wikimedia.org/T396210) [18:44:56] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:44:59] (03merge) 10raymond-ndibe: d/changelog: bump to 16.1.13 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/106 (https://phabricator.wikimedia.org/T396210) [18:45:39] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210#10892248 (10Raymond_Ndibe) a:03Raymond_Ndibe [18:46:04] 06cloud-services-team, 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210#10892250 (10Raymond_Ndibe) [18:46:14] 06cloud-services-team, 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210#10892252 (10Raymond_Ndibe) 05Open→03In progress [18:46:18] 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: [jobs-api] expose health_check.type deprecation metrics - https://phabricator.wikimedia.org/T396236#10892255 (10Raymond_Ndibe) 05Open→03In progress [18:46:24] 06cloud-services-team, 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210#10892257 (10Raymond_Ndibe) 05In progress→03Resolved [18:48:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:53:04] (03update) 10raymond-ndibe: [api.jobs] health_check.type deprecation patch [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/173 (https://phabricator.wikimedia.org/T396210 https://phabricator.wikimedia.org/T396236) [19:03:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:13:56] RESOLVED: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:15:56] FIRING: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:30:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:15:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:20:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:24:45] 14Grid-Engine-to-K8s-Migration, 10Tools, 06All-and-every-Wikisource: Migrate phetools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319965#10892490 (10Xover) >>! In T319965#10891075, @taavi wrote: > Hello folks - this tool has a cache directory (`/data/project/phetool... [20:27:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:30:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:35:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:38:13] 10Tool-openstack-browser: openstack-browser timing out trying to fetch dns zones in multiple projects - https://phabricator.wikimedia.org/T396256 (10bd808) 03NEW [20:45:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:50:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:00:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:02:22] FIRING: [3x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:03:22] FIRING: HAProxyServiceUnavailable: HAProxy service designate-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [21:03:35] 06cloud-services-team: HAProxyServiceUnavailable HAProxy service designate-api_backend has no available backends on cloudlb1002:9900 - https://phabricator.wikimedia.org/T396257 (10phaultfinder) 03NEW [21:04:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [21:04:29] 10Tool-openstack-browser: openstack-browser timing out trying to fetch dns zones in multiple projects - https://phabricator.wikimedia.org/T396256#10892558 (10Andrew) 05Open→03Resolved a:03Andrew This is not very satisfying but restarting designate services seems to have resolved this. Let's see if it a... [21:05:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:07:22] RESOLVED: [3x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:08:22] RESOLVED: HAProxyServiceUnavailable: HAProxy service designate-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [21:15:56] RESOLVED: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:19:26] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [21:19:26] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [21:23:49] (03open) 10bd808: channels: Send some bugs to #wikimedia-zuul [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/56 [21:28:09] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [21:28:31] FIRING: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [21:28:35] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [21:28:37] FIRING: JobsApiUpMetricUnknown: JobsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsApiUpMetricUnknown [21:28:38] FIRING: BuildsApiUpMetricUnknown: BuildsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/BuildsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DBuildsApiUpMetricUnknown [21:28:46] FIRING: EnvvarsAdmissionDown: EnvvarsAdmission is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsAdmissionDown [21:30:10] FIRING: JobsEmailerUpMetricUnknown: JobsEmailer might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerUpMetricUnknown [21:34:09] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [21:34:30] FIRING: EnvvarsApiUpMetricUnknown: EnvvarsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsApiUpMetricUnknown [21:34:33] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [21:37:57] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-54, tools-k8s-worker-nfs-74 [21:38:41] FIRING: CloudVPSDesignateLeaks: Detected 28 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:41:49] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure: Consider setting up an https://github.com/knyar/phalerts instance in metricsinfra - https://phabricator.wikimedia.org/T394446#10892658 (10bd808) @wmcs-alerts is the Phab bot that will create the tasks for this. [21:47:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-54 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:48:46] RESOLVED: EnvvarsAdmissionDown: EnvvarsAdmission is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsAdmissionDown [21:49:26] RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [21:49:26] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [21:49:37] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-54, tools-k8s-worker-nfs-74 [21:50:10] RESOLVED: JobsEmailerUpMetricUnknown: JobsEmailer might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerUpMetricUnknown [21:53:09] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [21:53:31] RESOLVED: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [21:53:35] RESOLVED: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [21:53:37] RESOLVED: JobsApiUpMetricUnknown: JobsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsApiUpMetricUnknown [21:53:38] RESOLVED: BuildsApiUpMetricUnknown: BuildsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/BuildsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DBuildsApiUpMetricUnknown [21:54:09] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [21:54:30] RESOLVED: EnvvarsApiUpMetricUnknown: EnvvarsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsApiUpMetricUnknown [21:54:33] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [21:57:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-74 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:05:30] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [22:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 28 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:22:45] (03update) 10bd808: channels: Send some bugs to #wikimedia-zuul [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/56 [22:27:05] (03approved) 10bd808: channels: Send some bugs to #wikimedia-zuul [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/56 [22:27:10] (03merge) 10bd808: channels: Send some bugs to #wikimedia-zuul [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/56 [22:29:15] (03close) 10bd808: tests: Update for Phorge project rename [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/55 (owner: 10taavi) [22:31:30] (03update) 10bd808: gitlab: Ignore Gerritlab artifacts in target branch [toolforge-repos/wikibugs2] (main-I4473512c1e65dd70208244c51c0cfffba390c37a) - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/54 (owner: 10taavi) [22:31:57] (03update) 10bd808: gitlab: Ignore Gerritlab artifacts in target branch [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/54 (owner: 10taavi) [22:40:05] (03approved) 10bd808: gitlab: Ignore Gerritlab artifacts in target branch [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/54 (owner: 10taavi) [23:31:21] 06cloud-services-team, 10Phabricator, 07Developer Productivity, 10Release-Engineering-Team (Seen), 07SecTeam-Processed: Some very specific Maniphest search queries by RelEng, Sec Team and WMCS are global and shown for all users - https://phabricator.wikimedia.org/T214579#10892794 (10thcipriani) >>! In T2... [23:34:49] 06cloud-services-team, 10Phabricator, 07Developer Productivity, 10Release-Engineering-Team (Seen), 07SecTeam-Processed: Some very specific Maniphest search queries by RelEng, Sec Team and WMCS are global and shown for all users - https://phabricator.wikimedia.org/T214579#10892795 (10thcipriani) Also, thi...