[00:23:14] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [00:23:15] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [00:31:19] FIRING: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [00:31:25] FIRING: JobsApiUpMetricUnknown: JobsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsApiUpMetricUnknown [00:31:26] FIRING: BuildsApiUpMetricUnknown: BuildsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/BuildsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DBuildsApiUpMetricUnknown [00:31:57] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [00:31:57] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [00:31:58] FIRING: JobsEmailerUpMetricUnknown: JobsEmailer might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerUpMetricUnknown [00:32:18] FIRING: EnvvarsApiUpMetricUnknown: EnvvarsApi might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsApiUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsApiUpMetricUnknown [00:32:21] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [00:32:23] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [00:32:34] FIRING: EnvvarsAdmissionDown: EnvvarsAdmission is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsAdmissionDown [01:26:55] 10Tool-campwiz-nxt, 06translatewiki.net, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 07Unplanned-Sprint-Work: Add CampWiz NXT to translatewiki.net - https://phabricator.wikimedia.org/T393850#10892859 (10Nokib_Sarkar) Hi, any updates regarding when the translation thing might start? I wanted to add a... [05:47:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:27:04] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:25:34] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.992% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:39:16] (03open) 10lucaswerkmeister: Don’t show tool name in element [toolforge-repos/fourohfour] - 10https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/merge_requests/13 [12:43:58] (03open) 10lucaswerkmeister: Use toolforge envvars if available [toolforge-repos/python-toolforge] - 10https://gitlab.wikimedia.org/toolforge-repos/python-toolforge/-/merge_requests/27 (https://phabricator.wikimedia.org/T339940) [12:50:28] 10Tool-python-toolforge, 13Patch-For-Review: Support reading Wiki Replica/ToolsDB credentials from envvars - https://phabricator.wikimedia.org/T339940#10893114 (10LucasWerkmeister) a:03LucasWerkmeister [13:25:34] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.393% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:14:58] 06cloud-services-team, 10Phabricator, 07Developer Productivity, 10Release-Engineering-Team (Seen), 07SecTeam-Processed: Some very specific Maniphest search queries by RelEng, Sec Team and WMCS are global and shown for all users - https://phabricator.wikimedia.org/T214579#10893142 (10A_smart_kitten) >>! I... [15:08:04] RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.965% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:53:33] 06cloud-services-team, 10Toolforge: [toolsbeta,tofu,infra] There's some discrepancy between the volumes in toolsbeta and tofu - https://phabricator.wikimedia.org/T396276 (10dcaro) 03NEW [17:01:32] FIRING: ToolsNfsAlmostFull: Toolforge NFS is 86.44% full - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNfsAlmostFull - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNfsAlmostFull [17:26:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [18:44:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,nova [18:49:36] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for service: project,nova [18:53:49] FIRING: [24x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [18:55:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [19:08:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [19:20:19] RESOLVED: [24x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:38:41] FIRING: CloudVPSDesignateLeaks: Detected 19 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:49:36] 06cloud-services-team, 10Cloud-VPS: Nova metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742#10893289 (10Andrew) 05Resolved→03Open This is happening again, going to see if it's resolved the same way. [19:52:09] 06cloud-services-team, 10Cloud-VPS: Nova metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742#10893302 (10Andrew) Actually, restarting the neutron metadata agent seems to have done the trick: ` root@cloudnet1006:~# systemctl restart neutron-metadata-agent.service ` [20:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 19 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:06:30] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures