[00:00:59] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11033383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with... [01:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [02:05:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [02:05:08] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T400432 (10phaultfinder) 03NEW [02:41:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [03:05:00] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [03:14:20] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/weapon-of-mass-description] - 10https://gerrit.wikimedia.org/r/1172307 (owner: 10L10n-bot) [03:14:36] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/massmailer] - 10https://gerrit.wikimedia.org/r/1172304 (owner: 10L10n-bot) [03:15:05] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/map-of-monuments] - 10https://gerrit.wikimedia.org/r/1172305 (owner: 10L10n-bot) [03:15:10] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1172303 (owner: 10L10n-bot) [07:08:42] 06cloud-services-team, 10Data-Services: [wikireplicas] Views flaggedpage_pending and flaggedtemplates are broken - https://phabricator.wikimedia.org/T368939#11033687 (10Pppery) Anything left to do here? [07:10:51] 06cloud-services-team, 10Data-Services: Denormalize user_groups to contain actor information - https://phabricator.wikimedia.org/T238497#11033695 (10Pppery) [09:47:21] (03update) 10vriaa: Draft: Basic banner implementation [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/1 [09:56:38] 06cloud-services-team, 10Data-Services: [wikireplicas] Automatically check for missing tables - https://phabricator.wikimedia.org/T378470#11034074 (10fnegri) p:05Medium→03Low I think I would still like to have a list of "partially public" tables that are missing in the replicas. But now that the public tab... [10:14:21] 06cloud-services-team, 10Data-Services: [wikireplicas] Views flaggedpage_pending and flaggedtemplates are broken - https://phabricator.wikimedia.org/T368939#11034170 (10fnegri) 05Open→03Resolved a:03fnegri @Pppery sorry, this task slipped through the cracks. We no longer need to remove those tables f... [14:03:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:12:37] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11034774 (10Andrew) I broke the cluster again, but now it's working. The main thing I did was a version... [14:13:27] 10Tool-inteGraality: Retrieving labels via SPARQL tanks query performance - https://phabricator.wikimedia.org/T400480 (10JeanFred) 03NEW [14:14:52] 10Tool-inteGraality: Retrieving labels via SPARQL tanks query performance - https://phabricator.wikimedia.org/T400480#11034801 (10JeanFred) One potential idea: using subqueries: ` SELECT ?grouping ?higher_grouping ?grouping_link_value (COUNT(DISTINCT ?entity) as ?count) WITH { SELECT ?grouping (SAMPLE(?_highe... [14:33:43] 10Cloud-VPS (Project-requests): Request creation of SimpleProject VPS project - https://phabricator.wikimedia.org/T400482 (100000abcd1234) 03NEW [14:36:30] 06cloud-services-team, 10Cloud-VPS: Neutron metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742#11034860 (10Andrew) The fix for T395255 did not resolve the intermittent crashes here. [14:47:16] 10Cloud-VPS (Project-requests): Request creation of SimpleProject VPS project - https://phabricator.wikimedia.org/T400482#11034880 (10Aklapper) 05Open→03Declined a:050000abcd1234→03None Hi, the purpose is too broad and the project name is vague. We generally do not grant Cloud VPS projects for single... [15:03:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:25:16] 10PAWS: /home/paws is 100% - https://phabricator.wikimedia.org/T396051#11035053 (10Andrew) I now have lists of large home directories (more than 1G usage total) that have no date stamps after 2021. Is there any reason to not just delete all of those? [15:28:33] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:34:29] RESOLVED: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.68%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [15:43:33] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [16:18:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:50:29] 10Tool-globalcontribution: Check if timeout error for bulk requests - https://phabricator.wikimedia.org/T382658#11035537 (10Gnoeee) 05Open→03Resolved [18:38:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [18:39:04] 06cloud-services-team, 10Toolforge: tools-login intermittently has broken networking? - https://phabricator.wikimedia.org/T400502 (10DamianZaremba) 03NEW [18:40:38] 06cloud-services-team, 10Toolforge: tools-login intermittently has broken networking? - https://phabricator.wikimedia.org/T400502#11035728 (10DamianZaremba) And here is a traceroute when working ` traceroute to login.tools.wmflabs.org (185.15.56.57), 64 hops max, 40 byte packets 1 172.16.0.254 (172.16.0.254)... [19:12:32] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11035837 (10VRiley-WMF) While attempting to image this server (clouddb1022) and got this error. {F65673966} [19:23:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:50:46] 06cloud-services-team, 10Cloud-VPS, 10VPS-Projects, 10Catalyst: metricsinfra: send alerts for the catalyst project to catalyst@w.o email - https://phabricator.wikimedia.org/T386416#11035924 (10thcipriani) >>! In T386416#11027704, @taavi wrote: > This doesn't seem to have ever worked; the notification email... [20:01:41] (03update) 10vriaa: Draft: Basic banner implementation [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/1 [20:03:47] 06cloud-services-team, 10Data-Services: Denormalize user_groups to contain actor information - https://phabricator.wikimedia.org/T238497#11035959 (10Bugreporter) Such change is meaningless as long as copy in cloud replica are just views instead of real copy (or materialized views) - so any queries on such "den... [20:05:55] (03update) 10vriaa: Draft: Basic banner implementation [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/1 [20:08:29] 06cloud-services-team, 10Toolforge: tools-login intermittently has broken networking? - https://phabricator.wikimedia.org/T400502#11035964 (10DamianZaremba) Hung again in the middle of typing ` traceroute to login.tools.wmflabs.org (185.15.56.57), 64 hops max, 40 byte packets 1 172.16.0.254 (172.16.0.254) 6... [20:55:16] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Block web crawlers from accessing Cloud Services - https://phabricator.wikimedia.org/T226688#11036017 (10MusikAnimal) I agree robots.txt is useless. I have had everything blocked for years and it doesn't stop anything: https://xtools.wmcloud.org/robots.txt I... [21:18:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:25:29] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:28:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:13:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess