[01:19:28] FIRING: NodeTextfileStale: Stale textfile for cloudcontrol2005-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:24:28] RESOLVED: [2x] NodeTextfileStale: Stale textfile for cloudcontrol2005-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:29:06] 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809686 (10Andrew) This long thread is relates to the behavior we're seeing, although it's not identical: https://github.com/SSSD/sssd/issues/6219 The one suggestion there that se... [02:31:09] 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809687 (10Andrew) We also badly need metrics on our ldap servers (rw and ro) -- it would be nice to know if these outages correspond to high ldap traffic. As best I can tell we are... [07:54:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:21:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate Puppet CA: project-proxy-puppetmaster-01.project-proxy.eqiad.wmflabs is about to expire in 14d 18h 14m 54s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:22:29] 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809817 (10Andrew) If the problem is ldap responsiveness, why does ` watch -e ldapsearch -x uid=andrew ` never show any errors? Has anyone else gotten direct evidence of ldap fai... [13:34:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:37:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,nova [13:42:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for service: project,nova [13:44:58] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:02:22] 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809830 (10Fnielsen) Sometimes I can look into Toolforge. When then trying `become` I get `sudo: a password is required`. [14:09:58] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:15:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:18:29] 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809837 (10Andrew) ` Restart=always ` doesn't seem to help. I also migrated the host to a different less-busy cloudvirt, which also doesn't seem to have helped. Now I'm trying to... [14:55:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-79 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:00:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-79 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:06:43] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809853 (10Andrew) Here's a new theory to consider: the problem is not the ldap server being slow, but toolforge ldap queries being slow because there are a zi... [15:12:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-73 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:27:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:42:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [15:55:17] (03update) 10ttaylor: Draft: Refactored project structure to add Python API to relay events [toolforge-repos/listen-to-wiki-changes] - 10https://gitlab.wikimedia.org/toolforge-repos/listen-to-wiki-changes/-/merge_requests/1 [16:03:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:30:35] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809906 (10bd808) 205-04-24 was maybe the first of this series of problems. That failure was on the dev.toolforge.org bastion rather than the login.toolforge.o... [20:10:23] 10Tool-campwiz-nxt, 06translatewiki.net: Add CampWiz NXT to translatewiki.net - https://phabricator.wikimedia.org/T393850#10809999 (10Nokib_Sarkar) [20:37:00] 10Tool-campwiz-nxt, 06translatewiki.net: Add CampWiz NXT to translatewiki.net - https://phabricator.wikimedia.org/T393850#10810003 (10Nokib_Sarkar) [20:43:00] 10Tool-campwiz-nxt, 06translatewiki.net: Add CampWiz NXT to translatewiki.net - https://phabricator.wikimedia.org/T393850#10810004 (10Nokib_Sarkar)