[01:19:28] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for cloudcontrol2005-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:24:28] <jinxer-wm>	 RESOLVED: [2x] NodeTextfileStale: Stale textfile for cloudcontrol2005-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:29:06] <wikibugs>	 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809686 (10Andrew) This long thread is relates to the behavior we're seeing, although it's not identical:  https://github.com/SSSD/sssd/issues/6219  The one suggestion there that se...
[02:31:09] <wikibugs>	 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809687 (10Andrew) We also badly need metrics on our ldap servers (rw and ro) -- it would be nice to know if these outages correspond to high ldap traffic. As best I can tell we are...
[07:54:28] <wmcs-alerts>	 FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[08:21:32] <wmcs-alerts>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate Puppet CA: project-proxy-puppetmaster-01.project-proxy.eqiad.wmflabs is about to expire in 14d 18h 14m 54s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:22:29] <wikibugs>	 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809817 (10Andrew) If the problem is ldap responsiveness, why does   ` watch -e ldapsearch -x uid=andrew `  never show any errors? Has anyone else gotten direct evidence of ldap fai...
[13:34:28] <wmcs-alerts>	 RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[13:37:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,nova
[13:42:45] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for service: project,nova
[13:44:58] <wmcs-alerts>	 FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[14:02:22] <wikibugs>	 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809830 (10Fnielsen) Sometimes I can look into Toolforge. When then trying `become` I get `sudo: a password is required`.
[14:09:58] <wmcs-alerts>	 RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[14:15:28] <wmcs-alerts>	 FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-bastion-13 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[14:18:29] <wikibugs>	 06cloud-services-team, 10Toolforge: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809837 (10Andrew) ` Restart=always `  doesn't seem to help.  I also migrated the host to a different less-busy cloudvirt, which also doesn't seem to have helped.  Now I'm trying to...
[14:55:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-79 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[15:00:03] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-79 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[15:06:43] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809853 (10Andrew) Here's a new theory to consider: the problem is not the ldap server being slow, but toolforge ldap queries being slow because there are a zi...
[15:12:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-73 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[15:27:03] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[15:42:03] <wmcs-alerts>	 RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce
[15:55:17] <wikibugs>	 (03update) 10ttaylor: Draft: Refactored project structure to add Python API to relay events [toolforge-repos/listen-to-wiki-changes] - 10https://gitlab.wikimedia.org/toolforge-repos/listen-to-wiki-changes/-/merge_requests/1
[16:03:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[17:30:35] <wikibugs>	 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10809906 (10bd808) 205-04-24 was maybe the first of this series of problems. That failure was on the dev.toolforge.org bastion rather than the login.toolforge.o...
[20:10:23] <wikibugs>	 10Tool-campwiz-nxt, 06translatewiki.net: Add CampWiz NXT to translatewiki.net - https://phabricator.wikimedia.org/T393850#10809999 (10Nokib_Sarkar)
[20:37:00] <wikibugs>	 10Tool-campwiz-nxt, 06translatewiki.net: Add CampWiz NXT to translatewiki.net - https://phabricator.wikimedia.org/T393850#10810003 (10Nokib_Sarkar)
[20:43:00] <wikibugs>	 10Tool-campwiz-nxt, 06translatewiki.net: Add CampWiz NXT to translatewiki.net - https://phabricator.wikimedia.org/T393850#10810004 (10Nokib_Sarkar)