[00:45:09] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-7 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:45:17] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [02:20:56] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [02:41:10] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [03:15:55] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [03:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:52:14] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate DELETE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [03:52:48] (03update) 10raymond-ndibe: Draft: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [03:53:05] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [04:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:07:49] (03PS1) 10Elukey: requestctl: change comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074849 [07:08:07] (03CR) 10Elukey: [V:03+2 C:03+2] requestctl: change comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074849 (owner: 10Elukey) [08:11:27] FIRING: ProbeDown: Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:11:31] 06cloud-services-team: ProbeDown virt.cloudgw.eqiad1.wikimediacloud.org:0 failed when probed by icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4 from codfw. Availability is 50%. - https://phabricator.wikimedia.org/T375362 (10phaultfinder) 03NEW [08:15:06] 06cloud-services-team: ProbeDown virt.cloudgw.eqiad1.wikimediacloud.org:0 failed when probed by icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4 from codfw. Availability is 50%. - https://phabricator.wikimedia.org/T375362#10166579 (10aborrero) 05Open→03Resolved a:03aborrero something happened with the... [08:16:27] RESOLVED: ProbeDown: Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:42:15] 10Toolforge: restarting a continuous jobs causes for some seconds two jobs are running side by side - https://phabricator.wikimedia.org/T375366 (10Wurgl) 03NEW [08:53:34] 10wikitech.wikimedia.org, 10Gerrit, 07LDAP: Rename account Zoranzoki21 to Kizule on Gerrit - https://phabricator.wikimedia.org/T260647#10166724 (10Kizule) 05Declined→03Open I'm reopening this task per https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/5NBCVPPOXB4O3KI7B4YJB... [09:18:31] 10Toolforge: restarting a continuous jobs causes for some seconds two jobs are running side by side - https://phabricator.wikimedia.org/T375366#10166755 (10aborrero) [09:19:58] 10Toolforge: restarting a continuous jobs causes for some seconds two jobs are running side by side - https://phabricator.wikimedia.org/T375366#10166754 (10aborrero) Kubernetes creates the replacement pod as soon as the first pod enters termination state, without waiting for the first pod to actually disappear.... [09:20:44] 10Toolforge: restarting a continuous jobs causes for some seconds two jobs are running side by side - https://phabricator.wikimedia.org/T375366#10166757 (10aborrero) 05Open→03In progress p:05Triage→03Low [09:30:52] 10Toolforge: restarting a continuous jobs causes for some seconds two jobs are running side by side - https://phabricator.wikimedia.org/T375366#10166775 (10aborrero) I think this is the patch I'm proposing: ` diff --git a/tjf/runtimes/k8s/jobs.py b/tjf/runtimes/k8s/jobs.py index fd3e85c..22eff5d 100644 --- a/tj... [09:32:53] (03open) 10aborrero: jobs: continuous: set strategy based on number of replicas [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/124 (https://phabricator.wikimedia.org/T375366) [09:34:44] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10166785 (10elukey) Hi @dcaro! I know that you have been battling with some issues on cloud nod... [09:37:29] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: restarting a continuous jobs causes for some seconds two jobs are running side by side - https://phabricator.wikimedia.org/T375366#10166793 (10aborrero) [09:43:25] (03approved) 10dcaro: toolforge_depoly_mr: set the latest MR as default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/191 [09:43:29] (03merge) 10dcaro: toolforge_depoly_mr: set the latest MR as default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/191 [09:48:32] RECOVERY - Host cloudvirt1063 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [09:49:04] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:51:24] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10166816 (10dcaro) >>! In T372814#10165304, @Jclark-ctr wrote: > @Andrew i see this ticket is in my name. is there something i need to do for this?... [09:53:59] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223#10166820 (10fnegri) I restarted the server from the mgmt interface, I could ssh to it and check the syslog at the time of the crash. It's not much helpful but it's similar to the log entry... [09:54:29] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10166817 (10dcaro) a:05Jclark-ctr→03dcaro [10:06:05] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223#10166885 (10fnegri) [10:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:21:44] (03merge) 10aborrero: secgroups: add optional default security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/50 (https://phabricator.wikimedia.org/T375111) [12:22:00] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [12:23:35] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [12:24:14] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [12:26:30] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [12:27:10] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [12:28:41] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [12:28:52] 10Tool-lexeme-forms: Lexeme-forms on Toolforge returns error - https://phabricator.wikimedia.org/T374344#10167261 (10Fnielsen) Fine. I haven't seen the problem for a while. [12:29:17] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [12:30:58] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [12:34:58] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [12:35:37] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [12:36:58] (03open) 10aborrero: secgroups: codfw1dev-r_default: fix protocol casing [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/52 [12:38:21] (03merge) 10aborrero: secgroups: codfw1dev-r_default: fix protocol casing [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/52 [12:38:28] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [12:39:07] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [12:41:55] (03update) 10aborrero: secgroups: enable delete_default_rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/51 (https://phabricator.wikimedia.org/T375111) [12:47:17] 06cloud-services-team, 10Cloud-VPS: tofu-infra: refactor repo structure - https://phabricator.wikimedia.org/T375283#10167307 (10aborrero) 05Open→03In progress p:05Triage→03Medium [12:57:36] 06cloud-services-team, 10Cloud-VPS: tofu-infra: refactor repo structure - https://phabricator.wikimedia.org/T375283#10167348 (10aborrero) I like the idea of refactoring the repo to be per-project. However, I'm not that sure about requiring a single tofu plan/apply per tenant. On the other hand, if it is the c... [14:01:03] (03open) 10raymond-ndibe: [toolforge.kyverno] update kubeVersion [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/528 (https://phabricator.wikimedia.org/T359641) [14:02:20] (03update) 10raymond-ndibe: [toolforge.kyverno] update kubeVersion [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/528 (https://phabricator.wikimedia.org/T359641) [14:10:43] (03update) 10dcaro: [jobs-cli] remove _display_messages [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/62 (owner: 10raymond-ndibe) [14:10:48] (03update) 10dcaro: [envvars-cli] remove display_messages [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/57 (owner: 10raymond-ndibe) [14:10:51] (03update) 10dcaro: [builds-cli] remove _display_messages [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/69 (owner: 10raymond-ndibe) [14:11:35] 10Tool-Global-user-contributions, 10Special:GlobalContributions, 06Stewards-and-global-tools, 07Epic, and 2 others: [Epic] Implement global contributions feature - https://phabricator.wikimedia.org/T337089#10167579 (10KColeman-WMF) [14:33:56] FIRING: SystemdUnitDown: The service unit wmf_auto_restart_virtlogd.service is in failed status on host cloudvirt1063. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:37:33] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_rack [14:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:27:31] (03merge) 10aborrero: secgroups: enable delete_default_rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/51 (https://phabricator.wikimedia.org/T375111) [15:27:40] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [15:36:42] (03PS1) 10David Caro: ceph.undrain_rack: undrain from different hosts in parallel [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1075042 [15:36:43] (03PS1) 10David Caro: ceph.osd.undrain_rack: undrain osds from different hosts when able [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1075043 [15:37:02] (03PS2) 10David Caro: ceph.undrain_rack: undrain from different hosts in parallel [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1075042 [15:40:18] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan+apply for main branch [15:40:55] (03open) 10aborrero: Revert "secgroups: enable delete_default_rules" [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/53 (https://phabricator.wikimedia.org/T375111) [15:41:41] (03merge) 10aborrero: Revert "secgroups: enable delete_default_rules" [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/53 (https://phabricator.wikimedia.org/T375111) [15:42:07] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [15:43:00] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [15:43:08] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch [15:43:28] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for main branch [15:47:50] 10Toolforge (Toolforge iteration 14): [jobs-api,jobs-cli] Support multiple replicas of continuous jobs - https://phabricator.wikimedia.org/T341066#10168072 (10Raymond_Ndibe) 05In progress→03Resolved [15:48:18] (03open) 10aborrero: secgroups: default: remove port ranges and fix protocol [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/54 [15:52:17] (03merge) 10aborrero: secgroups: default: remove port ranges and fix protocol [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/54 [15:52:30] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch [15:52:54] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for main branch [15:53:04] 10Striker: Concatenated URLs in toolinfo.json - https://phabricator.wikimedia.org/T345776#10168125 (10TBurmeister) I think I just encountered this bug when looking at the record in Toolhub for https://toolhub.wikimedia.org/tools/toolforge-tool-watch and reading through https://phabricator.wikimedia.org/T341379#9... [15:54:42] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [15:55:18] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [16:07:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:09:26] (03approved) 10dcaro: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 (owner: 10raymond-ndibe) [16:10:16] (03update) 10dcaro: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 (owner: 10raymond-ndibe) [16:10:44] 06cloud-services-team: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#10168243 (10Ladsgroup) >>! In T328289#10157496, @fnegri wrote: > @Ladsgroup I stumbled upon this old task, not sure if it's still relevant, if yes I need more guidance :) It depends on what are the plans for... [16:18:11] 06cloud-services-team: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#10168271 (10fnegri) > Are there still use cases for it after removal of ldap from wikitech? From my understanding, labtestwiki should no longer be needed after we complete the removal of LDAP. /cc @bd808 who... [16:23:14] (03approved) 10dcaro: [toolforge.kyverno] update kubeVersion [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/528 (https://phabricator.wikimedia.org/T359641) (owner: 10raymond-ndibe) [16:23:15] (03update) 10dcaro: [toolforge.kyverno] update kubeVersion [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/528 (https://phabricator.wikimedia.org/T359641) (owner: 10raymond-ndibe) [16:26:09] 06cloud-services-team: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#10168330 (10fnegri) > or just for testing LDAP. To clarify, the only usage of labtestwikitech that I am aware of is to manage LDAP users in the [testing deployment for Cloud VPS](https://wikitech.wikimedia.or... [16:28:25] 10Tool-video-answer-tool, 06Future-Audiences: FA community call video demo - https://phabricator.wikimedia.org/T374878#10168335 (10Maryana) Goal: have 3 videos to show (in order to demonstrate range of topics this tool could cover). Looking for well-performing DYKs well-suited to a young audience, with >3 imag... [16:28:56] FIRING: SystemdUnitDown: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1063 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:29:02] 06cloud-services-team: SystemdUnitDown Unit wmf_auto_restart_virtlogd.service on node cloudvirt1063 has been down for long. - https://phabricator.wikimedia.org/T375403 (10phaultfinder) 03NEW [16:31:58] (03unapproved) 10dcaro: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 (owner: 10raymond-ndibe) [16:48:08] 06cloud-services-team, 10wikitech.wikimedia.org, 07Epic: Set up a bitu instance for codfw1dev - https://phabricator.wikimedia.org/T360795#10168408 (10Ladsgroup) FWIW, getting this deployed simplifies a lot of database stack see {T328289} for example. [16:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:58:43] (03approved) 10dcaro: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 (owner: 10raymond-ndibe) [17:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:11:28] (03Abandoned) 10David Caro: ceph.osd.undrain_rack: undrain osds from different hosts when able [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1075043 (owner: 10David Caro) [17:13:23] (03CR) 10David Caro: "@legoktm@debian.org is there anything I can help with to get this patch merged?" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1053538 (owner: 10David Caro) [17:14:53] 10Tool-video-answer-tool, 06Future-Audiences, 07Spike: Investigate different options for animation of images - https://phabricator.wikimedia.org/T374367#10168494 (10Maryana) Make a dev endpoint to deploy these changes [17:18:04] 10Tool-video-answer-tool, 06Future-Audiences, 07Spike: Investigate different options for animation of images - https://phabricator.wikimedia.org/T374367#10168509 (10Maryana) Ping @Maryana & Lucas on Slack when ready to get feedback [17:28:51] 10Tool-video-answer-tool, 06Future-Audiences, 07Spike: Investigate options for pulling more relevant images for video - https://phabricator.wikimedia.org/T374557#10168593 (10Maryana) 05Open→03Resolved [17:32:39] 10Tool-video-answer-tool, 06Future-Audiences: FA community call video demo - https://phabricator.wikimedia.org/T374878#10168639 (10Maryana) [17:32:45] 10Tool-video-answer-tool, 06Future-Audiences: FA community call video demo - https://phabricator.wikimedia.org/T374878#10168637 (10Maryana) a:05Maryana→03None [17:32:50] (03PS1) 10Majavah: t5: Fix condition [labs/tools/majavah-bot] - 10https://gerrit.wikimedia.org/r/1075057 [17:33:40] 10Tool-video-answer-tool, 06Future-Audiences: Improvements to video server-side rendering - https://phabricator.wikimedia.org/T375408 (10Maryana) 03NEW [17:35:03] (03CR) 10Majavah: [C:03+2] t5: Fix condition [labs/tools/majavah-bot] - 10https://gerrit.wikimedia.org/r/1075057 (owner: 10Majavah) [17:36:47] (03Merged) 10jenkins-bot: t5: Fix condition [labs/tools/majavah-bot] - 10https://gerrit.wikimedia.org/r/1075057 (owner: 10Majavah) [18:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:38:50] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_rack (exit_code=99) [19:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [22:00:34] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.948% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:06:25] RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:08:25] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:21:21] 10Tool-Global-user-contributions, 10Special:GlobalContributions, 06Stewards-and-global-tools, 07Epic, and 2 others: [Epic] Implement global contributions feature - https://phabricator.wikimedia.org/T337089#10169562 (10KColeman-WMF) [22:23:43] 10Toolforge (Quota-requests): Request increased quota for video-answer-tool-staging Toolforge tool - https://phabricator.wikimedia.org/T375446 (10etz) 03NEW [23:05:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-65 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:10:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-65 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:15:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-65 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:20:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-65 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess