[00:14:36] RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:25:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-7 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:32:36] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:38:08] 10Cloud-VPS (Project-requests): Request creation of createwikitest VPS project - https://phabricator.wikimedia.org/T375454 (10Xaloria) 03NEW [01:22:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudcephosd2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:27:10] PROBLEM - Host cloudvirt1063 is DOWN: PING CRITICAL - Packet loss = 100% [01:27:48] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on cloudcephmon2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:31:47] FIRING: NodeDown: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [01:31:54] 06cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T375458 (10phaultfinder) 03NEW [01:32:48] FIRING: [4x] PuppetZeroResources: Puppet has failed generate resources on cloudcephmon2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:33:34] RECOVERY - Host cloudvirt1063 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [01:33:38] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:37:48] FIRING: [5x] PuppetZeroResources: Puppet has failed generate resources on cloudcephmon2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:47:48] FIRING: [6x] PuppetZeroResources: Puppet has failed generate resources on cloudcephmon2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:49:17] RESOLVED: NodeDown: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [02:00:49] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.538% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:02:35] 10Toolforge: lighttpd does not record logs in $HOME/error.log - https://phabricator.wikimedia.org/T298322#10169949 (10bd808) 05Open→03Declined Closing for no known reproduction case. [05:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:48:03] FIRING: [6x] PuppetZeroResources: Puppet has failed generate resources on cloudcephmon2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:01:04] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.135% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:11:47] 10Data-Services, 06DBA: Prepare and check storage layer for shnwikinews - https://phabricator.wikimedia.org/T375432#10169974 (10ABran-WMF) a:03ABran-WMF [06:29:46] (03CR) 10Sebastian Berlin (WMSE): "@agboreugene@gmail.com it would be good to have this merged and deployed to see if it alleviates the issue in https://phabricator.wikimedi" [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1055949 (https://phabricator.wikimedia.org/T367397) (owner: 10Sebastian Berlin (WMSE)) [07:28:14] (03PS1) 10Slyngshede: Dummy Gitlab tokens for IDM. [labs/private] - 10https://gerrit.wikimedia.org/r/1075115 (https://phabricator.wikimedia.org/T359820) [07:47:10] (03CR) 10Slyngshede: [V:03+2 C:03+2] Dummy Gitlab tokens for IDM. [labs/private] - 10https://gerrit.wikimedia.org/r/1075115 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [08:06:12] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223#10170123 (10dcaro) The server went unexpectedly down tonight at ~1am UTC (got a page): ` Sep 24 01:25:22 cloudvirt1063 systemd-logind[1392]: Power key pressed short. Sep 24 01:25:22 cloudv... [08:06:12] 06cloud-services-team: SystemdUnitDown Unit wmf_auto_restart_virtlogd.service on node cloudvirt1063 has been down for long. - https://phabricator.wikimedia.org/T375403#10170124 (10dcaro) [08:06:52] 06cloud-services-team: 2024-09-24 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375458#10170128 (10dcaro) [08:07:13] 06cloud-services-team: 2024-09-24 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375458#10170129 (10dcaro) [08:07:59] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: 2024-09-21 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223#10170134 (10dcaro) [08:09:58] 06cloud-services-team: 2024-09-24 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375458#10170146 (10dcaro) This time there seems not to have been a termal trip: ` dcaro@cloudvirt1063:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Jan-11-... [08:36:54] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: 2024-09-21 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223#10170256 (10fnegri) > Sep 24 01:25:22 cloudvirt1063 systemd-logind[1392]: Power key pressed short. Was it maybe dcops folks powering down to check it? @dcaro Sorry about the p... [08:54:00] (03update) 10dcaro: tekton: upgrade to v0.60.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/61 [09:10:25] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_rack [09:10:26] !log dcaro@urcuchillay admin END (ERROR) - Cookbook wmcs.ceph.osd.undrain_rack (exit_code=97) [09:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:11:47] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.undrain_rack [09:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:18:46] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Migrate per-project Puppet servers to Puppet 7 - https://phabricator.wikimedia.org/T351452#10170363 (10MoritzMuehlenhoff) [09:21:32] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10170362 (10dcaro) Okok, let's take the 8 drives from cloudcephosd1025 on rack E4 to send them, let me drain it firs... [09:28:03] (03update) 10dcaro: tekton: upgrade to v0.60.2 [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/61 [09:30:18] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudcephosd2003-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:44:06] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Improve WMCS NodeDown alerts - https://phabricator.wikimedia.org/T375479 (10fnegri) 03NEW [09:52:40] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Improve WMCS NodeDown alerts - https://phabricator.wikimedia.org/T375479#10170531 (10dcaro) I think that the paging alert has a wrong regex, it should be `up{job="node", cluster="wmcs", instance=~"cloudvirt(-wdqs|local)1.*"} == 0`, it's a page because the... [10:01:04] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 4.585% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:11:00] (03update) 10dcaro: all: upgrade to tekton 0.60.X [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [10:11:53] (03update) 10dcaro: all: upgrade to tekton 0.59.X LTS [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/111 (https://phabricator.wikimedia.org/T374908) [11:10:48] FIRING: PuppetFailure: Puppet has failed on cloudcephosd1040:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:10:55] 06cloud-services-team: PuppetFailure Puppet failure on cloudcephosd1040:9100 - https://phabricator.wikimedia.org/T375484 (10phaultfinder) 03NEW [11:11:12] 10wikitech.wikimedia.org, 10Gerrit, 07LDAP: Rename account Zoranzoki21 to Kizule on Gerrit - https://phabricator.wikimedia.org/T260647#10170714 (10Ladsgroup) 05Open→03Declined Renaming shell/idm/gerrit accounts is out of the scope of wikitech SULification so I'm not sure reopening this ticket makes s... [11:13:35] (03PS1) 10Slyngshede: Dummy secrets for IDM account blocking. [labs/private] - 10https://gerrit.wikimedia.org/r/1075174 [11:20:48] FIRING: [2x] PuppetFailure: Puppet has failed on cloudcephosd1040:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:20:55] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T375485 (10phaultfinder) 03NEW [11:27:16] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [11:27:27] (03CR) 10Slyngshede: [V:03+2 C:03+2] Dummy secrets for IDM account blocking. [labs/private] - 10https://gerrit.wikimedia.org/r/1075174 (owner: 10Slyngshede) [11:55:22] (03approved) 10aborrero: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) (owner: 10raymond-ndibe) [12:22:22] (03open) 10aborrero: secgroups: enable delete_default_rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/55 (https://phabricator.wikimedia.org/T375111) [12:28:26] (03update) 10aborrero: secgroups: enable delete_default_rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/55 (https://phabricator.wikimedia.org/T375111) [12:29:30] (03update) 10aborrero: secgroups: enable delete_default_rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/55 (https://phabricator.wikimedia.org/T375111) [12:43:03] (03update) 10aborrero: jobs: continuous: set strategy based on number of replicas [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/124 (https://phabricator.wikimedia.org/T375366) [12:46:51] (03merge) 10aborrero: secgroups: enable delete_default_rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/55 (https://phabricator.wikimedia.org/T375111) [12:46:54] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [12:46:55] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [12:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:53:28] (03update) 10aborrero: secgroups: have a common description suffix string to all resources [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [12:53:37] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [12:57:50] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [13:00:44] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [13:03:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudcephosd1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:09:59] (03merge) 10aborrero: secgroups: have a common description suffix string to all resources [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/49 [13:10:21] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [13:14:29] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-prometheus-7 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [13:14:32] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [13:24:29] FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance tools-prometheus-6 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [13:30:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-prometheus-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:01:04] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 4.188% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:21:58] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-prometheus-1 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:33:36] (03update) 10aborrero: cloudinfra-codfw1dev: ntp secgroup: fix service rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [14:35:09] (03merge) 10aborrero: cloudinfra-codfw1dev: ntp secgroup: fix service rules [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/48 [14:35:14] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [14:36:13] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.drain_node (T348643) [14:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:36:18] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [14:36:32] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T348643) [14:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:41:59] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-prometheus-7 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:47:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudidm2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:07:08] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [15:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:37:22] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Improve WMCS NodeDown alerts - https://phabricator.wikimedia.org/T375479#10172059 (10taavi) I tried to fix part of this in https://gerrit.wikimedia.org/r/c/operations/alerts/+/977743 (as Prometheus will [[ https://prometheus.io/docs/alerting/latest/alertm... [15:47:15] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [15:47:27] 10Toolforge: [gateway-api] something is caching the openapi docs - https://phabricator.wikimedia.org/T371033#10172093 (10dcaro) p:05Medium→03Low [15:47:41] 10Toolforge (Toolforge iteration 14): [toolforge deploy] direct-api tests fail intermittently on toolsbeta - https://phabricator.wikimedia.org/T369891#10172097 (10dcaro) 05Open→03Resolved a:03dcaro [15:53:06] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [builds-builder,builds-api] Upgrade tekton - https://phabricator.wikimedia.org/T374908#10172145 (10taavi) duplicate of {T370869}? [15:53:57] PROBLEM - Disk space on cloudbackup1004 is CRITICAL: DISK CRITICAL - free space: /srv 647210MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup1004&var-datasource=eqiad+prometheus/ops [15:57:38] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: tofu-infra: extend coverage to Designate DNS data - https://phabricator.wikimedia.org/T374338#10172168 (10aborrero) 05In progress→03Resolved [15:58:21] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: openstack: updates to horizon for vxlan migration - https://phabricator.wikimedia.org/T374824#10172155 (10aborrero) 05In progress→03Stalled next is eqiad1, when we get there [16:00:45] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T348643) [16:00:52] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [16:04:02] 10Toolforge (Quota-requests): Request increased quota for video-answer-tool-staging Toolforge tool - https://phabricator.wikimedia.org/T375446#10172199 (10aborrero) 05Open→03In progress p:05Triage→03Medium approved. [16:04:50] 10Toolforge (Quota-requests): Request increased quota for video-answer-tool-staging Toolforge tool - https://phabricator.wikimedia.org/T375446#10172203 (10etz) Thank you! [16:05:48] FIRING: PuppetFailure: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:05:53] 06cloud-services-team: PuppetFailure Puppet failure on cloudcontrol2004-dev:9100 - https://phabricator.wikimedia.org/T375534 (10phaultfinder) 03NEW [16:09:03] (03PS2) 10David Caro: wmcs: add the new gitlab repos [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1053538 [16:15:21] 10Toolforge (Quota-requests): Request increased quota for video-answer-tool-staging Toolforge tool - https://phabricator.wikimedia.org/T375446#10172238 (10dcaro) 05In progress→03Resolved a:03dcaro Done: {F57534431} [16:30:05] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Improve WMCS NodeDown alerts - https://phabricator.wikimedia.org/T375479#10172290 (10dcaro) >>! In T375479#10172059, @taavi wrote: > I tried to fix part of this in https://gerrit.wikimedia.org/r/c/operations/alerts/+/977743 (as Prometheus will [[ https://... [16:32:40] (03update) 10dcaro: change setup order [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/190 [16:36:04] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [16:40:35] RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.855% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:53:57] RECOVERY - Disk space on cloudbackup1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup1004&var-datasource=eqiad+prometheus/ops [17:18:17] 10Quarry: Quarry login fails due to redirect to plaintext HTTP URL - https://phabricator.wikimedia.org/T361471#10172447 (10LucasWerkmeister) Still happening. I got redirected like this: - https://quarry.wmcloud.org/login?next=/ - https://meta.wikimedia.org/w/index.php?title=Special%3AOAuth%2Fauthenticate&oauth_t... [17:28:14] 10Quarry: Quarry login fails due to redirect to plaintext HTTP URL - https://phabricator.wikimedia.org/T361471#10172471 (10github-toolforge-bot) supertassu opened https://github.com/toolforge/quarry/pull/70 [17:28:25] supertassu opened https://github.com/toolforge/quarry/pull/70 [17:32:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-7 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:49:21] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [18:11:14] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [18:48:03] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudidm2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:33:07] !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T348643) [19:33:15] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [19:53:10] 10Tool-video-answer-tool, 06Future-Audiences, 07Spike: Investigate different options for animation of images - https://phabricator.wikimedia.org/T374367#10172924 (10etz) @Maryana checking in on point 2 from this ticket: I did check out the videos mentioned in that doc. Is part of the expectation of this tick... [20:06:03] FIRING: PuppetFailure: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:30:30] (03update) 10raymond-ndibe: [maintain-kubeusers] kyverno do not validate UPDATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [20:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:59:18] (03update) 10raymond-ndibe: [maintain-kubeusers.kyverno] only validate CREATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [21:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:09:19] 06cloud-services-team: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#10173145 (10bd808) >>! In T328289#10168269, @fnegri wrote: >> Are there still use cases for it after removal of ldap from wikitech? > > From my understanding, labtestwiki should no longer be needed after we c... [21:18:28] (03update) 10raymond-ndibe: [maintain-kubeusers.kyverno] only validate CREATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [21:18:50] (03approved) 10raymond-ndibe: [maintain-kubeusers.kyverno] only validate CREATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [21:18:54] (03update) 10raymond-ndibe: [maintain-kubeusers.kyverno] only validate CREATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [21:26:38] 10Striker: Concatenated URLs in toolinfo.json - https://phabricator.wikimedia.org/T345776#10173211 (10bd808) >>! In T345776#10168125, @TBurmeister wrote: > I think I just encountered this bug when looking at the record in Toolhub for https://toolhub.wikimedia.org/tools/toolforge-tool-watch and reading through ht... [21:34:26] 10Data-Services, 06DBA: Prepare and check storage layer for madwiktionary - https://phabricator.wikimedia.org/T375023#10173239 (10Zabe) wiki has been created [21:35:51] !log raymondndibe@wmf3402 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component kyverno (T359641) [21:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [21:35:57] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [21:40:59] !log raymondndibe@wmf3402 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component kyverno (T359641) [21:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [21:41:07] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [21:41:19] 10Data-Services, 06DBA: Prepare and check storage layer for kgewiki - https://phabricator.wikimedia.org/T374814#10173261 (10Zabe) wiki has been created [21:41:48] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component kyverno (T359641) [21:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:48:25] !log raymondndibe@wmf3402 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component kyverno (T359641) [21:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:48:31] T359641: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641 [21:48:48] (03approved) 10raymond-ndibe: [toolforge.kyverno] update kubeVersion [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/528 (https://phabricator.wikimedia.org/T359641) [21:49:05] (03merge) 10raymond-ndibe: [toolforge.kyverno] update kubeVersion [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/528 (https://phabricator.wikimedia.org/T359641) [21:50:28] (03merge) 10raymond-ndibe: [maintain-kubeusers.kyverno] only validate CREATE operations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/62 (https://phabricator.wikimedia.org/T375157) [21:52:47] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.169-20240924215037-64da2c2e [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/529 (https://phabricator.wikimedia.org/T375157) [21:55:59] !log raymondndibe@wmf3402 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers (T375157) [21:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [21:56:05] T375157: kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157 [22:03:02] !log raymondndibe@wmf3402 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers (T375157) [22:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [22:03:09] T375157: kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157 [22:03:47] !log raymondndibe@wmf3402 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers (T375157) [22:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:06:44] 10Data-Services, 06DBA: Prepare and check storage layer for gorwikiquote - https://phabricator.wikimedia.org/T375094#10173394 (10Zabe) wiki has been created [22:11:22] !log raymondndibe@wmf3402 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers (T375157) [22:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:11:28] T375157: kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157 [22:12:14] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: kyverno prevents deletion of pods that violates its policies - https://phabricator.wikimedia.org/T375157#10173453 (10Raymond_Ndibe) 05In progress→03Resolved [22:24:06] 10Data-Services, 06DBA: Prepare and check storage layer for shnwikinews - https://phabricator.wikimedia.org/T375432#10173488 (10Zabe) wiki has been created [22:48:18] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudidm2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:58:21] 10wikitech.wikimedia.org, 10Gerrit, 07LDAP: Rename account Zoranzoki21 to Kizule on Gerrit - https://phabricator.wikimedia.org/T260647#10173636 (10Kizule) >>! In T260647#10170714, @Ladsgroup wrote: > Renaming shell/idm/gerrit accounts is out of the scope of wikitech SULification so I'm not sure reopening... [23:00:32] 10Tool-video-answer-tool, 06Future-Audiences: FA community call video demo - https://phabricator.wikimedia.org/T374878#10173639 (10derenrich) generated videos based on above and put them here https://drive.google.com/drive/folders/1_LHILChMO0GvqopWaiEgva5ioCNJeMQ3?usp=sharing Comments on the results: We have... [23:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:26:31] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-3 is lagging behind the primary, the current lag is 3681 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [23:27:47] 10Toolforge-standards-committee (Maintainer needed): wikistream.toolforge.org needs new maintainers - https://phabricator.wikimedia.org/T251555#10173753 (10bd808) >>! In T251555#10160564, @Pintoch wrote: > I'm tagging this as low priority given that there are alternatives Weird flex, but ok. > For the same reas... [23:31:31] RESOLVED: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-3 is lagging behind the primary, the current lag is 3681 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh