[00:37:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [02:32:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [02:32:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [02:35:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:45:41] RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:32:37] (03merge) 10aborrero: kubeconfig: support updating the file [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/68 (https://phabricator.wikimedia.org/T262562) [08:34:59] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.175-20250519083249-6ee18335 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/789 (https://phabricator.wikimedia.org/T262562) [08:45:39] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [08:46:51] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [08:50:39] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [08:51:42] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [09:03:32] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.175-20250519083249-6ee18335 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/789 (https://phabricator.wikimedia.org/T262562) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [09:09:53] 10Toolforge (Toolforge iteration 20): mypy x509 invalid syntax while running CI tests - https://phabricator.wikimedia.org/T394593#10833808 (10dcaro) We upgraded pre-commit deps too, we can try regenerating the ci image see if that solves the versioning issue [09:52:42] 06cloud-services-team, 10Toolforge: toolforge: tofu-provisioning: reorganize DNS records in the state - https://phabricator.wikimedia.org/T394645 (10aborrero) 03NEW [09:53:00] 06cloud-services-team, 10Toolforge: toolforge: tofu-provisioning: reorganize DNS records in the state - https://phabricator.wikimedia.org/T394645#10834027 (10aborrero) 05Open→03In progress p:05Triage→03Medium work started here: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merg... [10:05:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:10:21] (03update) 10aborrero: dns: use zonename_recordname as opentofu state key [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/22 (https://phabricator.wikimedia.org/T394645) [10:10:50] (03update) 10aborrero: dns: use zonename_recordname as opentofu state key [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/22 (https://phabricator.wikimedia.org/T394645) [10:11:28] (03update) 10aborrero: dns: use zonename_recordname as opentofu state key [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/22 (https://phabricator.wikimedia.org/T394645) [10:13:45] (03update) 10aborrero: dns: use zonename_recordname as opentofu state key [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/22 (https://phabricator.wikimedia.org/T394645) [10:15:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:15:41] RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:40:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:40:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:44:47] (03update) 10aborrero: dns: use zonename_recordname as opentofu state key [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/22 (https://phabricator.wikimedia.org/T394645) [10:45:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:49:12] (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe) [10:57:41] 10Cloud Services Proposals, 06cloud-services-team, 10Striker: Decision request - Tool account management and Striker - https://phabricator.wikimedia.org/T394035#10834265 (10taavi) >>! In T394035#10823194, @fnegri wrote: > Option Purple is my favourite so far, but I'm still a bit confused about how the new se... [11:05:04] 06cloud-services-team, 10Toolforge: Decouple Toolforge API gateway authentication from Kubernetes certificates - https://phabricator.wikimedia.org/T332478#10834273 (10taavi) >>! In T332478#10747989, @dcaro wrote: > @taavi should I close this as duplicate? Or do you want to refresh/extend the oauth+dedicated au... [11:18:01] 10Cloud Services Proposals, 06cloud-services-team, 10Striker: Decision request - Tool account management and Striker - https://phabricator.wikimedia.org/T394035#10834319 (10dcaro) >>! In T394035#10820913, @fnegri wrote: >>> Could it be a generic LDAP adapter, with some minimal logic to restrict the damage yo... [12:13:20] (03update) 10dcaro: [envvars-api] return custom message for invalid EnvvarName [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/53 (https://phabricator.wikimedia.org/T360147) (owner: 10raymond-ndibe) [12:16:29] (03open) 10aborrero: tofu-infra: introduce gitlab CI/CD workflow [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/236 (https://phabricator.wikimedia.org/T370652) [12:18:51] (03update) 10aborrero: tofu-infra: introduce gitlab CI/CD workflow [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/236 (https://phabricator.wikimedia.org/T370652) [12:33:51] 10Toolforge (Toolforge iteration 20): [builds-api] Store the commit hash that was used for the build - https://phabricator.wikimedia.org/T389043#10834525 (10dcaro) 05In progress→03Resolved [12:34:12] 10Toolforge (Toolforge iteration 20), 07Epic: [cicd] create cicd flow for non repo owners - https://phabricator.wikimedia.org/T394594#10834529 (10dcaro) 05Duplicate→03Resolved [12:34:17] 10Toolforge (Toolforge iteration 20): [jobs-api] prepend date and pod name to filelog lines - https://phabricator.wikimedia.org/T372025#10834532 (10dcaro) 05Duplicate→03Resolved [12:48:40] (03approved) 10dcaro: api.metrics: add deprecation metrics [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/166 (https://phabricator.wikimedia.org/T390137) (owner: 10raymond-ndibe) [12:48:43] (03update) 10dcaro: api.metrics: add deprecation metrics [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/166 (https://phabricator.wikimedia.org/T390137) (owner: 10raymond-ndibe) [12:57:16] (03update) 10dcaro: [jobs-api] check services diff [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/158 (https://phabricator.wikimedia.org/T392717) (owner: 10raymond-ndibe) [13:05:05] 06cloud-services-team, 10Data-Services, 06Data-Engineering: Create existencelinks table in production - https://phabricator.wikimedia.org/T394617#10834617 (10Ladsgroup) From the #DBA sign off, it had and has my sign off (I reviewed the schema, etc.). You can deploy it yourself. I added it to the table catalo... [13:13:05] 06cloud-services-team, 10Bitu, 06Infrastructure-Foundations, 07LDAP: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663#10834648 (10Andrew) Hello @MoritzMuehlenhoff and @SLyngshede-WMF -- this re-allocation/bitu change needs to happen soon. We have a corresponding... [13:17:50] 06cloud-services-team, 10Toolforge: Decouple Toolforge API gateway authentication from Kubernetes certificates - https://phabricator.wikimedia.org/T332478#10834658 (10taavi) [13:17:54] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Understand Octavia network needs - https://phabricator.wikimedia.org/T394099#10834659 (10Andrew) @aborrero, can you stage patches for this same change in eqiad1? Also: I'm pretty sure that everything you did was in tofu and/or puppet but want to be 10... [13:23:56] (03update) 10raymond-ndibe: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] (use_pydantic_for_core_job_model) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) [13:30:32] 06cloud-services-team, 10Cloud-VPS: Service implementation for Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T394671 (10Andrew) 03NEW [13:41:42] 10Toolforge (Toolforge iteration 20): [components-api] Add support for port/helathcheck for continuous jobs in tool config/depolyment - https://phabricator.wikimedia.org/T362072#10834795 (10dcaro) [13:41:42] 06cloud-services-team, 10Toolforge: [components-api] add order to the components deployment - https://phabricator.wikimedia.org/T362075#10834796 (10dcaro) [13:49:09] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Understand Octavia network needs - https://phabricator.wikimedia.org/T394099#10834835 (10aborrero) I can confirm I didn't do anything "hidden". All the changes/commits were references with this ticket (puppet, gitlab), so it should be fairly simple to r... [13:51:13] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 20), 05Cloud-Services-Origin-Team, and 3 others: [Hypothesis] WE6.3.10 start a beta for the push-to-deploy features - https://phabricator.wikimedia.org/T393564#10834837 (10dcaro) [13:54:10] 10wikitech.wikimedia.org, 06serviceops-radar, 06SRE, 13Patch-For-Review, 07SRE-Unowned: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10834856 (10Andrew) >>! In T376400#10825452, @taavi wrote: > The site at http://ec2-54-81-201-239.compute-1.amazonaws.com/ seems to embed images fro... [13:55:20] 10Toolforge (Toolforge iteration 20): [components-api,buildsa-api] When building and deploying, if none of the settings changed, the jobs are not restarted - https://phabricator.wikimedia.org/T389044#10834867 (10dcaro) Now that we have the `resolved_ref` property returned by the builds-api for each build, we can... [13:57:32] 10wikitech.wikimedia.org, 06serviceops-radar, 06SRE, 13Patch-For-Review, 07SRE-Unowned: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10834875 (10taavi) >>! In T376400#10834856, @Andrew wrote: > Can you point me to some specific examples? My half-baked spot checks (e.g. http://ec2-... [14:00:30] 10wikitech.wikimedia.org, 06serviceops-radar, 06SRE, 13Patch-For-Review, 07SRE-Unowned: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10834892 (10Andrew) yep, I see it now. [14:04:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on clouddumps1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:27:32] (03update) 10dcaro: [jobs-api] check services diff [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/158 (https://phabricator.wikimedia.org/T392717) (owner: 10raymond-ndibe) [14:27:58] (03update) 10dcaro: [envvars-cli] print error string and not dict [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/81 (https://phabricator.wikimedia.org/T360147) (owner: 10raymond-ndibe) [14:29:26] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] (use_pydantic_for_core_job_model) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [14:30:32] (03update) 10raymond-ndibe: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] (use_pydantic_for_core_job_model) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) [14:34:52] 10Toolforge (Toolforge iteration 20), 07Epic: [cicd] create cicd flow for non repo owners - https://phabricator.wikimedia.org/T394594#10835043 (10JJMC89) →14Duplicate dup:03T394595 [14:34:53] 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: [cicd] create cicd flow for non repo owners - https://phabricator.wikimedia.org/T394595#10835044 (10JJMC89) [14:35:27] 10Toolforge (Toolforge iteration 20): [jobs-api] prepend date and pod name to filelog lines - https://phabricator.wikimedia.org/T372025#10835050 (10JJMC89) →14Duplicate dup:03T127367 [14:35:32] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10835051 (10JJMC89) [15:10:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Persistence: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10835170 (10fnegri) @joanna_borun as we agreed, I sent an email to cloud-announce with the following text: ` Starting next month, we are going t... [15:18:32] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Persistence: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10835232 (10fnegri) @Marostegui is there any page on wikitech with the procedure that you usually follow for major-version upgrades? I was think... [15:19:16] (03open) 10aborrero: eqiad1: introduce openstack octavia network support [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/237 (https://phabricator.wikimedia.org/T394099) [15:21:01] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Persistence, 10Data-Platform: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10835253 (10dr0ptp4kt) Looping Data Platform with a tag addition here for tracking. [15:21:58] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Engineering, 06Data-Persistence, and 2 others: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10835264 (10dr0ptp4kt) [15:23:13] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [15:23:46] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Engineering, 06Data-Persistence, and 2 others: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10835274 (10fnegri) @dr0ptp4kt thanks! We might even start doing this for `an-redacteddb1001`, before clouddb... [15:24:45] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [15:39:09] 06cloud-services-team, 10Cloud-VPS: puppet-enc issue with Hiera values starting with a colon due to PyYAML and Ruby YAML parsing differences - https://phabricator.wikimedia.org/T394691 (10taavi) 03NEW [15:40:51] 06cloud-services-team, 10Cloud-VPS: puppet-enc issue with Hiera values starting with a colon due to PyYAML and Ruby YAML parsing differences - https://phabricator.wikimedia.org/T394691#10835469 (10taavi) p:05Triage→03High a:03taavi [15:44:32] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Understand Octavia network needs - https://phabricator.wikimedia.org/T394099#10835484 (10aborrero) >>! In T394099#10834659, @Andrew wrote: > @aborrero, can you stage patches for this same change in eqiad1? > Done, see: * https://gitlab.wikimedia.org/r... [15:50:14] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: puppet-enc issue with Hiera values starting with a colon due to PyYAML and Ruby YAML parsing differences - https://phabricator.wikimedia.org/T394691#10835509 (10taavi) 05Open→03Resolved [15:54:56] FIRING: SystemdUnitDown: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1076. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1076 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:03:01] (03open) 10raymond-ndibe: [runtimes.k8s.jobs] fix default resource bug [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/168 [16:03:50] PROBLEM - ensure kvm processes are running on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:08:08] PROBLEM - ensure kvm processes are running on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:12:24] PROBLEM - ensure kvm processes are running on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:14:53] (03approved) 10dcaro: [runtimes.k8s.jobs] fix default resource bug [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/168 (owner: 10raymond-ndibe) [16:16:42] PROBLEM - ensure kvm processes are running on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:18:53] (03open) 10raymond-ndibe: [runtimes.k8s.jobs] fix default resource bug [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 [16:21:00] PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:22:35] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1065'] [16:22:41] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] testing diff bug fix [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 [16:22:46] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] testing diff bug fix [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 [16:23:07] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1065'] [16:24:24] PROBLEM - nova-compute proc minimum on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:25:16] PROBLEM - ensure kvm processes are running on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:25:24] RECOVERY - nova-compute proc minimum on cloudvirt1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:27:26] RESOLVED: SystemdUnitDown: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1076. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1076 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:29:34] PROBLEM - ensure kvm processes are running on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:30:36] (03update) 10dcaro: [runtimes.k8s.jobs] fix default resource bug [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/168 (owner: 10raymond-ndibe) [16:35:03] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Engineering, 06Data-Persistence, and 2 others: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10835772 (10Marostegui) >>! In T394372#10835232, @fnegri wrote: > @Marostegui is there any page on wikitech w... [16:35:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:56:21] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1068', 'cloudvirt1069', 'cloudvirt1070', 'cloudvirt1071'] [16:57:25] RECOVERY - ensure kvm processes are running on cloudvirt1069 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:57:34] RECOVERY - ensure kvm processes are running on cloudvirt1068 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:58:02] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1068', 'cloudvirt1069', 'cloudvirt1070', 'cloudvirt1071'] [16:58:16] RECOVERY - ensure kvm processes are running on cloudvirt1071 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:02:32] PROBLEM - Host cloudvirt1072 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1072 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:06:00] RECOVERY - Host cloudvirt1072 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [17:06:10] PROBLEM - ensure kvm processes are running on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:10:48] FIRING: PuppetFailure: Puppet has failed on cloudvirt1076:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:10:59] 06cloud-services-team: PuppetFailure Puppet has failed on cloudvirt1076:9100 - https://phabricator.wikimedia.org/T394706 (10phaultfinder) 03NEW [17:11:56] FIRING: SystemdUnitDown: The service unit networking.service is in failed status on host cloudvirt1072. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1072 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:15:48] RESOLVED: PuppetFailure: Puppet has failed on cloudvirt1076:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:26:17] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1074'] [17:26:47] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1074'] [17:26:56] RESOLVED: [2x] SystemdUnitDown: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1074. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1074 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:27:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1073.eqiad.wmnet}' [17:30:34] PROBLEM - Host cloudvirt1073 is DOWN: PING CRITICAL - Packet loss = 100% [17:31:18] RECOVERY - Host cloudvirt1073 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [17:31:27] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1073.eqiad.wmnet}' [17:32:00] PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:32:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1076.eqiad.wmnet}' [17:32:49] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10836173 (10Dzahn) ` 17:32 <+icinga-wm> PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex ar... [17:34:18] PROBLEM - Host cloudvirt1076 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1076.eqiad.wmnet}' [17:35:30] 06cloud-services-team: PuppetFailure Puppet has failed on cloudvirt1076:9100 - https://phabricator.wikimedia.org/T394706#10836193 (10Dzahn) alerts for host being completely down now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=cloudvirt1076 [17:35:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1073 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:36:00] RECOVERY - Host cloudvirt1076 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [17:36:44] PROBLEM - ensure kvm processes are running on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:40:28] FIRING: [2x] TargetDown: Job app is unreachable in project quarry instance quarry.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [17:40:39] FIRING: QuarryDown: Quarry application is unreachable - https://prometheus-alerts.wmcloud.org/?q=alertname%3DQuarryDown [17:40:49] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1073 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:41:56] FIRING: [4x] SystemdUnitDown: The service unit networking.service is in failed status on host cloudvirt1073. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:47:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1072.eqiad.wmnet}' [17:50:18] PROBLEM - Host cloudvirt1072 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1072.eqiad.wmnet}' [17:52:46] RECOVERY - Host cloudvirt1072 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [17:53:10] PROBLEM - ensure kvm processes are running on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:55:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1075.eqiad.wmnet}' [17:55:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1072 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:57:58] PROBLEM - Host cloudvirt1075 is DOWN: PING CRITICAL - Packet loss = 100% [17:58:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1075.eqiad.wmnet}' [17:59:12] RECOVERY - Host cloudvirt1075 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [17:59:52] PROBLEM - ensure kvm processes are running on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:05:33] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1073'] [18:05:49] RESOLVED: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1073 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [18:06:00] RECOVERY - ensure kvm processes are running on cloudvirt1073 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:06:03] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1073'] [18:06:39] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1075'] [18:06:56] FIRING: SystemdUnitDown: The service unit networking.service is in failed status on host cloudvirt1075. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1075 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:07:09] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1075'] [18:07:45] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1072', 'cloudvirt1076'] [18:07:52] RECOVERY - ensure kvm processes are running on cloudvirt1075 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:08:10] RECOVERY - ensure kvm processes are running on cloudvirt1072 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:08:42] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1072', 'cloudvirt1076'] [18:08:44] RECOVERY - ensure kvm processes are running on cloudvirt1076 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:24:52] 06cloud-services-team: Rename cloudcontrol200[789]-dev.codfw to cloudrabbit200[123]-dev.codfw - https://phabricator.wikimedia.org/T392539#10836545 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1072.eqiad.wmnet with OS bookworm [18:33:09] RESOLVED: QuarryDown: Quarry application is unreachable - https://prometheus-alerts.wmcloud.org/?q=alertname%3DQuarryDown [18:37:58] RESOLVED: [2x] TargetDown: Job app is unreachable in project quarry instance quarry.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [18:40:39] 06cloud-services-team, 10decommission-hardware: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 (10Andrew) 03NEW [18:43:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1031.eqiad.wmnet' (T394727) [18:44:05] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [18:44:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1031.eqiad.wmnet' (T394727) [18:44:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1032.eqiad.wmnet' (T394727) [18:45:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1032.eqiad.wmnet' (T394727) [18:45:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1033.eqiad.wmnet' (T394727) [18:46:23] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1033.eqiad.wmnet' (T394727) [18:46:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1034.eqiad.wmnet' (T394727) [18:47:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1034.eqiad.wmnet' (T394727) [18:47:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1035.eqiad.wmnet' (T394727) [18:47:57] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1035.eqiad.wmnet' (T394727) [18:48:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [18:48:55] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) [18:50:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [18:50:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) [18:50:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.set_maintenance [18:51:27] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=0) [18:51:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [18:51:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [18:51:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1036.eqiad.wmnet' (T394727) [18:51:50] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [18:52:24] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1036.eqiad.wmnet' (T394727) [18:54:33] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1068'] [18:54:57] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1068'] [18:56:23] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1069'] [18:56:47] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1069'] [18:57:21] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1070'] [18:57:46] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1070'] [18:58:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1037.eqiad.wmnet' (T394727) [18:58:14] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [19:00:28] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] testing diff bug fix [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 [19:00:31] 06cloud-services-team: Rename cloudcontrol200[789]-dev.codfw to cloudrabbit200[123]-dev.codfw - https://phabricator.wikimedia.org/T392539#10836690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1072.eqiad.wmnet with OS bookworm executed with errors:... [19:09:21] (03PS1) 10Amire80: Consistent spelling of "metadata" in a message [labs/tools/intuition] - 10https://gerrit.wikimedia.org/r/1147856 [19:09:30] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1073', 'cloudvirt1076'] [19:09:34] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1073', 'cloudvirt1076'] [19:15:23] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1073', 'cloudvirt1076'] [19:15:32] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1073', 'cloudvirt1076'] [19:17:20] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1075'] [19:17:27] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1075'] [19:28:03] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1037.eqiad.wmnet' (T394727) [19:28:09] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [19:28:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1037.eqiad.wmnet' (T394727) [19:38:13] 10Tool-campwiz-nxt: Implement Reverse proxy and Failover server into campwiz nxt - https://phabricator.wikimedia.org/T394730 (10Nokib_Sarkar) 03NEW [19:38:50] 10Tool-campwiz-nxt: Implement Reverse proxy and Failover server into campwiz nxt - https://phabricator.wikimedia.org/T394730#10836795 (10Nokib_Sarkar) [19:38:51] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1037.eqiad.wmnet' (T394727) [19:38:52] 10Tool-campwiz-nxt: Migration of CampWiz NXT to toolforge - https://phabricator.wikimedia.org/T394515#10836796 (10Nokib_Sarkar) [19:38:57] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [19:39:10] 10Tool-campwiz-nxt: Implement Reverse proxy and Failover server into campwiz nxt - https://phabricator.wikimedia.org/T394730#10836803 (10Nokib_Sarkar) 05Open→03In progress [19:40:51] 10Tool-campwiz-nxt: Migration of CampWiz NXT to toolforge - https://phabricator.wikimedia.org/T394515#10836813 (10Nokib_Sarkar) [19:43:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1038.eqiad.wmnet' (T394727) [19:59:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1038.eqiad.wmnet' (T394727) [19:59:21] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [20:31:51] (03update) 10bd808: Convert project to golang [toolforge-repos/gitlab-content] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-content/-/merge_requests/6 [20:37:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1039.eqiad.wmnet' (T394727) [20:37:51] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [20:58:31] FIRING: ToolsNfsAlmostFull: Toolforge NFS is 0.8646319687934529/1 full - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNfsAlmostFull - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNfsAlmostFull [21:05:16] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1039.eqiad.wmnet' (T394727) [21:05:22] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [21:07:17] 10Toolforge (Toolforge iteration 20): [jobs-api] bug in runtime diff_with_running_job function - https://phabricator.wikimedia.org/T394734 (10Raymond_Ndibe) 03NEW [21:07:27] 10Toolforge (Toolforge iteration 20): [jobs-api] bug in runtime diff_with_running_job function - https://phabricator.wikimedia.org/T394734#10837225 (10Raymond_Ndibe) a:03Raymond_Ndibe [21:08:42] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] testing diff bug fix [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 (https://phabricator.wikimedia.org/T394734) [21:14:24] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1071'] [21:14:49] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1071'] [21:16:28] 10Toolforge (Toolforge iteration 20): [jobs-api] bug in runtime diff_with_running_job function - https://phabricator.wikimedia.org/T394734#10837250 (10Raymond_Ndibe) 05Open→03In progress [21:16:42] PROBLEM - ensure kvm processes are running on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:16:44] PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:16:56] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1073', 'cloudvirt1074', 'cloudvirt1075', 'cloudvirt1076'] [21:17:14] PROBLEM - ensure kvm processes are running on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:17:44] RECOVERY - ensure kvm processes are running on cloudvirt1073 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:17:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1039.eqiad.wmnet' (T394727) [21:17:55] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [21:18:14] RECOVERY - ensure kvm processes are running on cloudvirt1075 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:18:24] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1073', 'cloudvirt1074', 'cloudvirt1075', 'cloudvirt1076'] [21:18:42] RECOVERY - ensure kvm processes are running on cloudvirt1076 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:28:00] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1039.eqiad.wmnet' (T394727) [21:28:07] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [21:28:33] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] testing diff bug fix [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 (https://phabricator.wikimedia.org/T394734) [21:34:12] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] testing diff bug fix [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 (https://phabricator.wikimedia.org/T394734) [21:37:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1037.eqiad.wmnet' (T394727) [21:37:43] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [21:45:27] FIRING: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [21:47:46] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1037.eqiad.wmnet' (T394727) [21:47:51] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [21:50:27] RESOLVED: ToolsbetaNFSDown: No toolsbeta nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsbetaNFSDown [21:56:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1039.eqiad.wmnet' (T394727) [21:56:06] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [21:59:36] (03update) 10raymond-ndibe: [runtimes.k8s.jobs] fix default resource bug [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/168 [22:06:02] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1039.eqiad.wmnet' (T394727) [22:06:09] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [22:15:54] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:16:02] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:16:10] PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:16:54] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:17:49] (03update) 10raymond-ndibe: [runtimes.k8s.jobs] fix default resource bug [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/168 [22:18:48] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] testing diff bug fix [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 (https://phabricator.wikimedia.org/T394734) [22:19:46] PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:20:34] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:21:10] RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:21:55] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:22:10] PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:22:36] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:23:07] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:23:54] PROBLEM - nova-compute proc maximum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:24:59] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:25:07] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:25:46] RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:26:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,nova [22:26:23] (03update) 10bd808: Convert project to golang [toolforge-repos/gitlab-content] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-content/-/merge_requests/6 [22:26:40] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.restart_openstack (exit_code=97) on deployment eqiad1 for service: project,nova [22:27:10] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:27:21] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:28:45] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:29:10] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1072'] [22:29:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,nova [22:30:45] PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:30:46] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] testing diff bug fix [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 (https://phabricator.wikimedia.org/T394734) [22:30:53] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:31:27] PROBLEM - nova-compute proc maximum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:31:33] PROBLEM - nova-compute proc maximum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:31:45] PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:33] RECOVERY - nova-compute proc maximum on cloudvirt1071 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:45] RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:45] RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:34:53] RECOVERY - nova-compute proc maximum on cloudvirt1068 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:01] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for service: project,nova [22:35:45] PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:36:01] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:36:49] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] fix bug in diff_with_running_job method [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 (https://phabricator.wikimedia.org/T394734) [22:37:15] (03update) 10raymond-ndibe: [runtimes.k8s.runtime] fix bug in diff_with_running_job method [repos/cloud/toolforge/jobs-api] (fix_default_resource_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/169 (https://phabricator.wikimedia.org/T394734) [22:39:33] PROBLEM - nova-compute proc maximum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:39:45] PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:39:53] PROBLEM - nova-compute proc maximum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:42:53] RECOVERY - nova-compute proc maximum on cloudvirt1068 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:43:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,nova [22:43:09] RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:43:28] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.restart_openstack (exit_code=97) on deployment eqiad1 for service: project,nova [22:44:33] RECOVERY - nova-compute proc maximum on cloudvirt1071 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:44:53] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:45:27] RECOVERY - nova-compute proc maximum on cloudvirt1070 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:45:45] RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:46:15] (03update) 10bd808: Convert project to golang [toolforge-repos/gitlab-content] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-content/-/merge_requests/6 [22:46:45] RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:47:01] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:47:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1039.eqiad.wmnet' (T394727) [22:47:53] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [22:58:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1039.eqiad.wmnet' (T394727) [22:58:08] T394727: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727 [23:26:11] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:36:22] (03update) 10bd808: Convert project to golang [toolforge-repos/gitlab-content] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-content/-/merge_requests/6 [23:52:29] (03update) 10bd808: Convert project to golang [toolforge-repos/gitlab-content] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-content/-/merge_requests/6