[00:03:55] FIRING: MaxConntrack: Max conntrack at 80.14% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:07:29] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:10:50] FIRING: TfInfraTestDestroyFailed: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:15:50] RESOLVED: TfInfraTestDestroyFailed: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:17:29] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:19:56] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:28:55] RESOLVED: MaxConntrack: Max conntrack at 80.58% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:40:25] FIRING: MaxConntrack: Max conntrack at 81.03% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:50:25] RESOLVED: MaxConntrack: Max conntrack at 80.74% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [01:08:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:08:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:08:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:08:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:08:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:08:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:08:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:08:37] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:08:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:08:49] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:09:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:10:01] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:10:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:10:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:12:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:12:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:14:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:14:42] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [01:17:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:17:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [03:30:23] (03open) 10raymond-ndibe: Draft: [envvars-api] DO_NOT_MERGE: schedule all pods on toolforge-worker [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/45 (https://phabricator.wikimedia.org/T358203) [03:31:10] (03open) 10raymond-ndibe: [builds-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [03:32:11] (03update) 10raymond-ndibe: [builds-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [03:40:15] (03update) 10raymond-ndibe: Draft: [toolforge-deploy] DO_NOT_MERGE : increase builds-api replicas in local env [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/481 (https://phabricator.wikimedia.org/T358203) [03:45:25] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10068420 (10Andrew) (meanwhile I am draining and rebuilding cloudcephosd1035 because it was built with improper drive assignments.) [03:45:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T363344) [03:45:31] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [03:50:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T363344) [03:50:46] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [03:51:27] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344) [03:52:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T363344) [03:53:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344) [03:57:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T363344) [03:57:56] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [03:58:37] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344) [04:09:06] (03update) 10raymond-ndibe: Draft: [envvars-api] DO_NOT_MERGE: schedule all pods on toolforge-worker [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/45 (https://phabricator.wikimedia.org/T358203) [04:09:37] (03update) 10raymond-ndibe: Draft: [envvars-api] DO_NOT_MERGE: schedule all pods on toolforge-worker3 [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/45 (https://phabricator.wikimedia.org/T358203) [04:19:56] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:24:50] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:25:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T363344) [04:25:39] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [04:26:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344) [04:32:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T363344) [04:32:38] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [04:33:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344) [04:36:05] (03update) 10raymond-ndibe: Draft: [envvars-api] DO_NOT_MERGE: schedule all pods on toolforge-worker3 [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/45 (https://phabricator.wikimedia.org/T358203) [04:42:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T363344) [04:42:06] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [04:42:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344) [04:45:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T363344) [04:46:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344) [04:54:50] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:57:41] (03update) 10raymond-ndibe: Draft: [envvars-api] DO_NOT_MERGE: schedule all pods on toolforge-worker3 [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/45 (https://phabricator.wikimedia.org/T358203) [04:58:04] (03update) 10raymond-ndibe: Draft: [envvars-api] DO_NOT_MERGE: schedule all pods on toolforge-worker [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/45 (https://phabricator.wikimedia.org/T358203) [05:04:34] (03update) 10raymond-ndibe: Draft: [envvars-api] DO_NOT_MERGE: schedule all pods on toolforge-worker [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/45 (https://phabricator.wikimedia.org/T358203) [05:11:57] (03update) 10raymond-ndibe: [builds-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [05:12:04] (03update) 10raymond-ndibe: [builds-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [05:40:30] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344) [05:40:35] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [06:55:24] (03update) 10raymond-ndibe: [envvars-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [06:58:09] (03update) 10raymond-ndibe: Draft: [toolforge-deploy] DO_NOT_MERGE : increase envvars-api replicas in local env [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/481 (https://phabricator.wikimedia.org/T358203) [06:58:40] (03update) 10raymond-ndibe: Draft: [toolforge-deploy] DO_NOT_MERGE : increase envvars-api replicas in local env [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/481 (https://phabricator.wikimedia.org/T358203) [07:36:54] (03open) 10raymond-ndibe: Draft: [lima-kilo] DO_NOT_MERGE: enable node inclusion policy feature gate [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/183 (https://phabricator.wikimedia.org/T358203) [07:39:41] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:41:37] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29687 bytes in 4.219 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:19:56] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:23:55] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "wikipathways" project Buster deprecation - https://phabricator.wikimedia.org/T367563#10068615 (10EgonWillighagen) @Andrew, hi, the email replies by Alex seems to get bounced, but he already replied July 19th. Since I received that email, I did not know you had... [08:39:19] (03update) 10raymond-ndibe: Draft: [envvars-api] DO_NOT_MERGE: schedule all pods on toolforge-worker [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/45 (https://phabricator.wikimedia.org/T358203) [08:42:31] (03update) 10raymond-ndibe: [envvars-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [08:42:36] (03update) 10raymond-ndibe: [envvars-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [09:01:31] (03update) 10raymond-ndibe: [envvars-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [09:04:08] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10068690 (10Raymond_Ndibe) Tested on lima-kilo: https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api... [09:13:29] (03open) 10raymond-ndibe: [api-gateway] add topologySpreadConstraints to deployment [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/35 (https://phabricator.wikimedia.org/T358203) [09:21:32] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10068726 (10Raymond_Ndibe) [09:25:20] (03PS1) 10Klausman: hiera/k8s: Update ML Swift secrets sections for consistency [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 [09:25:33] (03open) 10raymond-ndibe: [envvars-admission] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/10 (https://phabricator.wikimedia.org/T358203) [09:25:45] (03update) 10raymond-ndibe: [envvars-admission] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/10 (https://phabricator.wikimedia.org/T358203) [09:26:23] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10068741 (10Raymond_Ndibe) [09:27:34] (03CR) 10Klausman: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 (owner: 10Klausman) [09:28:06] (03update) 10raymond-ndibe: [api-gateway] add topologySpreadConstraints to deployment [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/35 (https://phabricator.wikimedia.org/T358203) [09:31:07] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:33:03] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29693 bytes in 5.808 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:36:27] (03open) 10raymond-ndibe: [volume-admission] add topologySpreadConstraints to deployment [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/15 (https://phabricator.wikimedia.org/T358203) [09:36:37] (03update) 10raymond-ndibe: [volume-admission] add topologySpreadConstraints to deployment [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/15 (https://phabricator.wikimedia.org/T358203) [09:37:16] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10068797 (10Raymond_Ndibe) [09:44:12] (03open) 10raymond-ndibe: [jobs-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/117 (https://phabricator.wikimedia.org/T358203) [09:44:52] (03update) 10raymond-ndibe: [jobs-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/117 (https://phabricator.wikimedia.org/T358203) [09:47:28] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10068895 (10Raymond_Ndibe) [10:04:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-55 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:09:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-55 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:14:42] (03open) 10raymond-ndibe: [builds-builder] add topologySpreadConstraints to deployment [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/58 (https://phabricator.wikimedia.org/T358203) [10:16:18] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10069012 (10Raymond_Ndibe) [10:18:18] (03update) 10raymond-ndibe: [jobs-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/117 (https://phabricator.wikimedia.org/T358203) [10:21:43] (03open) 10raymond-ndibe: [ingress-admission] add topologySpreadConstraints to deployment [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/8 (https://phabricator.wikimedia.org/T358203) [10:22:13] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10069032 (10Raymond_Ndibe) [10:32:28] (03open) 10raymond-ndibe: [jobs-emailer] add topologySpreadConstraints to deployment [repos/cloud/toolforge/jobs-emailer] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-emailer/-/merge_requests/6 (https://phabricator.wikimedia.org/T358203) [10:32:52] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10069058 (10Raymond_Ndibe) [10:36:09] (03update) 10raymond-ndibe: [jobs-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/117 (https://phabricator.wikimedia.org/T358203) [10:39:23] (03open) 10raymond-ndibe: [registry-admission] add topologySpreadConstraints to deployment [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/11 (https://phabricator.wikimedia.org/T358203) [10:39:53] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10069085 (10Raymond_Ndibe) [10:40:04] (03update) 10raymond-ndibe: [api-gateway] add topologySpreadConstraints to deployment [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/35 (https://phabricator.wikimedia.org/T358203) [10:43:59] (03update) 10raymond-ndibe: [builds-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/builds-api] (node-selector-to-test-topology-spread) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/82 (https://phabricator.wikimedia.org/T358203) [10:45:52] (03update) 10raymond-ndibe: [envvars-admission] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/10 (https://phabricator.wikimedia.org/T358203) [10:46:52] 10Tools: Flickr2 Commons is currently down - https://phabricator.wikimedia.org/T372451#10069096 (10DaxServer) For the moment, I enabled a copy at https://flickr2commons-ng.toolforge.org/ Code cloned from Magnus' repos and hosted at https://gitlab.wikimedia.org/toolforge-repos/flickr2commons-ng [10:47:45] (03update) 10raymond-ndibe: [envvars-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [10:51:34] (03update) 10raymond-ndibe: [envvars-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/envvars-api] (node-selector-to-test-topology-constraints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/46 (https://phabricator.wikimedia.org/T358203) [10:55:17] (03update) 10raymond-ndibe: [volume-admission] add topologySpreadConstraints to deployment [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/15 (https://phabricator.wikimedia.org/T358203) [10:59:47] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#10069120 (10Raymond_Ndibe) >>! In T358203#10068690, @Raymond_Ndibe wrote: > Tested on lima-kilo: > https://git... [11:00:34] (03close) 10raymond-ndibe: Draft: [builds-api] DO_NOT_MERGE: schedule all pods on toolforge-worker [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/108 (https://phabricator.wikimedia.org/T358203) [11:01:47] (03update) 10raymond-ndibe: [builds-api] add topologySpreadConstraints to deployment [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/82 (https://phabricator.wikimedia.org/T358203) [11:04:30] (03close) 10raymond-ndibe: DO_NOT_MERGE: testing _display_messages move to toolforge-weld [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/116 [11:04:52] (03close) 10raymond-ndibe: DO_NOT_MERGE: testing _display_messages move to toolforge-weld [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/44 [11:06:20] (03close) 10raymond-ndibe: DO_NOT_MERGE: testing _display_messages move to toolforge-weld [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/90 [11:24:13] (03update) 10raymond-ndibe: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 [11:28:31] (03CR) 10Pwangai: Exempt Test Group Repositories (031 comment) [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) (owner: 10Pwangai) [11:45:17] (03update) 10raymond-ndibe: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 [11:46:33] (03update) 10raymond-ndibe: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 [11:47:54] (03update) 10raymond-ndibe: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 [11:48:12] (03update) 10raymond-ndibe: [toolforge-weld] move _display_message into toolforge weld [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/46 [11:54:34] (03update) 10raymond-ndibe: [builds-cli] remove _display_messages [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/69 [11:59:15] (03update) 10raymond-ndibe: [jobs-cli] remove _display_messages [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/62 [11:59:59] (03update) 10raymond-ndibe: [envvars-cli] remove display_messages [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/57 [12:00:06] (03update) 10raymond-ndibe: [jobs-cli] remove _display_messages [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/62 [12:03:17] 10Tools: Flickr2 Commons is currently down - https://phabricator.wikimedia.org/T372451#10069255 (10Prototyperspective) Thanks DaxServer. This would need to be discussed elsewhere but I think it would be great if the repos from BitBucket were moved to GitLab which is the established standard for Wikimedia tools t... [12:03:46] 10Toolforge (Toolforge iteration 14): something is wrong with pre-commit on builds-api - https://phabricator.wikimedia.org/T372601#10069256 (10Raymond_Ndibe) [12:19:56] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:23:36] 10Toolforge: [toolforge] [envvars] - https://phabricator.wikimedia.org/T372640 (10fnegri) 03NEW [13:23:57] 10Toolforge: [toolforge] [envvars] TOOL_REPLICA_USER and TOOL_TOOLSDB_USER missing for new tool - https://phabricator.wikimedia.org/T372640#10069419 (10fnegri) [13:24:42] 10Toolforge: [toolforge] [envvars] TOOL_REPLICA_USER and TOOL_TOOLSDB_USER missing for new tool - https://phabricator.wikimedia.org/T372640#10069426 (10fnegri) I can see some errors in the logs for the service, not just for that tool but for a few other tools as well. This is the part of the log about that speci... [13:26:12] 10Toolforge: [toolforge] [envvars] TOOL_REPLICA_USER and TOOL_TOOLSDB_USER missing for new tool - https://phabricator.wikimedia.org/T372640#10069441 (10fnegri) [13:26:50] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10069443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1035.eq... [13:27:27] 10Toolforge: [toolforge] [envvars] TOOL_REPLICA_USER and TOOL_TOOLSDB_USER missing for new tool - https://phabricator.wikimedia.org/T372640#10069446 (10fnegri) [13:33:59] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10069470 (10fnegri) [13:43:20] (03CR) 10Kosta Harlan: Exempt Test Group Repositories (031 comment) [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) (owner: 10Pwangai) [13:45:16] 10Toolforge: [toolforge] [envvars] TOOL_REPLICA_USER and TOOL_TOOLSDB_USER missing for new tool - https://phabricator.wikimedia.org/T372640#10069499 (10dcaro) I merged some patches to maintain dbusers yesterday, you can try reverting those (puppet repo), though that specific error seems to be a race condition be... [13:51:57] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10069501 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1017.eqiad.wmnet with OS bookworm [14:02:24] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#10069525 (10fnegri) 05In progress→03Resolved Replication lag is now back to zero on all clouddb* hosts. Upgrading all of them to bookworm and to... [14:04:06] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10069542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1035.eqiad.... [14:08:24] PROBLEM - Host cloudcephosd1035 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:56] RECOVERY - Host cloudcephosd1035 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:14:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T363344) [14:17:28] PROBLEM - Host cloudcephosd1035 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:00] RECOVERY - Host cloudcephosd1035 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:23:36] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10069607 (10Jhancock.wm) [14:25:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:27:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T363344) [14:28:06] T363344: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 [14:29:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:29:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:29:56] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:30:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:30:12] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:33:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:33:22] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:33:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:34:04] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:34:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:34:41] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:35:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:35:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:35:26] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:35:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:36:00] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:36:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:36:16] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:36:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:36:30] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:36:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:36:44] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [14:37:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [14:37:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [14:38:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node [14:40:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) [14:40:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [14:40:39] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [14:42:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [14:48:49] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10069684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1017.eqiad.wmnet with OS bookworm executed with e... [14:53:11] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10069691 (10fnegri) The reimage cookbook for clouddb1017 failed only because MariaDB is taking a bit longer to catch up with the primary, and the cookbook did not... [14:53:43] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10069692 (10fnegri) [15:02:37] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:03:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:07:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:07:36] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:07:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:08:03] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [15:08:06] RESOLVED: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:08:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:08:15] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:08:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:08:33] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:08:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:08:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [15:08:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:09:01] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [15:09:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:09:12] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:09:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:09:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [15:09:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:09:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [15:10:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:10:09] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:15:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:15:38] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:19:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:20:08] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [15:30:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:30:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:51:41] (03CR) 10Klausman: [C:03+1] hiera/k8s: Update ML Swift secrets sections for consistency [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 (owner: 10Klausman) [15:51:45] (03CR) 10Klausman: [C:03+2] hiera/k8s: Update ML Swift secrets sections for consistency [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 (owner: 10Klausman) [15:51:47] (03CR) 10Klausman: [V:03+2 C:03+2] hiera/k8s: Update ML Swift secrets sections for consistency [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 (owner: 10Klausman) [16:00:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:07:39] (03PS5) 10Pwangai: Exempt Test Group Repositories [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) [16:10:09] (03CR) 10Pwangai: Exempt Test Group Repositories (031 comment) [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) (owner: 10Pwangai) [16:28:04] (03CR) 10Kosta Harlan: Exempt Test Group Repositories (033 comments) [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) (owner: 10Pwangai) [16:28:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:28:23] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:29:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:29:53] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:30:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:31:02] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:31:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:31:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:31:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:31:24] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:31:25] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:31:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:31:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:31:46] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:31:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:31:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:31:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:32:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:32:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:32:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:32:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:32:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:32:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [16:32:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [16:34:32] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878#10070011 (10Andrew) 05Open→03Resolved a:03Andrew I've now repooled all affected ceph nodes (and rebuilt cloudcephosd1035) and repooled all cloudvirts. Until the switch fl... [16:36:58] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "wikipathways" project Buster deprecation - https://phabricator.wikimedia.org/T367563#10070063 (10Andrew) 05Open→03Resolved a:03Andrew Thanks @EgonWillighagen. This project is now up to date. [16:38:48] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "commons-corruption-checker" project Buster deprecation - https://phabricator.wikimedia.org/T367525#10070067 (10Andrew) Emailed today, again: ` Hello again! If you would like to preserve your cloud-vps project, please read the email below and take action. On... [16:44:48] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "wikicommunityhealth" project Buster deprecation - https://phabricator.wikimedia.org/T367560#10070095 (10Andrew) Emailed today: > Hello! > > Your cloud-vps project 'wikicommunityhealth' is currently under multiple threats. > > 1) It contains Buster VMs, and... [16:49:37] 10Cloud-VPS (Debian Buster Deprecation), 06Machine-Learning-Team, 10Wikilabels: Cloud VPS "wikilabels" project Buster deprecation - https://phabricator.wikimedia.org/T367562#10070111 (10Andrew) emailed today: > Hello! > > Your cloud-vps project 'wikilabels' is currently under multiple threats. > > I... [16:49:57] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "schematreerecommender" project Buster deprecation - https://phabricator.wikimedia.org/T367552#10070113 (10Andrew) Hi! I don't see evidence that there has been any progress with this project. Please respond with a plan and timeline if you would like me to not d... [16:54:29] (03PS6) 10Pwangai: Exempt Test Group Repositories [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) [16:57:31] (03CR) 10Pwangai: Exempt Test Group Repositories (033 comments) [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) (owner: 10Pwangai) [17:16:08] 10Cloud-VPS: Cloud-VPS OpenTofu provider is not working on M1 Macs - https://phabricator.wikimedia.org/T353019#10070180 (10taavi) [17:16:54] 06cloud-services-team, 10Cloud-VPS, 07ARM support: Support terraform.wmcloud.org/registry/cloudvps on MacOS arm64 clients - https://phabricator.wikimedia.org/T372361#10070178 (10taavi) →14Duplicate dup:03T353019 [17:41:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [17:43:26] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29688 bytes in 0.790 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [17:56:09] 10VPS-project-devtools, 06collaboration-services, 10GitLab, 06Release-Engineering-Team: https://gitlab.devtools.wmcloud.org is being indexed by google (and scoring pretty high) - https://phabricator.wikimedia.org/T372538#10070228 (10brennen) p:05Triage→03Medium [17:57:40] 10Cloud-VPS, 07ARM support: Cloud-VPS OpenTofu provider is not working on M1 Macs - https://phabricator.wikimedia.org/T353019#10070244 (10bd808) [18:54:38] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [18:55:30] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29688 bytes in 0.411 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [18:59:41] RESOLVED: CloudVPSDesignateLeaks: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:00:25] (03CR) 10Kosta Harlan: Exempt Test Group Repositories (031 comment) [labs/tools/sonarqubebot] - 10https://gerrit.wikimedia.org/r/1063008 (https://phabricator.wikimedia.org/T372565) (owner: 10Pwangai) [19:18:45] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "striker" project Buster deprecation - https://phabricator.wikimedia.org/T367555#10070449 (10Andrew) I created a new Bookworm VM and moved the cinder volume over. I expected to be able to launch striker with 'docker compose' but... ` root@striker-docker-02:/sr... [21:53:33] 10Tool-schedule-deployment: Specify the deploy window via URL in the scheduler tool - https://phabricator.wikimedia.org/T372059#10070688 (10bd808) 05Open→03In progress a:03bd808 [23:24:52] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [23:25:48] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29689 bytes in 5.722 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [23:38:14] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) - https://phabricator.wikimedia.org/T370460#10070859 (10Eevans) By way an of an update: I built a new instance —de...