[01:50:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:25:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [03:10:36] RESOLVED: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [03:35:22] FIRING: HAProxyBackendUnavailable: HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:40:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:15:36] FIRING: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [04:35:36] RESOLVED: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [04:41:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service magnum-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:46:22] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service magnum-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:05:36] FIRING: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [05:20:36] RESOLVED: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [05:36:29] 10Tool-yearinreview, 07good first task: Center Align Username Placeholder for Consistency - https://phabricator.wikimedia.org/T373919#10115861 (10Aklapper) @ChandraPratap25: Hi and thank you for your interest! Please check https://www.mediawiki.org/wiki/New_Developers (and all of its communication section!). T... [05:50:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:25:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [06:30:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [07:31:29] (03PS1) 10Slyngshede: cloudidp-dev: Rename Horizon OIDC service. [labs/private] - 10https://gerrit.wikimedia.org/r/1070530 [07:36:06] (03CR) 10Slyngshede: [V:03+2 C:03+2] cloudidp-dev: Rename Horizon OIDC service. [labs/private] - 10https://gerrit.wikimedia.org/r/1070530 (owner: 10Slyngshede) [07:39:54] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972 (10dcaro) 03NEW [07:45:24] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116104 (10dcaro) From `kyverno-admission-controller-7cb7c68647-zwrvv` only, first trough: ` 1764 2024-09-04T02:48:29Z... [07:45:25] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116103 (10dcaro) those errors happen more often than just during the troughs, I see this also around one of the troughs... [07:49:29] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116112 (10dcaro) btm9l pod (the first to change leader election) restarted by itself: ` root@tools-k8s-control-7:~/tool... [07:54:07] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116131 (10dcaro) [07:58:34] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116144 (10dcaro) Similar error from the other controller pods that restarted, they lost the leadership and restarted th... [07:59:40] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116148 (10dcaro) kube-controller-manager seems to be restarting somewhat too: ` kube-apiserver-tools-k8s-control-7... [08:01:22] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116151 (10dcaro) Interesting, same issue, timeout and leader lost: ` root@tools-k8s-control-7:~/toolforge-deploy/compon... [08:27:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [08:27:05] (03CR) 10Raymond Ndibe: [C:03+2] kyverno.copy_images_to_registry: add missing kyverno-cli image [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1070262 (https://phabricator.wikimedia.org/T359641) (owner: 10David Caro) [08:31:26] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116227 (10aborrero) There was a network switch problem overnight as well: ` 06:51 <+icinga-wm> PROBLEM - BGP status on... [08:31:47] (03Merged) 10jenkins-bot: kyverno.copy_images_to_registry: add missing kyverno-cli image [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1070262 (https://phabricator.wikimedia.org/T359641) (owner: 10David Caro) [08:35:17] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116231 (10Raymond_Ndibe) What failed upgrade are we talking about @dcaro ? The upgrade to 1.26 or did we attempt doing... [08:36:00] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116242 (10Raymond_Ndibe) oooh this https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests... [08:38:20] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116250 (10aborrero) >>! In T373972#10116242, @Raymond_Ndibe wrote: > oooh this https://gitlab.wikimedia.org/repos/cloud... [09:05:41] 06cloud-services-team, 10Cloud-VPS: Cloud VPS: investigate conntrack table usage on cloudvirt1050 - https://phabricator.wikimedia.org/T373816#10116354 (10fnegri) @aborrero I think this task can be resolved, unless you want to do further investigations. [09:50:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:53:54] 06cloud-services-team, 10Cloud-VPS: Cloud VPS: investigate conntrack table usage on cloudvirt1050 - https://phabricator.wikimedia.org/T373816#10116541 (10aborrero) 05In progress→03Resolved [09:57:22] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986 (10aborrero) 03NEW [09:59:11] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10116567 (10aborrero) [09:59:27] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10116568 (10aborrero) [09:59:29] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116569 (10aborrero) [10:01:08] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10116574 (10aborrero) Another somewhat related (maybe) ticket: {T371879} [10:01:54] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116565 (10aborrero) I think the current theory is that this is caused by the api-server being unrealiable, which is bei... [10:02:04] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10116580 (10aborrero) Also somewhat related: {T316544} [10:02:31] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10116584 (10aborrero) p:05Triage→03High [10:35:29] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116718 (10aborrero) I just checked the etcd logs on server `tools-k8s-etcd-22`. There are a few leader elections around... [10:35:30] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10116715 (10cmooney) [10:41:11] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116730 (10aborrero) p:05Triage→03High [10:42:28] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10116740 (10aborrero) There are definitely network instabilities in etcd: `lines=10 Sep 04 04:18:25 tools-k8s-etcd-23 et... [10:47:36] 10Tool-yearinreview, 07good first task: Center Align Username Placeholder for Consistency - https://phabricator.wikimedia.org/T373919#10116749 (10ChandraPratap25) Unable TO Create a Account On GitLab **Your account is pending approval from your GitLab administrator and hence blocked. Please contact your GitL... [10:48:56] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10116750 (10cmooney) [10:48:57] (03update) 10dcaro: k8s: upgrade to 1.27.16 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/188 (https://phabricator.wikimedia.org/T359641) [10:56:52] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots - https://phabricator.wikimedia.org/T127367#10116779 (10aborrero) [10:58:04] 06cloud-services-team, 10Toolforge: Toolforge: systemd monitoring - https://phabricator.wikimedia.org/T215155#10116797 (10aborrero) 05Open→03Resolved a:03aborrero this is present already. [11:01:09] (03approved) 10raymond-ndibe: kyverno_pod_policy: remove kyverno versions from annotations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/61 (https://phabricator.wikimedia.org/T359641) (owner: 10dcaro) [11:02:48] 06cloud-services-team, 10Cloud-VPS, 07Epic: Build Prometheus service for use by all Cloud VPS projects and their instances - https://phabricator.wikimedia.org/T266050#10116819 (10aborrero) 05Resolved→03Open I'm boldly reopen to keep this "epic" task as the entry point for all the other subtasks. [11:06:58] (03merge) 10raymond-ndibe: kyverno_pod_policy: remove kyverno versions from annotations [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/61 (https://phabricator.wikimedia.org/T359641) (owner: 10dcaro) [11:07:57] (03update) 10raymond-ndibe: jobs,cronjobs: add clarifying note on why the limits [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/60 (https://phabricator.wikimedia.org/T372720) (owner: 10dcaro) [11:09:31] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.167-20240904110710-671baa77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/512 (https://phabricator.wikimedia.org/T359641) [11:12:40] !log raymond@ubuntu toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [11:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:20:02] !log raymond@ubuntu toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [11:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:58:55] (03update) 10dcaro: k8s: upgrade to 1.27.16 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/188 (https://phabricator.wikimedia.org/T359641) [12:14:20] 10Tool-yearinreview, 07good first task: Center Align Username Placeholder for Consistency - https://phabricator.wikimedia.org/T373919#10117072 (10ChandraPratap25) {F57457148} [12:22:44] (03PS1) 10Slyngshede: P:idp Add Keystone dummy secret [labs/private] - 10https://gerrit.wikimedia.org/r/1070588 [12:26:28] (03PS2) 10Slyngshede: P:idp Add Keystone dummy secret [labs/private] - 10https://gerrit.wikimedia.org/r/1070588 [12:27:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [12:33:22] 06cloud-services-team: codfw1dev: rabbitmq is not working because some auth failures - https://phabricator.wikimedia.org/T374002 (10aborrero) 03NEW [12:37:08] (03merge) 10sstefanova: pre-commit: Autoupdate [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/10 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:37:26] (03merge) 10sstefanova: poetry: Autoupdate [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/122 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:37:38] (03update) 10sstefanova: pre-commit: Autoupdate [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/123 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:37:46] (03merge) 10sstefanova: pre-commit: Autoupdate [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/17 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:40:00] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: ingress-admission: bump to 0.0.50-20240904123720-27c26acf [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/513 [12:40:41] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: jobs-api: bump to 0.0.333-20240904123738-79dd7646 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/514 [12:41:44] (03merge) 10sstefanova: pre-commit: Autoupdate [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/123 (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [12:42:01] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: volume-admission: bump to 0.0.55-20240904123757-4b68dd89 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/515 [12:44:25] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: jobs-api: bump to 0.0.333-20240904123738-79dd7646 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/514 [12:44:55] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [12:45:13] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission [12:45:55] 10cloud-services-team (FY2024/2025-Q1-Q2): [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005 (10dcaro) 03NEW [12:46:04] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [12:47:29] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10117265 (10dcaro) [12:47:30] 10cloud-services-team (FY2024/2025-Q1-Q2): [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10117266 (10dcaro) [12:47:44] 10cloud-services-team (FY2024/2025-Q1-Q2): [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10117279 (10dcaro) We should start with 1005, as we have 2 mons already on C8 [12:51:26] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission [13:02:02] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [13:02:08] (03update) 10sstefanova: ingress-admission: bump to 0.0.50-20240904123720-27c26acf [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/513 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:02:18] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission [13:03:23] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [13:07:36] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission [13:08:11] (03merge) 10sstefanova: ingress-admission: bump to 0.0.50-20240904123720-27c26acf [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/513 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:10:52] (03update) 10sstefanova: volume-admission: bump to 0.0.55-20240904123757-4b68dd89 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/515 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:18:32] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component volume-admission [13:18:49] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission [13:28:57] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component volume-admission [13:33:58] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission [13:35:24] (03approved) 10fnegri: k8s: upgrade to 1.27.16 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/188 (https://phabricator.wikimedia.org/T359641) (owner: 10dcaro) [13:35:59] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component volume-admission [13:36:15] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission [13:37:11] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component volume-admission [13:41:37] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission [13:43:35] (03approved) 10sstefanova: volume-admission: bump to 0.0.55-20240904123757-4b68dd89 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/515 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:47:46] (03update) 10raymond-ndibe: maintain-kubeusers: bump to 0.0.167-20240904110710-671baa77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/512 (https://phabricator.wikimedia.org/T359641) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:49:05] !log raymond@ubuntu toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [13:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:50:42] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:51:13] (03merge) 10sstefanova: volume-admission: bump to 0.0.55-20240904123757-4b68dd89 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/515 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:52:59] (03approved) 10raymond-ndibe: k8s: upgrade to 1.27.16 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/188 (https://phabricator.wikimedia.org/T359641) (owner: 10dcaro) [13:53:09] (03update) 10sstefanova: jobs-api: bump to 0.0.333-20240904123738-79dd7646 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/514 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:53:13] (03update) 10sstefanova: jobs-api: bump to 0.0.333-20240904123738-79dd7646 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/514 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:55:16] !log raymond@ubuntu toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [13:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:56:12] !log raymond@ubuntu tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [13:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:56:35] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [13:56:52] (03update) 10raymond-ndibe: k8s: upgrade to 1.27.16 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/188 (https://phabricator.wikimedia.org/T359641) (owner: 10dcaro) [13:57:13] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [13:57:50] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:01:47] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [14:02:38] !log raymond@ubuntu tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [14:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:02:56] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:03:33] (03update) 10raymond-ndibe: maintain-kubeusers: bump to 0.0.167-20240904110710-671baa77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/512 (https://phabricator.wikimedia.org/T359641) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [14:03:42] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [14:03:54] (03merge) 10raymond-ndibe: maintain-kubeusers: bump to 0.0.167-20240904110710-671baa77 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/512 (https://phabricator.wikimedia.org/T359641) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [14:04:38] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10117535 (10Raymond_Ndibe) [14:04:44] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:05:26] (03update) 10sstefanova: jobs-api: bump to 0.0.333-20240904123738-79dd7646 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/514 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [14:08:48] 06cloud-services-team, 10Cloud-VPS, 07IPv6: Some WMCS clusters have inconsistent AAAA DNS records for the primary IPv6 of the hosts - https://phabricator.wikimedia.org/T312557#10117550 (10joanna_borun) p:05Triage→03Low [14:08:49] 10cloud-services-team (Hardware), 06DC-Ops, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [cookbooks.ceph] create a script to get the list of rbd images affected by stuck/inactive PGs - https://phabricator.wikimedia.org/T331636#10117553 (10joanna_borun) p:05Triage→03Low [14:08:58] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [14:09:38] (03merge) 10sstefanova: jobs-api: bump to 0.0.333-20240904123738-79dd7646 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/514 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [14:10:10] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Create an alert for K8s cronjobs - https://phabricator.wikimedia.org/T308203#10117558 (10joanna_borun) p:05Triage→03Low [14:15:45] 06cloud-services-team, 10Toolforge: LDAP cleanup after tool deletion does not always work - https://phabricator.wikimedia.org/T334040#10117571 (10joanna_borun) p:05Triage→03Low [14:16:36] 06cloud-services-team: ceph: Update netbox status when bootstrapping a new osd node - https://phabricator.wikimedia.org/T295132#10117581 (10joanna_borun) p:05Triage→03Low [14:16:41] 06cloud-services-team: ceph: Update netbox status when bootstrapping a new osd node - https://phabricator.wikimedia.org/T295132#10117586 (10joanna_borun) 05Open→03Declined [14:16:56] 06cloud-services-team, 10Toolforge, 07Kubernetes: Pods getting stuck in "Terminating" status - https://phabricator.wikimedia.org/T335543#10117591 (10joanna_borun) 05Open→03Resolved [14:18:11] 06cloud-services-team, 10Cloud-VPS: Detect and alert on rabbitmq splitbrain/partition - https://phabricator.wikimedia.org/T335304#10117593 (10joanna_borun) 05Open→03Resolved [14:18:35] 06cloud-services-team, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [registry-admission-webhook] Investigate why helm did not override the selector on the service on deployment - https://phabricator.wikimedia.org/T320665#10117595 (10Raymond_Ndibe) [14:19:32] 06cloud-services-team, 10Cloud-VPS, 07Puppet: Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281#10117597 (10joanna_borun) p:05Triage→03Low [14:21:32] 06cloud-services-team, 10wikitech.wikimedia.org: Find or recreate tool to "grep" wikitech using CirrusSearch data - https://phabricator.wikimedia.org/T308154#10117599 (10joanna_borun) p:05Triage→03Low [14:22:26] 06cloud-services-team, 10Cloud-VPS: mysterious oom issues on VMs - https://phabricator.wikimedia.org/T337806#10117601 (10aborrero) 05Open→03Invalid the VMs no longer exists. [14:22:55] 06cloud-services-team, 10Toolforge: Revisit Toolforge automated package updates and version pinnings - https://phabricator.wikimedia.org/T290494#10117603 (10joanna_borun) p:05Triage→03Low [14:27:37] 06cloud-services-team, 10Toolforge: Toolforge - add support for SSH on port 443 - https://phabricator.wikimedia.org/T337241#10117610 (10joanna_borun) p:05Triage→03Low [14:28:34] 06cloud-services-team, 10Cloud-VPS: codfw1dev ldap tls certificate names do not match dns used by labtestwikitech - https://phabricator.wikimedia.org/T342185#10117617 (10joanna_borun) p:05Triage→03Low [14:29:34] 06cloud-services-team, 10Cloud-VPS: The upgrade_openstack_node cookbook doesn't silence everything that needs silencing - https://phabricator.wikimedia.org/T323087#10117620 (10joanna_borun) p:05Triage→03Low [14:30:04] 06cloud-services-team, 10Cloud-VPS: [wmcs-cookbooks] Tidy up the new repo - https://phabricator.wikimedia.org/T326978#10117621 (10joanna_borun) p:05Triage→03Low [14:30:15] 06cloud-services-team: codfw1dev: rabbitmq is not working because some auth failures - https://phabricator.wikimedia.org/T374002#10117622 (10aborrero) p:05Triage→03Medium a:03Andrew [14:30:32] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [cookbooks] Refactor the specific wmcs libraries into spicerack module - https://phabricator.wikimedia.org/T319450#10117626 (10fnegri) p:05High→03Low [14:30:40] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [alerting][cookbooks] When removing silences, give some time for the alerts to clear - https://phabricator.wikimedia.org/T318654#10117628 (10joanna_borun) p:05Triage→03Low [14:31:12] 06cloud-services-team, 10Toolforge: [infra,k8s] remove deprecated kubelet flags before 1.27 upgrade - https://phabricator.wikimedia.org/T370245#10117632 (10Raymond_Ndibe) [14:31:14] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [alerting] Use the `services` label in cookbook to silence groups of alerts - https://phabricator.wikimedia.org/T318653#10117633 (10joanna_borun) p:05Triage→03Medium [14:32:02] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: cookbooks: for --interactive flags, add an option to skip the rest - https://phabricator.wikimedia.org/T315341#10117641 (10joanna_borun) p:05Triage→03Low [14:32:04] 10cloud-services-team (Hardware): cloudservices1006: debian bullseye installer hangs when partitioning /srv - https://phabricator.wikimedia.org/T345731#10117638 (10aborrero) 05Open→03Invalid the server is now Debian bookworm. [14:32:39] 06cloud-services-team, 10Toolforge: [infra,k8s] remove deprecated kubelet flags before 1.27 upgrade - https://phabricator.wikimedia.org/T370245#10117645 (10Slst2020) a:05Slst2020→03None [14:32:49] 06cloud-services-team, 10Cloud-VPS: Add some monitoring/non-paging alerts to codfw1dev - https://phabricator.wikimedia.org/T344440#10117649 (10joanna_borun) p:05Triage→03High [14:32:52] 06cloud-services-team: openstack: consider automating DB grants - https://phabricator.wikimedia.org/T346619#10117650 (10aborrero) p:05Triage→03Low [14:34:03] 06cloud-services-team, 10Cloud-VPS, 10Spicerack, 10SRE-tools, and 2 others: cookbooks: for --interactive flags, add an option to skip the rest - https://phabricator.wikimedia.org/T315341#10117647 (10fnegri) [14:35:54] 10Cloud Services Proposals, 06cloud-services-team: Split cloud-announce into two lists: toolforge-announce and cloudvps-announce - https://phabricator.wikimedia.org/T334748#10117657 (10joanna_borun) 05Stalled→03Declined [14:35:55] 06cloud-services-team, 13Patch-For-Review: cloudgw: replace keepalived with BGP - https://phabricator.wikimedia.org/T347687#10117654 (10aborrero) 05Open→03Stalled p:05Triage→03Low [14:38:11] 06cloud-services-team, 10Cloud-VPS: rabbitmq: missing heartbeats issue - https://phabricator.wikimedia.org/T347017#10117667 (10joanna_borun) p:05Triage→03Low [14:38:36] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: cloud: codfw: decide on new ceph cluster details - https://phabricator.wikimedia.org/T346725#10117669 (10joanna_borun) p:05Triage→03Low [14:39:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-idp-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:39:32] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661#10117671 (10joanna_borun) p:05Triage→03Low [14:39:33] 06cloud-services-team: Remove Icinga checks for Cloud VPS projects (not: infrastructure) - https://phabricator.wikimedia.org/T345983#10117673 (10joanna_borun) p:05Triage→03Medium [14:41:57] 06cloud-services-team: cloud: consider creating a reproducible local development environment for openstack-helm-based Cloud VPS - https://phabricator.wikimedia.org/T346785#10117678 (10joanna_borun) 05Open→03Declined [14:42:01] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: cloud: codfw: decide on new ceph cluster details - https://phabricator.wikimedia.org/T346725#10117686 (10aborrero) 05Open→03Declined not working on this at the moment. [14:42:09] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Create an alert for K8s cronjobs - https://phabricator.wikimedia.org/T308203#10117682 (10dcaro) →14Duplicate dup:03T357977 [14:42:12] 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#10117692 (10fnegri) [14:42:22] 06cloud-services-team, 10Cloud-VPS, 06SRE-OnFire, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#10117696 (10joanna_borun) p:05Triage→03Medium [14:42:33] 10Toolforge, 13Patch-For-Review: [toolforge.infra] create fullstack tests - https://phabricator.wikimedia.org/T357977#10117684 (10dcaro) [14:42:36] 06cloud-services-team: cloud: introduce eqiad2dev region for openstack-in-kubernetes PoC via openstack-helm - https://phabricator.wikimedia.org/T346665#10117689 (10aborrero) 05Open→03Declined not working on this at the moment. [14:42:44] 06cloud-services-team: cloud: introduce a kubernetes undercloud to run openstack (via openstack-helm) - https://phabricator.wikimedia.org/T342750#10117693 (10aborrero) 05Open→03Declined not working on this at the moment. [14:42:48] 06cloud-services-team: clouddb1019 memory alert - https://phabricator.wikimedia.org/T346826#10117705 (10joanna_borun) 05Open→03Resolved [14:42:59] 06cloud-services-team, 10Cloud-VPS: systemd-machined crashing on some cloudvirts - https://phabricator.wikimedia.org/T351203#10117707 (10joanna_borun) 05Open→03Resolved [14:43:07] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661#10117702 (10aborrero) 05Open→03Declined not working on this at the moment. [14:43:20] 06cloud-services-team: haproxy: install some command line interface - https://phabricator.wikimedia.org/T367956#10117724 (10aborrero) p:05Triage→03Low [14:44:01] 06cloud-services-team: [wmf-sre-laptop] fetch public keys for Cloud bastions - https://phabricator.wikimedia.org/T329322#10117727 (10joanna_borun) p:05Triage→03Low [14:44:34] 06cloud-services-team, 06Tech-Docs-Team, 07Documentation: What are our documentation wikis for? - https://phabricator.wikimedia.org/T324210#10117739 (10joanna_borun) 05Open→03Invalid [14:44:49] 06cloud-services-team, 10Cloud-VPS: neutron agents losing RabbitMQ connectivity don't crash properly - https://phabricator.wikimedia.org/T311149#10117740 (10aborrero) p:05Triage→03Low [14:44:57] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724#10117697 (10aborrero) 05Open→03Declined not working on this at the moment. [14:45:00] 06cloud-services-team, 10Toolforge: Consider improving quota workflow - https://phabricator.wikimedia.org/T306324#10117712 (10aborrero) 05Open→03Resolved a:03aborrero [14:45:20] 06cloud-services-team, 10Observability-Alerting: Automatically close stale alertmanager created tasks - https://phabricator.wikimedia.org/T352079#10117744 (10joanna_borun) p:05Triage→03Medium [14:45:31] 06cloud-services-team, 10Cloud-VPS: SPF records for wmcloud.org and wmflabs.org are out of sync - https://phabricator.wikimedia.org/T352555#10117746 (10aborrero) p:05Triage→03Low [14:46:26] 06cloud-services-team, 10Toolforge: Do something to Toolforge tools with no non-blocked maintainers - https://phabricator.wikimedia.org/T320342#10117759 (10joanna_borun) p:05Triage→03Medium [14:47:47] 06cloud-services-team, 10Cloud-VPS: Neutron policy does not allow the admin role to modify security groups - https://phabricator.wikimedia.org/T348582#10117765 (10aborrero) p:05Triage→03Low [14:49:02] 06cloud-services-team, 10Cloud-VPS: Neutron policy does not allow the admin role to modify security groups - https://phabricator.wikimedia.org/T348582#10117767 (10aborrero) cc @Andrew [14:50:31] 06cloud-services-team, 10Toolforge, 07Epic: Encrypt all Toolforge internal traffic - https://phabricator.wikimedia.org/T329667#10117762 (10aborrero) p:05Triage→03Low [14:53:09] 10Tools: QuickStatements anti-abuse measure (rate limit?) - Cannot automatically assign ID - https://phabricator.wikimedia.org/T350262#10117775 (10M2k_dewiki) Related: https://phabricator.wikimedia.org/T272032 [14:53:39] 10Tools: QuickStatements anti-abuse measure (rate limit?) - Cannot automatically assign ID - https://phabricator.wikimedia.org/T350262#10117777 (10M2k_dewiki) Also see * https://www.wikidata.org/wiki/Wikidata:Project_chat#Mass-import_policy (31st of August 2024) * https://www.wikidata.org/w/index.php?title=Wiki... [14:57:31] 06cloud-services-team, 10Cloud-VPS: Neutron policy does not allow the admin role to modify security groups - https://phabricator.wikimedia.org/T348582#10117788 (10aborrero) a:03Andrew [14:58:41] (03open) 10sstefanova: calico: correct kubeVersion [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/516 [15:01:19] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10117793 (10dcaro) [15:01:22] 10Toolforge: k8s-status tool crashes because of assumption that all container images will have a tag associated - https://phabricator.wikimedia.org/T374017 (10bd808) 03NEW [15:02:14] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#10117796 (10dcaro) [15:03:27] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10117835 (10dcaro) Related, we have to install one of the mons that's out of C8 to be able to drain the rack {T374005} [15:11:11] 06cloud-services-team, 10Cloud-VPS: CloudVPS: research VXLAN implementation for neutron - https://phabricator.wikimedia.org/T248881#10117880 (10aborrero) 05Stalled→03Resolved a:03aborrero done with {T373869}. [15:12:22] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Cloud VPS: design target vxlan setup - https://phabricator.wikimedia.org/T373869#10117873 (10aborrero) 05In progress→03Resolved in a meeting, I shared this design today with @cmooney @dcaro and @fnegri with no major objections and a few comments:... [15:13:44] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: openstack: double check VXLAN-based flat network implementation - https://phabricator.wikimedia.org/T374020 (10aborrero) 03NEW [15:15:28] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Cloud VPS: design target vxlan setup - https://phabricator.wikimedia.org/T373869#10117922 (10aborrero) follow up in {T374020} [15:17:02] (03open) 10aborrero: codfw1dev: instrument VXLAN-based flat network [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/30 (https://phabricator.wikimedia.org/T374020) [15:17:14] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/30 [15:17:33] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: openstack: instrument VXLAN-based flat network - https://phabricator.wikimedia.org/T374020#10117924 (10aborrero) p:05Triage→03Medium [15:17:49] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/30 [15:18:00] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: openstack: instrument VXLAN-based flat network - https://phabricator.wikimedia.org/T374020#10117938 (10aborrero) [15:22:58] (03PS6) 10David Caro: Revert^2 "openstack.tofu: use run_script instead of reimplementing it" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1070020 [15:24:40] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/30 [15:25:26] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/30 [15:29:24] 06cloud-services-team, 10Cloud-VPS, 07Epic: tofu-infra: the cookbook should use a different git tree copy than the main one - https://phabricator.wikimedia.org/T374022 (10aborrero) 03NEW [15:29:58] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance cloudinfra-idp-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:30:20] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/30 [15:30:55] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/30 [15:33:01] (03CR) 10Arturo Borrero Gonzalez: Revert^2 "openstack.tofu: use run_script instead of reimplementing it" (032 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1070020 (owner: 10David Caro) [15:36:32] 06cloud-services-team: openstack: eqiad1: designate is maybe not working as expected (2024-09-04) - https://phabricator.wikimedia.org/T374023 (10aborrero) 03NEW [15:46:25] 06cloud-services-team: ceph: Update netbox status when bootstrapping a new osd node - https://phabricator.wikimedia.org/T295132#10118070 (10fnegri) 05Declined→03Invalid The state was set correctly for the latest osd node that was added. [15:51:50] 10Cloud Services Proposals, 06cloud-services-team: Split cloud-announce into two lists: toolforge-announce and cloudvps-announce - https://phabricator.wikimedia.org/T334748#10118101 (10fnegri) This idea never got much traction and I personally think it makes sense to keep a single list, as most communicati... [15:53:56] FIRING: SystemdUnitDown: The service unit ceph-mon@cloudcephmon1005.service is in failed status on host cloudcephmon1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephmon1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:55:28] FIRING: PuppetAgentDisabled: Puppet agent disabled on instance tools-prometheus-6 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentDisabled [15:56:07] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: openstack: instrument VXLAN-based flat network - https://phabricator.wikimedia.org/T374020#10118104 (10aborrero) I feel blocked by {T374002} at the moment. [15:56:37] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: openstack: instrument VXLAN-based flat network - https://phabricator.wikimedia.org/T374020#10118108 (10aborrero) 05Open→03In progress [16:12:26] RESOLVED: SystemdUnitDown: The service unit ceph-mon@cloudcephmon1005.service is in failed status on host cloudcephmon1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcephmon1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:18:36] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 14): [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972#10118225 (10dcaro) 05Open→03In progress [16:32:35] 10cloud-services-team (FY2024/2025-Q1-Q2): [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10118348 (10dcaro) [16:33:27] 10cloud-services-team (FY2024/2025-Q1-Q2): [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005#10118359 (10dcaro) For cloudcephmon1005 I had to downgrade puppet from 5 to 7, cleanup the certs, run puppet, sign the cert on the puppetmaster1001, a... [16:35:30] !log dcaro@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (T373986) [16:35:36] T373986: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986 [16:40:58] (03PS7) 10David Caro: Revert^2 "openstack.tofu: use run_script instead of reimplementing it" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1070020 [16:41:27] (03CR) 10David Caro: Revert^2 "openstack.tofu: use run_script instead of reimplementing it" (032 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1070020 (owner: 10David Caro) [16:42:08] 06cloud-services-team: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10118451 (10dcaro) a:03dcaro [16:42:57] 10cloud-services-team (FY2024/2025-Q1-Q2): cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986#10118469 (10dcaro) [17:13:52] 06cloud-services-team, 06Data-Engineering, 05Cloud-Services-Origin-User: WMCS-roots paging responsibilities - https://phabricator.wikimedia.org/T344608#10118768 (10fnegri) [17:16:03] (03open) 10lucaswerkmeister: shell: drop --wait [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/55 (https://phabricator.wikimedia.org/T373866) [17:16:08] 06cloud-services-team, 10Cloud-VPS: cloudcephosd1017 /dev/sdg (osd.132) failed - https://phabricator.wikimedia.org/T358945#10118771 (10dcaro) 05In progress→03Resolved Back up and running [17:16:13] 10cloud-services-team (FY2024/2025-Q1-Q2): Drain C8 rack - https://phabricator.wikimedia.org/T374043 (10dcaro) 03NEW [17:18:36] 10cloud-services-team (FY2024/2025-Q1-Q2): Drain C8 rack - https://phabricator.wikimedia.org/T374043#10118807 (10dcaro) [17:19:23] 10cloud-services-team (FY2024/2025-Q1-Q2): Drain C8 rack - https://phabricator.wikimedia.org/T374043#10118825 (10dcaro) @aborrero, @fnegri, @Andrew can you give the list a look and add any notes about if they need draining or not? [17:20:37] 10Toolforge: tools-webservice repo does not support merge requests from forks properly - https://phabricator.wikimedia.org/T374045 (10LucasWerkmeister) 03NEW [17:21:22] 10Toolforge: tools-webservice repo does not support merge requests from forks properly - https://phabricator.wikimedia.org/T374045#10118838 (10LucasWerkmeister) Note: I don’t know what the pipeline does and have no idea if pointing the script at the fork’s pipeline run would be enough. (E.g. maybe the original r... [17:21:43] (03update) 10lucaswerkmeister: shell: drop --wait [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/54 (https://phabricator.wikimedia.org/T373866) [17:21:44] (03close) 10lucaswerkmeister: shell: drop --wait [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/54 (https://phabricator.wikimedia.org/T373866) [17:24:33] 10Tool-Global-user-contributions, 10Special:GlobalContributions, 06Stewards-and-global-tools, 07Epic, 10Temporary accounts (Create/update essential tools/anti-abuse management): [Epic] Implement global user contributions feature - https://phabricator.wikimedia.org/T337089#10118844 (10Niharika) [17:25:04] 10Tool-Global-user-contributions, 10Special:GlobalContributions, 06Stewards-and-global-tools, 07Design, 10Temporary accounts (Blockers to pilot wiki deployment): [Design EPIC] Global User Contributions - https://phabricator.wikimedia.org/T349901#10118845 (10Niharika) [17:29:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [17:30:47] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [17:33:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [17:36:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [17:50:42] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:56:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [17:59:02] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [18:14:31] 10wikitech.wikimedia.org: Requesting administrator access for Andrea_Denisse - https://phabricator.wikimedia.org/T374052 (10andrea.denisse) 03NEW [18:24:21] (03approved) 10sstefanova: shell: drop --wait [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/55 (https://phabricator.wikimedia.org/T373866) (owner: 10lucaswerkmeister) [18:53:45] (03open) 10hunsvotti: Rename "Suggested images" to "Pics for Wikts" in titles [toolforge-repos/imgs-for-wikt] - 10https://gitlab.wikimedia.org/toolforge-repos/imgs-for-wikt/-/merge_requests/1 [18:59:45] (03update) 10hunsvotti: Rename "Suggested images" to "Pics for Wikts" in titles [toolforge-repos/imgs-for-wikt] - 10https://gitlab.wikimedia.org/toolforge-repos/imgs-for-wikt/-/merge_requests/1 [19:00:27] (03update) 10hunsvotti: Rename "Suggested images" to "Pics for Wikts" [toolforge-repos/imgs-for-wikt] - 10https://gitlab.wikimedia.org/toolforge-repos/imgs-for-wikt/-/merge_requests/1 [19:01:57] (03merge) 10hunsvotti: Rename "Suggested images" to "Pics for Wikts" [toolforge-repos/imgs-for-wikt] - 10https://gitlab.wikimedia.org/toolforge-repos/imgs-for-wikt/-/merge_requests/1 [19:06:02] (03open) 10hunsvotti: Update project name in tests also [toolforge-repos/imgs-for-wikt] - 10https://gitlab.wikimedia.org/toolforge-repos/imgs-for-wikt/-/merge_requests/2 [19:08:41] 10wikitech.wikimedia.org: Requesting administrator access for Andrea_Denisse - https://phabricator.wikimedia.org/T374052#10119258 (10Aklapper) [19:09:48] (03merge) 10hunsvotti: Update project name in tests also [toolforge-repos/imgs-for-wikt] - 10https://gitlab.wikimedia.org/toolforge-repos/imgs-for-wikt/-/merge_requests/2 [19:14:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [19:19:37] (03open) 10hunsvotti: Add favicon [toolforge-repos/imgs-for-wikt] - 10https://gitlab.wikimedia.org/toolforge-repos/imgs-for-wikt/-/merge_requests/3 [19:19:41] (03merge) 10hunsvotti: Add favicon [toolforge-repos/imgs-for-wikt] - 10https://gitlab.wikimedia.org/toolforge-repos/imgs-for-wikt/-/merge_requests/3 [19:29:55] 10wikitech.wikimedia.org: Requesting administrator access for Andrea_Denisse - https://phabricator.wikimedia.org/T374052#10119300 (10bd808) 05Open→03Resolved a:03bd808 `{{Done}}` https://wikitech.wikimedia.org/w/index.php?title=Special:Log&logid=973292 [19:35:48] !log dcaro@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986) [19:35:55] T373986: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986 [20:06:23] 10Toolforge: Upgrade python buildpack to v0.17.0 or newer for Poetry support - https://phabricator.wikimedia.org/T374056 (10bd808) 03NEW [20:47:39] 10Toolforge: k8s-status tool crashes because of assumption that all container images will have a tag associated - https://phabricator.wikimedia.org/T374017#10119532 (10bd808) `lang=python >>> import k8s.client >>> images = k8s.client.get_images(cached=True) >>> for image in images["items"].keys(): ... repo, na... [21:05:58] (03open) 10bd808: images: guard against images without a tag [toolforge-repos/k8s-status] - 10https://gitlab.wikimedia.org/toolforge-repos/k8s-status/-/merge_requests/4 (https://phabricator.wikimedia.org/T374017) [21:06:31] (03merge) 10bd808: images: guard against images without a tag [toolforge-repos/k8s-status] - 10https://gitlab.wikimedia.org/toolforge-repos/k8s-status/-/merge_requests/4 (https://phabricator.wikimedia.org/T374017) [21:17:14] (03open) 10bd808: bin: Raise limits and requests for increased data size [toolforge-repos/k8s-status] - 10https://gitlab.wikimedia.org/toolforge-repos/k8s-status/-/merge_requests/5 [21:17:54] (03merge) 10bd808: bin: Raise limits and requests for increased data size [toolforge-repos/k8s-status] - 10https://gitlab.wikimedia.org/toolforge-repos/k8s-status/-/merge_requests/5 [21:21:11] 10Tool-k8s-status, 10Toolforge: k8s-status tool crashes because of assumption that all container images will have a tag associated - https://phabricator.wikimedia.org/T374017#10119600 (10bd808) [21:25:19] 10Tool-k8s-status, 10Toolforge: k8s-status tool crashes because of assumption that all container images will have a tag associated - https://phabricator.wikimedia.org/T374017#10119596 (10bd808) 05Open→03Resolved a:03bd808 [21:37:12] 10Toolforge: Support running a job using an alternate service account - https://phabricator.wikimedia.org/T374062 (10bd808) 03NEW [21:50:42] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:55:10] 10cloud-services-team (FY2024/2025-Q1-Q2): Drain C8 rack - https://phabricator.wikimedia.org/T374043#10119678 (10Andrew) We should drain the osds and cloudvirts. The few other should be fine. [21:56:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [21:58:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [22:06:22] 06cloud-services-team: openstack: eqiad1: designate is maybe not working as expected (2024-09-04) - https://phabricator.wikimedia.org/T374023#10119705 (10Andrew) This was cleared up with sudo cookbook wmcs.openstack.restart_openstack --designate --cluster-name eqiad1 I don't know if it's a resource leak or wh... [22:06:28] 10Tool-k8s-status: k8s-status: List inactive cron jobs as image users - https://phabricator.wikimedia.org/T342848#10119706 (10bd808) 05Open→03In progress p:05Triage→03Medium a:03bd808 [22:06:58] 06cloud-services-team: openstack: eqiad1: designate is maybe not working as expected (2024-09-04) - https://phabricator.wikimedia.org/T374023#10119711 (10Andrew) a:05Andrew→03None [22:30:42] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:06:11] (03PS1) 10BCornwall: corto: Add gdrive-creds.json [labs/private] - 10https://gerrit.wikimedia.org/r/1070679 [23:06:48] (03CR) 10BCornwall: [V:03+2 C:03+2] corto: Add gdrive-creds.json [labs/private] - 10https://gerrit.wikimedia.org/r/1070679 (owner: 10BCornwall) [23:29:43] (03open) 10bd808: images: track images used by cronjobs [toolforge-repos/k8s-status] - 10https://gitlab.wikimedia.org/toolforge-repos/k8s-status/-/merge_requests/6 (https://phabricator.wikimedia.org/T342848) [23:34:55] (03merge) 10bd808: images: track images used by cronjobs [toolforge-repos/k8s-status] - 10https://gitlab.wikimedia.org/toolforge-repos/k8s-status/-/merge_requests/6 (https://phabricator.wikimedia.org/T342848) [23:39:08] 10Tool-k8s-status, 13Patch-For-Review: k8s-status: List inactive cron jobs as image users - https://phabricator.wikimedia.org/T342848#10119893 (10bd808) 05In progress→03Resolved [23:42:29] 10Tool-k8s-status, 13Patch-For-Review: k8s-status: List inactive cron jobs as image users - https://phabricator.wikimedia.org/T342848#10119892 (10bd808) Before: {F57461889,size=full} After: {F57461894,size=full}