[00:39:15] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130 (10bd808) 03NEW [00:46:02] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130#10294929 (10bd808) {T377781} would be a potential solution for this situation, but there are other things that can be done without platform support for timeouts or replac... [01:22:37] FIRING: CloudVPSDesignateLeaks: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:42:38] 06cloud-services-team, 10Toolforge: Jobs hang on toolforge - https://phabricator.wikimedia.org/T379132 (10Leloiandudu) 03NEW [02:01:14] 06cloud-services-team, 10Toolforge: Jobs hang on toolforge - https://phabricator.wikimedia.org/T379132#10295012 (10JJMC89) Likely the same issue as {T379130} [02:35:22] FIRING: HAProxyBackendUnavailable: HAProxy service keystone-public-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [02:40:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service keystone-public-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:15:56] FIRING: SystemdUnitDown: The service unit opentofu-infra-diff.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:56:11] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with OpenTofu - https://phabricator.wikimedia.org/T360626#10295088 (10Raymond_Ndibe) To reproduce this using s3 bucket * ssh into cloudcontrol1005, cloudcontrol1006, cloudcontrol1007 (i.e. ssh cloudcontrol1006.eqiad.wmnet) * for each cloudcontrol,... [03:58:27] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with OpenTofu - https://phabricator.wikimedia.org/T360626#10295089 (10Raymond_Ndibe) >>! In T360626#10292238, @fnegri wrote: > @Raymond_Ndibe can you paste one or more example commands that are failing? > > It's interesting that for OpenTofu it s... [04:07:45] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with OpenTofu - https://phabricator.wikimedia.org/T360626#10295093 (10Raymond_Ndibe) >>! In T360626#10295088, @Raymond_Ndibe wrote: > To reproduce this using s3 bucket > * ssh into cloudcontrol1005, cloudcontrol1006, cloudcontrol1007 (i.e. ssh clo... [04:11:48] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with OpenTofu - https://phabricator.wikimedia.org/T360626#10295094 (10Raymond_Ndibe) >>! In T360626#10292422, @dcaro wrote: > This might be related to this errors on logstash https://logstash.wikimedia.org/goto/c7fa935688ccd6ccda0e11b420b747d1 >... [04:14:05] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with OpenTofu - https://phabricator.wikimedia.org/T360626#10295095 (10Raymond_Ndibe) It might also be worth looking at the sql driver. I am not sure how that part of it works, but if we are reading stuffs from sql, it might be worth it to look for... [04:46:45] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1087476 (owner: 10L10n-bot) [05:11:56] FIRING: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:12:02] 06cloud-services-team: SystemdUnitDown cloudcontrol1007:9100 The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T379133 (10phaultfinder) 03NEW [05:14:43] 10Cloud-VPS (Quota-requests): Request floating IP for wikiwho project - https://phabricator.wikimedia.org/T376637#10295113 (10taavi) >>! In T376637#10293654, @MusikAnimal wrote: > I guess the more important question for now, is 185.15.56.49 a stable IP? I know as per the docs it's better to use the service name... [05:22:37] FIRING: CloudVPSDesignateLeaks: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:05:17] (03update) 10sstefanova: maintain-kubeusers: bump to 0.0.171-20241105173021-bf5186a3 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/580 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [07:07:35] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [07:12:40] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [07:14:06] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [07:20:32] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [07:21:09] (03approved) 10sstefanova: maintain-kubeusers: bump to 0.0.171-20241105173021-bf5186a3 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/580 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [07:21:12] (03merge) 10sstefanova: maintain-kubeusers: bump to 0.0.171-20241105173021-bf5186a3 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/580 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [07:21:52] (03CR) 10Hashar: "I have no idea how the Heritage application is deployed/maintained/updated. I'd +2 it but the code might be auto pulled which would break " [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1082452 (https://phabricator.wikimedia.org/T377939) (owner: 10Awight) [07:31:55] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component tools-webservice [07:31:57] !log sstefanova@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component tools-webservice [07:37:47] 10PAWS, 06Community-Tech, 10PHP-API-for-Wikisource, 10Wikimedia OCR: PAWS Code for Google OCR - https://phabricator.wikimedia.org/T379134 (10Akbarali) 03NEW [07:41:26] (03open) 10sstefanova: d/changelog: bump to 0.103.12 [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/58 [07:46:38] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component tools-webservice [07:51:58] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component tools-webservice [07:52:31] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component tools-webservice [07:57:50] !log sstefanova@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component tools-webservice [07:57:51] 10PAWS, 06Community-Tech, 10PHP-API-for-Wikisource, 10Wikimedia OCR: PAWS Code for Google OCR - https://phabricator.wikimedia.org/T379134#10295172 (10Samwilson) Could you explain a bit more about why PAWS is relevant to this feature? The technical side of bulk OCR additions is not that tricky — it's the f... [08:01:33] (03approved) 10sstefanova: d/changelog: bump to 0.103.12 [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/58 [08:01:36] (03merge) 10sstefanova: d/changelog: bump to 0.103.12 [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/58 [08:08:38] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T378676#10295208 (10Slst2020) a:03Slst2020 [08:08:43] 06cloud-services-team, 10Toolforge: [infra, k8s, webservice] remove deprecated kubectl --wait flag before k8s 1.29 upgrade - https://phabricator.wikimedia.org/T373866#10295193 (10Slst2020) 05Open→03Resolved done – the deprecation warning is gone now. ` tools.automated-toolforge-tests@tools-bastion-12:~... [08:22:21] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T378676#10295231 (10Slst2020) 05Open→03In progress [08:31:10] 10PAWS, 06Community-Tech, 10PHP-API-for-Wikisource, 10Wikimedia OCR: PAWS Code for Google OCR - https://phabricator.wikimedia.org/T379134#10295242 (10Akbarali) >>! In T379134#10295172, @Samwilson wrote: > Could you explain a bit more about why PAWS is relevant to this feature? > > The technical side of bu... [08:34:37] 10PAWS, 06Community-Tech, 10PHP-API-for-Wikisource, 10Wikimedia OCR: PAWS Code for Google OCR - https://phabricator.wikimedia.org/T379134#10295244 (10Samwilson) Oh sure, that does make sense. Sorry for my scepticism! :-) [08:47:56] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T378676#10295270 (10Slst2020) might this be related to the wikitech SUL migration? ` tools.wikitech-double-redirect-bot@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-... [08:54:46] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130#10295287 (10dcaro) > I killed the stuck job Can you elaborate a bit on how did you kill the stuck job? (that might help pinpoint the underlying issue, and help trim down... [08:55:32] 06cloud-services-team, 10Toolforge: Jobs hang on toolforge - https://phabricator.wikimedia.org/T379132#10295289 (10dcaro) This also coincides with the upgrade to 1.28, the job is still stuck, that will help us troubleshoot the root issue, looking. [09:02:10] 06cloud-services-team, 10Toolforge: Jobs hang on toolforge - https://phabricator.wikimedia.org/T379132#10295293 (10dcaro) Hmm... I do see a bunch of stuck processes (in D status), so there might be a reporting issue too :/ [09:02:50] 06cloud-services-team, 10Toolforge: Jobs hang on toolforge - https://phabricator.wikimedia.org/T379132#10295292 (10dcaro) Looking a bit, the job is running on tools-k8s-worker-nfs-24, that does not seem to be reporting stuck processes: {F57685257} But I'm timing out trying to ssh to it, so definitely having... [09:08:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [09:11:56] FIRING: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:16:48] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130#10295309 (10dcaro) :/, adding a silly livenessProbe like 'echo "I'm alive"' does not help, as the container is actually able to execute that without issues (in this case,... [09:22:37] FIRING: CloudVPSDesignateLeaks: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:27:14] 06cloud-services-team, 10Toolforge: Jobs hang on toolforge - https://phabricator.wikimedia.org/T379132#10295332 (10Leloiandudu) >>! In T379132#10295289, @dcaro wrote: > This also coincides with the upgrade to 1.28, the job is still stuck, that will help us troubleshoot the root issue, looking. this started ha... [09:32:35] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130#10295346 (10dcaro) Results are: * `concurrencyPolicy: Replace` seems to be the only one able to get NFS stuck jobs to retrigger (potentially in a different worker). * wra... [09:34:39] 06cloud-services-team, 10Toolforge: Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`) - https://phabricator.wikimedia.org/T377781#10295351 (10dcaro) Added some exploration here {https://phabricator.wikimedia.org/T379130#10295346}, it seems that... [09:37:52] (03CR) 10Kosta Harlan: [C:03+1] Handle temporary accounts as anons [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1084298 (https://phabricator.wikimedia.org/T378530) (owner: 10AntiCompositeNumber) [09:38:48] 10Toolforge (Toolforge iteration 16): [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139 (10dcaro) 03NEW [09:45:28] 10Toolforge (Toolforge iteration 16): [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139#10295374 (10dcaro) Hmm, the node does not show up in the list of targets for prometheus: https://prometheus.svc.toolforge.org/tools/targets?search=&scr... [09:49:12] 06cloud-services-team, 10Toolforge: Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`) - https://phabricator.wikimedia.org/T377781#10295383 (10aborrero) I would set this unconditionally, document the behavior, and let the users deal with their co... [09:50:20] 10Toolforge (Toolforge iteration 16): [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139#10295395 (10dcaro) Oh, I think it might be because the status of the VM is pending confirming a resize/migrate operation: {F57685338} That started on... [09:53:26] 10Toolforge (Toolforge iteration 16): [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139#10295404 (10dcaro) Manually confirmed the migration, and the server is back in 'active' state, let's see if now it shows up in prometheus. [09:54:20] 10Toolforge (Toolforge iteration 16): [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139#10295406 (10dcaro) There you go, waiting for the first scrape: {F57685352} [09:55:21] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [09:56:03] 10Toolforge (Toolforge iteration 16): [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139#10295411 (10dcaro) And first data coming in: {F57685357} [09:57:12] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [09:59:44] 06cloud-services-team, 10Cloud-VPS: tofu-infra: conflict with tf-infra-test - https://phabricator.wikimedia.org/T379141 (10aborrero) 03NEW [09:59:49] 10Toolforge (Toolforge iteration 16): [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139#10295429 (10dcaro) The alert should trigger soon: {F57685362} [10:00:05] 06cloud-services-team, 10Cloud-VPS: tofu-infra: conflict with tf-infra-test - https://phabricator.wikimedia.org/T379141#10295431 (10aborrero) p:05Triage→03Medium [10:13:24] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-24 (T379139) [10:13:28] T379139: [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139 [10:14:08] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-24 (T379139) [10:20:21] 10Toolforge (Toolforge iteration 16): [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state - https://phabricator.wikimedia.org/T379139#10295507 (10dcaro) p:05Triage→03High a:03dcaro [10:22:12] 10Toolforge (Toolforge iteration 16): [components-api] Add support for pre-built images (ex. python3.11, to refine) - https://phabricator.wikimedia.org/T362076#10295516 (10dcaro) 05In progress→03Resolved [10:23:32] 06cloud-services-team, 10Cloud-VPS: tofu-infra: conflict with tf-infra-test - https://phabricator.wikimedia.org/T379141#10295510 (10fnegri) I think this is not caused by the "tf-infra-test mechanism", but more simply by the fact the project was deleted manually in {T379076}, but has not been removed from the l... [10:24:18] 06cloud-services-team, 10Cloud-VPS: tofu-infra: conflict with tf-infra-test - https://phabricator.wikimedia.org/T379141#10295518 (10aborrero) 05Open→03Invalid [10:27:51] 06cloud-services-team, 10Cloud-VPS: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076#10295521 (10aborrero) We track project existence via opentofu for Cloud VPS. The `tf-infra-test` project needs to be deleted from the tofu-infra repository, see https://wikitech.wikimedia.org/wiki/Por... [10:28:20] 06cloud-services-team, 10Cloud-VPS: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076#10295523 (10fnegri) > I don't see that file on either cloudcontrol1005.eqiad.wmnet or cloudcontrol1007.eqiad.wmnet My bad, I thought it was installed on all cloudcontrols, but apparently it's only inst... [10:43:19] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Decision Request - How to do the Cloud VPS VXLAN/IPv6 migration - https://phabricator.wikimedia.org/T377467#10295557 (10aborrero) 05Open→03Resolved a:03aborrero We will move forward with option 2. I have created h... [10:44:31] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: openstack: develop a script to migrate a VM instance from the old network setting (vlan) to the new (vxlan, IPv6) - https://phabricator.wikimedia.org/T377346#10295564 (10aborrero) 05Stalled→03Declined We wont be working on th... [10:49:43] (03open) 10fnegri: Remove project tf-infra-test [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/116 (https://phabricator.wikimedia.org/T379076) [10:52:37] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076#10295582 (10fnegri) @rook We also have `tf-infra-dev` in codfw, should that one be deleted as well? [10:55:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-24 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:56:34] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076#10295583 (10fnegri) 05Open→03In progress p:05Triage→03Low [11:01:56] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with Object Storage - https://phabricator.wikimedia.org/T360626#10295618 (10fnegri) [11:02:50] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with Object Storage - https://phabricator.wikimedia.org/T360626#10295620 (10fnegri) p:05Medium→03High [11:04:07] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with Object Storage - https://phabricator.wikimedia.org/T360626#10295622 (10aborrero) >>! In T360626#10292422, @dcaro wrote: > We might want to try this https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Rotating_... [11:04:45] 06cloud-services-team, 10Cloud-VPS: openstack: automate fernet token renewal - https://phabricator.wikimedia.org/T379143 (10aborrero) 03NEW [11:05:26] 06cloud-services-team, 10Cloud-VPS: openstack: automate fernet token renewal - https://phabricator.wikimedia.org/T379143#10295636 (10aborrero) p:05Triage→03Low [11:06:25] (03update) 10aborrero: Remove project tf-infra-test [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/116 (https://phabricator.wikimedia.org/T379076) (owner: 10fnegri) [11:06:29] 10Toolforge (Toolforge iteration 16): [usage] Try to get an idea of the amount of tools that were created, but never started anything - https://phabricator.wikimedia.org/T379144 (10dcaro) 03NEW [11:06:51] (03approved) 10aborrero: Remove project tf-infra-test [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/116 (https://phabricator.wikimedia.org/T379076) (owner: 10fnegri) [11:09:50] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/116 [11:10:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-24 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:10:18] !log fnegri@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/116 [11:14:21] (03merge) 10fnegri: Remove project tf-infra-test [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/116 (https://phabricator.wikimedia.org/T379076) [11:14:27] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [11:15:00] !log fnegri@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [11:18:47] 06cloud-services-team, 10Toolforge: Add --timeout to toolforge jobs - https://phabricator.wikimedia.org/T377782#10295675 (10dcaro) Note that this solution would not help with jobs that get stuck due to NFS misbehaving (so far all the instances I've seen), as those jobs are considered 'active' by k8s. [11:22:01] 06cloud-services-team, 10Toolforge: Add --timeout to toolforge jobs - https://phabricator.wikimedia.org/T377782#10295677 (10dcaro) >>! In T377782#10295675, @dcaro wrote: > Note that this solution would not help with jobs that get stuck due to NFS misbehaving (so far all the instances I've seen), as those jobs... [11:32:37] RESOLVED: CloudVPSDesignateLeaks: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:33:14] (03open) 10dcaro: toolforge_deploy_mr: use always toolsbeta as repo for mr deploys [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/209 [11:35:50] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10295702 (10roti_WMDE) >>! In T376267#10293251, @Ladsgroup wrote: > Hi, can you try the 2fa value for your SUL account? I do not have 2FA enabled in my SUL account AFAIK. [11:39:17] (03approved) 10sstefanova: toolforge_deploy_mr: use always toolsbeta as repo for mr deploys [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/209 (owner: 10dcaro) [11:53:40] 06cloud-services-team, 10Cloud-VPS: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076#10295753 (10rook) >>! In T379076#10295582, @fnegri wrote: > @rook We also have `tf-infra-dev` in codfw, should that one be deleted as well? It doesn't have a replacement with no - symbol. Though it has... [11:59:38] (03merge) 10dcaro: toolforge_deploy_mr: use always toolsbeta as repo for mr deploys [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/209 [12:02:09] 06cloud-services-team, 10Toolforge: Jobs hang on toolforge - https://phabricator.wikimedia.org/T379132#10295800 (10dcaro) @Leloiandudu can you check that this is fixed for you? This should have been temporarily fixed (the worker that was having issues was restarted, so the jobs got unblocked), there's still th... [12:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:00:38] 06cloud-services-team, 10Cloud-VPS: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076#10295928 (10fnegri) > It doesn't have a replacement with no - symbol. Though it has also never really been implemented. So both keeping and removing it are an appropriate choice. Thanks, I vote for rem... [13:03:10] (03open) 10fnegri: Remove tf-infra-dev project in codfw [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/117 [13:03:35] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/117 [13:04:04] !log fnegri@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/117 [13:04:34] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/117 [13:05:15] !log fnegri@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/117 [13:05:23] (03update) 10fnegri: Remove tf-infra-dev project in codfw [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/117 (https://phabricator.wikimedia.org/T379076) [13:05:36] (03approved) 10aborrero: Remove tf-infra-dev project in codfw [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/117 (https://phabricator.wikimedia.org/T379076) (owner: 10fnegri) [13:05:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:08:20] (03merge) 10fnegri: Remove tf-infra-dev project in codfw [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/117 (https://phabricator.wikimedia.org/T379076) [13:08:31] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [13:09:22] !log fnegri@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [13:09:23] 06cloud-services-team, 10Cloud-VPS: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076#10295946 (10fnegri) 05In progress→03Resolved [13:38:32] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: openstack: vxlan: potential changes to cloudvirt MTU to enable jumbo frames - https://phabricator.wikimedia.org/T379154 (10aborrero) 03NEW [13:40:46] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: openstack: vxlan: potential changes to cloudvirt MTU to enable jumbo frames - https://phabricator.wikimedia.org/T379154#10296051 (10aborrero) p:05Triage→03Medium [13:49:08] vivian-rook closed https://github.com/toolforge/paws/pull/461 [13:51:21] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: openstack: vxlan: potential changes to cloudvirt MTU to enable jumbo frames - https://phabricator.wikimedia.org/T379154#10296107 (10cmooney) Thanks @aborrero Yeah there is a potential problem for cloud VMs talking to UDP (or other non-TCP) services on t... [13:52:40] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate deployment-poolcounter06.deployment-prep.eqiad.wmflabs is about to expire in 21d 23h 58m 30s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:03:21] 06cloud-services-team, 10Toolforge (Toolforge iteration 16): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 - https://phabricator.wikimedia.org/T362867#10296144 (10dcaro) >>! In T362867#10292196, @fnegri wrote: >> might be related to this upgrade? > > Potentially yes. Looking at the logs you pasted... [14:20:27] 10cloud-services-team (FY2024/2025-Q1-Q2): Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159 (10fnegri) 03NEW [14:25:06] 10cloud-services-team (FY2024/2025-Q1-Q2): Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10296287 (10fnegri) p:05Triage→03Medium [14:54:36] 10cloud-services-team (FY2024/2025-Q1-Q2), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10296455 (10SLyngshede-WMF) @joanna_borun you're listed as the approver, for wmcs-roots. [14:57:10] (03open) 10dcaro: fix chart version [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/210 [15:00:52] (03approved) 10sstefanova: fix chart version [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/210 (owner: 10dcaro) [15:01:51] (03merge) 10dcaro: api: fix merged version changing on every call [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/49 [15:01:54] (03update) 10dcaro: fix url [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/50 [15:04:27] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: api-gateway: bump to 0.0.52-20241106150204-ce148784 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/581 [15:20:38] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T378676#10296522 (10dcaro) Testing with the previous old code also failed, so probably yes. [15:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:31:43] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [15:36:34] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component api-gateway [15:42:57] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T378676#10296628 (10dcaro) It did work when I use my credentials: ` tools.wm-what@tools-bastion-13:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:wikit... [15:43:25] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [15:48:53] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component api-gateway [15:50:06] (03approved) 10dcaro: api-gateway: bump to 0.0.52-20241106150204-ce148784 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/581 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [15:50:09] (03merge) 10dcaro: api-gateway: bump to 0.0.52-20241106150204-ce148784 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/581 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [15:50:10] (03update) 10dcaro: api-gateway: bump to 0.0.52-20241106150204-ce148784 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/581 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [15:52:31] (03merge) 10dcaro: fix url [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/50 [15:52:33] (03update) 10dcaro: auth: allow pass through for deploy urls with tokens [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/51 (https://phabricator.wikimedia.org/T362066) [15:53:20] (03merge) 10dcaro: fix chart version [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/210 [15:53:20] (03update) 10dcaro: fix chart version [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/210 [15:53:52] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130#10296652 (10bd808) >>! In T379130#10295287, @dcaro wrote: >> I killed the stuck job > > Can you elaborate a bit on how did you kill the stuck job? > (that might help pin... [15:54:26] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: api-gateway: bump to 0.0.53-20241106155244-4e77dfbe [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/582 [16:01:25] 06cloud-services-team, 10Toolforge: Add --timeout to toolforge jobs - https://phabricator.wikimedia.org/T377782#10296681 (10bd808) [16:06:02] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T378676#10296689 (10taavi) {T376224} [16:08:02] 06cloud-services-team, 10Toolforge: Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`) - https://phabricator.wikimedia.org/T377781#10296694 (10dcaro) I find this feature very common on all cron-like systems (just search for how to avoid cron over... [16:11:24] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130#10296703 (10dcaro) > I did not attempt to capture where the job was running when it got stuck unfortunately, so there probably is not a lot to learn here other than that... [16:12:53] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [16:14:36] 10Toolforge (Toolforge iteration 16): [usage] Try to get an idea of the amount of tools that were created, but never started anything - https://phabricator.wikimedia.org/T379144#10296709 (10dcaro) p:05Triage→03Medium [16:15:01] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component api-gateway [16:16:22] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [16:21:29] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component api-gateway [16:22:49] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [16:27:57] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component api-gateway [16:31:06] (03approved) 10dcaro: api-gateway: bump to 0.0.53-20241106155244-4e77dfbe [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/582 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [16:31:10] (03merge) 10dcaro: api-gateway: bump to 0.0.53-20241106155244-4e77dfbe [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/582 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [16:44:58] 10Toolforge (Toolforge iteration 16): [toolforge,grafana,infra] Grafana stopped showing the namespaces (and other stats) for toolforge namespaces - https://phabricator.wikimedia.org/T378981#10296809 (10dcaro) a:03dcaro Yep, that was the issue, though I sorted it out by using a different metric to get the n... [16:45:03] 10Toolforge (Toolforge iteration 16): [toolforge,grafana,infra] Grafana stopped showing the namespaces (and other stats) for toolforge namespaces - https://phabricator.wikimedia.org/T378981#10296812 (10dcaro) 05Open→03Resolved [16:45:25] 06cloud-services-team, 10Toolforge: Add --timeout to toolforge jobs - https://phabricator.wikimedia.org/T377782#10296805 (10AntiCompositeNumber) Duplicate of {T306391}? [16:46:41] 10Cloud-VPS (Quota-requests): Request floating IP for wikiwho project - https://phabricator.wikimedia.org/T376637#10296819 (10taavi) And just to clarify: are you looking to redirect `wikiwho.net`, `www.wikiwho.net`, or both? [16:47:59] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikiwho.net domain - https://phabricator.wikimedia.org/T376637#10296820 (10taavi) [16:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:52:37] 10Cloud-VPS, 10Wikispore: vanity domain for Wikispore - https://phabricator.wikimedia.org/T368236#10296838 (10taavi) https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_VPS_servers_from_the_internet#Vanity_domains is now a possibility. [16:57:00] 06cloud-services-team, 10Cloud-VPS: Enable IPv6 for the Cloud VPS web proxy - https://phabricator.wikimedia.org/T379175 (10taavi) 03NEW [16:57:28] 06cloud-services-team, 10Cloud-VPS: Enable IPv6 for the Cloud VPS web proxy - https://phabricator.wikimedia.org/T379175#10296865 (10taavi) [16:57:38] 06cloud-services-team, 10Cloud-VPS, 07Epic, 07IPv6: Enable IPv6 on CloudVPS - https://phabricator.wikimedia.org/T37947#10296866 (10taavi) [17:01:21] 06cloud-services-team, 10Cloud-VPS: DNS resolution chosing IPv6 addrs on hosts with only link-local IPv6 addresses - https://phabricator.wikimedia.org/T176891#10296877 (10taavi) 05Open→03Declined Boldly declining given that those hosts will soon(TM) have v6 connectivity! [17:01:51] 06cloud-services-team, 10Cloud-VPS, 07LDAP: LDAP: review domain and TLS setup - https://phabricator.wikimedia.org/T339909#10296893 (10taavi) [17:02:47] 06cloud-services-team, 10Cloud-VPS: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076#10296897 (10fnegri) Of course I did the same mistake and I deleted `tf-infra-dev` without checking if it contained servers of other resources. @rook found two vms (8bb53a94-dee2-4aec-8aeb-e752b8bb0... [17:04:04] (03open) 10aborrero: WIP: tofu-infra: add code to validate no leaking VMs exist [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/118 [17:05:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:20:48] 10Toolforge (Toolforge iteration 16): [usage] Try to get an idea of the amount of tools that were created, but never started anything - https://phabricator.wikimedia.org/T379144#10296974 (10Aklapper) One partial lamppost figure could be empty (apart from `.gitreview` etc) Git repositories in GitLab created by St... [17:51:26] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130#10297198 (10bd808) This has all happened before: T306391#9436882 [18:03:55] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10297262 (10bd808) >>! In T376267#10295702, @roti_WMDE wrote: >>>! In T376267#10293251, @Ladsgroup wrote: >> Hi, can you try the 2fa value for your SUL account? > > I do not have 2FA enabled... [18:27:59] 06cloud-services-team, 10Web Team Visual Regression Framework, 10Quality-and-Test-Engineering-Team (Test Infrastructure): Move disk space and other Pixel metrics from Graphite to Prometheus - https://phabricator.wikimedia.org/T363969#10297299 (10Peter) 05Open→03Declined Lets keep it in Graphite now w... [18:40:56] FIRING: [2x] SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:42:27] 10Tool-lexeme-forms, 06translatewiki.net: translatewiki export for Wikidata Lexeme Forms tries to remove sh-latn translations - https://phabricator.wikimedia.org/T379188 (10LucasWerkmeister) 03NEW [18:44:11] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with Object Storage - https://phabricator.wikimedia.org/T360626#10297371 (10Raymond_Ndibe) @dcaro I think I figured out where this problem is from. There are two files `/etc/keystone/credential-keys/0` and `/etc/keystone/credential-keys/1` with `u... [19:03:45] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with Object Storage - https://phabricator.wikimedia.org/T360626#10297420 (10Raymond_Ndibe) according to https://docs.openstack.org/keystone/zed/admin/credential-encryption.html, the configuration for this is (also happens to be the default): ` [cr... [19:35:56] RESOLVED: [2x] SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:39:28] FIRING: [5x] NodeTextfileStale: Stale textfile for cloudvirt1063:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:34:48] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with Object Storage - https://phabricator.wikimedia.org/T360626#10297614 (10Raymond_Ndibe) a:03Raymond_Ndibe [20:36:38] 06cloud-services-team, 10Cloud-VPS: Frequent radosgw 500 errors with Object Storage - https://phabricator.wikimedia.org/T360626#10297619 (10Raymond_Ndibe) >>! In T360626#10292422, @dcaro wrote: > This might be related to this errors on logstash https://logstash.wikimedia.org/goto/c7fa935688ccd6ccda0e11b420b747... [20:57:04] 06cloud-services-team, 10Toolforge: Jobs hang on toolforge - https://phabricator.wikimedia.org/T379132#10297682 (10Leloiandudu) I'm going to continue monitoring, thank you [21:11:58] 10Tool-Pageviews, 10Tool-wikistatistics2-0, 06Data Products, 06Data-Engineering, and 2 others: Pageviews Analysis 3.0 (Vue + Codex) - https://phabricator.wikimedia.org/T378549#10297728 (10Ottomata) [21:18:13] FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-harbor-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [22:06:16] (03approved) 10bd808: add #wikimedia-cloud* channels [toolforge-repos/ircservserv-config] - 10https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/merge_requests/12 (https://phabricator.wikimedia.org/T377744) (owner: 10jjmc89) [22:06:20] (03merge) 10bd808: add #wikimedia-cloud* channels [toolforge-repos/ircservserv-config] - 10https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/merge_requests/12 (https://phabricator.wikimedia.org/T377744) (owner: 10jjmc89) [22:08:10] !issync [22:08:11] Syncing #wikimedia-cloud-feed (requested by bd808) [22:08:16] Error: Unable to get opped in #wikimedia-cloud-feed [22:08:52] !issync [22:08:52] Syncing #wikimedia-cloud-feed (requested by bd808) [22:08:57] Error: Unable to get opped in #wikimedia-cloud-feed [22:09:34] !issync [22:09:34] Syncing #wikimedia-cloud-feed (requested by bd808) [22:09:39] Error: Unable to get opped in #wikimedia-cloud-feed [22:11:29] !issync [22:11:29] Syncing #wikimedia-cloud-feed (requested by bd808) [22:11:31] Set /cs flags #wikimedia-cloud-feed TheresNoTime +Afiortv [22:11:33] Set /cs flags #wikimedia-cloud-feed litharge +Vov [22:11:35] Set /cs flags #wikimedia-cloud-feed icinga-wm +v [22:11:37] Set /cs flags #wikimedia-cloud-feed logmsgbot +v [22:11:39] Set /cs flags #wikimedia-cloud-feed dhinus -es [22:11:41] Set /cs flags #wikimedia-cloud-feed *!*@libera/staff/* -Airtv [22:11:43] Set /cs flags #wikimedia-cloud-feed dcaro +AFRefiorstv [22:11:45] Set /cs flags #wikimedia-cloud-feed blancadesal +Afiortv [22:11:47] Set /cs flags #wikimedia-cloud-feed taavi +Afiortv [22:11:49] Set /cs flags #wikimedia-cloud-feed komla +Afiortv [22:11:51] Set /cs flags #wikimedia-cloud-feed rook +Afiortv [22:11:53] Set /cs flags #wikimedia-cloud-feed wikibugs +v [22:11:55] Set /cs flags #wikimedia-cloud-feed arturo +Afiortv [22:11:57] Set /cs flags #wikimedia-cloud-feed stashbot +v [22:11:59] Set /cs flags #wikimedia-cloud-feed wmopbot -Ae [22:12:01] Set /cs flags #wikimedia-cloud-feed wm-bot +v [22:12:03] Set /cs flags #wikimedia-cloud-feed ircservserv-wm +V [22:12:05] Set /cs flags #wikimedia-cloud-feed Az1568 -ARefiorstv [22:12:07] Set /cs flags #wikimedia-cloud-feed jinxer-wm +v [22:12:09] Set /cs flags #wikimedia-cloud-feed bstorm -AFRefiorstv [22:12:11] Set /cs flags #wikimedia-cloud-feed Raymond_Ndibe +Afiortv [22:12:13] Set /cs flags #wikimedia-cloud-feed wmcs-alerts +v [22:12:22] that was more work than I expected [22:25:38] (03open) 10bd808: fix: Use Taavi's account name rather than his preferred nick [toolforge-repos/ircservserv-config] - 10https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/merge_requests/13 [22:33:54] (03PS1) 10Eevans: Add (fake) corto bot password [labs/private] - 10https://gerrit.wikimedia.org/r/1087979 (https://phabricator.wikimedia.org/T379204) [22:34:17] (03CR) 10Eevans: [V:03+2 C:03+2] Add (fake) corto bot password [labs/private] - 10https://gerrit.wikimedia.org/r/1087979 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [22:55:59] (03merge) 10bd808: fix: Use Taavi's account name rather than his preferred nick [toolforge-repos/ircservserv-config] - 10https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/merge_requests/13 [22:57:17] !issync [22:57:17] Syncing #wikimedia-cloud-feed (requested by bd808) [22:57:19] Set /cs flags #wikimedia-cloud-feed dcaro +AFRefiorstv [23:03:08] 06cloud-services-team, 10ircservserv: Use ircservserv to manage permissions for #wikimedia-cloud* channels - https://phabricator.wikimedia.org/T377744#10298031 (10bd808) 05Open→03Resolved a:03JJMC89 Thanks for writing the patch @JJMC89. I would have gotten to this someday™, but I have no idea when. [23:39:28] FIRING: [5x] NodeTextfileStale: Stale textfile for cloudvirt1063:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks