[00:03:55] FIRING: MaxConntrack: Max conntrack at 88.47% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:04:56] FIRING: MaxConntrack: Max conntrack at 90.42% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:05:03] 06cloud-services-team: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281 (10phaultfinder) 03NEW [00:09:55] RESOLVED: MaxConntrack: Max conntrack at 90.49% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:12:55] FIRING: MaxConntrack: Max conntrack at 90.78% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:13:02] 06cloud-services-team: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281#10090280 (10phaultfinder) [00:13:50] RESOLVED: TfInfraTestDestroyFailed: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:21:51] FIRING: TfInfraTestApplyFailed: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:27:55] RESOLVED: MaxConntrack: Max conntrack at 90.03% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:29:55] FIRING: MaxConntrack: Max conntrack at 90.57% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:30:01] 06cloud-services-team: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281#10090297 (10phaultfinder) [00:33:22] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090300 (10Andrew) Are people still seeing this issue? I'm unable to produce the specific failure mentioned in the task description. [00:34:55] RESOLVED: MaxConntrack: Max conntrack at 90.06% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:35:29] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090302 (10AntiCompositeNumber) The last one I got was 2024-08-25 22:07:47Z. But it's been intermittent the whole time. [00:40:10] FIRING: MaxConntrack: Max conntrack at 90.52% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:40:22] 06cloud-services-team: MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281#10090303 (10phaultfinder) [00:42:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [00:45:10] RESOLVED: MaxConntrack: Max conntrack at 90.05% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:47:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [00:48:55] RESOLVED: MaxConntrack: Max conntrack at 87.48% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:53:04] 10Toolforge: ChieBot: Intermittent connection reset by peer errors - https://phabricator.wikimedia.org/T356163#10090304 (10Leloiandudu) @dcaro hey any ideas? [00:53:05] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090307 (10Andrew) by 'intermittent' do you mean that it's always failing a little bit, or that every few hours it fails a lot, for a few minutes? [00:55:46] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090308 (10Stuartyeates) I'm seeing failures of URLs like https://orcid-scraper.toolforge.org/results?qid=Q112671057 "Internal Server Error / The server encountered an inte... [01:29:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:46:17] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:13:09] (03open) 10samwilson: Flush jobs before loading [toolforge-repos/wishlist] - 10https://gitlab.wikimedia.org/toolforge-repos/wishlist/-/merge_requests/2 [03:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:08:57] FIRING: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:10:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:15:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:20:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:37:29] FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-harbor-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [05:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:55:29] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:58:35] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090504 (10Don-vip) For me the errors are gone (toolforge job service works, I was able to build and deploy my tool. No more DNS errors, everything looks fine). [06:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:50:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:03:57] FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:04:01] 06cloud-services-team: SystemdUnitDown Unit backup_vms.service on node cloudbackup1003 has been down for long. - https://phabricator.wikimedia.org/T373292 (10phaultfinder) 03NEW [07:13:41] (03open) 10dcaro: jobs,cronjobs: add clarifying note on why the limits [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/60 (https://phabricator.wikimedia.org/T372720) [07:15:54] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: Possible error in jobs and cronjobs quotas in maintain-kubeusers - https://phabricator.wikimedia.org/T372720#10090570 (10dcaro) p:05Triage→03Low [07:15:57] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: Possible error in jobs and cronjobs quotas in maintain-kubeusers - https://phabricator.wikimedia.org/T372720#10090569 (10dcaro) @Raymond_Ndibe feel free to take over that MR and change the message so it's clear for you (and others!) [07:17:17] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293 (10dcaro) 03NEW [07:17:23] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090586 (10dcaro) p:05Triage→03High [07:20:40] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090590 (10dcaro) p:05High→03Medium Manually running `toolforge build quota` for the test tool seems to work too: ` dcaro@tools-bastion-13:~$ become automated-too... [07:23:35] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090602 (10dcaro) Oh, no, it fails, just not all the time: ` tools.automated-toolforge-tests@tools-bastion-13:~$ toolforge build quota ReadTimeout: HTTPSConnectionPoo... [07:23:37] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090604 (10dcaro) p:05Medium→03High [07:28:39] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090610 (10dcaro) This does not show much on the builds-api python app level: ` dcaro@tools-bastion-13:~$ kubectl-sudo -n builds-api logs --tail 1000 deployment/build... [07:29:58] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090611 (10dcaro) builds-api nginx does not seem to get it either, looking at the api-gateway: ` dcaro@tools-bastion-13:~$ kubectl-sudo -n builds-api logs --tail 1000... [07:33:01] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090613 (10dcaro) There's something there: ` 192.168.254.192 - - [26/Aug/2024:07:21:50 +0000] "GET /builds/v1/tool/automated-toolforge-tests/quotas HTTP/1.1" 499 0 "-... [07:35:30] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090617 (10dcaro) It happens in both pods: ` dcaro@tools-bastion-13:~$ kubectl-sudo -n api-gateway logs --timestamps --prefix --tail 1000 --container nginx -l name=ap... [07:42:17] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090627 (10dcaro) [07:42:36] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090625 (10dcaro) →14Duplicate dup:03T373243 [07:43:42] 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090633 (10dcaro) The full command: ` tools.automated-toolforge-tests@tools-bastion-13:~$ curl -v https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/too... [07:54:09] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090637 (10dcaro) Coredns does not seem to have spikes in usage, cpu: {F57294340} Mem {F57294342} Looking [08:06:16] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090640 (10dcaro) hmm... from a webservice shell, we get sometimes a `non authoritative answer`: ` I have no name!@shell-1724659470:~$ nslookup tools-harbor.wmcloud.org Serv... [08:16:16] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10090648 (10dcaro) We are currently having issues with the DNS resolution, though I suspect they are not load issues, let me try scaling up manually and... [08:17:19] 10Tool-techcontribs: Tech Contribs does not support parentheses in user names - https://phabricator.wikimedia.org/T373269#10090650 (10Chlod) 05Open→03Resolved a:03Chlod Deployed! [08:18:17] 10Tool-techcontribs: Tech Contribs does not support parentheses in user names - https://phabricator.wikimedia.org/T373269#10090653 (10LucasWerkmeister) Thanks \o/ [08:21:03] (03PS5) 10Jean-Frédéric: Use toolforge-jobs to install requirements during deployment [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124 (https://phabricator.wikimedia.org/T319787) [08:21:04] (03PS5) 10Jean-Frédéric: Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125 [08:21:36] (03CR) 10CI reject: [V:04-1] Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125 (owner: 10Jean-Frédéric) [08:24:59] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10090665 (10dcaro) CPU usage lowered from ~0.8 to ~0.5, running tests [08:28:05] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10090669 (10dcaro) Tests seem to be passing \o/ So it might have been load, though it was using 0.8 CPU, and had no limit, maybe we have to increase the... [08:29:41] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10090674 (10dcaro) Oh, tests started failing on the openapi.json endpoint, like in toolsbeta (failing to fetch it from builds-api): ` < HTTP/1.1 500 I... [08:30:20] 06cloud-services-team, 10Toolforge (Toolforge iteration 14): toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782#10090677 (10Aklapper) @dcaro: Hi, the `Due Date` set for this open task passed a while ago. Could you please either update or r... [08:33:31] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090708 (10dcaro) Just manually scaled up the number of replicas for the coredns deployment from 2 to 4, and things seem to be improving, is anyone still seeing issues? [08:38:34] (03open) 10dcaro: toolforge_get_versions: add calico [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/489 [08:43:07] (03PS2) 10Lokal Profil: Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) [08:43:13] (03CR) 10CI reject: [V:04-1] Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) (owner: 10Lokal Profil) [08:49:11] (03approved) 10sstefanova: toolforge_get_versions: add calico [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/489 (owner: 10dcaro) [08:50:33] (03approved) 10sstefanova: jobs,cronjobs: add clarifying note on why the limits [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/60 (https://phabricator.wikimedia.org/T372720) (owner: 10dcaro) [08:50:33] (03open) 10dcaro: openapi.json: add backend errors to the message [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38 [08:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:51:26] (03merge) 10dcaro: toolforge_get_versions: add calico [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/489 [08:52:00] (03update) 10sstefanova: [jobs-cli] update autocomplete and man files [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/66 (owner: 10raymond-ndibe) [08:54:18] (03update) 10sstefanova: openapi.json: add backend errors to the message [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38 (owner: 10dcaro) [08:54:36] (03approved) 10sstefanova: openapi.json: add backend errors to the message [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38 (owner: 10dcaro) [08:55:26] (03merge) 10dcaro: openapi.json: add backend errors to the message [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38 [08:57:39] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: api-gateway: bump to 0.0.41-20240826085537-205b142a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/490 [08:58:22] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [09:00:08] 06cloud-services-team, 10Toolforge (Toolforge iteration 14): toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782#10090916 (10dcaro) [09:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:00:42] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component api-gateway [09:01:09] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [09:02:36] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component api-gateway [09:03:19] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [09:08:42] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component api-gateway [09:08:48] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component api-gateway [09:13:49] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component api-gateway [09:14:38] (03approved) 10dcaro: api-gateway: bump to 0.0.41-20240826085537-205b142a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/490 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:14:40] (03merge) 10dcaro: api-gateway: bump to 0.0.41-20240826085537-205b142a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/490 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:16:21] (03update) 10dcaro: kind: upgrade k8s to 1.26 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/177 (https://phabricator.wikimedia.org/T370244) (owner: 10sstefanova) [09:18:19] FIRING: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [09:41:10] 14Grid-Engine-to-K8s-Migration, 10Wiki-Loves-Monuments-Database, 13Patch-For-Review: Migrate heritage from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319787#10091157 (10dcaro) >>! In T319787#10082843, @JeanFred wrote: > I got something working, will wait overnight to se... [09:54:31] (03open) 10dcaro: toolforge: add calico to deployment list [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/185 [10:10:14] 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091210 (10Vgutierrez) toolsadmin.wikimedia.org is handled by labweb.svc.eqiad.wmnet which has two pooled backend servers, cloudweb1003 and cloudweb1004, cloudweb1003 is struggling for some reason (timeou... [10:15:13] 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091229 (10dcaro) a:03dcaro [10:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:26:20] 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091268 (10dcaro) The last log I see in cloudweb1003 is from two days ago: ` root@cloudweb1003:~# systemctl status striker.service ● striker.service - Systemd runner for striker Loaded: loaded (/lib... [10:27:37] 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091272 (10dcaro) that seemed to help: ` dcaro@cp6001:~$ curl -v -s -4 --connect-to toolsadmin.wikimedia.org:443:$(dig +short cloudweb1003.wikimedia.org):7443 https://toolsadmin.wikimedia.org -o /dev/null... [10:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:31:02] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091290 (10dcaro) Yep, still having issues, looking [11:04:12] FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:10:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:14:59] FIRING: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [11:15:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:15:54] 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091402 (10dcaro) p:05Unbreak!→03High Service seems restored, monitoring for a bit before closing, @Vgutierrez your comment was very helpful, thanks! [11:19:59] RESOLVED: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [11:20:33] 10Tool-Global-user-contributions: GUC displays a database error - https://phabricator.wikimedia.org/T373319 (10Melos) 03NEW [11:21:19] 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091425 (10Vgutierrez) @dcaro thanks! at the moment healthchecks only validate wikitech.wm.org: `yaml labweb-ssl: description: "lvs for cloudweb services: horizon, striker, wikitech - HTTPS" enc... [11:22:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:25:19] 10Tool-Global-user-contributions: GUC displays a database error - https://phabricator.wikimedia.org/T373319#10091429 (10RhinosF1) →14Duplicate dup:03T373243 [11:25:34] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091432 (10RhinosF1) [11:46:28] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091535 (10dcaro) Querying from a webservice shell fails pretty frequently, even for internal names (and without domain searching, ie. with trailing `.`): ` I have no name!@... [11:49:08] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091538 (10dcaro) I can reproduce with nsenter on the worker: ` root@tools-k8s-worker-104:~# time nsenter -t 578510 -n nslookup api.svc.tools.eqiad1.wikimedia.cloud. 10.96.0... [12:19:23] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10091623 (10rook) Doesn't appear to be fully deploying. Cluster deploys, but kube-system pods seem to have some issues. Main issue is maybe k8s-keystone-auth which is giving: ` Warning Failed 11m (x4 over 13m) kubelet... [12:26:45] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091656 (10MBH) When I'm trying to build an image from my github repo, I got this strange issue: `unable to access 'https://github.com/Saisengen/wikibots/': Could not resol... [12:40:41] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10091722 (10rook) Looks like the same kinds of things are happening in tf-infra-test ` NAME READY STATUS RESTARTS AGE pod/coredns-745687fb66-8jw96 1/1 R... [12:42:54] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-104 (T373243) [12:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:42:59] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [12:44:11] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-104 (T373243) [12:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:44:45] 06cloud-services-team: Replace or deprecate WMCS uses of report updater - https://phabricator.wikimedia.org/T357856#10091732 (10lbowmaker) @Andrew - this was migrated from ReportUpdater to Airflow https://phabricator.wikimedia.org/T357938 Let us know if there are any issues. [12:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:53:14] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-104 (T373243) [12:53:18] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [12:53:19] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-104 (T373243) [12:53:32] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 06serviceops: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592#10091775 (10akosiaris) >>! In T371592#10081906, @bd808 wrote: >>>! In T371592#10081520, @Andrew wrote: >> Will this be... [13:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:05:06] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (T373243) [13:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:05:10] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [13:12:41] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (T373243) [13:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:12:47] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [13:14:12] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091898 (10dcaro) So going around with cumin, we found some workers that fail often: ` tools-k8s-worker-{nfs-{4,15,18,25,51,52},104} ` ` # running this many times to get al... [13:17:21] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091925 (10dcaro) The reboot did not help xd, the VMs are all running on different cloudvirts: ` root@cloudcontrol1007:~# for node in tools-k8s-worker-{nfs-{4,15,18,25,51,52... [13:33:36] 10VPS-project-Codesearch, 10GitLab (Integrations): Figure out the future of codesearch in a GitLab world - https://phabricator.wikimedia.org/T268196#10091975 (10hashar) In case someone is looking at having our Code Search to index repositories hosted in GitLab, see T371992 / https://gerrit.wikimedia.org/r/... [13:33:50] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:38:50] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:46:07] (03update) 10sstefanova: calico: bump to 0.0.8-20240731084636-9937ff2a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [13:51:24] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [13:51:29] FIRING: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [13:56:24] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [13:56:29] RESOLVED: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown [13:57:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:03:23] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-4 (T373243) [14:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:03:28] T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243 [14:04:07] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-4 (T373243) [14:04:09] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-15 (T373243) [14:04:55] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-15 (T373243) [14:04:56] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-18 (T373243) [14:05:37] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-18 (T373243) [14:05:39] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-25 (T373243) [14:06:22] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-25 (T373243) [14:06:24] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-51 (T373243) [14:07:03] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-51 (T373243) [14:07:05] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-52 (T373243) [14:07:45] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-52 (T373243) [14:07:46] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-104 (T373243) [14:08:03] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-104 (T373243) [14:17:34] 10Striker, 13Patch-For-Review: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10092171 (10dcaro) I got that patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1066784, but I'm not sure if that will be forwarded to cloud services (team=wmcs) or not, do y... [14:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:29:25] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10092244 (10dcaro) I have cordoned all the misbehaving workers, users should stop seeing issues right now, will try to debug in more detail and add new nodes if I can't find... [14:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:40:02] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10092314 (10ArthurPSmith) Just to confirm I've done a few dozen actions that would have triggered this problem a few days ago, and everything is working. Thanks! [15:04:12] FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:04:42] 10VPS-Projects: Quota leak for dbs? - https://phabricator.wikimedia.org/T373348 (10rook) 03NEW [15:10:03] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [15:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:11:13] 10Cloud-VPS, 06Infrastructure-Foundations, 10SRE-tools: Update offboard-user script to use Keystone API - https://phabricator.wikimedia.org/T306788#10092464 (10SLyngshede-WMF) a:03SLyngshede-WMF [15:12:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [15:15:47] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [15:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:31:58] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [15:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:33:51] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [15:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:34:58] 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357 (10Akaibu1) 03NEW [15:35:05] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [15:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:35:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:38:19] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=97) for a worker-nfs role in the tools cluster [15:38:20] 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357#10092723 (10rook) →14Duplicate dup:03T358604 [15:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:38:34] 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357#10092740 (10rook) 05Duplicate→03Open [15:38:54] 10PAWS: Update labpawspublic extension to jupyterlab 4 system - https://phabricator.wikimedia.org/T358604#10092725 (10rook) [15:39:08] 10PAWS: Update labpawspublic extension to jupyterlab 4 system - https://phabricator.wikimedia.org/T358604#10092733 (10rook) Sorry T373357 wasn't a duplicate of this. [15:39:13] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [15:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:41:19] 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357#10092759 (10rook) This was removed as the project stopped getting supported and the feature stopped working https://github.com/jupyterlab-contrib/jupyterlab-link-share/commits/main/ I'll update the documentation, thank you for b... [15:44:42] 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357#10092783 (10rook) 05Open→03Resolved a:03rook [15:44:57] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [15:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:45:22] FIRING: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:48:46] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [15:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:49:10] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [15:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:49:12] 10Cloud-Services: Prepare "What's new with Wikimedia Cloud Services" presentation for WikiConNA 2024 - https://phabricator.wikimedia.org/T373159#10092811 (10bd808) The tech spike I have been working on in {T372498} could be a good hook for talking about Cloud VPS changes in the last couple of years. That spike i... [15:50:10] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [15:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:50:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:52:01] 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360 (10rook) 03NEW [16:02:09] !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-57.tools.eqiad1.wikimedia.cloud to the cluster [16:02:09] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [16:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:02:41] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [16:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:03:57] FIRING: SystemdUnitDown: The service unit prometheus-openstack-exporter.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:06:00] 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10092877 (10dcaro) New nodes seem to not have the issue, so will continue adding new ones (added worker-nfs-57) [16:08:29] FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:10:28] 10Cloud-Services, 10Catalyst (PatchDemo GoLive): Moving proxies across wmcs projects for patchdemo.wmflabs.org - https://phabricator.wikimedia.org/T370080#10092893 (10jnuche) 05Open→03Resolved a:03jnuche Done during the switchover on 2024-08-19 [16:12:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [16:13:29] RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:14:05] !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-58.tools.eqiad1.wikimedia.cloud to the cluster [16:14:05] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [16:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:20:40] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 06serviceops: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592#10092958 (10bd808) >>! In T371592#10091775, @akosiaris wrote: >>>! In T371592#10081906, @bd808 wrote: >> This is block... [16:26:07] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [16:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:30:46] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [16:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:36:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-59 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:37:13] 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360#10093032 (10rook) https://github.com/toolforge/tf-infra-test/pull/18 [16:37:38] 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360#10093033 (10rook) 05Open→03Resolved a:03rook [16:42:18] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [16:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:42:28] FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:47:28] RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:50:42] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093066 (10rook) Looks like 1.26 isn't working anymore as some of the image tags that 1.26 wants have been removed. Upgrading to 1.27 [16:54:01] !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-60.tools.eqiad1.wikimedia.cloud to the cluster [16:54:01] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [16:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:54:15] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [16:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:57:58] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-59 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:04:16] !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-61.tools.eqiad1.wikimedia.cloud to the cluster [17:04:16] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [17:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:08:46] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10093181 (10bd808) >>! In T373227#10089302, @Andrew wrote: > I didn't do a rabbit... [17:08:57] RESOLVED: SystemdUnitDown: The service unit prometheus-openstack-exporter.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:09:29] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:10:03] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10093183 (10bd808) a:03Andrew [17:11:07] 10PAWS: Upgrade to k8s 1.27 - https://phabricator.wikimedia.org/T373372 (10rook) 03NEW [17:14:50] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093234 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/67 [17:14:56] vivian-rook closed https://github.com/toolforge/quarry/pull/67 [17:15:35] 10Quarry: Remove quarry-124 cluster - https://phabricator.wikimedia.org/T373375 (10rook) 03NEW [17:17:10] 10Quarry: Remove quarry-124 cluster - https://phabricator.wikimedia.org/T373375#10093268 (10rook) [17:17:11] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093269 (10rook) [17:17:24] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093272 (10rook) 05Open→03Resolved [17:18:18] (03PS3) 10Lokal Profil: Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) [17:18:59] 10PAWS: Remove paws-127 cluster - https://phabricator.wikimedia.org/T373036#10093280 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/450 [17:19:11] vivian-rook opened https://github.com/toolforge/paws/pull/450 [17:19:29] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:20:20] (03PS4) 10Lokal Profil: Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) [17:21:58] FIRING: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [17:24:29] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:26:58] RESOLVED: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources [17:29:29] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:29:39] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [17:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:29:57] vivian-rook closed https://github.com/toolforge/paws/pull/450 [17:30:07] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [17:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:30:15] 10PAWS: Remove paws-127 cluster - https://phabricator.wikimedia.org/T373036#10093338 (10rook) 05Open→03Resolved a:03rook [17:33:14] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.quota_increase [17:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:33:21] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [17:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:33:33] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [17:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:33:58] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [17:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:34:29] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:38:19] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.quota_increase [17:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:38:26] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [17:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:38:31] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [17:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:48:50] vivian-rook opened https://github.com/toolforge/paws/pull/451 [17:49:26] !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-62.tools.eqiad1.wikimedia.cloud to the cluster [17:49:26] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [17:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:49:29] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:53:54] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10093444 (10rook) Looks like some of the images for the 1.26 deploy of magnum k8s... [18:15:19] 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10093508 (10bd808) [18:15:20] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044#10093509 (10bd808) [18:15:21] 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360#10093512 (10bd808) [18:15:26] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044#10093513 (10bd808) [18:16:15] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093518 (10bd808) [18:16:19] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044#10093519 (10bd808) [18:31:56] FIRING: SystemdUnitDown: The service unit neutron-openvswitch-agent.service is in failed status on host cloudvirt1062. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:33:11] FIRING: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:34:38] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [18:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:35:06] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [18:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:05:09] 10Cloud-VPS (Project-requests): Request creation of udstest VPS project - https://phabricator.wikimedia.org/T373386#10093729 (10sbassett) [19:06:41] FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:07:20] 10Cloud-VPS (Project-requests): Request creation of udstest VPS project - https://phabricator.wikimedia.org/T373386#10093744 (10sbassett) [19:07:28] 10Cloud-VPS (Project-requests): Request creation of udstest VPS project - https://phabricator.wikimedia.org/T373386#10093745 (10sbassett) [19:08:38] 10Cloud-VPS (Project-requests): Request creation of udstest VPS project - https://phabricator.wikimedia.org/T373386#10093743 (10JJMC89) Do you mean `usdtest`? [19:11:29] 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10093752 (10sbassett) [19:11:31] 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10093754 (10sbassett) >>! In T373386#10093743, @JJMC89 wrote: > Do you mean `usdtest`? I did, thanks for catching that. [19:12:59] 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10093757 (10sbassett) [19:17:53] 10VPS-Projects: magnum clusters not deploying in eqiad1 - https://phabricator.wikimedia.org/T373207#10093800 (10rook) →14Duplicate dup:03T373360 [19:17:55] 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360#10093802 (10rook) [19:43:53] 06cloud-services-team, 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10093914 (10bd808) +1 [19:44:29] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:12:59] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.quota_increase [20:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:13:09] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [20:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:13:15] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [20:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:23:39] !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-63.tools.eqiad1.wikimedia.cloud to the cluster [20:23:39] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [20:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:26:41] RESOLVED: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:26:41] RESOLVED: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:34:29] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:03:40] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [21:03:42] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=97) for a worker-nfs role in the tools cluster [21:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:03:44] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [21:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:13:56] !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-64.tools.eqiad1.wikimedia.cloud to the cluster [21:13:56] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [21:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:19:19] 10Cloud-VPS (Debian Buster Deprecation), 10Humaniki: Cloud VPS "wikidumpparse" project Buster deprecation - https://phabricator.wikimedia.org/T367561#10094195 (10Maximilianklein) update for 2024-08-19 [x] create cinder volume. [x] move project code [x] move mysql-db files [x] create a new debian bookworm inst... [21:21:55] 10Cloud-VPS (Debian Buster Deprecation), 10Humaniki: Cloud VPS "wikidumpparse" project Buster deprecation - https://phabricator.wikimedia.org/T367561#10094203 (10Maximilianklein) 05Open→03Resolved a:03Maximilianklein done and deleted [21:50:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:09:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:14:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:21:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown