[00:03:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 88.47% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:04:56] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 90.42% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:05:03] <wikibugs>	 06cloud-services-team: MaxConntrack  Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281 (10phaultfinder) 03NEW
[00:09:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 90.49% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:12:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 90.78% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:13:02] <wikibugs>	 06cloud-services-team: MaxConntrack  Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281#10090280 (10phaultfinder)
[00:13:50] <wmcs-alerts>	 RESOLVED: TfInfraTestDestroyFailed: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[00:21:51] <wmcs-alerts>	 FIRING: TfInfraTestApplyFailed: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[00:27:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 90.03% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:29:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 90.57% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:30:01] <wikibugs>	 06cloud-services-team: MaxConntrack  Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281#10090297 (10phaultfinder)
[00:33:22] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090300 (10Andrew) Are people still seeing this issue?  I'm unable to produce the specific failure mentioned in the task description.
[00:34:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 90.06% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:35:29] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090302 (10AntiCompositeNumber) The last one I got was 2024-08-25 22:07:47Z. But it's been intermittent the whole time.
[00:40:10] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 90.52% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:40:22] <wikibugs>	 06cloud-services-team: MaxConntrack  Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T373281#10090303 (10phaultfinder)
[00:42:55] <wmcs-alerts>	 FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[00:45:10] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 90.05% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:47:55] <wmcs-alerts>	 RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[00:48:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 87.48% on cloudvirt1050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[00:53:04] <wikibugs>	 10Toolforge: ChieBot: Intermittent connection reset by peer errors - https://phabricator.wikimedia.org/T356163#10090304 (10Leloiandudu) @dcaro hey any ideas?
[00:53:05] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090307 (10Andrew) by 'intermittent' do you mean that it's always failing a little bit, or that every few hours it fails a lot, for a few minutes?
[00:55:46] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090308 (10Stuartyeates) I'm seeing failures of URLs like https://orcid-scraper.toolforge.org/results?qid=Q112671057  "Internal Server Error / The server encountered an inte...
[01:29:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[01:46:17] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[03:13:09] <wikibugs>	 (03open) 10samwilson: Flush jobs before loading [toolforge-repos/wishlist] - 10https://gitlab.wikimedia.org/toolforge-repos/wishlist/-/merge_requests/2
[03:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[03:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[04:50:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[05:00:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[05:08:57] <jinxer-wm>	 FIRING: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[05:10:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:15:28] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:20:28] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:37:29] <wmcs-alerts>	 FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-harbor-2 on project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[05:50:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[05:55:29] <wmcs-alerts>	 FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-2 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[05:58:35] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090504 (10Don-vip) For me the errors are gone (toolforge job service works, I was able to build and deploy my tool. No more DNS errors, everything looks fine).
[06:00:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[06:50:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[07:03:57] <jinxer-wm>	 FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[07:04:01] <wikibugs>	 06cloud-services-team: SystemdUnitDown  Unit backup_vms.service on node cloudbackup1003 has been down for long. - https://phabricator.wikimedia.org/T373292 (10phaultfinder) 03NEW
[07:13:41] <wikibugs>	 (03open) 10dcaro: jobs,cronjobs: add clarifying note on why the limits [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/60 (https://phabricator.wikimedia.org/T372720)
[07:15:54] <wikibugs>	 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: Possible error in jobs and cronjobs quotas in maintain-kubeusers - https://phabricator.wikimedia.org/T372720#10090570 (10dcaro) p:05Triage→03Low
[07:15:57] <wikibugs>	 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: Possible error in jobs and cronjobs quotas in maintain-kubeusers - https://phabricator.wikimedia.org/T372720#10090569 (10dcaro) @Raymond_Ndibe feel free to take over that MR and change the message so it's clear for you (and others!)
[07:17:17] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293 (10dcaro) 03NEW
[07:17:23] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090586 (10dcaro) p:05Triage→03High
[07:20:40] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090590 (10dcaro) p:05High→03Medium Manually running `toolforge build quota` for the test tool seems to work too: ` dcaro@tools-bastion-13:~$ become automated-too...
[07:23:35] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090602 (10dcaro) Oh, no, it fails, just not all the time: ` tools.automated-toolforge-tests@tools-bastion-13:~$ toolforge build quota ReadTimeout: HTTPSConnectionPoo...
[07:23:37] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090604 (10dcaro) p:05Medium→03High
[07:28:39] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090610 (10dcaro) This does not show much on the builds-api python app level: ` dcaro@tools-bastion-13:~$ kubectl-sudo -n builds-api logs --tail 1000 deployment/build...
[07:29:58] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090611 (10dcaro) builds-api nginx does not seem to get it either, looking at the api-gateway: ` dcaro@tools-bastion-13:~$ kubectl-sudo -n builds-api logs --tail 1000...
[07:33:01] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090613 (10dcaro) There's something there: ` 192.168.254.192 - - [26/Aug/2024:07:21:50 +0000] "GET /builds/v1/tool/automated-toolforge-tests/quotas HTTP/1.1" 499 0 "-...
[07:35:30] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090617 (10dcaro) It happens in both pods: ` dcaro@tools-bastion-13:~$ kubectl-sudo -n api-gateway logs --timestamps --prefix --tail 1000 --container nginx -l name=ap...
[07:42:17] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090627 (10dcaro)
[07:42:36] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090625 (10dcaro) →14Duplicate dup:03T373243
[07:43:42] <wikibugs>	 10Toolforge (Toolforge iteration 14): [builds-api] quota command failing on functional tests on tools - https://phabricator.wikimedia.org/T373293#10090633 (10dcaro) The full command: ` tools.automated-toolforge-tests@tools-bastion-13:~$ curl -v https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/too...
[07:54:09] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090637 (10dcaro) Coredns does not seem to have spikes in usage, cpu: {F57294340}  Mem {F57294342}  Looking
[08:06:16] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090640 (10dcaro) hmm... from a webservice shell, we get sometimes a `non authoritative answer`: ` I have no name!@shell-1724659470:~$ nslookup tools-harbor.wmcloud.org Serv...
[08:16:16] <wikibugs>	 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10090648 (10dcaro) We are currently having issues with the DNS resolution, though I suspect they are not load issues, let me try scaling up manually and...
[08:17:19] <wikibugs>	 10Tool-techcontribs: Tech Contribs does not support parentheses in user names - https://phabricator.wikimedia.org/T373269#10090650 (10Chlod) 05Open→03Resolved a:03Chlod Deployed!
[08:18:17] <wikibugs>	 10Tool-techcontribs: Tech Contribs does not support parentheses in user names - https://phabricator.wikimedia.org/T373269#10090653 (10LucasWerkmeister) Thanks \o/
[08:21:03] <wikibugs>	 (03PS5) 10Jean-Frédéric: Use toolforge-jobs to install requirements during deployment [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124 (https://phabricator.wikimedia.org/T319787)
[08:21:04] <wikibugs>	 (03PS5) 10Jean-Frédéric: Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125
[08:21:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125 (owner: 10Jean-Frédéric)
[08:24:59] <wikibugs>	 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10090665 (10dcaro) CPU usage lowered from ~0.8 to ~0.5, running tests
[08:28:05] <wikibugs>	 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10090669 (10dcaro) Tests seem to be passing \o/  So it might have been load, though it was using 0.8 CPU, and had no limit, maybe we have to increase the...
[08:29:41] <wikibugs>	 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#10090674 (10dcaro) Oh, tests started failing on the openapi.json endpoint, like in toolsbeta (failing to fetch it from builds-api): `    < HTTP/1.1 500 I...
[08:30:20] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 14): toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782#10090677 (10Aklapper) @dcaro: Hi, the `Due Date` set for this open task passed a while ago. Could you please either update or r...
[08:33:31] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090708 (10dcaro) Just manually scaled up the number of replicas for the coredns deployment from 2 to 4, and things seem to be improving, is anyone still seeing issues?
[08:38:34] <wikibugs>	 (03open) 10dcaro: toolforge_get_versions: add calico [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/489
[08:43:07] <wikibugs>	 (03PS2) 10Lokal Profil: Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633)
[08:43:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) (owner: 10Lokal Profil)
[08:49:11] <wikibugs>	 (03approved) 10sstefanova: toolforge_get_versions: add calico [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/489 (owner: 10dcaro)
[08:50:33] <wikibugs>	 (03approved) 10sstefanova: jobs,cronjobs: add clarifying note on why the limits [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/60 (https://phabricator.wikimedia.org/T372720) (owner: 10dcaro)
[08:50:33] <wikibugs>	 (03open) 10dcaro: openapi.json: add backend errors to the message [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38
[08:50:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[08:51:26] <wikibugs>	 (03merge) 10dcaro: toolforge_get_versions: add calico [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/489
[08:52:00] <wikibugs>	 (03update) 10sstefanova: [jobs-cli] update autocomplete and man files [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/66 (owner: 10raymond-ndibe)
[08:54:18] <wikibugs>	 (03update) 10sstefanova: openapi.json: add backend errors to the message [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38 (owner: 10dcaro)
[08:54:36] <wikibugs>	 (03approved) 10sstefanova: openapi.json: add backend errors to the message [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38 (owner: 10dcaro)
[08:55:26] <wikibugs>	 (03merge) 10dcaro: openapi.json: add backend errors to the message [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38
[08:57:39] <wikibugs>	 (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: api-gateway: bump to 0.0.41-20240826085537-205b142a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/490
[08:58:22] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component api-gateway
[09:00:08] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 14): toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782#10090916 (10dcaro)
[09:00:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[09:00:42] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component api-gateway
[09:01:09] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component api-gateway
[09:02:36] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component api-gateway
[09:03:19] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component api-gateway
[09:08:42] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component api-gateway
[09:08:48] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component api-gateway
[09:13:49] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component api-gateway
[09:14:38] <wikibugs>	 (03approved) 10dcaro: api-gateway: bump to 0.0.41-20240826085537-205b142a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/490 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38)
[09:14:40] <wikibugs>	 (03merge) 10dcaro: api-gateway: bump to 0.0.41-20240826085537-205b142a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/490 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38)
[09:16:21] <wikibugs>	 (03update) 10dcaro: kind: upgrade k8s to 1.26 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/177 (https://phabricator.wikimedia.org/T370244) (owner: 10sstefanova)
[09:18:19] <wmcs-alerts>	 FIRING: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown
[09:41:10] <wikibugs>	 14Grid-Engine-to-K8s-Migration, 10Wiki-Loves-Monuments-Database, 13Patch-For-Review: Migrate heritage from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319787#10091157 (10dcaro) >>! In T319787#10082843, @JeanFred wrote: > I got something working, will wait overnight to se...
[09:54:31] <wikibugs>	 (03open) 10dcaro: toolforge: add calico to deployment list [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/185
[10:10:14] <wikibugs>	 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091210 (10Vgutierrez) toolsadmin.wikimedia.org is handled by labweb.svc.eqiad.wmnet which has two pooled backend servers, cloudweb1003 and cloudweb1004, cloudweb1003 is struggling for some reason (timeou...
[10:15:13] <wikibugs>	 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091229 (10dcaro) a:03dcaro
[10:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[10:26:20] <wikibugs>	 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091268 (10dcaro) The last log I see in cloudweb1003 is from two days ago: ` root@cloudweb1003:~# systemctl status striker.service  ● striker.service - Systemd runner for striker      Loaded: loaded (/lib...
[10:27:37] <wikibugs>	 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091272 (10dcaro) that seemed to help: ` dcaro@cp6001:~$ curl -v -s -4 --connect-to toolsadmin.wikimedia.org:443:$(dig +short cloudweb1003.wikimedia.org):7443 https://toolsadmin.wikimedia.org -o /dev/null...
[10:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[10:31:02] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091290 (10dcaro) Yep, still having issues, looking
[11:04:12] <jinxer-wm>	 FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[11:10:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:14:59] <wmcs-alerts>	 FIRING: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources
[11:15:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:15:54] <wikibugs>	 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091402 (10dcaro) p:05Unbreak!→03High Service seems restored, monitoring for a bit before closing, @Vgutierrez your comment was very helpful, thanks!
[11:19:59] <wmcs-alerts>	 RESOLVED: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources
[11:20:33] <wikibugs>	 10Tool-Global-user-contributions: GUC displays a database error - https://phabricator.wikimedia.org/T373319 (10Melos) 03NEW
[11:21:19] <wikibugs>	 10Striker: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10091425 (10Vgutierrez) @dcaro thanks! at the moment healthchecks only validate wikitech.wm.org: `yaml   labweb-ssl:     description: "lvs for cloudweb services: horizon, striker, wikitech - HTTPS"     enc...
[11:22:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:25:19] <wikibugs>	 10Tool-Global-user-contributions: GUC displays a database error - https://phabricator.wikimedia.org/T373319#10091429 (10RhinosF1) →14Duplicate dup:03T373243
[11:25:34] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091432 (10RhinosF1)
[11:46:28] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091535 (10dcaro) Querying from a webservice shell fails pretty frequently, even for internal names (and without domain searching, ie. with trailing `.`): ` I have no name!@...
[11:49:08] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091538 (10dcaro) I can reproduce with nsenter on the worker: ` root@tools-k8s-worker-104:~# time nsenter -t 578510 -n nslookup api.svc.tools.eqiad1.wikimedia.cloud. 10.96.0...
[12:19:23] <wikibugs>	 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10091623 (10rook) Doesn't appear to be fully deploying. Cluster deploys, but kube-system pods seem to have some issues. Main issue is maybe k8s-keystone-auth which is giving: ` Warning  Failed     11m (x4 over 13m)     kubelet...
[12:26:45] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091656 (10MBH) When I'm trying to build an image from my github repo, I got this strange issue:  `unable to access 'https://github.com/Saisengen/wikibots/': Could not resol...
[12:40:41] <wikibugs>	 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10091722 (10rook) Looks like the same kinds of things are happening in tf-infra-test ` NAME                                             READY   STATUS             RESTARTS   AGE pod/coredns-745687fb66-8jw96                     1/1     R...
[12:42:54] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-104 (T373243)
[12:42:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[12:42:59] <stashbot>	 T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243
[12:44:11] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-104 (T373243)
[12:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[12:44:45] <wikibugs>	 06cloud-services-team: Replace or deprecate WMCS uses of report updater - https://phabricator.wikimedia.org/T357856#10091732 (10lbowmaker) @Andrew - this was migrated from ReportUpdater to Airflow  https://phabricator.wikimedia.org/T357938  Let us know if there are any issues.
[12:50:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[12:53:14] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-104 (T373243)
[12:53:18] <stashbot>	 T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243
[12:53:19] <logmsgbot_cloud>	 !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-104 (T373243)
[12:53:32] <wikibugs>	 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 06serviceops: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592#10091775 (10akosiaris) >>! In T371592#10081906, @bd808 wrote: >>>! In T371592#10081520, @Andrew wrote: >> Will this be...
[13:00:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[13:05:06] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (T373243)
[13:05:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[13:05:10] <stashbot>	 T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243
[13:12:41] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (T373243)
[13:12:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[13:12:47] <stashbot>	 T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243
[13:14:12] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091898 (10dcaro) So going around with cumin, we found some workers that fail often: ` tools-k8s-worker-{nfs-{4,15,18,25,51,52},104} `  ` # running this many times to get al...
[13:17:21] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10091925 (10dcaro) The reboot did not help xd, the VMs are all running on different cloudvirts: ` root@cloudcontrol1007:~# for node in tools-k8s-worker-{nfs-{4,15,18,25,51,52...
[13:33:36] <wikibugs>	 10VPS-project-Codesearch, 10GitLab (Integrations): Figure out the future of codesearch in a GitLab world - https://phabricator.wikimedia.org/T268196#10091975 (10hashar) In case someone is looking at having our Code Search to index repositories hosted in GitLab, see T371992 / https://gerrit.wikimedia.org/r/...
[13:33:50] <wmcs-alerts>	 FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[13:38:50] <wmcs-alerts>	 RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[13:46:07] <wikibugs>	 (03update) 10sstefanova: calico: bump to 0.0.8-20240731084636-9937ff2a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/462 (https://phabricator.wikimedia.org/T370046) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38)
[13:51:24] <wmcs-alerts>	 FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
[13:51:29] <wmcs-alerts>	 FIRING: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown
[13:56:24] <wmcs-alerts>	 RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
[13:56:29] <wmcs-alerts>	 RESOLVED: TektonUpMetricUnknown: Tekton might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonUpMetricUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonUpMetricUnknown
[13:57:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[14:03:23] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-4 (T373243)
[14:03:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[14:03:28] <stashbot>	 T373243: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243
[14:04:07] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-4 (T373243)
[14:04:09] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-15 (T373243)
[14:04:55] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-15 (T373243)
[14:04:56] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-18 (T373243)
[14:05:37] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-18 (T373243)
[14:05:39] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-25 (T373243)
[14:06:22] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-25 (T373243)
[14:06:24] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-51 (T373243)
[14:07:03] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-51 (T373243)
[14:07:05] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-52 (T373243)
[14:07:45] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-52 (T373243)
[14:07:46] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-104 (T373243)
[14:08:03] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-104 (T373243)
[14:17:34] <wikibugs>	 10Striker, 13Patch-For-Review: toolsadmin.wikimedia.org is unavailable (2024-08-24) - https://phabricator.wikimedia.org/T373250#10092171 (10dcaro) I got that patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1066784, but I'm not sure if that will be forwarded to cloud services (team=wmcs) or not, do y...
[14:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[14:29:25] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10092244 (10dcaro) I have cordoned all the misbehaving workers, users should stop seeing issues right now, will try to debug in more detail and add new nodes if I can't find...
[14:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[14:40:02] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10092314 (10ArthurPSmith) Just to confirm I've done a few dozen actions that would have triggered this problem a few days ago, and everything is working. Thanks!
[15:04:12] <jinxer-wm>	 FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[15:04:42] <wikibugs>	 10VPS-Projects: Quota leak for dbs? - https://phabricator.wikimedia.org/T373348 (10rook) 03NEW
[15:10:03] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[15:10:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:11:13] <wikibugs>	 10Cloud-VPS, 06Infrastructure-Foundations, 10SRE-tools: Update offboard-user script to use Keystone API - https://phabricator.wikimedia.org/T306788#10092464 (10SLyngshede-WMF) a:03SLyngshede-WMF
[15:12:55] <wmcs-alerts>	 FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity
[15:15:47] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[15:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[15:31:58] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[15:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:33:51] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[15:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:34:58] <wikibugs>	 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357 (10Akaibu1) 03NEW
[15:35:05] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[15:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:35:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[15:38:19] <wm-bot2>	 !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=97) for a worker-nfs role in the tools cluster
[15:38:20] <wikibugs>	 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357#10092723 (10rook) →14Duplicate dup:03T358604
[15:38:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:38:34] <wikibugs>	 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357#10092740 (10rook) 05Duplicate→03Open
[15:38:54] <wikibugs>	 10PAWS: Update labpawspublic extension to jupyterlab 4 system - https://phabricator.wikimedia.org/T358604#10092725 (10rook)
[15:39:08] <wikibugs>	 10PAWS: Update labpawspublic extension to jupyterlab 4 system - https://phabricator.wikimedia.org/T358604#10092733 (10rook) Sorry T373357 wasn't a duplicate of this.
[15:39:13] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[15:39:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:41:19] <wikibugs>	 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357#10092759 (10rook) This was removed as the project stopped getting supported and the feature stopped working https://github.com/jupyterlab-contrib/jupyterlab-link-share/commits/main/ I'll update the documentation, thank you for b...
[15:44:42] <wikibugs>	 10PAWS: PAWS Share button moved/removed? - https://phabricator.wikimedia.org/T373357#10092783 (10rook) 05Open→03Resolved a:03rook
[15:44:57] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[15:45:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:45:22] <jinxer-wm>	 FIRING: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[15:48:46] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[15:48:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:49:10] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[15:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:49:12] <wikibugs>	 10Cloud-Services: Prepare "What's new with Wikimedia Cloud Services" presentation for WikiConNA 2024 - https://phabricator.wikimedia.org/T373159#10092811 (10bd808) The tech spike I have been working on in {T372498} could be a good hook for talking about Cloud VPS changes in the last couple of years. That spike i...
[15:50:10] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[15:50:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:50:22] <jinxer-wm>	 RESOLVED: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[15:52:01] <wikibugs>	 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360 (10rook) 03NEW
[16:02:09] <wm-bot2>	 !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-57.tools.eqiad1.wikimedia.cloud to the cluster
[16:02:09] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[16:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:02:41] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[16:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:03:57] <jinxer-wm>	 FIRING: SystemdUnitDown: The service unit prometheus-openstack-exporter.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[16:06:00] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10092877 (10dcaro) New nodes seem to not have the issue, so will continue adding new ones (added worker-nfs-57)
[16:08:29] <wmcs-alerts>	 FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[16:10:28] <wikibugs>	 10Cloud-Services, 10Catalyst (PatchDemo GoLive): Moving proxies across wmcs projects for patchdemo.wmflabs.org - https://phabricator.wikimedia.org/T370080#10092893 (10jnuche) 05Open→03Resolved a:03jnuche Done during the switchover on 2024-08-19
[16:12:55] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity
[16:13:29] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[16:14:05] <wm-bot2>	 !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-58.tools.eqiad1.wikimedia.cloud to the cluster
[16:14:05] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[16:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:20:40] <wikibugs>	 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 06serviceops: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592#10092958 (10bd808) >>! In T371592#10091775, @akosiaris wrote: >>>! In T371592#10081906, @bd808 wrote: >> This is block...
[16:26:07] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[16:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:30:46] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[16:30:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:36:28] <wmcs-alerts>	 FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-59 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[16:37:13] <wikibugs>	 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360#10093032 (10rook) https://github.com/toolforge/tf-infra-test/pull/18
[16:37:38] <wikibugs>	 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360#10093033 (10rook) 05Open→03Resolved a:03rook
[16:42:18] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[16:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:42:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[16:47:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[16:50:42] <wikibugs>	 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093066 (10rook) Looks like 1.26 isn't working anymore as some of the image tags that 1.26 wants have been removed. Upgrading to 1.27
[16:54:01] <wm-bot2>	 !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-60.tools.eqiad1.wikimedia.cloud to the cluster
[16:54:01] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[16:54:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:54:15] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[16:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:57:58] <wmcs-alerts>	 RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-k8s-worker-nfs-59 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[17:04:16] <wm-bot2>	 !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-61.tools.eqiad1.wikimedia.cloud to the cluster
[17:04:16] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[17:04:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:08:46] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10093181 (10bd808) >>! In T373227#10089302, @Andrew wrote: > I didn't do a rabbit...
[17:08:57] <jinxer-wm>	 RESOLVED: SystemdUnitDown: The service unit prometheus-openstack-exporter.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:09:29] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[17:10:03] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10093183 (10bd808) a:03Andrew
[17:11:07] <wikibugs>	 10PAWS: Upgrade to k8s 1.27 - https://phabricator.wikimedia.org/T373372 (10rook) 03NEW
[17:14:50] <wikibugs>	 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093234 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/67
[17:14:56] <notefromgithub>	 vivian-rook closed https://github.com/toolforge/quarry/pull/67
[17:15:35] <wikibugs>	 10Quarry: Remove quarry-124 cluster - https://phabricator.wikimedia.org/T373375 (10rook) 03NEW
[17:17:10] <wikibugs>	 10Quarry: Remove quarry-124 cluster - https://phabricator.wikimedia.org/T373375#10093268 (10rook)
[17:17:11] <wikibugs>	 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093269 (10rook)
[17:17:24] <wikibugs>	 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093272 (10rook) 05Open→03Resolved
[17:18:18] <wikibugs>	 (03PS3) 10Lokal Profil: Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633)
[17:18:59] <wikibugs>	 10PAWS: Remove paws-127 cluster - https://phabricator.wikimedia.org/T373036#10093280 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/450
[17:19:11] <notefromgithub>	 vivian-rook opened https://github.com/toolforge/paws/pull/450
[17:19:29] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[17:20:20] <wikibugs>	 (03PS4) 10Lokal Profil: Add LICENSE to repo [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633)
[17:21:58] <wmcs-alerts>	 FIRING: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources
[17:24:29] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[17:26:58] <wmcs-alerts>	 RESOLVED: Toolforge Kyverno no policy resources: Toolforge Kyverno has no policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_no_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+no+policy+resources
[17:29:29] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[17:29:39] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[17:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:29:57] <notefromgithub>	 vivian-rook closed https://github.com/toolforge/paws/pull/450
[17:30:07] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[17:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:30:15] <wikibugs>	 10PAWS: Remove paws-127 cluster - https://phabricator.wikimedia.org/T373036#10093338 (10rook) 05Open→03Resolved a:03rook
[17:33:14] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.quota_increase
[17:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:33:21] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0)
[17:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:33:33] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[17:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:33:58] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[17:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:34:29] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[17:38:19] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.quota_increase
[17:38:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:38:26] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0)
[17:38:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:38:31] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[17:38:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:48:50] <notefromgithub>	 vivian-rook opened https://github.com/toolforge/paws/pull/451
[17:49:26] <wm-bot2>	 !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-62.tools.eqiad1.wikimedia.cloud to the cluster
[17:49:26] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[17:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:49:29] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[17:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[17:53:54] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10093444 (10rook) Looks like some of the images for the 1.26 deploy of magnum k8s...
[18:15:19] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10093508 (10bd808)
[18:15:20] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044#10093509 (10bd808)
[18:15:21] <wikibugs>	 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360#10093512 (10bd808)
[18:15:26] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044#10093513 (10bd808)
[18:16:15] <wikibugs>	 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093#10093518 (10bd808)
[18:16:19] <wikibugs>	 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044#10093519 (10bd808)
[18:31:56] <jinxer-wm>	 FIRING: SystemdUnitDown: The service unit neutron-openvswitch-agent.service is in failed status on host cloudvirt1062. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[18:33:11] <jinxer-wm>	 FIRING: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[18:34:38] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[18:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[18:35:06] <wm-bot2>	 !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster
[18:35:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[19:05:09] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of udstest VPS project - https://phabricator.wikimedia.org/T373386#10093729 (10sbassett)
[19:06:41] <jinxer-wm>	 FIRING: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[19:07:20] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of udstest VPS project - https://phabricator.wikimedia.org/T373386#10093744 (10sbassett)
[19:07:28] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of udstest VPS project - https://phabricator.wikimedia.org/T373386#10093745 (10sbassett)
[19:08:38] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of udstest VPS project - https://phabricator.wikimedia.org/T373386#10093743 (10JJMC89) Do you mean `usdtest`?
[19:11:29] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10093752 (10sbassett)
[19:11:31] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10093754 (10sbassett) >>! In T373386#10093743, @JJMC89 wrote: > Do you mean `usdtest`?  I did, thanks for catching that.
[19:12:59] <wikibugs>	 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10093757 (10sbassett)
[19:17:53] <wikibugs>	 10VPS-Projects: magnum clusters not deploying in eqiad1 - https://phabricator.wikimedia.org/T373207#10093800 (10rook) →14Duplicate dup:03T373360
[19:17:55] <wikibugs>	 10VPS-Projects: magnum to 1.27 - https://phabricator.wikimedia.org/T373360#10093802 (10rook)
[19:43:53] <wikibugs>	 06cloud-services-team, 10Cloud-VPS (Project-requests): Request creation of usdtest VPS project - https://phabricator.wikimedia.org/T373386#10093914 (10bd808) +1
[19:44:29] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[20:12:59] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.quota_increase
[20:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:13:09] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0)
[20:13:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:13:15] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[20:13:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:23:39] <wm-bot2>	 !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-63.tools.eqiad1.wikimedia.cloud to the cluster
[20:23:39] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[20:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:26:41] <jinxer-wm>	 RESOLVED: SystemdUnitDown: The service unit backup_vms.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[20:26:41] <jinxer-wm>	 RESOLVED: SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[20:34:29] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[21:03:40] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[21:03:42] <wm-bot2>	 !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=97) for a worker-nfs role in the tools cluster
[21:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[21:03:44] <wm-bot2>	 !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster
[21:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[21:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[21:13:56] <wm-bot2>	 !log dcaro@urcuchillay tools Added a new k8s worker-nfs tools-k8s-worker-nfs-64.tools.eqiad1.wikimedia.cloud to the cluster
[21:13:56] <wm-bot2>	 !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster
[21:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[21:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[21:19:19] <wikibugs>	 10Cloud-VPS (Debian Buster Deprecation), 10Humaniki: Cloud VPS "wikidumpparse" project Buster deprecation - https://phabricator.wikimedia.org/T367561#10094195 (10Maximilianklein) update for 2024-08-19  [x] create cinder volume. [x] move project code [x] move mysql-db files [x] create a new debian bookworm inst...
[21:21:55] <wikibugs>	 10Cloud-VPS (Debian Buster Deprecation), 10Humaniki: Cloud VPS "wikidumpparse" project Buster deprecation - https://phabricator.wikimedia.org/T367561#10094203 (10Maximilianklein) 05Open→03Resolved a:03Maximilianklein done and deleted
[21:50:56] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[23:09:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[23:14:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[23:21:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown