[00:06:56] FIRING: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:01:56] RESOLVED: SystemdUnitDown: The service unit logrotate.service is in failed status on host cloudgw1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudgw1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:10:56] FIRING: SystemdUnitDown: The service unit remove_dangling_cinder_snapshots.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:15:56] FIRING: [2x] SystemdUnitDown: The service unit remove_dangling_cinder_snapshots.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [02:25:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [02:26:07] FIRING: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [02:31:08] RESOLVED: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [02:45:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [02:55:07] FIRING: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [02:55:08] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [03:00:07] RESOLVED: HarborDown: Harbor is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborDown [03:00:08] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [03:02:18] FIRING: [2x] KernelErrors: Server cloudcephosd1041 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1041 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [03:02:28] 06cloud-services-team: KernelErrors Server cloudcephosd1041 logged kernel errors - https://phabricator.wikimedia.org/T400222 (10phaultfinder) 03NEW [03:34:14] 06cloud-services-team, 10Toolforge: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223 (10Danilo) 03NEW [03:39:36] (03update) 10raymond-ndibe: Draft: [maintain-harbor] add tests and configurations for new maintain-harbor jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/881 (https://phabricator.wikimedia.org/T360509) [03:41:01] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [03:56:09] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#11026562 (10Raymond_Ndibe) 05Stalled→03In progress [04:05:56] FIRING: SystemdUnitDown: The systemd unit remove_dangling_cinder_snapshots.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:06:06] 06cloud-services-team: SystemdUnitDown The systemd unit remove_dangling_cinder_snapshots.service on node cloudbackup1001-dev has been failing for more than two hours. - https://phabricator.wikimedia.org/T400224 (10phaultfinder) 03NEW [04:10:56] FIRING: [2x] SystemdUnitDown: The systemd unit remove_dangling_cinder_snapshots.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:11:09] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T400225 (10phaultfinder) 03NEW [04:25:56] FIRING: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:32:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudcephmon2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:55:18] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudcephmon2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:20:56] FIRING: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:21:07] 06cloud-services-team: SystemdUnitDown The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://phabricator.wikimedia.org/T400227 (10phaultfinder) 03NEW [07:02:33] FIRING: [2x] KernelErrors: Server cloudcephosd1041 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1041 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [07:06:34] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11026650 (10PerfektesChaos) >>! In T400212#11025942, @bd808 wrote: > Possible duplicate of {T226688}. Nope. I am requesting harsh action for both wmclou... [07:15:29] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11026653 (10PerfektesChaos) >>! In T400212#11025977, @bd808 wrote: > A robots.txt file is only advisory guidance for well-behaved web bots: The request... [07:31:36] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11026723 (10PerfektesChaos) The defense strategy does need a restart, since our tools, services and usage of BETA suffer from significant limitations in... [08:15:23] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 22), 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#11026764 (10fnegri) [08:19:39] 06cloud-services-team: KernelErrors Server cloudcephosd1041 logged kernel errors - https://phabricator.wikimedia.org/T400222#11026773 (10fnegri) These ones are a bit concerning: ` fnegri@cloudcephosd1041:~$ sudo journalctl -k -perr --since today -- Journal begins at Fri 2025-06-27 18:34:30 UTC, ends at Wed 2025... [08:27:05] 06cloud-services-team, 10Cloud-VPS: [openstack object storage] deleted files still occupying space - https://phabricator.wikimedia.org/T376673#11026783 (10dcaro) 05Open→03Resolved a:03dcaro The space has been freed :), the radosgw-admin command work on the mons again, so this can be closed. {F65618547} [08:30:32] 06cloud-services-team: KernelErrors Server cloudcephosd1041 logged kernel errors - https://phabricator.wikimedia.org/T400222#11026789 (10fnegri) Full kernel logs including lower-priority messages: ` fnegri@cloudcephosd1041:~$ sudo journalctl -k --since today -- Journal begins at Fri 2025-06-27 18:34:30 UTC, end... [08:36:55] 06cloud-services-team: KernelErrors Server cloudcephosd1041 logged kernel errors - https://phabricator.wikimedia.org/T400222#11026808 (10fnegri) 05Open→03Resolved a:03fnegri I'm slightly confused by the fact that the `mce` messages were logged with `priority=emerg`, while the other messages were logged... [08:52:30] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11026858 (10taavi) >>! In T400212#11026650, @PerfektesChaos wrote: > T226688 is dealing with beta.wmcloud.org only. I do not see a single mention of bet... [08:52:41] 06cloud-services-team, 10Toolforge, 06serviceops: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11026860 (10taavi) →14Duplicate dup:03T226688 [08:52:54] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Block web crawlers from accessing Cloud Services - https://phabricator.wikimedia.org/T226688#11026862 (10taavi) [08:53:10] 06cloud-services-team, 10Toolforge: Shield wmcloud.org and toolforge.org​ against crawler traffic - https://phabricator.wikimedia.org/T400212#11026864 (10taavi) [09:45:10] FIRING: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (4d 3h 38m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [09:50:10] RESOLVED: [2x] ProjectProxyMainProxyCertificateExpiry: Certificate for proxy on proxy-5 is about to expire (4d 3h 37m 52s to expiration) - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyCertificateExpiry [09:51:20] 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 07Privacy: tools-static.wmflabs.org/cdnjs may return redirects to speedcf.cloudflareaccess.com, violating user privacy - https://phabricator.wikimedia.org/T399483#11027090 (10taavi) > `lang=shell-session > $ curl -I 'https://tools-static.wmflabs.org/c... [09:51:24] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Block web crawlers from accessing Cloud Services - https://phabricator.wikimedia.org/T226688#11027092 (10PerfektesChaos) Retellling the story of T400212: The amount of disturbance has exceeded reasonable limits. BETA cannot be used by regulars since huge amoun... [10:21:11] FIRING: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:31:28] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 13Patch-For-Review: [jobs-api] Jobs API should query logs from Loki - https://phabricator.wikimedia.org/T398645#11027265 (10taavi) 05Open→03In progress [10:54:39] 06cloud-services-team, 10Cloud-VPS: Cloud VPS project creation cookbook times out really often - https://phabricator.wikimedia.org/T398712#11027295 (10taavi) >>! In T398712#11018926, @Andrew wrote: > @Taavi, are you seeing delays of more than 2 or so minutes? Yes. > And do you happen to know which stage of th... [10:59:50] 06cloud-services-team, 10Cloud-VPS, 10Ceph, 06Data-Platform-SRE: Proposed improvement: Manage CephX users via exported/collected Puppet resources - https://phabricator.wikimedia.org/T399594#11027305 (10taavi) [11:05:32] 06cloud-services-team, 10Cloud-VPS: [wmcs-cookbooks] cloudvirt.safe_reboot triggers NeutronAgentDown alert - https://phabricator.wikimedia.org/T399705#11027325 (10taavi) p:05Triage→03Medium a:03taavi T335943 decreased the Prometheus scrape interval to be short enough that we can drop the `min_over_time()... [11:35:06] 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 07Privacy: tools-static.wmflabs.org/cdnjs may return redirects to speedcf.cloudflareaccess.com, violating user privacy - https://phabricator.wikimedia.org/T399483#11027432 (10LucasWerkmeister) Huh. Thanks! Theoretically that still seems undesirable t... [12:02:15] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11027518 (10fnegri) I deleted the failed backup. @YochayCO waiting for your ok before changing the `archive_mode` setting and restarting the database. [12:20:56] RESOLVED: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:20:57] RESOLVED: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:25:05] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: [wmcs-cookbooks] cloudvirt.safe_reboot triggers NeutronAgentDown alert - https://phabricator.wikimedia.org/T399705#11027575 (10taavi) 05Open→03Resolved [12:27:18] RESOLVED: KernelErrors: Server cloudcephosd1015 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1015 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [12:32:16] 06cloud-services-team, 10Toolforge: Build Trixie based Toolforge pre-built images - https://phabricator.wikimedia.org/T400255 (10taavi) 03NEW p:05Triage→03Medium [12:32:36] 06cloud-services-team, 10Toolforge: Build Trixie based Toolforge pre-built images - https://phabricator.wikimedia.org/T400255#11027618 (10taavi) [12:32:49] 06cloud-services-team, 10Toolforge: Build Trixie based Toolforge pre-built images - https://phabricator.wikimedia.org/T400255#11027625 (10taavi) 05Open→03Stalled [12:33:20] 06cloud-services-team, 10Toolforge: Add support for Python 3.13 - https://phabricator.wikimedia.org/T381899#11027628 (10taavi) 05Open→03Stalled Marking as stalled on {T400255}. [12:35:33] 06cloud-services-team, 10Toolforge: Add support for Python 3.13 - https://phabricator.wikimedia.org/T381899#11027637 (10LucasWerkmeister) >>! In T381899#10394800, @LucasWerkmeister wrote: > It should also be possible to use the [Toolforge Build Service](https://wikitech.wikimedia.org/wiki/Help:Toolforge/Buildi... [12:39:01] 06cloud-services-team, 10Toolforge: Update Toolforge Tcl image to a supported Debian release - https://phabricator.wikimedia.org/T400256 (10taavi) 03NEW p:05Triage→03Low [12:39:18] 06cloud-services-team, 10Toolforge: Build Trixie based Toolforge pre-built images - https://phabricator.wikimedia.org/T400255#11027659 (10taavi) [12:39:20] 06cloud-services-team, 10Toolforge: Update Toolforge Tcl image to a supported Debian release - https://phabricator.wikimedia.org/T400256#11027658 (10taavi) [12:39:27] 06cloud-services-team, 10Toolforge: Update Toolforge Tcl image to a supported Debian release - https://phabricator.wikimedia.org/T400256#11027661 (10taavi) Blocking on T400255 as I'd like to go directly to Trixie here. [12:44:00] 06cloud-services-team, 10Toolforge: Stop building Bullseye based Toolforge prebuilt images - https://phabricator.wikimedia.org/T400258 (10taavi) 03NEW p:05Triage→03Low [12:44:22] 06cloud-services-team, 10Toolforge: Stop building Bullseye based Toolforge prebuilt images - https://phabricator.wikimedia.org/T400258#11027692 (10taavi) [12:44:24] 06cloud-services-team, 10Toolforge: Update Toolforge Tcl image to a supported Debian release - https://phabricator.wikimedia.org/T400256#11027693 (10taavi) [12:44:25] 06cloud-services-team, 10Toolforge: Build Trixie based Toolforge pre-built images - https://phabricator.wikimedia.org/T400255#11027694 (10taavi) [12:57:11] 06cloud-services-team, 10Cloud-VPS, 10VPS-Projects: metricsinfra: send alerts for the catalyst project to catalyst@w.o email - https://phabricator.wikimedia.org/T386416#11027704 (10taavi) 05Resolved→03Open a:05dcaro→03None This doesn't seem to have ever worked; the notification emails are being bounc... [13:03:57] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11027714 (10YochayCO) Sorry for the significant delay. Please go ahead 🙏 [13:08:52] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11027733 (10fnegri) No problem! DB restarted, please check everything is working. [13:32:41] 06cloud-services-team, 10Cloud-VPS: [trove] Postgres uses up to 50% of disk space for wal_archive - https://phabricator.wikimedia.org/T400260 (10fnegri) 03NEW [13:32:52] 06cloud-services-team, 10Cloud-VPS: [trove] Postgres uses up to 50% of disk space for wal_archive - https://phabricator.wikimedia.org/T400260#11027809 (10fnegri) [13:32:53] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11027810 (10fnegri) [13:33:15] 06cloud-services-team, 10Cloud-VPS: [trove] Postgres uses up to 50% of disk space for wal_archive - https://phabricator.wikimedia.org/T400260#11027811 (10fnegri) p:05Triage→03Low [13:33:57] 06cloud-services-team, 10Cloud-VPS: [trove] Postgres uses up to 50% of disk space for wal_archive - https://phabricator.wikimedia.org/T400260#11027815 (10fnegri) [13:34:00] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11027816 (10YochayCO) Looking healthy, I'll supervise it tomorrow as well just in case :) Thank you @fnegri [13:34:05] 06cloud-services-team, 10Cloud-VPS: [trove] Postgres uses up to 50% of disk space for wal_archive - https://phabricator.wikimedia.org/T400260#11027817 (10fnegri) [13:43:06] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11027857 (10fnegri) Thank you @YochayCO ! For the record, I couldn't find any cleaner way to disable `archive_mode` than manually editing `/etc/postg... [14:00:50] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11027924 (10fnegri) Hmm the resize failed: ` fnegri@cloudcontrol1007:~$ sudo OS_PROJECT_ID=glamwikidashboard wmcs-openstack database instance resize... [14:04:50] 06cloud-services-team: SystemdUnitDown The systemd unit remove_dangling_cinder_snapshots.service on node cloudbackup1001-dev has been failing for more than two hours. - https://phabricator.wikimedia.org/T400224#11027955 (10taavi) →14Duplicate dup:03T400225 [14:04:52] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T400225#11027957 (10taavi) [14:04:55] 06cloud-services-team: SystemdUnitDown The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://phabricator.wikimedia.org/T400227#11027959 (10fnegri) 05Open→03Resolved a:03fnegri One-off failure on the kiwis side, it's working again: ` Jul... [14:05:59] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T400225#11027965 (10taavi) `lang=irc 07:41 ceph in codfw1dev is in a bad way thanks to the mon nodes being in a weird chicken/egg situation. The cloudbackup100x-dev alerts are from that. 07:42 Feel free to ig... [14:06:07] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T400225#11027967 (10fnegri) p:05Triage→03Low [14:08:07] 06cloud-services-team: KernelErrors Server cloudcephosd1013 logged kernel errors - https://phabricator.wikimedia.org/T399366#11027981 (10fnegri) p:05Triage→03Medium [14:10:56] 06cloud-services-team: SystemdUnitDown The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T399354#11027984 (10fnegri) 05Open→03Resolved a:03fnegri Logs don't go back up to Jul 12, so we can no longer check... [14:12:37] 06cloud-services-team, 10Toolforge: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223#11027998 (10fnegri) p:05Triage→03Medium [14:13:38] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure (Zuul upgrade): http://169.254.169.254/openstack/latest/user_data semi-regularly unavaliable during Magnum Kubernetes cluster builds - https://phabricator.wikimedia.org/T399596#11028001 (10fnegri) p:05Triage→03Medium [14:18:46] 06cloud-services-team, 10Tool-quickcategories, 10Toolforge, 13Patch-For-Review: Relax restrictions on toolforge envvar names - https://phabricator.wikimedia.org/T374780#11028043 (10fnegri) p:05Triage→03Medium [14:24:02] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11028063 (10fnegri) Ok a second attempt is now resizing the disk, the db should be back in a couple minutes. [14:32:11] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11028118 (10fnegri) Resize completed. I was hoping it would be smoother, apologies for the downtime. Things should be back to normal now (with 1000GB... [15:04:03] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11028217 (10YochayCO) Great. If I don't say anything tomorrow it means there are no problems 💪 [15:05:03] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [trove] Disk full for DBapp instance in glamwikidashboard project - https://phabricator.wikimedia.org/T396724#11028220 (10fnegri) 05In progress→03Resolved Finally marking as Resolved, please reopen if you see any issue! [15:18:43] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Block web crawlers from accessing Cloud Services - https://phabricator.wikimedia.org/T226688#11028299 (10Framawiki) anubis (https://anubis.techaro.lol/docs/) is a middleware tool that adds a challenge to suspicious requesters, that requires some computing time... [16:06:09] 14Toolforge (Software install/update): Build Bookworm based Toolforge Kubernetes images - https://phabricator.wikimedia.org/T335507#11028544 (10bd808) [16:07:12] 14cloud-services-team (Kanban), 14Toolforge (Software install/update): Build Bullseye based Toolforge images - https://phabricator.wikimedia.org/T284590#11028547 (10bd808) [17:03:34] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cinder] Clean up unused linkwatcher volumes in "trove" project - https://phabricator.wikimedia.org/T400285 (10fnegri) 03NEW [17:05:34] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cinder] Clean up unused linkwatcher volumes in "trove" project - https://phabricator.wikimedia.org/T400285#11028769 (10fnegri) p:05Triage→03Low [17:13:34] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: [cinder] Clean up unused linkwatcher volumes in "trove" project - https://phabricator.wikimedia.org/T400285#11028789 (10fnegri) Full terminal session for further debugging: https://phabricator.wikimedia.org/P79760 [17:36:57] (03update) 10damian: [T400024] Allow protocol to be specified for ports [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/113 [17:37:41] (03update) 10damian: [T400024] Allow protocol to be specified with port [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/113 [17:40:11] (03update) 10damian: [T400024] Allow protocol to be specified with port [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/113 [17:40:39] (03update) 10damian: [T400024] Allow protocol to be specified with port [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/113 [18:46:20] (03update) 10raymond-ndibe: runtime: do the diff at the core.models.Job level [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 (owner: 10dcaro) [18:46:21] (03approved) 10raymond-ndibe: runtime: do the diff at the core.models.Job level [repos/cloud/toolforge/jobs-api] (fix_diff_bug) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/182 (owner: 10dcaro) [18:51:17] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [18:54:11] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [18:56:22] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [18:59:52] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [19:00:12] (03update) 10raymond-ndibe: maintain-harbor: bump to 0.0.57-20250721140622-cd1281e2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/880 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [19:00:13] (03approved) 10raymond-ndibe: maintain-harbor: bump to 0.0.57-20250721140622-cd1281e2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/880 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [19:00:20] (03merge) 10raymond-ndibe: maintain-harbor: bump to 0.0.57-20250721140622-cd1281e2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/880 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [20:23:37] 06cloud-services-team, 10Cloud-VPS, 10VPS-Projects, 10Catalyst: metricsinfra: send alerts for the catalyst project to catalyst@w.o email - https://phabricator.wikimedia.org/T386416#11029429 (10A_smart_kitten) [20:51:57] (03update) 10damian: [T400024] Allow protocol to be specified with port [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/113 [20:55:47] (03update) 10damian: [T400024] Allow protocol to be specified with port [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/113 [20:58:47] (03close) 10damian: [T400025] Add explicit support for TCP probes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/184 [21:05:56] FIRING: SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1001-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:06:05] 06cloud-services-team: SystemdUnitDown The systemd unit backup_cinder_volumes.service on node cloudbackup1001-dev has been failing for more than two hours. - https://phabricator.wikimedia.org/T400298 (10phaultfinder) 03NEW [21:10:56] FIRING: [2x] SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:11:06] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T400225#11029495 (10phaultfinder) [21:25:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:28:33] 06cloud-services-team, 10Toolforge: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223#11029528 (10Danilo) [22:14:43] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11029581 (10VRiley-WMF) [23:00:43] (03update) 10raymond-ndibe: api: allow protocol to be specified for ports [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/186 (owner: 10dcaro) [23:00:59] (03update) 10raymond-ndibe: api: allow protocol to be specified for ports [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/186 (owner: 10dcaro) [23:39:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 87.03%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [23:47:00] (03update) 10raymond-ndibe: [maintain-harbor.jobs] manage policies and robot accounts [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/47 (https://phabricator.wikimedia.org/T360509) [23:52:14] 10Cloud-VPS (Quota-requests), 10Continuous-Integration-Infrastructure (Zuul upgrade): Large quota increase for zuul Cloud VPS project - https://phabricator.wikimedia.org/T400305 (10bd808) 03NEW [23:52:30] 10Cloud-VPS (Quota-requests), 10Continuous-Integration-Infrastructure (Zuul upgrade): Large quota increase for zuul Cloud VPS project - https://phabricator.wikimedia.org/T400305#11029709 (10bd808)