[04:09:40] 10Tool-toolwatch, 10GitLab (Project Migration): Migrate tool to gitlab - https://phabricator.wikimedia.org/T410850#11399265 (10Gopavasanth) @Reputation22 thanks for creating this! I was lately thinking of migrating the repo, and this gave me an additional push. [04:38:23] (03update) 10jaredblumer: Add Server Object Rulesets and Corresponding Specs [toolforge-repos/wmf-openapi-linter] - 10https://gitlab.wikimedia.org/toolforge-repos/wmf-openapi-linter/-/merge_requests/3 [07:55:28] !log komla@cloudcumin1001 prove START - Cookbook wmcs.vps.create_project for project prove in eqiad1 (T408387) [07:55:30] komla@cloudcumin1001: Unknown project "prove" [07:55:30] T408387: CloudVPS instance for ProVe - https://phabricator.wikimedia.org/T408387 [07:56:07] (03update) 10group_199_bot_f98be072172e323ae6d1441939d3e461: projects: added project prove [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/283 (https://phabricator.wikimedia.org/T408387) [07:56:10] (03open) 10group_199_bot_f98be072172e323ae6d1441939d3e461: projects: added project prove [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/283 (https://phabricator.wikimedia.org/T408387) [07:56:28] !log komla@cloudcumin1001 prove END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project prove in eqiad1 (T408387) [07:56:28] komla@cloudcumin1001: Unknown project "prove" [07:58:48] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11399356 (10fgiunchedi) I can't currently reproduce the issue -- navigating to https://hub-paws.wmcloud.org/ spawns a container for me and works as expected. Is this still a proble... [08:24:23] 10Tool-toolwatch, 10GitLab (Project Migration): Migrate toolwatch tool to gitlab - https://phabricator.wikimedia.org/T410850#11399380 (10Peachey88) [09:11:58] (03open) 10volans: infra-tracing: set the X-Scope-OrgId header [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1087 (https://phabricator.wikimedia.org/T399313) [09:16:24] (03approved) 10filippo: infra-tracing: set the X-Scope-OrgId header [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1087 (https://phabricator.wikimedia.org/T399313) (owner: 10volans) [09:47:28] !log volans@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component infra-tracing [10:01:02] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11399868 (10Sadeiiw67) Yes I having it still @fgiunchedi [10:03:07] !log volans@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component infra-tracing [10:04:51] !log volans@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component infra-tracing [10:05:26] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11399879 (10Sadeiiw67) {F70567920} [10:23:58] !log volans@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component infra-tracing [10:52:43] 06cloud-services-team, 10Toolforge: [tofu-provisioning] Allow running manual "tofu state" commands - https://phabricator.wikimedia.org/T410720#11399999 (10fnegri) Thanks @bd808 that's indeed an approach that might work for `toolforge/tofu-provisioning` too. I don't particularly like using the Gitlab CI setting... [11:07:42] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11400037 (10fnegri) I can reproduce the issue by going to https://hub-paws.wmcloud.org/hub/spawn-pending/Haywi578 and clicking "Relaunch server" [11:14:23] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11400055 (10fnegri) It's failing with a "permission denied" error: `lang=shell-session root@bastion:~# kubectl logs -n prod jupyter--48aywi578 [I 2025-11-24 11:12:57.570 ServerAp... [11:26:36] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11400089 (10fnegri) The user's home directory `/srv/paws/project/paws/userhomes/79386614` was manually locked on Nov, 19 because it contains a copy of the [xmrig](https://xmrig.com... [11:29:04] 06cloud-services-team, 10Cloud-VPS: wmcs Trixie kernel reboots - https://phabricator.wikimedia.org/T410846#11400099 (10MoritzMuehlenhoff) [11:29:43] 06cloud-services-team, 10Cloud-VPS: wmcs Trixie kernel reboots - https://phabricator.wikimedia.org/T410846#11400102 (10MoritzMuehlenhoff) There's a few more trixie cloud nodes, I took the liberty to expand the task accordingly. [11:33:57] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11400107 (10Sadeiiw67) It's true I was making a tutorial for my class related to mining with xmrig and telling them how it's work and after the class I have turned off the miner be... [11:43:20] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11400152 (10Sadeiiw67) 05Open→03Resolved a:03Sadeiiw67 As the user was trying something like mining, so its got locked, and any further action like mining will also be bl... [13:27:02] (03PS1) 10Volans: labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) [14:14:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:15:11] 06cloud-services-team, 10Cloud-VPS: wmcs Trixie kernel reboots - https://phabricator.wikimedia.org/T410846#11400543 (10Andrew) this is the price I pay for being an early adopter [14:19:09] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:29:40] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch [14:30:00] !log fnegri@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for main branch [14:31:05] 06cloud-services-team, 10Cloud-VPS: [tofu-infra] tofu failing to retrieve DNS zones on codfw - https://phabricator.wikimedia.org/T410265#11400598 (10fnegri) Now it's failing with a different error, 503 accessing the statefile with S3: HeadObject ` Error refreshing state: operation error S3: HeadObject, https... [14:53:02] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch [14:53:21] !log fnegri@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for main branch [14:54:32] (03PS2) 10Volans: labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) [14:55:40] 06cloud-services-team, 10Cloud-VPS: [tofu-infra] tofu failing to retrieve DNS zones on codfw - https://phabricator.wikimedia.org/T410265#11400741 (10fnegri) I checked in root@cloudcontrol2005-dev and the credentials in /etc/tofu.env are correct, I can use them with `awscli` but listing the file randomly fails... [14:56:49] (03CR) 10Ladsgroup: [C:03+2] "We need to disable diffusion." [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1208445 (https://phabricator.wikimedia.org/T410624) (owner: 10Ladsgroup) [14:57:06] (03CR) 10Volans: "Required by the related change:" [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [14:58:31] (03Merged) 10jenkins-bot: Switch off netbox-exports from diffusion [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1208445 (https://phabricator.wikimedia.org/T410624) (owner: 10Ladsgroup) [14:59:53] 06cloud-services-team, 10Cloud-VPS: [tofu-infra] "tofu plan" failing in codfw - https://phabricator.wikimedia.org/T410265#11400751 (10fnegri) [15:00:40] 10VPS-project-Codesearch, 06collaboration-services, 13Patch-For-Review: Stop pulling netbox-exported-dns repo from Phabricator Diffusion (which mirrors netbox-exports.wikimedia.org) - https://phabricator.wikimedia.org/T410624#11400752 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It'll be live in 2... [15:01:05] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch [15:01:25] !log fnegri@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.tofu (exit_code=99) running tofu plan for main branch [15:01:51] 06cloud-services-team, 10Cloud-VPS: [tofu-infra] "tofu plan" failing in codfw - https://phabricator.wikimedia.org/T410265#11400757 (10fnegri) Possibly related to {T394061}. [15:45:16] (03CR) 10Alexandros Kosiaris: [C:03+1] labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [16:28:25] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 10Wiki-Loves-Monuments-Database, 07Sustainability (Incident Followup): [toolsdb] ibdata1 growing on primary - https://phabricator.wikimedia.org/T409716#11401281 (10fnegri) > I check that email every day, for the week I didn't check, this happened. LOL!... [16:37:24] 10VPS-project-Codesearch: codesearch-write-config cronjob failing since 15 Dec: "RuntimeError: Unsure how to handle URL: https://codeberg.org/chdorner/CheckRegistrationEmailDomains" - https://phabricator.wikimedia.org/T383192#11401352 (10LSobanski) [17:11:29] !log taavi@cloudcumin1001 project-proxy START - Cookbook wmcs.vps.instance.stop_start vm project-proxy-acme-chief-03 (cluster eqiad1) [17:11:58] !log taavi@cloudcumin1001 project-proxy END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm project-proxy-acme-chief-03 (cluster eqiad1) [17:23:43] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 13Patch-For-Review, 07Sustainability (Incident Followup): [toolsdb] crash recovery can fail because of insufficient innodb_log_file_size - https://phabricator.wikimedia.org/T409922#11401719 (10fnegri) This is still happening: `lang=shell-session fnegri... [17:33:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance maps-proxy-6 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [18:01:10] (03merge) 10volans: infra-tracing: set the X-Scope-OrgId header [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1087 (https://phabricator.wikimedia.org/T399313) [18:24:45] (03CR) 10Jforrester: "check experimental" [labs/tools/extjsonuploader] - 10https://gerrit.wikimedia.org/r/1173970 (https://phabricator.wikimedia.org/T315923) (owner: 10Brian Wolff) [18:24:49] (03CR) 10Jforrester: "check experimental" [labs/tools/force-rebase] - 10https://gerrit.wikimedia.org/r/881434 (owner: 10DannyS712) [18:24:53] (03CR) 10Jforrester: "check experimental" [labs/tools/intuition] - 10https://gerrit.wikimedia.org/r/1210567 (owner: 10L10n-bot) [18:25:18] (03CR) 10Jforrester: "check experimental" [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1203439 (owner: 10L10n-bot) [18:25:39] (03CR) 10Jforrester: "check experimental" [labs/tools/toolbase] - 10https://gerrit.wikimedia.org/r/1180641 (owner: 10Krinkle) [18:26:27] (03CR) 10Jforrester: "check experimental" [labs/tools/toolbase] - 10https://gerrit.wikimedia.org/r/1180641 (owner: 10Krinkle) [18:31:16] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1041', 'cloudvirt1042', 'cloudvirt1043'] [18:32:24] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1041', 'cloudvirt1042', 'cloudvirt1043'] [18:34:04] 06cloud-services-team, 10Cloud-VPS: cloud VPS - cinder mounting broke - https://phabricator.wikimedia.org/T410936 (10Dzahn) 03NEW [18:35:19] 06cloud-services-team, 10Cloud-VPS: cloud VPS - puppet broken - cinderutils 'fstype must not be an empty string' - https://phabricator.wikimedia.org/T410936#11402243 (10Dzahn) [18:36:26] (03PS1) 10Volans: labs: enable infra-tracing-nfs tracing [labs/private] - 10https://gerrit.wikimedia.org/r/1210664 (https://phabricator.wikimedia.org/T399313) [18:42:07] (03CR) 10Jforrester: [V:03+2] "tox failure is (still) unrelated." [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1185449 (owner: 10Libraryupgrader) [18:42:20] (03CR) 10Jforrester: [V:03+2] "check experimental" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/1185449 (owner: 10Libraryupgrader) [18:52:39] 06cloud-services-team, 10Cloud-VPS: cloud VPS - puppet broken - cinderutils 'fstype must not be an empty string' - https://phabricator.wikimedia.org/T410936#11402380 (10Dzahn) So.. the `fstype` [18:55:57] 06cloud-services-team, 10Cloud-VPS: cloud VPS - puppet broken - cinderutils 'fstype must not be an empty string' - https://phabricator.wikimedia.org/T410936#11402429 (10Dzahn) `facter block_devices` works normal and looks about the same on both wikistats-trixie and wikistats-bookworm instance but this whole pr... [18:56:59] 06cloud-services-team, 10Cloud-VPS: cloud VPS - puppet broken - cinderutils 'fstype must not be an empty string' - https://phabricator.wikimedia.org/T410936#11402439 (10Dzahn) ` dzahn@wikistats-trixie:~$ facter block_devices | grep fstype fstype => null fstype => "ext4" fstype => null fstype =>... [19:02:12] 06cloud-services-team, 10Cloud-VPS: cloud VPS - puppet broken - cinderutils 'fstype must not be an empty string' - https://phabricator.wikimedia.org/T410936#11402510 (10Dzahn) just `umount /srv/wikistats/backup` to unmount the backup dir fixed the puppet-brokeness (and daily nagging mails) for now. [19:02:35] 06cloud-services-team, 10Cloud-VPS, 10VPS-project-Wikistats: cloud VPS - puppet broken - cinderutils 'fstype must not be an empty string' - https://phabricator.wikimedia.org/T410936#11402511 (10Dzahn) [19:10:54] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11402536 (10Pppery) 05Resolved→03Declined [19:11:17] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11402538 (10Pppery) a:05Sadeiiw67→03None [19:13:21] 06cloud-services-team, 10PAWS: [bug] PAWS: User servers failing to spawn (timeout 30s) - https://phabricator.wikimedia.org/T410712#11402553 (10Pppery) There is a real bug buried in the tide of noise here - PAWS should report the error to the user better. That is, if a user's homedir is locked it should say... [19:30:03] 10Tools, 05PES1.3.3 WP25 Easter Eggs, 07Software-Licensing: wikipedia25-years-of-wikipedia tool loads and uses non-free JavaScript - https://phabricator.wikimedia.org/T410465#11402599 (10BCornwall) a:03ATitkov [20:41:16] 10PAWS, 06tools-platform-team: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T410947#11402828 (10LibUp-bot) [20:41:17] 10Toolforge, 06tools-platform-team: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T410948#11402830 (10LibUp-bot) [22:32:42] 06cloud-services-team, 10Cloud-VPS: Keystone not cleaning up ldap groups on project delete - https://phabricator.wikimedia.org/T397648#11403355 (10Andrew) 05Open→03Resolved This is now taken care of by the project deletion cookbook. [22:35:18] (03CR) 10A smart kitten: Switch off netbox-exports from diffusion (031 comment) [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1208445 (https://phabricator.wikimedia.org/T410624) (owner: 10Ladsgroup) [22:43:20] 06cloud-services-team, 10Cloud-VPS: wmcs Trixie kernel reboots - https://phabricator.wikimedia.org/T410846#11403382 (10Andrew) [22:51:17] FIRING: JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:56:17] RESOLVED: JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable