[00:32:17] FIRING: JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:56] FIRING: [2x] SystemdUnitDown: The service unit remove_dangling_cinder_snapshots.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:17:56] FIRING: [3x] SystemdUnitDown: The service unit remove_dangling_cinder_snapshots.service is in failed status on host cloudbackup1001-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:02:56] FIRING: [2x] SystemdUnitDown: The systemd unit remove_dangling_cinder_snapshots.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:32:56] RESOLVED: SystemdUnitDown: The service unit remove_dangling_cinder_snapshots.service is in failed status on host cloudbackup1002-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:33:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [04:33:49] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) on deployment codfw1dev for all services [04:43:26] RESOLVED: [2x] SystemdUnitDown: The systemd unit remove_dangling_cinder_snapshots.service on node cloudbackup1001-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:46:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [04:47:51] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) on deployment codfw1dev for all services [04:48:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [04:48:54] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) on deployment codfw1dev for all services [04:50:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [04:53:47] RESOLVED: JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:54:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment codfw1dev for all services [05:10:56] FIRING: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:40:53] 10Cloud-VPS (Project-requests): Request creation of Gutendex VPS project - https://phabricator.wikimedia.org/T411158 (10Ijon) 03NEW [06:10:44] FIRING: MaintainDBUsersManyErrors: Maintain-dbusers is having sustained errors - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainDBUsersManyErrors - https://grafana.wikimedia.org/d/ae240a06-c13e-49f3-b12c-58432c551e85/wmcs-maintain-dbusers - https://alerts.wikimedia.org/?q=alertname%3DMaintainDBUsersManyErrors [06:15:44] RESOLVED: MaintainDBUsersManyErrors: Maintain-dbusers is having sustained errors - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainDBUsersManyErrors - https://grafana.wikimedia.org/d/ae240a06-c13e-49f3-b12c-58432c551e85/wmcs-maintain-dbusers - https://alerts.wikimedia.org/?q=alertname%3DMaintainDBUsersManyErrors [06:26:39] 06cloud-services-team (FY2025/26-Q1-Q2), 10Data-Services: [wikireplicas] Create views for new wiki tokwiki - https://phabricator.wikimedia.org/T404570#11411843 (10Marostegui) 05Stalled→03Open This can happen now. [06:41:45] 10Tool-wsindex, 10Wikisource Reader App: Books with title having virama characters are being joined in the API - https://phabricator.wikimedia.org/T411159 (10Bodhisattwa) 03NEW [07:51:07] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1211684 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:53:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:11:11] FIRING: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:15:28] 10VPS-project-Wikistats: Add tokwiki to wikistats - https://phabricator.wikimedia.org/T404572#11412122 (10A_smart_kitten) (noting that `tokwiki` has now been created) [09:21:34] 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS (Project-requests): CloudVPS instance for ProVe - https://phabricator.wikimedia.org/T408387#11412137 (10Albert.meronyo) Thank you all so much for your help on this! @komla @fnegri could you also please add user Albertmeronyo / amp to the project? Thanks [09:40:56] RESOLVED: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:46:51] 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS (Project-requests): CloudVPS instance for ProVe - https://phabricator.wikimedia.org/T408387#11412200 (10fnegri) > @komla @fnegri could you also please add user Albertmeronyo / amp to the project? Thanks Done! You can also self-manage the list of users... [10:17:29] 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS (Project-requests): CloudVPS instance for ProVe - https://phabricator.wikimedia.org/T408387#11412277 (10Albert.meronyo) Amazing, thank you so much! Much appreciated :) [10:23:56] 10Cloud-VPS, 06tools-infrastructure-team, 13Patch-For-Review: Improve how virt networks are configured in cloudgw - https://phabricator.wikimedia.org/T411081#11412295 (10fgiunchedi) Something I wanted to add: I'm not very familiar with that part of the puppet codebase though I was wondering if we can start r... [10:57:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirtlocal1003 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [11:02:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirtlocal1003 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [11:03:45] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 10Wiki-Loves-Monuments-Database, 07Sustainability (Incident Followup): [toolsdb] ibdata1 growing on primary - https://phabricator.wikimedia.org/T409716#11412414 (10fnegri) History Length is still growing almost linearly, altough `ibdata1` is constante a... [11:28:35] 10Tool-techcontribs: expand scope of "Show descriptions" element - https://phabricator.wikimedia.org/T411171 (10Novem_Linguae) 03NEW [11:43:05] 06cloud-services-team, 06Infrastructure-Foundations, 10SRE-tools, 07IPv6: Some WMCS clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271139#11412503 (10ayounsi) 05Open→03Resolved a:03ayounsi All solved. [11:54:03] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:28:45] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/53 [12:55:40] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/53 (owner: 10l10n-bot) [12:55:46] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/53 (owner: 10l10n-bot) [13:39:24] (03open) 10volans: shared: k8s security group for infra-metrics-loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/104 (https://phabricator.wikimedia.org/T399313) [13:41:19] (03approved) 10taavi: shared: k8s security group for infra-metrics-loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/104 (https://phabricator.wikimedia.org/T399313) (owner: 10volans) [13:45:48] (03update) 10volans: shared: k8s security group for infra-metrics-loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/104 (https://phabricator.wikimedia.org/T399313) [13:48:25] (03merge) 10volans: shared: k8s security group for infra-metrics-loki [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/104 (https://phabricator.wikimedia.org/T399313) [14:31:54] 06cloud-services-team (FY2025/26-Q1-Q2), 10Data-Services: [wikireplicas] Create views for new wiki tokwiki - https://phabricator.wikimedia.org/T404570#11413195 (10taavi) 05Open→03Resolved [14:34:22] 06cloud-services-team, 10Data-Services: wmcs-wikireplica-dns is horribly inefficient - https://phabricator.wikimedia.org/T411192 (10taavi) 03NEW [14:35:44] 06cloud-services-team: SystemdUnitDown and SystemdUnitDownForLong - https://phabricator.wikimedia.org/T411193 (10fgiunchedi) 03NEW [14:35:55] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 10Wiki-Loves-Monuments-Database, 07Sustainability (Incident Followup): [toolsdb] ibdata1 growing on primary - https://phabricator.wikimedia.org/T409716#11413225 (10Magnus) @fnegri I have deactivated the query in code, and removed the option to start it. [14:37:06] 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: [tofu-infra] [wmcs-cookbooks] Allow running "tofu apply" on a single cluster - https://phabricator.wikimedia.org/T411090#11413227 (10fnegri) > We already have the opentofu-infra-diff systemd timer that will alert us if one of the two c... [14:37:12] 06cloud-services-team: SystemdUnitDown and SystemdUnitDownForLong - https://phabricator.wikimedia.org/T411193#11413228 (10fgiunchedi) Something else to note: the alerts are deployed in eqiad only, not codfw [14:38:07] 06cloud-services-team, 10Data-Services: wmcs-wikireplica-dns is horribly inefficient - https://phabricator.wikimedia.org/T411192#11413230 (10taavi) [14:38:11] 06cloud-services-team: SystemdUnitDown and SystemdUnitDownForLong - https://phabricator.wikimedia.org/T411193#11413231 (10fnegri) [14:38:20] 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS, 13Patch-For-Review: [tofu-infra] [wmcs-cookbooks] Allow running "tofu apply" on a single cluster - https://phabricator.wikimedia.org/T411090#11413232 (10fnegri) [14:48:27] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 10Wiki-Loves-Monuments-Database, 07Sustainability (Incident Followup): [toolsdb] ibdata1 growing on primary - https://phabricator.wikimedia.org/T409716#11413302 (10fnegri) @Magnus thanks! That was effective: {F70688074} [14:57:27] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 07Sustainability (Incident Followup): [toolsdb] Add filesystem space alerts - https://phabricator.wikimedia.org/T409404#11413320 (10fnegri) [15:05:22] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 07Sustainability (Incident Followup): [toolsdb] Add filesystem space alerts - https://phabricator.wikimedia.org/T409404#11413341 (10fnegri) [15:05:24] 06cloud-services-team, 10Toolforge: Move all Toolforge alerts to the toolforge/alerts git repo - https://phabricator.wikimedia.org/T410505#11413342 (10fnegri) [15:06:19] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge: Move all Toolforge alerts to the toolforge/alerts git repo - https://phabricator.wikimedia.org/T410505#11413344 (10fnegri) a:03fnegri [15:06:49] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge: Move all Toolforge alerts to the toolforge/alerts git repo - https://phabricator.wikimedia.org/T410505#11413346 (10fnegri) p:05Triage→03High [15:10:56] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 07Sustainability (Incident Followup): [toolsdb] crash recovery can fail because of insufficient innodb_log_file_size - https://phabricator.wikimedia.org/T409922#11413361 (10fnegri) This is promising but I'll keep this task open for a few days before resol... [15:54:03] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [16:07:26] 06cloud-services-team (FY2025/26-Q1-Q2), 10Data-Services: [wikireplicas] Create views for new wiki tokwiki - https://phabricator.wikimedia.org/T404570#11413636 (10fnegri) a:05fnegri→03taavi [16:18:42] 10Tools, 07SecTeam-Processed, 07Security, 05Vuln-XSS: Reasonator XSS vulnerability - https://phabricator.wikimedia.org/T327962#11413670 (10sbassett) p:05Triage→03Medium [16:19:05] 10Tools, 07SecTeam-Processed, 07Security, 05Vuln-XSS: filedupes XSS vulnerability - https://phabricator.wikimedia.org/T305766#11413676 (10sbassett) p:05Triage→03Medium [16:19:34] 10Tools, 07SecTeam-Processed, 07Security, 05Vuln-XSS: magnustools: trans-parent XSS vulnerability - https://phabricator.wikimedia.org/T310029#11413680 (10sbassett) p:05Triage→03Medium [16:19:56] 10Tools, 07SecTeam-Processed, 07Security, 05Vuln-XSS: get_distinct_authors XSS vulnerability - https://phabricator.wikimedia.org/T310027#11413687 (10sbassett) p:05Triage→03Medium [16:20:17] 06cloud-services-team, 10Cloud-VPS: VM metadata service slow response - https://phabricator.wikimedia.org/T410983#11413692 (10fgiunchedi) [16:20:23] 10Tools, 07SecTeam-Processed, 07Security, 05Vuln-XSS: most-wanted XSS vulnerability - https://phabricator.wikimedia.org/T310026#11413693 (10sbassett) p:05Triage→03Medium [16:21:23] 10Tools, 07SecTeam-Processed, 07Security, 05Vuln-XSS: missingtopics XSS vulnerability - https://phabricator.wikimedia.org/T310024#11413699 (10sbassett) p:05Triage→03Medium [17:10:18] 06cloud-services-team (FY2025/26-Q1-Q2), 10Cloud-VPS, 10Toolforge: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge - https://phabricator.wikimedia.org/T391369#11413844 (10fnegri) a:05fnegri→03None Unassigning myself, this remains important but I am doing to... [17:11:06] 10Tool-wdrecentchanges: Feature Request: Performance and data handling - https://phabricator.wikimedia.org/T411205 (10Gnoeee) 03NEW [17:15:06] 10Tool-wdrecentchanges: Feature Request: Performance and data handling - https://phabricator.wikimedia.org/T411205#11413861 (10Gnoeee) [17:15:28] 10Tool-wdrecentchanges: Feature Request: Improve the UX and core functionality - https://phabricator.wikimedia.org/T411206 (10Gnoeee) 03NEW [17:22:38] 10Cloud-VPS (Project-requests): Request creation of Gutendex VPS project - https://phabricator.wikimedia.org/T411158#11413896 (10bd808) Did you consider the possibility of hosting this as a Toolforge tool @Ijon? The general benefit of Toolforge over a Cloud VPS project would be the hosted platform nature of Tool... [17:47:34] 06cloud-services-team, 10Toolforge: [lima-kilo] error mounting docker cache - https://phabricator.wikimedia.org/T411208 (10fnegri) 03NEW [17:50:38] 06cloud-services-team, 10Toolforge: [lima-kilo] error mounting docker cache - https://phabricator.wikimedia.org/T411208#11413946 (10fnegri) @Volans ran into this issue some time ago, and I ran into it today. The workaround is using `./start-devenv.sh --no-cache`. [17:51:54] 06cloud-services-team, 10Toolforge: [lima-kilo] error mounting docker cache - https://phabricator.wikimedia.org/T411208#11413948 (10fnegri) [17:59:43] 10Cloud-VPS (Project-requests): Request creation of Gutendex VPS project - https://phabricator.wikimedia.org/T411158#11413965 (10Ijon) Oh, then that sounds fine. The postgres requirement was the only reason I concluded I need a VPS project. The docs did not reveal the possibility of a Trove database for a Toolfo... [19:54:03] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:07:09] 06cloud-services-team (FY2025/26-Q1-Q2), 10Toolforge, 10Wiki-Loves-Monuments-Database, 07Sustainability (Incident Followup): [toolsdb] ibdata1 growing on primary - https://phabricator.wikimedia.org/T409716#11414292 (10Usernamekiran) Just to be sure, would it be helpful if I stop my operations for a few day... [23:54:03] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange