[00:03:28] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance syslog-server-audit01 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [00:17:19] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1058 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [00:23:19] (03PS1) 10Krinkle: Normalize IPv6 inputs before doing the database query [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1163909 [00:27:45] PROBLEM - toolschecker: Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/redis - 324 bytes in 60.017 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [00:33:19] RECOVERY - toolschecker: Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 22.740 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [00:33:56] 10Tool-Global-user-contributions: Support passing short/lowercase forms of IPv6 to GUC - https://phabricator.wikimedia.org/T397892 (10Krinkle) 03NEW [00:34:50] (03PS2) 10Krinkle: Normalize IPv6 inputs before doing the database query [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1163909 (https://phabricator.wikimedia.org/T397892) [00:35:09] 10Tool-Global-user-contributions, 13Patch-For-Review: Support passing short/lowercase forms of IPv6 to GUC - https://phabricator.wikimedia.org/T397892#10949103 (10Krinkle) p:05Triage→03Medium a:03Krinkle [00:35:31] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance cvn-app10 in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:38:46] (03CR) 10Krinkle: [C:03+2] Normalize IPv6 inputs before doing the database query [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1163909 (https://phabricator.wikimedia.org/T397892) (owner: 10Krinkle) [00:39:15] (03Merged) 10jenkins-bot: Normalize IPv6 inputs before doing the database query [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1163909 (https://phabricator.wikimedia.org/T397892) (owner: 10Krinkle) [00:41:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1060 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [00:49:11] 10Tool-Global-user-contributions, 13Patch-For-Review: Support passing short/lowercase forms of IPv6 to GUC - https://phabricator.wikimedia.org/T397892#10949108 (10Krinkle) 05Open→03Resolved [01:01:50] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1060 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [01:21:50] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1061 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [01:33:20] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1061 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [01:53:20] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [02:02:20] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [02:17:20] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1064 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [02:37:20] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1065 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [02:41:19] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1065 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [03:01:19] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1066 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [03:14:20] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1066 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [03:29:19] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1068 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [03:49:19] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1069 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [04:00:19] andrew@cloudcumin1001 safe_reboot (PID 3795026) is awaiting input [04:04:19] RESOLVED: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1069 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [05:48:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-61 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:50:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:58:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-61 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:00:16] 06cloud-services-team, 10Cloud-VPS: Neutron policy does not allow the admin role to modify security groups - https://phabricator.wikimedia.org/T348582#10949361 (10taavi) 05Open→03Resolved [07:05:04] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [07:23:16] (03update) 10dcaro: cancel: add new subcommand to cancel a deployment [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/45 [07:27:27] (03update) 10dcaro: cancel: add endpoint to cancel an ongoing deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/99 (https://phabricator.wikimedia.org/T395039) [07:31:14] (03update) 10dcaro: cancel: add new subcommand to cancel a deployment [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/45 [07:31:36] (03update) 10dcaro: cancel: add endpoint to cancel an ongoing deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/99 (https://phabricator.wikimedia.org/T395039) [07:31:43] (03update) 10dcaro: cancel: add endpoint to cancel an ongoing deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/99 (https://phabricator.wikimedia.org/T395039) [07:47:00] (03open) 10dcaro: handle unparsable loglines [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/81 (https://phabricator.wikimedia.org/T362521) [07:52:54] (03update) 10dcaro: handle unparsable loglines [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/81 (https://phabricator.wikimedia.org/T362521) [07:53:18] (03update) 10dcaro: logs: handle case where date can't be parsed [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/81 (https://phabricator.wikimedia.org/T362521) [08:07:36] (03update) 10dcaro: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) [08:09:24] (03update) 10dcaro: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) [08:10:10] (03update) 10dcaro: logs: handle case where date can't be parsed [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/81 (https://phabricator.wikimedia.org/T362521) [08:13:30] (03update) 10dcaro: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) [08:16:58] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [jobs-api] logs internal datetime error - https://phabricator.wikimedia.org/T362521#10949445 (10dcaro) This patch continues handling the logs (uses the current date), and shows a warning too to avoid them getting lost, though I have been unable to repro... [08:50:29] (03open) 10taavi: project: Fix default security group filtering [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/251 (https://phabricator.wikimedia.org/T397901) [08:51:10] (03approved) 10fnegri: project: Fix default security group filtering [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/251 (https://phabricator.wikimedia.org/T397901) (owner: 10taavi) [08:51:15] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/251 [08:51:53] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/251 [08:52:24] (03merge) 10taavi: project: Fix default security group filtering [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/251 (https://phabricator.wikimedia.org/T397901) [08:52:25] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [08:53:58] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [09:14:13] (03update) 10dcaro: cancel: add endpoint to cancel an ongoing deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/99 (https://phabricator.wikimedia.org/T395039) [09:25:18] (03update) 10taavi: logging: loki: Set nameOverride [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/825 (https://phabricator.wikimedia.org/T386480) [09:25:18] (03update) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 (https://phabricator.wikimedia.org/T386480) [09:25:18] (03update) 10taavi: logging: loki: Add network policy rule for object storage access [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/827 (https://phabricator.wikimedia.org/T386480) [09:25:19] (03update) 10taavi: logging: loki: Add second Loki instance for infrastructure logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/834 (https://phabricator.wikimedia.org/T386480 https://phabricator.wikimedia.org/T97861) [09:25:20] (03update) 10taavi: logging: alloy: Add routing for infrastructure logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/835 (https://phabricator.wikimedia.org/T386480 https://phabricator.wikimedia.org/T97861) [09:25:20] (03update) 10taavi: logging: alloy: Allow running on the entire cluster [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/836 (https://phabricator.wikimedia.org/T97861) [09:25:30] (03update) 10taavi: logging: alloy: Add routing for infrastructure logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/835 (https://phabricator.wikimedia.org/T386480 https://phabricator.wikimedia.org/T97861) [09:25:35] (03update) 10taavi: logging: loki: Set nameOverride [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/825 (https://phabricator.wikimedia.org/T386480) [09:25:37] (03update) 10taavi: logging: loki: Add network policy rule for object storage access [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/827 (https://phabricator.wikimedia.org/T386480) [09:25:38] (03update) 10taavi: logging: alloy: Allow running on the entire cluster [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/836 (https://phabricator.wikimedia.org/T97861) [09:25:40] (03update) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 (https://phabricator.wikimedia.org/T386480) [09:25:44] (03update) 10taavi: logging: loki: Add second Loki instance for infrastructure logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/834 (https://phabricator.wikimedia.org/T386480 https://phabricator.wikimedia.org/T97861) [09:38:35] (03update) 10taavi: logging: loki: Set nameOverride [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/825 (https://phabricator.wikimedia.org/T386480) [09:38:39] (03merge) 10taavi: logging: loki: Set nameOverride [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/825 (https://phabricator.wikimedia.org/T386480) [09:38:41] (03update) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 (https://phabricator.wikimedia.org/T386480) [09:38:50] (03merge) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 (https://phabricator.wikimedia.org/T386480) [09:38:51] (03update) 10taavi: logging: loki: Add network policy rule for object storage access [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/827 (https://phabricator.wikimedia.org/T386480) [10:01:02] (03update) 10dcaro: build: fail if ref failed to resolve [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/96 [10:24:23] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [10:27:40] (03open) 10dcaro: use logging error [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/100 [10:28:04] (03open) 10dcaro: global: use logging.error instead of exception out of handlers [repos/cloud/toolforge/components-api] (fail_if_ref_failed_to_resolve) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/101 [10:28:48] (03close) 10dcaro: use logging error [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/100 [10:29:00] (03update) 10dcaro: global: use logging.error instead of exception out of handlers [repos/cloud/toolforge/components-api] (fail_if_ref_failed_to_resolve) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/101 [10:30:10] (03update) 10dcaro: global: use logging.error instead of exception out of handlers [repos/cloud/toolforge/components-api] (fail_if_ref_failed_to_resolve) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/101 [10:39:30] (03approved) 10fnegri: global: use logging.error instead of exception out of handlers [repos/cloud/toolforge/components-api] (fail_if_ref_failed_to_resolve) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/101 (owner: 10dcaro) [12:00:47] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [12:04:50] (03update) 10dcaro: build: fail if ref failed to resolve [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/96 [12:12:53] (03update) 10dcaro: builds: handle long_status [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/39 [12:17:25] (03approved) 10fnegri: builds: handle long_status [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/39 (owner: 10dcaro) [12:21:04] (03open) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [12:21:58] (03update) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [12:25:21] (03merge) 10dcaro: builds: handle long_status [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/39 [12:27:11] (03update) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [12:27:50] (03approved) 10fnegri: build: fail if ref failed to resolve [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/96 (owner: 10dcaro) [12:30:34] (03merge) 10dcaro: build: fail if ref failed to resolve [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/96 [12:30:37] (03update) 10dcaro: global: use logging.error instead of exception out of handlers [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/101 [12:32:56] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.125-20250626123043-1d4d364e [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/839 [12:34:09] (03merge) 10dcaro: global: use logging.error instead of exception out of handlers [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/101 [12:34:46] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [12:35:00] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [12:36:37] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [toolforge,infra] Centralized logging for Toolforge infrastructure logs - https://phabricator.wikimedia.org/T97861#10950163 (10dcaro) @taavi hey, can you update this task with your plans on using loki for this? And how does it fit in the overall picture... [12:36:41] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.126-20250626123417-995eb248 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/839 [12:36:45] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.126-20250626123417-995eb248 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/839 [12:37:36] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [12:37:46] (03update) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [12:38:48] (03update) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [12:38:50] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [12:39:08] (03update) 10dcaro: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) [12:40:28] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [12:42:16] (03open) 10dcaro: d/changelog: bump to 0.0.11 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/47 (https://phabricator.wikimedia.org/T395077) [12:44:27] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [13:34:42] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [toolforge,infra] Centralized logging for Toolforge infrastructure logs - https://phabricator.wikimedia.org/T97861#10950390 (10taavi) Yes. My plan is to feed logs of everything running inside Kubernetes cluster itself to a Loki instance hosted in there.... [13:38:12] (03approved) 10taavi: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 (owner: 10dcaro) [13:45:50] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1072 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:45:58] (03open) 10taavi: Migrate Toolforge deployment to components [toolforge-repos/ircservserv-config] - 10https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/merge_requests/28 (https://phabricator.wikimedia.org/T397929) [13:46:01] (03update) 10taavi: Migrate Toolforge deployment to components [toolforge-repos/ircservserv-config] - 10https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/merge_requests/28 (https://phabricator.wikimedia.org/T397929) [13:46:08] (03update) 10taavi: Migrate Toolforge deployment to components [toolforge-repos/ircservserv-config] - 10https://gitlab.wikimedia.org/toolforge-repos/ircservserv-config/-/merge_requests/28 (https://phabricator.wikimedia.org/T397929) [13:51:00] (03update) 10fnegri: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) (owner: 10dcaro) [13:51:38] (03approved) 10fnegri: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) (owner: 10dcaro) [13:51:50] (03update) 10fnegri: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) (owner: 10dcaro) [13:52:06] (03update) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [13:52:11] (03update) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [13:52:25] (03update) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [13:53:42] (03approved) 10dcaro: components-api: bump to 0.0.126-20250626123417-995eb248 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/839 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:53:46] (03merge) 10dcaro: components-api: bump to 0.0.126-20250626123417-995eb248 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/839 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:54:12] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-cli [13:56:24] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-cli [13:58:47] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-cli [13:59:23] 06cloud-services-team, 10Toolforge: Disable tools.maintain-harbor - https://phabricator.wikimedia.org/T397933 (10taavi) 03NEW [14:00:33] 06cloud-services-team, 10Toolforge: Disable tools.maintain-harbor - https://phabricator.wikimedia.org/T397933#10950562 (10taavi) a:03Raymond_Ndibe @Raymond_Ndibe says he wants to check if the tool still has something that needs to be stored before disabling it. [14:01:27] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-cli [14:01:29] 06cloud-services-team, 10Toolforge: Disable tools.maintain-harbor - https://phabricator.wikimedia.org/T397933#10950571 (10taavi) [14:02:19] (03approved) 10dcaro: d/changelog: bump to 0.0.11 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/47 (https://phabricator.wikimedia.org/T395077) [14:02:22] (03merge) 10dcaro: d/changelog: bump to 0.0.11 [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/47 (https://phabricator.wikimedia.org/T395077) [14:04:17] (03merge) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [14:04:19] (03update) 10dcaro: scheduled: add scheduled component support [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/94 (https://phabricator.wikimedia.org/T395071) [14:05:50] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1072 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [14:07:05] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.127-20250626140427-fb508bbe [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/840 (https://phabricator.wikimedia.org/T395070) [14:08:06] (03update) 10dcaro: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) [14:08:59] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add endpoint to get what would be the "current" config - https://phabricator.wikimedia.org/T394753#10950628 (10dcaro) 05In progress→03Resolved [14:12:03] 06cloud-services-team, 10Cloud-VPS: Neutron metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742#10950644 (10Andrew) [14:13:33] 06cloud-services-team, 10Cloud-VPS: Neutron metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742#10950651 (10taavi) Hmm, isn't that about {T395255}? Or are these the same bug? [14:20:50] FIRING: [3x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1072 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [14:26:01] (03open) 10chuckonwumelu: Demo: Create volume for demo [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/56 [14:30:01] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [14:34:11] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [14:35:50] FIRING: [3x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1073 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [14:37:48] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [14:41:54] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [14:45:13] (03update) 10dcaro: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) [14:45:43] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] add all the missing options for continuous components - https://phabricator.wikimedia.org/T395070#10950893 (10dcaro) 05In progress→03Resolved [14:46:00] (03merge) 10dcaro: NewJob: set the default mount accordingly if not passed [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/175 [14:46:43] (03merge) 10dcaro: components-api: bump to 0.0.127-20250626140427-fb508bbe [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/840 (https://phabricator.wikimedia.org/T395070) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:47:51] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'P{O:wmcs::openstack::eqiad1::virt_ceph}' [14:49:01] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.382-20250626144611-d4e720ec [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/841 [14:50:50] FIRING: [3x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1074 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [14:51:13] 06cloud-services-team, 10Toolforge: [jobs-api] logs internal datetime error - https://phabricator.wikimedia.org/T362521#10950917 (10derenrich) I think using the python library tqdm causes it. [14:55:26] PROBLEM - Host cloudbackup2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:55:49] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1075 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [14:58:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirtlocal1001.eqiad.wmnet}' [14:58:58] RECOVERY - Host cloudbackup2004 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [15:03:22] PROBLEM - Host cloudvirtlocal1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:32] RECOVERY - Host cloudvirtlocal1001 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:04:46] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirtlocal1001.eqiad.wmnet}' [15:05:50] FIRING: [3x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1075 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:10:49] RESOLVED: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1075 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:12:13] (03update) 10chuckonwumelu: Demo: Create volume for demo [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/56 [15:18:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirtlocal1002.eqiad.wmnet}' [15:22:41] PROBLEM - Host cloudvirtlocal1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:11] RECOVERY - Host cloudvirtlocal1002 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [15:24:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirtlocal1002.eqiad.wmnet}' [15:25:50] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1076 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:29:32] (03close) 10chuckonwumelu: Demo: Create volume for demo [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/56 [15:34:49] 10Toolforge (Toolforge iteration 21): [docs] enable docs linter in one of the repos - https://phabricator.wikimedia.org/T397949 (10dcaro) 03NEW [15:35:50] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirtlocal1001 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:41:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirtlocal1003.eqiad.wmnet}' [15:45:09] PROBLEM - Host cloudvirtlocal1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:24] FIRING: [5x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-control-10.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [15:45:50] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirtlocal1002 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:46:37] RECOVERY - Host cloudvirtlocal1003 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:46:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirtlocal1003.eqiad.wmnet}' [15:50:24] RESOLVED: [5x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-control-10.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [15:51:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [15:55:44] (03approved) 10fnegri: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) (owner: 10dcaro) [15:59:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [16:01:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [16:01:31] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [16:02:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [16:02:55] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [16:03:34] (03update) 10dcaro: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) [16:08:07] (03merge) 10dcaro: update_tool_config: Return a warning for each non-managed field [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/97 (https://phabricator.wikimedia.org/T395070) [16:10:24] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [16:10:55] andrew@cloudcumin1001 bootstrap_and_add (PID 371659) is awaiting input [16:11:19] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [16:11:40] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.128-20250626160815-d07803d2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/842 (https://phabricator.wikimedia.org/T395070) [16:11:46] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.128-20250626160815-d07803d2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/842 (https://phabricator.wikimedia.org/T395070) [16:14:15] 10Tool-paulina: Restore scholary works to author works query results - https://phabricator.wikimedia.org/T397963 (10marfossatti) 03NEW [16:14:36] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [16:19:12] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [16:20:37] (03approved) 10dcaro: jobs-api: bump to 0.0.382-20250626144611-d4e720ec [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/841 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:20:39] (03merge) 10dcaro: jobs-api: bump to 0.0.382-20250626144611-d4e720ec [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/841 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:20:44] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [16:24:23] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [16:28:43] 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 06SRE: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10951416 (10Jhancock.wm) @Andrew this has been unracked and disks removed, but ran into an error running the offline script in netbox. lo... [16:28:54] 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 06SRE: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10951417 (10Jhancock.wm) [16:29:23] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [16:31:37] (03open) 10eliza189: Eliza views bugs [toolforge-repos/miss-search] (update-cycle-toolforge-testing) - 10https://gitlab.wikimedia.org/toolforge-repos/miss-search/-/merge_requests/9 [16:32:51] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [16:34:09] (03update) 10chuckonwumelu: bash-completion: Add file system recognition to autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/46 (https://phabricator.wikimedia.org/T395077) [16:35:30] (03update) 10dcaro: components-api: bump to 0.0.128-20250626160815-d07803d2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/842 (https://phabricator.wikimedia.org/T395070) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:35:39] (03approved) 10dcaro: components-api: bump to 0.0.128-20250626160815-d07803d2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/842 (https://phabricator.wikimedia.org/T395070) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:36:11] (03merge) 10dcaro: components-api: bump to 0.0.128-20250626160815-d07803d2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/842 (https://phabricator.wikimedia.org/T395070) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:39:08] (03update) 10chuckonwumelu: bash-completion: Add file system recognition to autocomplete [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/46 (https://phabricator.wikimedia.org/T395077) [16:49:16] 06cloud-services-team, 10Toolforge: [jobs-api] logs internal datetime error - https://phabricator.wikimedia.org/T362521#10951472 (10dcaro) Did a quick investigation, and I found that with the `tqdm` library as @derenrich says, the logs come in lines that are not preceded with the timestamp (this is from a sill... [16:53:37] 10Toolforge (Toolforge iteration 21): [components-api] Add warning when keys of the tool config are not understood - https://phabricator.wikimedia.org/T397828#10951489 (10dcaro) 05In progress→03Resolved [16:55:06] 06cloud-services-team, 10Toolforge: [jobs-api] logs internal datetime error - https://phabricator.wikimedia.org/T362521#10951517 (10dcaro) The kubectl logs with timestamp show that too, one line with the timestamp, many without: ` local.tf-test@toolslocal:~$ kubectl logs testlogs-654f6b5bf7-6cv6m --timestamps... [16:59:11] 06cloud-services-team, 10decommission-hardware: decommission cloudcephosd200[123]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968 (10Andrew) 03NEW [16:59:29] 06cloud-services-team, 10decommission-hardware: decommission cloudcephosd200[123]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10951539 (10Andrew) [16:59:32] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10951538 (10Andrew) [17:05:48] 06cloud-services-team, 10Cloud-VPS: OpenStack services should use system users to talk to Keystone - https://phabricator.wikimedia.org/T273150#10951558 (10Andrew) [17:05:51] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Modernize openstack rbac - https://phabricator.wikimedia.org/T330759#10951557 (10Andrew) [17:06:21] 06cloud-services-team, 10Cloud-VPS: OpenStack services should use system users to talk to Keystone - https://phabricator.wikimedia.org/T273150#10951559 (10Andrew) Shockingly, I have made some progress on this. Description updated to show progress [17:11:35] 06cloud-services-team, 10Cloud-VPS: OpenStack services should use system users to talk to Keystone - https://phabricator.wikimedia.org/T273150#10951574 (10Andrew) [17:15:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [17:25:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [17:26:52] (03update) 10dcaro: cancel: add endpoint to cancel an ongoing deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/99 (https://phabricator.wikimedia.org/T395039) [17:29:46] 06cloud-services-team, 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: `toolforge jobs dump` fails for tools.stewardsbot - https://phabricator.wikimedia.org/T396210#10951617 (10Raymond_Ndibe) >>! In T396210#10903049, @dcaro wrote: > @Raymond_Ndibe Hmmm... that change should not have been backwards... [17:42:35] (03update) 10ilanen1: Ilanmerge [toolforge-repos/miss-search] (update-cycle) - 10https://gitlab.wikimedia.org/toolforge-repos/miss-search/-/merge_requests/8 [17:48:45] 10Tool-translatetagger: Create Gadget to Simplify Workflow for Adding Translation Tag - https://phabricator.wikimedia.org/T393170#10951692 (10Gopavasanth) Hi @TiagoLubiana, Here is the script: https://www.mediawiki.org/w/load.php?modules=ext.gadget.TranslateTagger You can also enable this as a gadget from: http... [17:49:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [17:56:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [18:05:50] 06cloud-services-team, 10Cloud-VPS: Neutron metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742#10951765 (10Andrew) I /think/ it is the same bug. But you're right, the bug as described more closely resembles T395255 [18:11:31] 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 06SRE: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10951792 (10Andrew) I think I have removed cloudcontrol2004-dev.private.codfw.wikimedia.cloud and the associated IP from netbox so hopefu... [18:21:16] 06cloud-services-team, 10decommission-hardware: decommission cloudcephosd2003-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397979 (10Andrew) 03NEW [18:21:35] 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudcephosd200[12]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10951826 (10Andrew) [18:22:34] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10951828 (10Andrew) [18:22:35] 06cloud-services-team, 10decommission-hardware: decommission cloudcephosd2003-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397979#10951829 (10Andrew) [18:23:37] 06cloud-services-team, 10decommission-hardware, 13Patch-For-Review: decommission cloudcephosd200[12]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10951830 (10Andrew) [18:26:38] 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 13Patch-For-Review: decommission cloudcephosd200[12]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10951842 (10Andrew) a:05Andrew→03None Some puppet refs remain, they will soon be removed in a batch when cloudcephosd... [18:26:51] 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 13Patch-For-Review: decommission cloudcephosd200[12]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T397968#10951846 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1003 for hosts: `cloudcephosd200... [18:28:18] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10951857 (10Andrew) [x] @Jhancock.wm will connect the second ports for cloudcephosd200[56]-dev [x] @Andrew will move the workload to the new nodes (part... [19:59:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudvirt2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:06:12] 10VPS-project-Codesearch: Allow searching for plain text in CodeSearch - https://phabricator.wikimedia.org/T381325#10952070 (10SomeRandomDeveloper) I've done a bit of research into this, and it seems that this feature is already present in Hound: https://github.com/hound-search/hound/commit/ca5c7c8c1dc6753b0bbe2... [20:06:23] (03PS1) 10Andrew Bogott: Remove a couple of incorrect comments. The cinder service password is now used. [labs/private] - 10https://gerrit.wikimedia.org/r/1164299 (https://phabricator.wikimedia.org/T273150) [20:06:25] (03PS1) 10Andrew Bogott: Add stand-in passwords for 'glance' service user. [labs/private] - 10https://gerrit.wikimedia.org/r/1164300 (https://phabricator.wikimedia.org/T273150) [20:06:27] (03PS1) 10Andrew Bogott: Add dummy ldap passwords for designate service user [labs/private] - 10https://gerrit.wikimedia.org/r/1164301 (https://phabricator.wikimedia.org/T273150) [20:07:22] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Remove a couple of incorrect comments. The cinder service password is now used. [labs/private] - 10https://gerrit.wikimedia.org/r/1164299 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:07:33] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add stand-in passwords for 'glance' service user. [labs/private] - 10https://gerrit.wikimedia.org/r/1164300 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:07:46] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add dummy ldap passwords for designate service user [labs/private] - 10https://gerrit.wikimedia.org/r/1164301 (https://phabricator.wikimedia.org/T273150) (owner: 10Andrew Bogott) [20:15:37] (03open) 10dhardy: Update about screen and donation link [toolforge-repos/wikirun-game] - 10https://gitlab.wikimedia.org/toolforge-repos/wikirun-game/-/merge_requests/2 [21:21:22] FIRING: HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:26:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:31:22] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [22:23:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-46 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [23:06:24] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] Manage project puppet classes and hiera - https://phabricator.wikimedia.org/T397994 (10bd808) 03NEW [23:41:44] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] Manage project puppet classes and hiera - https://phabricator.wikimedia.org/T397994#10952525 (10bd808) My first use case for this was wanting to set a global `puppetmaster` hiera setting for the zuul project. [23:48:04] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-46 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses