[00:15:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1056.eqiad.wmnet' [00:34:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1056.eqiad.wmnet' [02:00:15] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: MTU setting in IPv6 VMs causes issues with Docker - https://phabricator.wikimedia.org/T408543#11381944 (10xcollazo) >>! In T408543#11381252, @bd808 wrote: > @xcollazo rediscovered this problem in {T408019}. See T408019#11380117 Ah, I knew I could not b... [02:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:05:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1057.eqiad.wmnet' [04:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:24:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1057.eqiad.wmnet' [04:25:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1058.eqiad.wmnet' [04:41:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1058.eqiad.wmnet' [05:13:04] 06cloud-services-team, 10Toolforge: Toolforge Tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352 (10derenrich) 03NEW [05:14:49] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382092 (10bd808) [05:15:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1059.eqiad.wmnet' [05:26:21] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382098 (10bd808) I captured a 500 response at the haproxy outer edge of `best-of.toolforge.org`: `lang=syslog 2025-11-18T00:43:37.753944+00:00 tools-k8s-haproxy-8 haproxy[766]... [05:31:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1059.eqiad.wmnet' [05:55:23] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382130 (10derenrich) i tried swapping from polka to express as my server and that had no effect. really unsure how this could be on my side but it seems unlikely it isn't give... [08:18:22] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382370 (10dcaro) I'm doing a quick check, putting a while loop requesting that url: ` In [6]: while response.status_code == 200: ...: response = requests.get("https://b... [08:57:22] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382534 (10dcaro) That did not take long: ` In [10]: print(response.text) Wikimedia Error 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382651 (10dcaro) We were also having some crawlers hitting the haproxy: {F70272541} But they stopped, and the errors are still there, so nice to see the throttling there work... [09:21:45] (03open) 10volans: Use final namespace name for the tracing loki [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/32 (https://phabricator.wikimedia.org/T399313) [09:24:58] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382677 (10taavi) The `PH--` [[ https://wikitech.wikimedia.org/wiki/HAProxy/session_states | session state ]] translates to: ` PH The proxy blocked the server's respons... [09:33:32] (03open) 10volans: kind: fix port 30004 comment with final name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/298 [09:36:32] (03update) 10volans: tracing: add tracing loki instance [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040 (https://phabricator.wikimedia.org/T399313) [10:03:22] 10Tool-ghanasupremecases1: About Us - https://phabricator.wikimedia.org/T409749#11382878 (10Sunkanmi12) p:05Triage→03High a:03FatawuY1 [10:04:00] 10Tool-ghanasupremecases1: About Us - https://phabricator.wikimedia.org/T409749#11382881 (10Sunkanmi12) a:05FatawuY1→03None [10:04:16] 10Tool-ghanasupremecases1: About Us - https://phabricator.wikimedia.org/T409749#11382883 (10Sunkanmi12) p:05High→03Triage [10:07:41] 10Tool-ghanasupremecases1: About Us - https://phabricator.wikimedia.org/T409749#11382903 (10Sunkanmi12) p:05Triage→03Low [10:11:36] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-legaci-redirector-3 (cluster eqiad1) [10:11:38] !log taavi@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.vps.instance.stop_start (exit_code=97) vm tools-legaci-redirector-3 (cluster eqiad1) [10:11:41] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-legacy-redirector-3 (cluster eqiad1) [10:12:19] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-legacy-redirector-3 (cluster eqiad1) [10:13:47] (03update) 10volans: shared: add loki-tracing S3 buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/92 (https://phabricator.wikimedia.org/T399313) [10:16:37] 06cloud-services-team, 06SRE, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11382952 (10fgiunchedi) I'm giving debugging this issue one more go, as part of this we now have `pause-reboot.cfg` included for `cloudcontrol2010... [10:17:57] (03update) 10volans: shared: add loki-tracing S3 buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/92 (https://phabricator.wikimedia.org/T399313) [10:19:20] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382957 (10dcaro) Aren't the `PH` part of the other set of flags? ` - On the first character, a code reporting the first event which caused the session to terminate : ..... [10:19:45] (03update) 10volans: tracing: add tracing loki instance [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040 (https://phabricator.wikimedia.org/T399313) [10:24:27] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382969 (10dcaro) Looks the same to me: ` 2025-11-18T10:22:37.611073+00:00 tools-k8s-haproxy-8 haproxy[766]: 213.55.247.35:24591 [18/Nov/2025:10:22:37.577] k8s-ingress-https~ k... [10:28:14] 06cloud-services-team, 10Toolforge, 10Tools: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11382980 (10taavi) https://github.com/haproxy/haproxy/issues/1597 suggests increasing `tune.maxrewrite`. It also references a metric that's not yet in Prometheus (due to T343885... [10:33:44] (03approved) 10taavi: Use final namespace name for the tracing loki [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/32 (https://phabricator.wikimedia.org/T399313) (owner: 10volans) [10:41:14] 06cloud-services-team, 10Toolforge, 10Tools, 13Patch-For-Review: "best-of" tool Unexpectedly Returning 500s - https://phabricator.wikimedia.org/T410352#11383020 (10taavi) 05Open→03Resolved a:03taavi I'm being optimistic and resolving, as we no longer seem to be serving any `PH--` 5xx responses wh... [10:44:14] (03merge) 10volans: Use final namespace name for the tracing loki [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/32 (https://phabricator.wikimedia.org/T399313) [10:46:57] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: ingress-admission: bump to 0.0.72-20251118104433-d892c480 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1082 (https://phabricator.wikimedia.org/T399313) [10:47:36] (03approved) 10dcaro: kind: fix port 30004 comment with final name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/298 (owner: 10volans) [10:48:35] (03merge) 10volans: kind: fix port 30004 comment with final name [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/298 [10:49:04] 10Tool-paulina, 07Documentation: Create Comprehensive Documentation - https://phabricator.wikimedia.org/T410381 (10System625) 03NEW [10:52:06] !log volans@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [10:56:54] (03open) 10system625: Add comprehensive documentation for contributors and developers [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/173 (https://phabricator.wikimedia.org/T410381) [10:59:57] !log volans@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission [11:02:38] 06cloud-services-team, 10Toolforge: Ensure ingress pods get scheduled on ingress nodes - https://phabricator.wikimedia.org/T410382 (10taavi) 03NEW [11:03:57] 06cloud-services-team, 10Toolforge: Ensure ingress pods get scheduled on ingress nodes - https://phabricator.wikimedia.org/T410382#11383113 (10taavi) I have a feeling this is more of a side effect of ingress-nginx upgrades (where you momentarily have more than 3 ingress pods) and a lack of a descheduler doing... [11:14:54] (03open) 10taavi: kubernetes: Alert on misplaced ingress-nginx pods [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/49 (https://phabricator.wikimedia.org/T410382) [11:14:57] (03update) 10taavi: kubernetes: Alert on misplaced ingress-nginx pods [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/49 (https://phabricator.wikimedia.org/T410382) [11:15:14] (03update) 10taavi: kubernetes: Alert on misplaced ingress-nginx pods [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/49 (https://phabricator.wikimedia.org/T410382) [11:18:45] 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 10vm-requests, 13Patch-For-Review: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11383162 (10MoritzMuehlenhoff) Specs look good [11:19:06] !log volans@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [11:21:18] (03open) 10dcaro: toolforge_get_versioins: use the env-specific values [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1083 [11:21:30] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11383168 (10taavi) [11:22:22] (03PS1) 10Tiziano Fogli: metamonitoring/icinga/ext-mon: add dummy smtp auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1206846 (https://phabricator.wikimedia.org/T393625) [11:23:04] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga/ext-mon: add dummy smtp auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1206846 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [11:23:07] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] metamonitoring/icinga/ext-mon: add dummy smtp auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1206846 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [11:23:25] (03approved) 10volans: toolforge_get_versioins: use the env-specific values [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1083 (owner: 10dcaro) [11:24:15] !log volans@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component ingress-admission [11:25:02] (03update) 10fnegri: images: resolve the image every time [repos/cloud/toolforge/jobs-api] (use_cache_for_harbor) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/251 (owner: 10dcaro) [11:56:14] !log volans@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [12:05:19] !log volans@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission [12:15:29] (03approved) 10dcaro: ingress-admission: bump to 0.0.72-20251118104433-d892c480 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1082 (https://phabricator.wikimedia.org/T399313) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [12:19:11] (03merge) 10volans: ingress-admission: bump to 0.0.72-20251118104433-d892c480 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1082 (https://phabricator.wikimedia.org/T399313) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [12:22:27] (03update) 10dcaro: toolforge_get_versioins: use the env-specific values [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1083 [12:22:50] (03merge) 10dcaro: toolforge_get_versioins: use the env-specific values [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1083 [12:26:15] (03update) 10volans: tracing: add tracing loki instance [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040 (https://phabricator.wikimedia.org/T399313) [12:40:57] FIRING: ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip6) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:45:57] RESOLVED: [6x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:36:48] FIRING: PuppetFailure: Puppet has failed on cloudrabbit2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:48:13] 10Toolforge, 06tools-infrastructure-team, 06Infrastructure-Foundations, 10netops: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11383672 (10cmooney) Ok thanks @fgiunchedi for the info. I think that seems doable. As per the sub-task about a VRF I think that... [14:21:55] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-prometheus-8 (cluster eqiad1) [14:22:33] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-prometheus-8 (cluster eqiad1) [14:23:18] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-80 [14:23:43] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-80 [14:23:50] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-k8s-worker-nfs-80 (cluster eqiad1) [14:24:28] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-k8s-worker-nfs-80 (cluster eqiad1) [14:25:03] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-81 [14:25:40] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-81 [14:26:36] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-k8s-worker-nfs-81 (cluster eqiad1) [14:27:14] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-k8s-worker-nfs-81 (cluster eqiad1) [14:27:22] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-82 [14:28:00] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-82 [14:28:05] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-k8s-worker-nfs-82 (cluster eqiad1) [14:28:43] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-k8s-worker-nfs-82 (cluster eqiad1) [14:28:59] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-112 [14:29:33] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-112 [14:30:36] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-k8s-worker-112 (cluster eqiad1) [14:31:05] (03approved) 10fnegri: images: cache images retrieved from harbor [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/250 (owner: 10dcaro) [14:31:14] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-k8s-worker-112 (cluster eqiad1) [14:31:30] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-113 [14:31:56] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-113 [14:32:15] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-k8s-worker-113 (cluster eqiad1) [14:32:52] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-k8s-worker-113 (cluster eqiad1) [14:33:18] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-prometheus-9 (cluster eqiad1) [14:33:56] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-prometheus-9 (cluster eqiad1) [14:35:35] !log taavi@cloudcumin1001 tofu START - Cookbook wmcs.vps.instance.stop_start vm tf-registry-3 (cluster eqiad1) [14:36:12] !log taavi@cloudcumin1001 tofu END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tf-registry-3 (cluster eqiad1) [14:36:57] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.instance.stop_start vm tools-harbor-2 (cluster eqiad1) [14:38:17] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm tools-harbor-2 (cluster eqiad1) [14:39:57] !log taavi@cloudcumin1001 metricsinfra START - Cookbook wmcs.vps.instance.stop_start vm metricsinfra-grafana-2 (cluster eqiad1) [14:40:26] !log taavi@cloudcumin1001 metricsinfra END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm metricsinfra-grafana-2 (cluster eqiad1) [14:41:11] !log taavi@cloudcumin1001 metricsinfra START - Cookbook wmcs.vps.instance.stop_start vm metricsinfra-thanos-fe-2 (cluster eqiad1) [14:41:39] !log taavi@cloudcumin1001 metricsinfra END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm metricsinfra-thanos-fe-2 (cluster eqiad1) [14:43:55] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1068-1 (cluster eqiad1) [14:44:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1060.eqiad.wmnet' [14:44:21] !log taavi@cloudcumin1001 account-creation-assistence START - Cookbook wmcs.vps.remove_instance for instance abogott-test-instance [14:44:22] taavi@cloudcumin1001: Unknown project "account-creation-assistence" [14:44:23] !log taavi@cloudcumin1001 account-creation-assistence END (FAIL) - Cookbook wmcs.vps.remove_instance (exit_code=99) for instance abogott-test-instance [14:44:23] taavi@cloudcumin1001: Unknown project "account-creation-assistence" [14:44:23] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1068-1 (cluster eqiad1) [14:44:25] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1069-1 (cluster eqiad1) [14:44:33] !log taavi@cloudcumin1001 account-creation-assistance START - Cookbook wmcs.vps.remove_instance for instance abogott-test-instance [14:44:50] !log taavi@cloudcumin1001 account-creation-assistance END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance abogott-test-instance [14:44:54] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1069-1 (cluster eqiad1) [14:44:55] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1070-1 (cluster eqiad1) [14:45:04] !log taavi@cloudcumin1001 wikicommunityhealth START - Cookbook wmcs.vps.remove_instance for instance abogott-testvm [14:45:19] !log taavi@cloudcumin1001 wikicommunityhealth END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance abogott-testvm [14:45:23] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1070-1 (cluster eqiad1) [14:45:24] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1071-1 (cluster eqiad1) [14:45:53] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1071-1 (cluster eqiad1) [14:45:54] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1072-1 (cluster eqiad1) [14:45:59] !log taavi@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.instance.stop_start vm chartmuseum-2 (cluster eqiad1) [14:46:27] !log taavi@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm chartmuseum-2 (cluster eqiad1) [14:46:32] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1072-1 (cluster eqiad1) [14:46:34] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1073-1 (cluster eqiad1) [14:46:34] !log taavi@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.instance.stop_start vm docker-registry-01 (cluster eqiad1) [14:47:03] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1073-1 (cluster eqiad1) [14:47:04] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1074-1 (cluster eqiad1) [14:47:11] !log taavi@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm docker-registry-01 (cluster eqiad1) [14:47:22] !log taavi@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.instance.stop_start vm enc-3 (cluster eqiad1) [14:47:34] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1074-1 (cluster eqiad1) [14:47:35] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1075-1 (cluster eqiad1) [14:47:59] !log taavi@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm enc-3 (cluster eqiad1) [14:48:11] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1075-1 (cluster eqiad1) [14:48:13] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.vps.instance.stop_start vm canary1076-1 (cluster eqiad1) [14:48:18] !log taavi@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.instance.stop_start vm enc-4 (cluster eqiad1) [14:48:50] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm canary1076-1 (cluster eqiad1) [14:48:55] !log taavi@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.instance.stop_start (exit_code=0) vm enc-4 (cluster eqiad1) [15:06:17] FIRING: JobUnavailable: Reduced availability for job rabbitmq in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [15:13:00] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) on deployment codfw1dev for all services [15:21:48] RESOLVED: PuppetFailure: Puppet has failed on cloudrabbit2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:26:28] 06cloud-services-team, 06SRE: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11384055 (10fgiunchedi) As far as I understand the problem, lvm metadata size and alignment can be related to the underlying block device reported data, specifically th... [15:29:10] (03approved) 10taavi: shared: add loki-tracing S3 buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/92 (https://phabricator.wikimedia.org/T399313) (owner: 10volans) [15:32:46] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:33:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1060.eqiad.wmnet' [15:35:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1061.eqiad.wmnet' [15:37:46] FIRING: [6x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:40:39] 06cloud-services-team, 06SRE: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11384157 (10fgiunchedi) And `lsblk -t` for comparison: ` root@cloudcontrol2010-dev:/# lsblk -t NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED... [15:43:29] 06cloud-services-team, 10Cloud-VPS: Fix pacct rotation properly everywhere - https://phabricator.wikimedia.org/T410410 (10taavi) 03NEW [15:47:46] FIRING: [6x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:50:22] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1061.eqiad.wmnet' [15:52:46] FIRING: [6x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:54:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1061.eqiad.wmnet' [15:55:33] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1061.eqiad.wmnet' [15:56:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1062.eqiad.wmnet' [16:07:46] RESOLVED: [6x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:10:43] (03update) 10miiswom: Proof of concept: add work form, with autocomplete [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/157 [16:11:13] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1062.eqiad.wmnet' [16:22:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1063.eqiad.wmnet' [16:23:22] 06cloud-services-team, 10Toolforge: toolforge: Investigate ingress-nginx replacements - https://phabricator.wikimedia.org/T392356#11384462 (10bd808) https://www.kubernetes.dev/blog/2025/11/12/ingress-nginx-retirement/ > To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Sec... [16:35:03] (03merge) 10volans: shared: add loki-tracing S3 buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/92 (https://phabricator.wikimedia.org/T399313) [16:36:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1063.eqiad.wmnet' [16:37:48] 06cloud-services-team, 10Toolforge: Add paging alert if Toolforge HAProxy connection limit is reached - https://phabricator.wikimedia.org/T410421 (10taavi) 03NEW [16:41:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1064.eqiad.wmnet' [16:42:43] (03open) 10bd808: Draw the rest of the owl [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/1 (https://phabricator.wikimedia.org/T409474) [16:44:42] (03merge) 10bd808: Draw the rest of the owl [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/1 (https://phabricator.wikimedia.org/T409474) [16:49:49] (03update) 10dcaro: images: cache images retrieved from harbor [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/250 [16:50:01] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11384666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cum... [16:51:15] (03update) 10miiswom: Proof of concept for adding new statements [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/153 [16:51:40] (03update) 10miiswom: Proof of concept for adding new statements [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/153 [16:52:00] (03update) 10miiswom: Proof of concept: add work form, with autocomplete [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/157 [16:52:12] (03merge) 10dcaro: images: cache images retrieved from harbor [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/250 [16:52:15] (03update) 10dcaro: images: resolve the image every time [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/251 [16:52:19] (03update) 10miiswom: Improve grid view in smaller screen [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/164 [16:55:14] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.455-20251118165226-b67314b3 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1084 [16:55:21] (03update) 10miiswom: Proof of concept: Add author form [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/131 [16:57:25] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:04:14] (03update) 10miiswom: Proof of concept: Add author form [toolforge-repos/paulina] - 10https://gitlab.wikimedia.org/toolforge-repos/paulina/-/merge_requests/131 [17:09:32] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1064.eqiad.wmnet' [17:10:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1065.eqiad.wmnet' [17:20:57] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:25:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1065.eqiad.wmnet' [17:25:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1044.eqiad.wmnet' [17:32:09] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [17:36:35] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1044.eqiad.wmnet' [17:38:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1044.eqiad.wmnet' [17:40:55] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1044.eqiad.wmnet' [17:41:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1044.eqiad.wmnet' [17:43:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1044.eqiad.wmnet' [17:54:03] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:56:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1066.eqiad.wmnet' [18:07:01] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [18:08:50] (03approved) 10dcaro: jobs-api: bump to 0.0.455-20251118165226-b67314b3 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1084 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [18:08:52] (03merge) 10dcaro: jobs-api: bump to 0.0.455-20251118165226-b67314b3 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1084 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [18:09:42] (03update) 10dcaro: images: resolve the image every time [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/251 [18:13:57] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1066.eqiad.wmnet' [18:27:13] 10Toolforge (Toolforge iteration 25): [jobs-api] Create storage layer, and save business models in persistent storage - https://phabricator.wikimedia.org/T359650#11385241 (10dcaro) [18:30:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1067.eqiad.wmnet' [18:38:50] 10Toolforge, 06tools-infrastructure-team: Reduce tool breakage over new ingress-nginx annotation validation rules - https://phabricator.wikimedia.org/T409474#11385341 (10bd808) [x] tool-keystone-browser/keystone-browser `lang=shell-session bd808@tools-bastion-14.tools.eqiad1:~$ become keystone-browser tools.ke... [18:45:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1067.eqiad.wmnet' [18:45:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1068.eqiad.wmnet' [18:58:37] (03open) 10bd808: Add service.template to repo and README [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/2 [18:59:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1068.eqiad.wmnet' [19:01:01] (03update) 10bd808: Add service.template to repo and README [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/2 [19:07:59] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1066'] [19:08:30] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1066'] [19:09:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1069.eqiad.wmnet' [19:22:46] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1069.eqiad.wmnet' [19:25:25] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1070.eqiad.wmnet' [19:28:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1068 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:37:55] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1070.eqiad.wmnet' [19:39:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1070.eqiad.wmnet' [19:40:18] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1070.eqiad.wmnet' [19:44:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1071.eqiad.wmnet' [19:51:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1071.eqiad.wmnet' [19:53:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1072.eqiad.wmnet' [20:04:19] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1068 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:13:38] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1072.eqiad.wmnet' [20:53:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1073.eqiad.wmnet' [21:02:56] (03open) 10lucaswerkmeister: Remove inappropriate visually-hidden class from skiplink [toolforge-repos/tech-doc-metrics] - 10https://gitlab.wikimedia.org/toolforge-repos/tech-doc-metrics/-/merge_requests/6 [21:03:19] FIRING: [3x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-9.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [21:08:19] RESOLVED: [6x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-9.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [21:10:54] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1073.eqiad.wmnet' [21:11:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1073.eqiad.wmnet' [21:12:57] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1073.eqiad.wmnet' [21:13:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1073.eqiad.wmnet' [21:14:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1073.eqiad.wmnet' [21:14:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1074.eqiad.wmnet' [21:24:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1074.eqiad.wmnet' [21:39:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1075.eqiad.wmnet' [21:43:31] 06cloud-services-team, 10Toolforge: HAProxy frontend session limit hit (repeat outage) - https://phabricator.wikimedia.org/T410463 (10DamianZaremba) 03NEW [21:49:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1075.eqiad.wmnet' [21:50:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1076.eqiad.wmnet' [21:52:57] (03open) 10bd808: README: Add missing link for [Build Service] [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/3 [21:53:14] (03merge) 10bd808: Add service.template to repo and README [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/2 [21:53:26] (03update) 10bd808: README: Add missing link for [Build Service] [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/3 [21:54:44] (03merge) 10bd808: README: Add missing link for [Build Service] [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/3 [21:54:56] 10Tool-global-search: Global search doesn't remember that I've logged in - https://phabricator.wikimedia.org/T410464 (10jhsoby) 03NEW [21:56:13] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1073'] [21:56:40] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) on eqiad1, with recreate False, for hosts list: ['cloudvirt1073'] [21:57:20] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1073'] [21:57:42] 10Tools, 05PES1.3.3 WP25 Easter Eggs: wikipedia25-years-of-wikipedia tool loads and uses non-free JavaScript - https://phabricator.wikimedia.org/T410465#11386304 (10taavi) [21:57:42] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1073'] [22:03:30] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1076.eqiad.wmnet' [22:03:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1076.eqiad.wmnet' [22:06:42] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1076.eqiad.wmnet' [22:10:24] (03approved) 10wikigit: Add Terms of Service URL Rulesets and Corresponding Specs [toolforge-repos/wmf-openapi-linter] - 10https://gitlab.wikimedia.org/toolforge-repos/wmf-openapi-linter/-/merge_requests/2 (owner: 10jaredblumer) [22:10:29] (03merge) 10wikigit: Add Terms of Service URL Rulesets and Corresponding Specs [toolforge-repos/wmf-openapi-linter] - 10https://gitlab.wikimedia.org/toolforge-repos/wmf-openapi-linter/-/merge_requests/2 (owner: 10jaredblumer) [22:31:03] 10Toolforge, 06tools-infrastructure-team: Reduce tool breakage over new ingress-nginx annotation validation rules - https://phabricator.wikimedia.org/T409474#11386420 (10bd808) >>! In T409474#11353177, @taavi wrote: > * tool-apt/apt-domain: It looks like @taavi setup tool-containers/redirect for this one. > *... [22:50:21] 10Toolforge, 06tools-infrastructure-team: Reduce tool breakage over new ingress-nginx annotation validation rules - https://phabricator.wikimedia.org/T409474#11386484 (10bd808) The remaining jenkins-build-stats, toolforge-standards-committee, toolinfo-scraper, and xn--9s9h tools all need to be able to manipula... [23:26:22] PROBLEM - Host cloudvirt1071 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:19] FIRING: [6x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-9.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [23:27:46] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:30:28] FIRING: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-34 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:30:30] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-ingress-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:30:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1071 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [23:30:50] RECOVERY - Host cloudvirt1071 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [23:31:10] FIRING: CloudVirtDown: Cloudvirt node cloudvirt1071 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1071 - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [23:35:28] RESOLVED: [4x] InstanceDown: Project tools instance tools-k8s-haproxy-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:35:30] RESOLVED: InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-ingress-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:35:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1071 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [23:36:09] RESOLVED: CloudVirtDown: Cloudvirt node cloudvirt1071 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1071 - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [23:37:19] RESOLVED: [6x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-9.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [23:37:46] RESOLVED: [3x] ProbeDown: Service tools-k8s-haproxy-8:443 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:44:49] FIRING: ObjectStorageObjectQuotaFull: Object storage quota by 'objects' is 89.2% full for project tools - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/ObjectStorageObjectQuotaFull - https://grafana.wikimedia.org/d/7120b794-4638-49f5-bccd-9716efc60f24/wmcs-object-storage-quotas - https://alerts.wikimedia.org/?q=alertname%3DObjectStorageObjectQuotaFull [23:49:00] 06cloud-services-team, 10Cloud-VPS: cloudvirt1071 crash - https://phabricator.wikimedia.org/T410470 (10Andrew) 03NEW [23:56:49] (03open) 10bd808: Allow target path manipulation [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/4 [23:58:24] (03update) 10bd808: Allow target path manipulation [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/4 [23:59:48] (03update) 10bd808: Allow target path manipulation [toolforge-repos/containers-redirect] - 10https://gitlab.wikimedia.org/toolforge-repos/containers-redirect/-/merge_requests/4