[08:08:21] morning [08:09:58] hello! [08:10:53] there was a small window yesterday when toolforge stopped working, at ~20:00, it seems it was not the same as last time as there were no denials on the frontend side, but also errors from the backend [08:11:27] I see there were a lot of cloudvirt alerts overnight, but all seem very short blips [08:12:01] on the ingress-nginx side all I can see is a peak of traffic followed by a valley then normal again [08:12:04] https://usercontent.irccloud-cdn.com/file/U5xFnVwR/image.png [08:12:15] those are after? [08:12:18] wait [08:12:34] 20:00 utc is 22:00 my time, so might be related yep [08:13:46] the cloudvirt alerts seem network related [08:13:49] I think andrewbogot.t was rebooting cloudvirts: !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1063.eqiad.wmnet}' [08:13:59] ah ok that explains it maybe [08:16:02] there seems to be some delay between the alerts for neutron triggering and the safe reboot finishing, so they are not silenced correctly [12:36:54] I was rebooting cloudvirts and as per tradition Icinga got freaked out. As far as I know there was no actual service interruption, the alerts should've been only for empty hosts [12:37:05] but I'm sorry about the noise, I never remember that I have to downtime icinga by hand [12:55:34] yep I'm also surprised if it caused the toolforge errors that dcaro noticed [12:58:28] to be fair, I'm not sure it did, but kinda matches the timing, but https://www.tylervigen.com/spurious-correlations [12:59:09] yeah maybe moving VMs caused some instability? [12:59:30] it really shouldn't unless there's a bug in our process somewhere [12:59:55] A think I test pretty often is ssh'ing into a VM and then migrating it and confirming that my ssh session isn't interrupted [13:01:29] yeah VM migration seems to be pretty reliable [13:01:52] maybe ingress failed for a bit to elect a primary or something? [13:02:03] obvious question: we migrate any VMs related to the toolforge web servicing path during that time? [13:02:11] s/we migrate/did we migrate/? [13:02:13] I wanted to check that this morning [13:02:18] but got distracted xd [13:03:09] for a given VM you can check the 'action log' in horizon, it will show a 'live-migration' action and a timestamp [13:05:12] hrm, `openstack server migration list` doesn't support per-project filtering with the nova api version we use, nor does it show the project instance name (id only for instances) :/ [13:05:48] taavi: when you say 'web servicing path' which VMs does that include? [13:05:58] what's the quickest way to check what role is applied to a random VM in cloud VPS? [13:06:13] volans: https://openstack-browser.toolforge.org/ [13:06:19] ingress-9 was migrated [13:06:21] andrewbogott: tools-proxy-N, tools-k8s-haproxy-N, tools-k8s-ingress-N [13:06:48] haproxy-5 too [13:07:25] volans: if you want a commandline way to check, this repo has the puppet confic of all VMs, projects, prefixes. It's not 100% obvious how to navigate though. https://gerrit.wikimedia.org/r/admin/repos/cloud/instance-puppet,general [13:07:30] volans: instance page of what andrew linked, with the caveat that a cloud vps vm can have anywhere from zero to many puppet classes applied, instead of the exactly one role class that a wikiprod machine always has [13:07:36] proxy-9 and proxy-10 too [13:07:50] ack, thanks both [13:08:07] dcaro: wait, are you saying both proxies are on the same hypervisor? [13:08:14] no, different times [13:08:17] ah [13:08:40] haproxy-5 kinda matches the time of the bump (8:13) [13:09:40] so, let's see... maybe VM migration maintains continuity for the primary nic but does some shenanigans with floating IPs that causes an interruption? [13:10:42] the haproxies do not use floating ips (at the moment), but have a keepalived vip for failover [13:11:37] ah, right. Well nevertheless, maybe there's something bumpy about switching the networking over to the new HV [13:12:38] ssh is pretty resilient btw. I can turn off my wifi, and turn it on, and my ssh session will still be up and running [13:12:53] (if I don't wait too long to turn it on of course) [13:13:03] yeah, although I've done other tests like with 'watch' and similar and never see a stutter [13:13:19] Another (unlikely) possibility is that sometimes libvirt will underclock a VM in order to get RAM synchronized between hypervisors. So the host could be /slow/ during migration which if it's very busy might cause it to drop things [13:13:38] how long of an interruption are we talking about? [13:31:01] ~4min https://grafana-rw.wmcloud.org/d/toolforge-k8s-haproxy/infra-k8s-haproxy?var-interval=30s&orgId=1&from=2025-09-30T19:57:33.795Z&to=2025-09-30T20:23:14.505Z&timezone=utc&var-host=tools-k8s-haproxy-5&var-backend=$__all&var-frontend=$__all&var-server=$__all&var-code=$__all&refresh=5m [13:40:40] that's seems too long to fit any of my guesses [13:44:59] it's right around the time of the migration in the action log (8:13), as in came back up right after the migration finished [13:47:25] hm [13:47:30] can we test with toolsbeta? [13:49:18] nothing in the keepalived logs on either host [13:54:30] toolsbeta has a more or less identical setup, although obviously with less traffic [13:55:59] I'm tempted to force-migrate the same VM(s) during working hours to see if we can reproduce it [13:56:18] https://configcat.com/blog/assets/images/3-when-i-do-ec81a96709bddf192d1e8510bddb1872.jpg [13:56:32] :) [14:41:16] dcaro: links added [14:41:24] thanks! [14:43:29] I have to log off a bit earlier, see you tomorrow! [14:48:20] fyi. the toolforge haproxy graphs now allow you to select the toolsbeta cluster [14:52:25] dcaro: ah, that explains why I did not add the cluster filter in https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/42/diffs#faf2ec56efb4db90ab6e091492b08c38d071aa07_16_16 when creating the alert itself :D [14:52:40] great! I assume they don't show anything from the migrations? [14:55:28] not really no, just that nobody is using toolsbeta xd [14:58:06] taavi: how does the paging go? we don't want it for toolsbeta right? [14:59:03] dcaro: there's a sneaky bit of prometheus config that rewrites everything with `severity: page` as `severity: critical` for toolsbeta, plus some of the alert rules are straight up not deployed there [14:59:29] ack [16:54:29] * dcaro off