[08:08:21] <dcaro>	 morning
[08:09:58] <dhinus>	 hello!
[08:10:53] <dcaro>	 there was a small window yesterday when toolforge stopped working, at ~20:00, it seems it was not the same as last time as there were no denials on the frontend side, but also errors from the backend
[08:11:27] <dhinus>	 I see there were a lot of cloudvirt alerts overnight, but all seem very short blips
[08:12:01] <dcaro>	 on the ingress-nginx side all I can see is a peak of traffic followed by a valley then normal again
[08:12:04] <dcaro>	 https://usercontent.irccloud-cdn.com/file/U5xFnVwR/image.png
[08:12:15] <dcaro>	 those are after?
[08:12:18] <dcaro>	 wait
[08:12:34] <dcaro>	 20:00 utc is 22:00 my time, so might be related yep
[08:13:46] <dhinus>	 the cloudvirt alerts seem network related
[08:13:49] <dcaro>	 I think andrewbogot.t was rebooting cloudvirts: !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1063.eqiad.wmnet}' 
[08:13:59] <dhinus>	 ah ok that explains it maybe
[08:16:02] <dcaro>	 there seems to be some delay between the alerts for neutron triggering and the safe reboot finishing, so they are not silenced correctly
[12:36:54] <andrewbogott>	 I was rebooting cloudvirts and as per tradition Icinga got freaked out. As far as I know there was no actual service interruption, the alerts should've been only for empty hosts
[12:37:05] <andrewbogott>	 but I'm sorry about the noise, I never remember that I have to downtime icinga by hand
[12:55:34] <dhinus>	 yep I'm also surprised if it caused the toolforge errors that dcaro noticed
[12:58:28] <dcaro>	 to be fair, I'm not sure it did, but kinda matches the timing, but https://www.tylervigen.com/spurious-correlations
[12:59:09] <dhinus>	 yeah maybe moving VMs caused some instability?
[12:59:30] <andrewbogott>	 it really shouldn't unless there's a bug in our process somewhere
[12:59:55] <andrewbogott>	 A think I test pretty often is ssh'ing into a VM and then migrating it and confirming that my ssh session isn't interrupted
[13:01:29] <dhinus>	 yeah VM migration seems to be pretty reliable
[13:01:52] <dcaro>	 maybe ingress failed for a bit to elect a primary or something?
[13:02:03] <taavi>	 obvious question: we migrate any VMs related to the toolforge web servicing path during that time?
[13:02:11] <taavi>	 s/we migrate/did we migrate/?
[13:02:13] <dcaro>	 I wanted to check that this morning
[13:02:18] <dcaro>	 but got distracted xd
[13:03:09] <andrewbogott>	 for a given VM you can check the 'action log' in horizon, it will show a 'live-migration' action and a timestamp
[13:05:12] <taavi>	 hrm, `openstack server migration list` doesn't support per-project filtering with the nova api version we use, nor does it show the project instance name (id only for instances) :/
[13:05:48] <andrewbogott>	 taavi: when you say 'web servicing path' which VMs does that include?
[13:05:58] <volans>	 what's the quickest way to check what role is applied to a random VM in cloud VPS?
[13:06:13] <andrewbogott>	 volans: https://openstack-browser.toolforge.org/
[13:06:19] <dcaro>	 ingress-9 was migrated
[13:06:21] <taavi>	 andrewbogott: tools-proxy-N, tools-k8s-haproxy-N, tools-k8s-ingress-N
[13:06:48] <dcaro>	 haproxy-5 too
[13:07:25] <andrewbogott>	 volans: if you want a commandline way to check, this repo has the puppet confic of all VMs, projects, prefixes. It's not 100% obvious how to navigate though.  https://gerrit.wikimedia.org/r/admin/repos/cloud/instance-puppet,general
[13:07:30] <taavi>	 volans: instance page of what andrew linked, with the caveat that a cloud vps vm can have anywhere from zero to many puppet classes applied, instead of the exactly one role class that a wikiprod machine always has
[13:07:36] <dcaro>	 proxy-9 and proxy-10 too
[13:07:50] <volans>	 ack, thanks both
[13:08:07] <taavi>	 dcaro: wait, are you saying both proxies are on the same hypervisor?
[13:08:14] <dcaro>	 no, different times
[13:08:17] <taavi>	 ah
[13:08:40] <dcaro>	 haproxy-5 kinda matches the time of the bump (8:13)
[13:09:40] <andrewbogott>	 so, let's see... maybe VM migration maintains continuity for the primary nic but does some shenanigans with floating IPs that causes an interruption?
[13:10:42] <taavi>	 the haproxies do not use floating ips (at the moment), but have a keepalived vip for failover
[13:11:37] <andrewbogott>	 ah, right. Well nevertheless, maybe there's something bumpy about switching the networking over to the new HV
[13:12:38] <dcaro>	 ssh is pretty resilient btw. I can turn off my wifi, and turn it on, and my ssh session will still be up and running
[13:12:53] <dcaro>	 (if I don't wait too long to turn it on of course)
[13:13:03] <andrewbogott>	 yeah, although I've done other tests like with 'watch' and similar and never see a stutter
[13:13:19] <andrewbogott>	 Another (unlikely) possibility is that sometimes libvirt will underclock a VM in order to get RAM synchronized between hypervisors. So the host could be /slow/ during migration which if it's very busy might cause it to drop things
[13:13:38] <andrewbogott>	 how long of an interruption are we talking about?
[13:31:01] <dcaro>	 ~4min https://grafana-rw.wmcloud.org/d/toolforge-k8s-haproxy/infra-k8s-haproxy?var-interval=30s&orgId=1&from=2025-09-30T19:57:33.795Z&to=2025-09-30T20:23:14.505Z&timezone=utc&var-host=tools-k8s-haproxy-5&var-backend=$__all&var-frontend=$__all&var-server=$__all&var-code=$__all&refresh=5m
[13:40:40] <andrewbogott>	 that's seems too long to fit any of my guesses
[13:44:59] <dcaro>	 it's right around the time of the migration in the action log (8:13), as in came back up right after the migration finished
[13:47:25] <andrewbogott>	 hm
[13:47:30] <andrewbogott>	 can we test with toolsbeta?
[13:49:18] <taavi>	 nothing in the keepalived logs on either host
[13:54:30] <taavi>	 toolsbeta has a more or less identical setup, although obviously with less traffic
[13:55:59] <dhinus>	 I'm tempted to force-migrate the same VM(s) during working hours to see if we can reproduce it
[13:56:18] <dhinus>	 https://configcat.com/blog/assets/images/3-when-i-do-ec81a96709bddf192d1e8510bddb1872.jpg
[13:56:32] <dhinus>	 :)
[14:41:16] <andrewbogott>	 dcaro: links added
[14:41:24] <dcaro>	 thanks!
[14:43:29] <dhinus>	 I have to log off a bit earlier, see you tomorrow!
[14:48:20] <dcaro>	 fyi. the toolforge haproxy graphs now allow you to select the toolsbeta cluster
[14:52:25] <taavi>	 dcaro: ah, that explains why I did not add the cluster filter in https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/42/diffs#faf2ec56efb4db90ab6e091492b08c38d071aa07_16_16 when creating the alert itself :D
[14:52:40] <andrewbogott>	 great! I assume they don't show anything from the migrations?
[14:55:28] <dcaro>	 not really no, just that nobody is using toolsbeta xd
[14:58:06] <dcaro>	 taavi: how does the paging go? we don't want it for toolsbeta right?
[14:59:03] <taavi>	 dcaro: there's a sneaky bit of prometheus config that rewrites everything with `severity: page` as `severity: critical` for toolsbeta, plus some of the alert rules are straight up not deployed there
[14:59:29] <dcaro>	 ack
[16:54:29] * dcaro off