[12:09:48] new openstack on k8s toy from mirantis [12:09:50] https://github.com/Mirantis/rockoon [12:15:12] Every six months a new k8s deployment tool is released and an old one abandoned :/ [12:15:36] that's slightly better than a new one being released and the old ones not being quite abandoned and dragged along [12:16:29] well they usually aren't /officially/ abandoned [13:14:28] there is a widespread puppet problem in our servers, I'm working on a fix [13:15:56] thanks! [13:20:05] arturo: can I safely reboot tools-k8s-control-X as long as I only do one at at time? [13:20:15] andrewbogott: yes [13:20:23] and wait for it to be green after the reboot [13:20:35] ok [13:24:09] same for tools-k8s-ingress-* ? [13:25:02] andrewbogott: yes [13:25:12] cool [13:25:24] this one may be slower to be green after the reboot [13:26:05] (many webservices .... ) [13:28:52] ok. I'm using the cookbook which seems to be handling it sensibly [13:29:08] this leaves... [13:29:48] there is a concerning alert [13:30:01] https://usercontent.irccloud-cdn.com/file/EndDrYRd/image.png [13:30:12] tools-k8s-haproxy-*, tools-harbor-1, tools-db-*, tools-package-builder-04, tools-proxy-* [13:30:27] they're definitely back up... [13:31:13] stashbot, how's it going? [13:31:13] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [13:32:00] arturo: looks to me like those haproxy alerts are clearing quickly, do you see any actual breakage? [13:32:13] let me check [13:32:40] andrewbogott: all seems green to me [13:32:49] yeah, ok, false alarm then [13:33:08] I will let you fix puppet and then we can discuss the remaining nodes [13:33:50] the puppet fix was merged [13:34:20] oh nice [13:36:17] * andrewbogott thinks alertmanager will be more useful after it clears some things [13:37:30] arturo: happen to know what will happen if I reboot proxy-04.project-proxy? Will the other proxy handle the load? [13:37:45] looks like the setup uses a VIP which is new since I last looked at it [13:37:53] andrewbogott: let me check. I don't recall they have any auto failover [13:38:41] neither has a floating IP assigned (which is what I was expecting I guess) [13:40:30] they do [13:40:59] just rebooting is ok, stop keepalived.service first if you want to be extra nice [13:41:18] ok, then that's why, the floating IP should be assigned to the neutron port IP [13:41:27] not to the port of the VMs [13:41:35] thx taavi, will do that [13:41:35] andrewbogott: ^^^ [13:43:02] oh great, my intended puppet fix from earlier did not fix the problem [13:43:57] * andrewbogott stops rebooting things until the alert dashboard is legible [13:54:38] ok, I think I fixed the puppet typo for good now [13:58:04] taavi: same thing for rebooting tools-proxy-[78]? [13:58:43] and, I guess, tools-k8s-haproxy-[56] [13:59:20] andrewbogott: iirc yes for tools-k8s-haproxy, i don't recall the normal proxies offhand but probably not [13:59:37] (long-term i want to merge those to the same role) [14:01:22] I find it a bit concerning that horizon doesn't attached internal IP of a floating IP, if that port is not attached to a VM. Makes me thing the floating IP is not attached to anything, and makes me thing I need to click the 'release floating IP' button [14:01:34] I don't know if that button would work in this case, but if it does, it could be a disaster [14:02:14] we could fix horizon, or we could have IaC. Or both [14:03:24] visual example of what I'm trying to say: [14:03:27] https://usercontent.irccloud-cdn.com/file/NbhoYN7X/image.png [14:04:24] what is the role of the second floating IP? [14:05:15] * arturo food time [14:18:18] arturo: I think this is the same problem I described in T381021 [14:18:18] T381021: [horizon] Floating IP pointing to Neutron VIP is not displayed - https://phabricator.wikimedia.org/T381021 [14:40:00] I wonder if there's a 'ports' panel I can add back [14:43:18] I rebooted the inactive proxy, tools-proxy-7. Not sure if it's better to move the floating IP over before rebooting -8 or just do it and accept the 15 second outage... (seems like either way there will be an outage) [14:43:58] dhinus: I assume there's no good way to reboot tools-db-4 and tools-db-5? [14:56:48] dhinus: ack [14:58:29] andrewbogott: we could failover tools-db, but probably not worth the hassle, you have to update the DNS, and that causes downtime anyway [14:59:05] makes sense for an upgrade, but probably not for a quick reboot [14:59:12] dhinus: do you mind scheduling a window for that? We can do the proxy at the same time. [14:59:33] yes reboot window seems the best way to me [14:59:39] I will send an email to cloud-announce [14:59:50] Since I already warned that I would reboot 'all the rest' on Thursday, let's do it then :) [14:59:58] thank you! [15:00:01] sounds good [15:08:17] arturo: prometheus-node-kernel-messages is failing on many hosts, I'm not sure why as I did test it on a real host before you merged it [15:17:52] I think I know why: "eval" is sometimes returning 1 and causing the script to exit [15:17:52] can everyone but me log into labtesthorizon? [15:18:00] Or, like, remain logged in? [15:18:10] andrewbogott: remain for how long? :P [15:18:34] If you see anything other than a page that says 'Submitting...' then I think I'm happy [15:18:58] in theory if you create a new VM there's a 'networking' step now. Which I would like to see but my local network is weird and seems to be preventing me [15:21:09] I see the dashboard [15:21:24] do you want me to try creating a new VM? [15:22:35] arturo: prometheus-node-kernel-messages is failing when zero messages are returned, and grep returns "1" for "no matches" [15:23:41] dhinus: can you log out and in again? After a long reconnecting loop, presumably? [15:23:54] andrewbogott: trying [15:23:59] ty [15:24:12] I'm in the "access_token" loop now [15:24:44] for me it eventually drops me back to the login panel rather than logging me in [15:25:15] oh wait it finally worked for me! [15:26:45] I'm still in the loop [15:27:42] I should really pay some attention to that bug [15:30:41] andrewbogott: stuck in Submitting... for me [15:31:19] I'm not so much surprised that it loops as I am surprised that it ever stops looping. [15:32:00] oh it finally let me in [15:33:29] so weird [16:22:27] andrewbogott: quick review? https://etherpad.wikimedia.org/p/cna4GMpRlJCm65THgk-l [16:25:05] dhinus: I added a thing about the proxy, possibly too little detail [16:35:06] thanks, I think that's fine [16:35:33] is the time ok? 14 UTC on Thursday? [16:50:20] yep [17:00:45] arturo, dhinus, was there something dc-specific in that prometheus patch? Our oddball servers cloudbackup100[12]-dev (which are physically in eqiad but part of the codfw1dev cluster) now have puppet errors. [17:00:56] "Could not set 'file' on ensure: No such file or directory - A directory component in /etc/prometheus/kernel-messages-ignore-regex.txt20250203-660386-hi66re.lock does not exist or is a dangling symbolic link" [17:01:00] I can look after the SRE meeting [17:02:49] andrewbogott: most likely they are missing something related to prometheus [17:03:00] the /etc/prometheus directory is _not_ deployed via puppet [17:03:26] hm, ok [17:03:39] andrewbogott: previously we've just accepted the short outage from moving the floating IP without any announcements. there's a cookbook to do it really fast [17:03:53] * arturo going offline now [17:04:36] sorry my notifications have been blowing up today due to shark shitposts so it's very difficult for me to follow any actual conversation atm [17:08:24] taavi andrewbogott I'll just mention toolsdb in the email, then. this also makes me think a similar floating IP for toolsdb might allow for fast failovers. [17:08:26] t.aavi's reason for notification overwhelm got me to actually laugh, so I guess thanks for sharing that. :) [17:09:33] * andrewbogott also lol'ed at 'shark shitposts' [17:09:45] bd808: what did you expect from me [17:20:46] taavi: only the best in blåhaj sockpuppetry ;) [19:09:35] in case anyone is still around... are labtesthorizon logins still slow and bouncy or do you get logged in directly now? [19:12:30] I don’t have access to labtesthorizon so can’t confirm, but I can log into horizon.wikimedia.org [19:13:20] andrewbogott: I just tried and it did ~10 redirects before completing [19:13:20] ok! I have a specific hack running on labtest that I want to test, but it's always hard to check an intermittent issue [19:13:35] bd808: labtesthorizon or regular horizon? [19:13:48] labtesthorizon [19:14:13] and an logout there now has me in a deep loop [19:14:14] ok. So it's not cache pooling then... [19:15:11] * bd808 does the logout dance again with the network monitor tools open for his browser [19:16:23] lol. 238 request from hitting https://labtesthorizon.wikimedia.org/auth/logout/ to being logged back in [19:18:45] andrewbogott: I saved a HAR of the 238 requests. I will look at it a bit later and upload to the bug if it seems reasonably safe to share. [19:19:23] * bd808 → lunch [19:21:33] bd808: I'm interested if it's any better now. But not at the expense of your lunch! [20:35:21] andrewbogott: massively better. 15 requests from /auth/logout/ to fully rendered post-login now. I was able to repeat that 3 times. [20:35:51] ok, I'm changing things again and will have you test shortly... [20:40:03] bd808: how about now? [20:42:43] Still 15 total requests (which is counting css, js, and images in the total). The IDP service had forgotten about me again even though I told it to remember me before I broke for lunch. [20:43:56] ok! I don't entirely understand this fix but I don't hate it. Making a patch... [21:16:29] bd808: if curious, here is a proposed fix https://gerrit.wikimedia.org/r/c/operations/puppet/+/1116868 [21:19:29] andrewbogott: oh! that totally makes sense as why sometimes it was normal and other times got trapped in redirect hell. [21:19:44] yeah, it just had to get lucky and bounce back to the same keystone server [21:20:05] but I'm grumpy that keystone isn't sharing that state across the servers... what can the cache pool be for if not that? [21:21:15] yeah, it is the sort of state that we would typically try to put in memcached/redis to allow a whole cluster of servers to access readily [21:22:26] Did you ask anyone upstream about ideas for why it needs to hit the same keystone instance? [21:23:09] I could see it being some easily overlooked cache setting given the general state of openstack config complexity [21:23:27] nope, I only just figured out that rotation between keystone servers was an issue [21:24:46] I need to read the source. Auth ought to be a single call returning a token that can be validated w/out state. I don't really understand why there would be anything resembling a session. [21:26:18] but maybe I should merge in the meantime, since it's causing current pain... [21:28:50] usually there is some "did I actually ask for this" tracking in the client side of an OAuth handshake [21:29:14] s/client/consumer/ [21:29:41] oh, that would do it