[07:39:00] thanks andrewbogot.t! [08:09:35] morning. PAWS is reporting down wince 3 mins ago, looking [08:09:40] *since [08:16:57] node-0 in the PAWS cluster is "node-0 NotReady" [08:17:13] but the other nodes look ok [08:21:00] the openstack magnum cli shows "health_status | UNKNOWN" [08:21:15] 'The cluster paws-127b is not accessible.' [08:25:21] I'm attempting to resize the cluster from 5 to 4 nodes using horizon [08:25:50] this should remove the "NotReady" node-0 [08:28:39] the update worked, there are now 4 "Ready" nodes, but https://hub-paws.wmcloud.org/ is replying with 404 [08:28:50] coming from proxy-5.project-proxy.eqiad1.wikimedia.cloud [08:29:44] the web proxy is apparently hardcoded to point to the node that was not working and I now removed [08:30:57] not true, it's pointing to an IP address that I thought was the one of the old node, but it's not [08:30:59] 172.16.6.241 [08:32:09] reverse lookup points to paws-nfs-1.paws.eqiad1.wikimedia.cloud [08:35:27] ok I was off track, the web proxy is for nfs-paws.wmcloud.org , not for hub-paws.wmcloud.org [08:35:33] so I'm not sure where hub-paws.wmcloud.org is configured [08:46:51] hmm the DNS for hub-paws.wmcloud.org is not resolving at all [08:53:01] I'm not finding docs on how this should be working, I'm randomly trying to create a new web proxy for hub-paws.wmcloud.org [08:59:11] this seems to have worked, I had to find the right port with "kubectl get svc" [08:59:31] so there is now a web proxy pointing to node-1 in the cluster at point 32611 [08:59:42] and I can use PAWS at https://hub-paws.wmcloud.org [08:59:59] not sure if this is the right way to do this :) [09:59:21] sorry, I was in a meeting, reading [10:00:35] I think that the web proxy is managed by tofu no? [10:00:47] I had a quick look and did not see it in the tofu folder [10:00:55] let me check again [10:03:10] it's not there, I checked the tfstate as well. tofu is only creating the cluster and clustertemplate [10:03:29] okok [10:03:32] I think what happened is that the proxy was manually set to point to node-0 [10:03:42] it's in the docs [10:03:43] yep [10:03:47] and when I removed that node, the proxy deleted the proxy record because it was not pointing to anything [10:04:12] I will try running "tofu apply", I expect it will take the number of workers back to 5 (maybe) [10:04:42] https://wikitech.wikimedia.org/wiki/PAWS/Admin#Blue_Green_Deployment,_creating_a_new_cluster [10:04:58] ohhhh [10:05:14] yep, maybe we should move the proxy to tofu now that we can [10:05:17] exactly: node_count = 4 -> 5 [10:05:47] ah good find, I did not see that line: [10:05:49] "Update Web Proxies in Horizon. Network > Web Proxies Point hub-paws and public-paws to the first node of the new cluster" [10:06:21] or use the new load balancer from openstack [10:06:33] (might be a good place to start) [10:07:09] ha I was searching for "proxy" but that sentence contains "proxies" :P [10:07:40] yes I agree we should add the proxy part to tofu, or try the load balancer [10:07:45] hahahah, xd, maybe you can try to add 'proxy' somewhere there [10:09:05] replaced one of the two 'proxies' with proxy :P [10:09:14] 👍🎉 [10:09:33] my remaining question is about the port: is there a way to find it without having to use kubectl? is it static? [10:09:42] I initially created the proxy pointing to 80, and it did not work [10:09:55] I think it's always the same [10:10:06] then I looked at "kubectl get svc" and found 80:32611/TCP [10:10:27] I'm also confused by the fact the wiki mentions you should have TWO proxies, I guess they should point to the same port? [10:10:38] I only created one so far (hub-paws) [10:11:07] it's in paws/values.yaml (in the paws repo) [10:11:18] ack [10:13:13] I vaguely remember there were more, let's check codfw [10:15:28] hmpf... the project is pawsdev, but the openstack client does not seem to be authorized (from cloudcontrol2010-dev at least) [10:16:00] oh... so the id and the name of the project are different [10:17:47] how do you check the web proxies from the command line? [10:17:52] I improved the docs a bit https://wikitech.wikimedia.org/wiki/PAWS/Admin#Update_Web_Proxy_in_Horizon [10:18:00] never tried checking the proxies from the CLI [10:18:39] can we use labtesthorizon? [10:19:42] https://usercontent.irccloud-cdn.com/file/mFhjxGwf/Screenshot%202025-07-21%20at%2012.19.29.png [10:19:44] yep, but I was not a member of the project, just added myself, it has grafana proxy set [10:19:48] yep [10:20:46] on the docs, you might want to move the last parts about removing the old cluster out of the 'update proxies' section [10:20:59] it looks good [10:22:05] I added another sub-section, how does it look now? [10:22:20] nice 👍 [10:23:27] I'm not sure about the grafana proxy in codfw, the IP does not seem to match any vm [10:24:09] unless it's in another project, but I think it's not working [10:25:08] is there a grafana deployed in eqiad paws? [10:25:26] the codfw1dev deployment seems botchered yep [10:26:27] I'll go back to the meet thingie, ping me if you need my attention :) [10:26:37] thanks, ttyl [10:30:48] public-paws is now responding to requests, but I'm not sure it's working as expected [10:33:32] both the example links from the admin page, and a random link I found in phab are returning 404 [10:33:38] "Jupyter has lots of moons, but this is not one..." [10:42:37] I tried understanding how public-paws is supposed to work but I'm not sure [11:05:01] ok I think I found the issue: the proxy should point to ingress-nginx, on port 30001 [11:05:17] not sure why 32611 is also listening on NodePort [11:13:28] updated the docs again: https://wikitech.wikimedia.org/wiki/PAWS/Admin#Update_Web_Proxy_in_Horizon [11:33:59] I created https://github.com/toolforge/paws/pull/495 to remove the additional port [11:37:02] dcaro: hello! would you have some time to help moving cloudcephosd1024, cloudcephosd1015 and cloudvirt1047 uplinks from cloudsw2-d5-eqiad to cloudsw1-d5-eqiad ? Jclack can help move the cables, I can do the network config, we just need someone to depool the hosts [11:53:31] XioNoX: hey, just saw the message, can we wait until ~3hours from now? (I'm in a meeting, and have to travel back home right after) [11:53:50] dcaro: yeah no pb, can be a different day if better [11:53:55] or just one host per day [11:53:56] etc [11:56:19] on the host side, all that's needed is depool/repool? no config change/ip change/etc right? [11:57:30] if we can schedule a time to do it sync it might be faster (as we don't need to move data), otherwise I'll need some time to depool them yep [11:58:56] dcaro: yeah that's correct [11:59:26] awesome :), send me an invite if you can do it sync, otherwise I'll start depooling when I get home [11:59:26] just depool, cable more, repool, so a few seconds downtime max if all smooth [12:01:15] great, I can ping you when I'm around later if you preer [12:01:46] dcaro: tomorrow 13:30 good for you? [12:02:04] 11:30 UTC [12:03:24] okokok [12:04:32] * dcaro traveling back, be back in ~1h [13:49:08] dcaro, dhinus, Have y'all learned all you can from cloudcephosd1006 or would you like me to leave it on bookworm for a while longer? [13:51:25] andrewbogott: just got back from barcelona, I wanted to completely take out and bootstrap (on the ceph side), to see if that helps in any way (creating the osd drives from scratch) [13:52:13] ok! [13:52:17] Want me to do that? [13:55:07] sure, tanks! :) [13:55:59] you will want to make sure it's done rebalancing before y'all do the switch changes tomorrow. [13:56:36] that'd be nice yes [13:58:03] Think I should /also/ reimage the host after the nodes are destroyed? Just to be 100% sure it's a clean slate? [13:59:14] It should not be needed, but sure, just in case, if it does not work when reimaging also, then it will most probably not work without reimaging [14:02:47] hm, I deleted a bunch of PAWS files last night, got it down to almost 80% and now it's back to 85% [14:05:10] PAWS crashed this morning, maybe unrelated (one worker was not responding) [14:05:15] maybe there's some heavy usage going on? [14:05:47] let's see if folks are building the android source again... [14:05:54] LOL [14:13:51] Are the switch moves tomorrow related to getting T395910 unblocked? [14:13:52] T395910: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910 [14:20:35] those were going to F4/E4 no? [16:35:28] I am not convinced that the keyholder model is a good one, it assumes that a reboot is a rare, major event that will always be noticed rather than a trivial bit of normal operation [16:35:59] it should at least raise some kind of persistent systemd flag when activation is pending [16:36:02] maybe that happens in prod [16:36:55] bd808 andrewbogott do you think this toolforge request is compliant with our policies? https://toolsadmin.wikimedia.org/tools/membership/status/1974 [16:38:22] I think it is [16:38:24] my understanding is that the code that runs on toolforge would be open source, but it would send data to chatgpt and gemini... which I think is allowed as long as the user is informed? [16:38:27] can you tell what llm backend it's using? [16:38:38] it says "Multiple AI backends: Support for OpenAI GPT-4 and Google Gemini models" [16:38:45] Yeah, I don't love it but it's not forbidden by tou [16:38:51] thanks [16:39:14] yeah. gross to me, but that's not a blocking criteria ;) [16:39:30] btw cloudcephosd1006 is back up and pooled, in case you want to go back to watching it suspiciously [16:39:39] great [16:39:48] still rebalancing [16:39:52] I will log off shortly, but I'll keep my fingers crossed :P [16:40:38] "Background processing: Celery with PostgreSQL for AI model integration" -- that sounds like a thing that will break in Toolforge [16:41:28] bd808: good catch, we might easily spin up a Trove db though [16:42:29] I have #feelings that the shared Redis in Toolforge worked fine until a handful of projects decided to use it to power a celery queue. [16:43:12] I have a similar feeling, but no evidence [16:44:25] GitHub stalking led me to this website that the user behind that Toolforge request seems to run -- https://isearthstillwarming.com/ -- The cognitive dissonance hurts. [16:47:53] from the domain name, I initially assumed it was a website of global warming deniers, which would have made more sense :) [16:54:40] topranks: looks like jumbo frames are not enabled for cloudcephosd2004-dev; is that an easy thing to fix? [16:55:16] (dcaro fixed the monitoring for that so I'm only noticing today) [16:56:14] andrewbogott: dhinus reviews are welcome for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171233 (to finish fixing it) [16:56:35] andrewbogott: yeah it's just a host-side thing [16:56:40] ip link set dev ens1f1np1 mtu 9000 [16:56:47] but I think puppet should be taking care of it tbh [16:57:12] not sure why it's not working for that one [16:57:21] If it's host-side then puppet should! I'll dig [16:58:34] yeah, wrong nic names in hiera [16:59:57] ah ok yeah [17:01:13] dcaro: +1d, one nit :) [17:02:50] thanks [17:02:50] 1 [17:07:48] * dhinus off [17:13:34] andrewbogott: the acmechief from project-proxy seems to be failing to validate the challenges [17:13:44] `failed to validate challenge Challenge type: ACMEChallengeType.DNS01` [17:13:48] have you seen that before? [17:14:13] it doesn't ring a bell [17:14:15] just restarted the service in case [17:14:26] that's what I would do :) [17:15:36] nope, it seems to still fail :/ [17:15:36] I noticed it uses ipv6 for the dns servers [17:15:36] `Jul 21 17:14:37 project-proxy-acme-chief-02 acme-chief-backend[1773844]: DNS server 2a02:ec80:a000:4000::2 (ACMEChallengeValidation.UNKNOWN) failed to validate challenge Challenge type: ACMEChallengeType.DNS01. _acme-challenge.o11y.wmcloud.org TXT -FfDc` [17:16:14] that could be it, or it could be a change in the role it uses to access designate [17:17:03] * andrewbogott re-arms keyholder just in case... [17:17:05] did that help? [17:18:58] it's retrying... [17:19:07] I can dig with the name, but not the ip6 though [17:19:13] nope, still failing [17:19:41] https://www.irccloud.com/pastebin/I5w3cdiv/ [17:19:49] probably that VM does not have ip6? [17:20:31] we might want to either rebuild it with ip6 (and hope it works), or force acme-chief on ip4 [17:20:39] (if possible) [17:20:39] it thinks it has an ipv6... [17:20:50] fe80::f816:3eff:fe17:2fa3 ? [17:20:59] is it connected to the right network and such? [17:21:29] hmm, other VMs also fail to dig on ip6, I might be doing something wrong [17:21:51] I can ping it from a different VM in the project using the v6 address [17:21:54] or the DNS servers don't listen on ip6 [17:22:03] can you `dig -6 TXT @ns0.openstack.eqiad1.wikimediacloud.org. _acme-challenge.maps.wmflabs.org` ? [17:22:53] network unreachable [17:23:02] I get the same :/ [17:23:25] so maybe it's some setup in the ip6 network missing [17:23:29] (or the DNS servers) [17:24:57] why can I do this then? [17:24:57] dig -6 @ns0.openstack.eqiad1.wikimediacloud.org project-proxy-puppetserver-1.project-proxy.eqiad1.wikimedia.cloud [17:25:17] I can't :/ [17:25:18] (from tools bastion) [17:25:54] or project-proxy-acme-chief-02 [17:26:11] where are you doing it from? [17:26:57] acme-chief is in VLAN/legacy [17:27:08] I don't think any of the toolforge bastions have routable IPv6 addresses yet. [17:27:46] yep, that would explain that [17:27:50] you're right, neither can I [17:28:15] ok, so we need to rebuild the acme-chief host [17:29:34] I think so, not sure now though if the DNS servers support ip6 though (they should I guess?) [17:30:34] cloudservices seem to on the internet side anyway [17:30:51] https://www.irccloud.com/pastebin/T9qO7Gdx/ [17:32:07] dcaro: you probably need to go, I can work on the rebuild 'next' [17:37:10] ack, just checked acme-chief, and it can't be configured, it uses by default both ip4 and ip6 (everything that the DNS resolves to), and if any fail it fails the check [17:37:27] so yep, rebuild might be the easiest (and nicest future-wise) [17:38:20] anyhow, cya tomorrow! ping/page me if anything goes awry as usual :), the ceph node is still not using a lot of memory, so I think it should last the night [17:42:29] ugh, can't create new VMS because can't sign certs on the puppetserver because [17:42:58] I will tackle this after lunch [17:48:11] 🤦‍♂️ [17:48:11] on the bright side [17:48:11] cloudcephosd1006 node disk usage looks "normal" now, not like before [17:48:11] https://usercontent.irccloud-cdn.com/file/D4VRAvJ6/image.png [17:49:34] that's interesting! [19:08:33] dcaro, if you are still around... something bad seems to be happening with the new proxy certs [19:17:01] reverted, we can sort this out tomorrow [19:52:21] Just saw [19:52:23] Okok [20:05:54] cloudcephosd1006 started swapping, some of the osds there are using up to 10G ram, I'll take it out of the pool :/, it did not work [20:27:49] andrewbogott: ^ fyi [20:28:13] yeah, same as the incremental update