[07:18:08] <godog>	 greetings
[07:28:29] <godog>	 XioNoX topranks I have an update re: redundant cloudsw in T414835, tl;dr from the latest switch maint tests we've done in T417393 it seems ceph is fine with having a rack down and not rebalance
[07:28:30] <stashbot>	 T417393: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393
[07:28:54] <XioNoX>	 godog: nice :)
[07:29:26] <godog>	 yeah! so standard single per-rack switch, having said that I know some of those switches defo need replacing
[07:30:21] <godog>	 i.e. we can use that money anyways to procure switches anyways (?) 
[07:30:37] <godog>	 unfortunately I can't make it to the network sync later today, I wanted to get the ball rolling
[07:31:41] <XioNoX>	 yeah we have budgeted replacement for the C8/D5 switches which are EOL on July 2027.
[07:32:40] <godog>	 nice, I'm updating the meeting notes too 
[07:33:22] <godog>	 and yes C8/D5 we can actually take down, main limiting factor for redundancy now is cloudvirt
[07:33:28] <godog>	 i.e. T424658
[07:33:29] <stashbot>	 T424658: Ensure cloudvirt capacity is more evenly spread out among racks - https://phabricator.wikimedia.org/T424658
[07:34:53] <godog>	 heh also in light of T416394
[07:34:58] <stashbot>	 T416394: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394
[07:46:38] <godog>	 catching up on backlog, re: zookeeper definitely zookeeper::server must fail if $myid ends up being undefined
[07:49:33] <godog>	 and relatedly it seems we're running zookeeper on the private network with private hostnames
[07:51:03] <godog>	 I'm reading https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network and basically all non "management" traffic goes over the cloud-private vlan
[07:53:17] <godog>	 I don't know the full history behind cloud-private though it seems to me we can and should be running zookeeper on the prod vlan
[07:53:44] <godog>	 and also define the zk cluster inside zookeeper_clusters hash like the rest of production does
[09:41:16] <dhinus>	 has anyone looked at the ToolforgeWebHighErrorRate alert?
[09:41:42] <dhinus>	 it has a "page" tag but it didn't send a page to victorops, not sure why
[09:43:25] <dhinus>	 ah it's toolsbeta I think, we rewrite the tags but not the "summary"
[09:43:41] <volans>	 lolsob
[09:45:35] <godog>	 I did start to look at it, and was equally tripped by the summary containing the hashtag
[09:45:48] <godog>	 I didn't go very far in understanding what's wrong though
[09:46:02] <godog>	 reproducible in the sense that https://admin.beta.toolforge.org/ does indeed 503
[09:48:49] * taavi wonders if it's just crawlers hitting that more than usual
[10:02:55] <godog>	 totally could be yeah
[10:06:39] <godog>	 though from https://grafana.wmcloud.org/goto/cfkjic4bpotfkd?orgId=1 I don't see an increase in requests
[10:17:45] * godog lunch
[11:14:14] <moritzm>	 I can see ten tools-* in production puppetdb, but the hosts aren't reachable, was there some issue with the setup?
[11:14:37] <moritzm>	 if they're new, I'll ignore, them. just saw some alerts during fleet-wide rollouts
[11:15:35] <volans>	 moritzm: do you mean tools-k8s-ctrl[1001-1002].eqiad.wmnet,tools-k8s-worker[1001-1008].eqiad.wmnet ?
[11:15:38] <godog>	 moritzm: they are being repurposed as k8s hosts, though I'm not sure in what state they are now cfr T423719
[11:15:38] <stashbot>	 T423719: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719
[11:17:55] <moritzm>	 yeah, it's tools-k8s-ctrl[1001-1002].eqiad.wmnet,tools-k8s-worker[1001-1008].eqiad.wmnet 
[11:18:12] <moritzm>	 I'll leave a note on task to run the decom cookbook on them
[11:20:32] <godog>	 ack
[11:20:59] <volans>	 godog: do you think we should keep some of them for cloudvirts to have a better spreading among racks by anuy chance?
[11:21:24] <volans>	 just to make sure we don't need them before gifting wikikube
[11:22:15] <godog>	 mmhh checking specs, or have you already?
[11:22:51] <volans>	 they are config C
[11:23:27] <volans>	 latest cloudvirts are config G, checking
[11:24:03] <godog>	 indeed, so 512GB ram vs 128GB, I'd rather keep config G or greater
[11:24:48] <godog>	 re: cloudvirt rack spread that's T424658, I did some quick calculations and even with the cloudvirt refresh we should be fine
[11:24:48] <stashbot>	 T424658: Ensure cloudvirt capacity is more evenly spread out among racks - https://phabricator.wikimedia.org/T424658
[11:25:12] <godog>	 there's a bunch of cloudvirts to move around of course
[11:25:55] <volans>	 yep
[13:25:21] <andrewbogott>	 godog: regarding zookeeper... minimal fix is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278524
[13:25:44] <andrewbogott>	 If we want to have a whole new conversation about the private network then... then I'm sorry you won't be at the network sync :)
[13:27:03] <andrewbogott>	 regarding the toolsbeta alert, I don't think it's a traffic issue, the tool that produces most of that metric is crashing with a bunch of these:
[13:27:05] <andrewbogott>	 2026-04-29T12:30:58Z [backend-5f65dfffdb-brwdq] [job] pydantic.errors.PydanticUserError: Field 'id' requires a type annotation
[13:53:45] <XioNoX>	 I won't be there for our sync up meeting, taking oncall compensation time
[14:58:39] <andrewbogott>	 This is a new failure, right? Or am I making a typo?
[14:58:46] <andrewbogott>	 andrew@cloudcumin1001:~$ sudo cumin "O{*}" "lsb_release -d | grep -i trixie"
[14:58:46] <andrewbogott>	 Caught Unauthorized exception: The request you have made requires authentication. (HTTP 401) (Request-ID: req-31c93704-ee56-45a8-9616-33d1d198c699)
[14:58:55] <andrewbogott>	 dhinus: ^ ?
[15:00:27] <dhinus>	 andrewbogott: it might be missing the patch? cc volans 
[15:00:34] <volans>	 weird, checking
[15:00:42] <andrewbogott>	 oh right, I forget that that gets overwritten
[15:01:00] <volans>	 but also reapplied when I upgraded cumin so shouldn't be the case
[15:01:34] <andrewbogott>	 ok, I'll let you investigate :)
[15:01:41] <volans>	 but there was a diff, reapplied and testing
[15:01:51] <volans>	 we should fix that once a forall at some point :/
[15:02:49] <volans>	 andrewbogott: I'm waiting for the query to complete but is not failing fast
[15:02:56] <volans>	 so I guess it's ok now, checking 2001 too
[15:03:19] <andrewbogott>	 yep, looks good now, thank you!
[15:06:33] <volans>	 andrewbogott: did it work for you? I got a timeout
[15:07:03] <andrewbogott>	 still waiting
[15:10:51] <andrewbogott>	 yes, timed out for me too
[15:11:21] <andrewbogott>	 specific project queries seem fine, it's just * that I've seen fail so far
[15:13:13] <volans>	 me too, I can dig a bit more
[15:13:53] <andrewbogott>	 ty, I'm also happy to investigate if your workday is over
[15:14:17] <volans>	 it seems the connection to openstack.eqiad1.wikimediacloud.org:25000
[15:14:29] <volans>	 unable to get a token?
[15:14:48] <volans>	 as is trying https://openstack.eqiad1.wikimediacloud.org:25000/v3/auth/tokens
[15:15:20] <volans>	 yep telnet hangs too
[15:15:28] <volans>	 anything changed in the firwall?
[15:16:29] <volans>	 I don't see 185.15.56.161 in sudo nft list ruleset
[15:16:41] <taavi>	 where is that failing from?
[15:16:47] <volans>	 cloudcumin1001
[15:16:58] <taavi>	 that will need to go through the proxy then
[15:17:11] <taavi>	 I thought we fixed cumin to do that?
[15:17:39] <volans>	 ah right sorry totally forgot but yes the config was there and I tested it
[15:17:56] <volans>	 yes proxy_url: http://webproxy.eqiad.wmnet:8080
[15:18:09] <volans>	 so maybe is the query for * that triggers the timeout of the proxy
[15:18:15] <volans>	 let me check the proxy logs
[15:19:24] <volans>	 ahahahah forgive me, I'm stupid
[15:19:49] <volans>	 so the old patch ofc doesn't have the proxy support
[15:19:52] <volans>	 let me fix that
[15:31:33] <volans>	 andrewbogott: good to go, sorry for the trouble
[15:31:43] <volans>	 812 hosts found
[15:32:12] <volans>	 you might want to use -x/--ignore-exit-codes for your command ;)
[15:32:35] * andrewbogott trying...
[15:33:56] <andrewbogott>	 yep, works now. thank you!