[07:18:08] greetings [07:28:29] XioNoX topranks I have an update re: redundant cloudsw in T414835, tl;dr from the latest switch maint tests we've done in T417393 it seems ceph is fine with having a rack down and not rebalance [07:28:30] T417393: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393 [07:28:54] godog: nice :) [07:29:26] yeah! so standard single per-rack switch, having said that I know some of those switches defo need replacing [07:30:21] i.e. we can use that money anyways to procure switches anyways (?) [07:30:37] unfortunately I can't make it to the network sync later today, I wanted to get the ball rolling [07:31:41] yeah we have budgeted replacement for the C8/D5 switches which are EOL on July 2027. [07:32:40] nice, I'm updating the meeting notes too [07:33:22] and yes C8/D5 we can actually take down, main limiting factor for redundancy now is cloudvirt [07:33:28] i.e. T424658 [07:33:29] T424658: Ensure cloudvirt capacity is more evenly spread out among racks - https://phabricator.wikimedia.org/T424658 [07:34:53] heh also in light of T416394 [07:34:58] T416394: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394 [07:46:38] catching up on backlog, re: zookeeper definitely zookeeper::server must fail if $myid ends up being undefined [07:49:33] and relatedly it seems we're running zookeeper on the private network with private hostnames [07:51:03] I'm reading https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network and basically all non "management" traffic goes over the cloud-private vlan [07:53:17] I don't know the full history behind cloud-private though it seems to me we can and should be running zookeeper on the prod vlan [07:53:44] and also define the zk cluster inside zookeeper_clusters hash like the rest of production does [09:41:16] has anyone looked at the ToolforgeWebHighErrorRate alert? [09:41:42] it has a "page" tag but it didn't send a page to victorops, not sure why [09:43:25] ah it's toolsbeta I think, we rewrite the tags but not the "summary" [09:43:41] lolsob [09:45:35] I did start to look at it, and was equally tripped by the summary containing the hashtag [09:45:48] I didn't go very far in understanding what's wrong though [09:46:02] reproducible in the sense that https://admin.beta.toolforge.org/ does indeed 503 [09:48:49] * taavi wonders if it's just crawlers hitting that more than usual [10:02:55] totally could be yeah [10:06:39] though from https://grafana.wmcloud.org/goto/cfkjic4bpotfkd?orgId=1 I don't see an increase in requests [10:17:45] * godog lunch [11:14:14] I can see ten tools-* in production puppetdb, but the hosts aren't reachable, was there some issue with the setup? [11:14:37] if they're new, I'll ignore, them. just saw some alerts during fleet-wide rollouts [11:15:35] moritzm: do you mean tools-k8s-ctrl[1001-1002].eqiad.wmnet,tools-k8s-worker[1001-1008].eqiad.wmnet ? [11:15:38] moritzm: they are being repurposed as k8s hosts, though I'm not sure in what state they are now cfr T423719 [11:15:38] T423719: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719 [11:17:55] yeah, it's tools-k8s-ctrl[1001-1002].eqiad.wmnet,tools-k8s-worker[1001-1008].eqiad.wmnet [11:18:12] I'll leave a note on task to run the decom cookbook on them [11:20:32] ack [11:20:59] godog: do you think we should keep some of them for cloudvirts to have a better spreading among racks by anuy chance? [11:21:24] just to make sure we don't need them before gifting wikikube [11:22:15] mmhh checking specs, or have you already? [11:22:51] they are config C [11:23:27] latest cloudvirts are config G, checking [11:24:03] indeed, so 512GB ram vs 128GB, I'd rather keep config G or greater [11:24:48] re: cloudvirt rack spread that's T424658, I did some quick calculations and even with the cloudvirt refresh we should be fine [11:24:48] T424658: Ensure cloudvirt capacity is more evenly spread out among racks - https://phabricator.wikimedia.org/T424658 [11:25:12] there's a bunch of cloudvirts to move around of course [11:25:55] yep [13:25:21] godog: regarding zookeeper... minimal fix is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278524 [13:25:44] If we want to have a whole new conversation about the private network then... then I'm sorry you won't be at the network sync :) [13:27:03] regarding the toolsbeta alert, I don't think it's a traffic issue, the tool that produces most of that metric is crashing with a bunch of these: [13:27:05] 2026-04-29T12:30:58Z [backend-5f65dfffdb-brwdq] [job] pydantic.errors.PydanticUserError: Field 'id' requires a type annotation [13:53:45] I won't be there for our sync up meeting, taking oncall compensation time [14:58:39] This is a new failure, right? Or am I making a typo? [14:58:46] andrew@cloudcumin1001:~$ sudo cumin "O{*}" "lsb_release -d | grep -i trixie" [14:58:46] Caught Unauthorized exception: The request you have made requires authentication. (HTTP 401) (Request-ID: req-31c93704-ee56-45a8-9616-33d1d198c699) [14:58:55] dhinus: ^ ? [15:00:27] andrewbogott: it might be missing the patch? cc volans [15:00:34] weird, checking [15:00:42] oh right, I forget that that gets overwritten [15:01:00] but also reapplied when I upgraded cumin so shouldn't be the case [15:01:34] ok, I'll let you investigate :) [15:01:41] but there was a diff, reapplied and testing [15:01:51] we should fix that once a forall at some point :/ [15:02:49] andrewbogott: I'm waiting for the query to complete but is not failing fast [15:02:56] so I guess it's ok now, checking 2001 too [15:03:19] yep, looks good now, thank you! [15:06:33] andrewbogott: did it work for you? I got a timeout [15:07:03] still waiting [15:10:51] yes, timed out for me too [15:11:21] specific project queries seem fine, it's just * that I've seen fail so far [15:13:13] me too, I can dig a bit more [15:13:53] ty, I'm also happy to investigate if your workday is over [15:14:17] it seems the connection to openstack.eqiad1.wikimediacloud.org:25000 [15:14:29] unable to get a token? [15:14:48] as is trying https://openstack.eqiad1.wikimediacloud.org:25000/v3/auth/tokens [15:15:20] yep telnet hangs too [15:15:28] anything changed in the firwall? [15:16:29] I don't see 185.15.56.161 in sudo nft list ruleset [15:16:41] where is that failing from? [15:16:47] cloudcumin1001 [15:16:58] that will need to go through the proxy then [15:17:11] I thought we fixed cumin to do that? [15:17:39] ah right sorry totally forgot but yes the config was there and I tested it [15:17:56] yes proxy_url: http://webproxy.eqiad.wmnet:8080 [15:18:09] so maybe is the query for * that triggers the timeout of the proxy [15:18:15] let me check the proxy logs [15:19:24] ahahahah forgive me, I'm stupid [15:19:49] so the old patch ofc doesn't have the proxy support [15:19:52] let me fix that [15:31:33] andrewbogott: good to go, sorry for the trouble [15:31:43] 812 hosts found [15:32:12] you might want to use -x/--ignore-exit-codes for your command ;) [15:32:35] * andrewbogott trying... [15:33:56] yep, works now. thank you!