[00:01:06] <Raymond_Ndibe>	 bd808: yes I'm still running that. I am also aware that we currently have 10 nodes marked SchedulingDisabled
[00:01:18] <bd808>	 It finally got a slot after waiting for 15 minutes
[00:01:42] <bd808>	 we don't have enough capacity to have 10 nodes down at the same time
[00:01:55] <Raymond_Ndibe>	 though that number is actively being reduced since I currently have two terminal windows open rebooting all of them in a loop
[00:02:30] <Raymond_Ndibe>	 yes. I am rebooting those so they'll all be ready soon
[00:04:11] <bd808>	 Raymond_Ndibe: the !log problem from T374875 should be fixed now
[00:04:12] <stashbot>	 T374875: Shell accounts containing `-` not recognized when parsing "user@host" clause from a `!log` message - https://phabricator.wikimedia.org/T374875
[00:06:11] <Raymond_Ndibe>	 the problem happened because of the D state pods preventing an upgrade. I configured the upgrade loop to move on to the next node if an upgrade fails (because of the D state pods). When I noticed the number of SchedulingDisabled pods and recognized the potential problem, I started running the reboot cookbook for all the nodes that couldn't be upgraded
[00:06:42] <Raymond_Ndibe>	 bd808: btw thanks for fixing the log issue
[00:07:14] <Raymond_Ndibe>	 `SchedulingDisabled nodes*`
[00:07:43] <bd808>	 Raymond_Ndibe: the D state problem makes sense. #someday maybe the NFS nodes will stop locking up like that
[00:08:36] <bd808>	 I suppose I could wish for log storage that lets more things abandon NFS too, but that seems greedy ;)
[00:27:18] <Raymond_Ndibe>	 Down to 2 `SchedulingDisabled` nodes. Should be gone after which I will attempt upgrading the failed ones again
[00:31:17] <bd808>	 good luck :)
[10:10:34] * dcaro lunch
[12:43:56] <dcaro>	 hmm, got a gitlab runner error failing to connect to gitlab
[12:43:57] <dcaro>	 xd
[12:43:58] <dcaro>	 fatal: unable to access 'https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api.git/': Failed to connect to gitlab.wikimedia.org port 443 after 30 ms: Could not connect to server
[12:44:45] <dcaro>	 it seems reproducible
[12:48:26] <dhinus>	 hmm worth mentioning in #-releng?
[12:48:59] <dcaro>	 I mentioned it in -gitlab
[12:50:12] <dhinus>	 oh I'm not in that one :)
[12:51:44] <dcaro>	 too many channels xd
[14:18:27] <dcaro>	 it seems we have a bunch of pods stuck in terminating status due to kyverno preventing them from being removed xd
[14:18:59] <dcaro>	 Raymond_Ndibe: is looking into it, it seems that we should exclude the `DELETE` method from the kyverno hook
[14:19:45] <dcaro>	 that made the upgrade fail trying to drain the pods from the nodes that were running those (as even rebooting the node does not clear the pod, as it's not really running), and well, that made the upgrade a bit bumpy yesterday xd
[14:26:48] <arturo>	 ouch
[14:31:41] <dhinus>	 is this the first k8s upgrade after we enabled kyverno?
[14:32:18] <dcaro>	 no, second, though the pods were stuck since the first (actually, since we fixed kyverno after the previous upgrade)
[14:32:25] <dcaro>	 4th Spet
[14:32:28] <dcaro>	 *Sept
[15:18:23] <arturo>	 dhinus: I think this is ready for review https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/46
[15:19:25] <dcaro>	 Raymond_Ndibe: what happened on september 4th is the network blip when the switch started misbehaving (and kyverno stopped working for a second), that might have opened the cluster to create those malformed pods than after, once kyverno worked again, it did not allow to remove
[15:19:26] <dcaro>	 :/v
[15:22:28] <arturo>	 a quick & dirty workaround could be to manually delete the kyverno pod policy from the affected namespaces, tell k8s again to drop the Terminating pods, then re-enable the pod policy
[15:23:33] <dcaro>	 ceph seems to be dealing ok
[15:24:26] <dcaro>	 arturo: cloudgw did not HA I'm guessing?
[15:24:58] <arturo>	 dcaro: I think the switch did not failover?
[15:25:08] <arturo>	 I see cloudgw1002 up and running
[15:25:38] <dcaro>	 when you mean switch, you mean passing the primary from one to the other right? (not the physical switch)
[15:26:01] <arturo>	 no, I mean the cloudsw, which is the upstream edge router for cloudgw
[15:26:54] <dcaro>	 oh, topranks ^
[15:28:02] <arturo>	 I can ping irb-1120.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org from cloudgw1002
[15:29:05] <topranks>	 cloudsw1-d5 is active in vrrp for all those 
[15:29:10] <arturo>	 and I can ssh to login.toolforge.org
[15:29:20] <arturo>	 dcaro: maybe the network problem is gone now?
[15:29:21] <dcaro>	 me too
[15:30:02] <dcaro>	 welcome back stashbot!
[15:30:10] <bd808>	 Things I was looking at seem to ber recovering
[15:30:30] <arturo>	 at some point I got no route to host to login.toolforge.org from by laptop, and I'm pretty sure that is produced by either the core routers or cloudsw
[15:31:00] <dcaro>	 I 
[15:31:05] <dcaro>	 I got unable to resolve name
[15:31:27] <bd808>	 I was getting things like "92 bytes from instance-tools-bastion-12.tools.wmcloud.org (185.15.56.57): Destination Host Unreachable"
[15:31:30] <arturo>	 the auth resolvers don't run via cloudgw
[15:32:20] <arturo>	 dcaro: so DNS failing I think confirms the theory that the cloudsw failover was the problem here
[15:33:25] <dcaro>	 The DNS issue I got because I used the wrong name:
[15:33:34] <dcaro>	 dcaro@urcuchillay$ ssh login                                                                                                                                               
[15:33:34] <dcaro>	 ssh: Could not resolve hostname login: Name or service not known
[15:33:35] <dcaro>	                                                    
[15:33:35] <dcaro>	 dcaro@urcuchillay$ wm-ssh login                                                                                                                                            
[15:33:35] <dcaro>	 INFO:wm-ssh:Found full hostname login.toolforge.org     
[15:33:35] <dcaro>	 ssh: connect to host login.toolforge.org port 22: No route to host
[15:33:35] <dcaro>	 ssh: connect to host login.toolforge.org port 22: No route to host  
[15:33:39] <dcaro>	 (should have pasted that)
[15:33:48] <arturo>	 oh, ok
[15:33:53] <dcaro>	 the right name got no route
[15:34:09] <arturo>	 similar to bd808 and myself
[15:34:33] <dcaro>	 toolforge looking good now, alerts are clearing
[15:35:03] <arturo>	 ack
[15:35:03] <dcaro>	 ceph is starting to have slow ops though
[15:35:04] <dcaro>	 Ceph cluster in eqiad has 107 slow ops
[15:36:46] <topranks>	 usage on the links from D5 to E4/F4 looks ok, increase for sure but not saturating 
[15:37:10] <arturo>	 dcaro: we got a bunch of kyverno policies in Ready= False
[15:37:22] <arturo>	 :-(
[15:37:24] <dcaro>	 yep, I think that the mon is getting a bit saturated getting reports from all the other osds reporting the C8 osds down
[15:39:20] <dcaro>	 some are passing though
[15:40:22] <dcaro>	 there's no more pods in terminating state, I think that kyverno flipped again
[15:40:39] <dcaro>	 and allowed the stuck pods to cleanup, but new pods were started without the right policy
[15:40:47] <bd808>	 Is Komla around? The competing pages at https://wikitech.wikimedia.org/wiki/News/Migrating_Wikitech_Account_to_SUL and https://wikitech.wikimedia.org/wiki/Wikitech/SUL-migration could use someone to merge them into one uniform message.
[15:41:14] <arturo>	 yes, I'm manually restarting some of the kyverno pods, they failed to reach the k8s API, and are apparently doing nothing
[15:43:06] <dcaro>	 just restarted cloudcephmon1002 to force one of the new mons to take over, as it was having slow ops itself (so far seems to have helped)
[15:43:44] <arturo>	 kyverno policies start to get in Ready state, ~200 units (out of 3k) so far
[15:43:49] <dcaro>	 there's a bunch of alert about cloudcontrol1005.private being down
[15:44:28] <arturo>	 my ssh to cloudcontrol1005 is not working ATM
[15:44:33] <dcaro>	 (oh, that one is there in the rack)
[15:44:40] <dcaro>	 so expected, I'll ack
[15:44:40] <arturo>	 ok
[15:45:49] <arturo>	 action item: send an issue to kyverno upstream to clarify what happens if the kyverno admission controller pods lost connection to the api server. What do they do. Do they retry? Do they get killed?
[15:46:51] <arturo>	 kyverno policies 1.5k ready out of 3k
[15:48:27] <dcaro>	 which controller is the one flipping the policies?
[15:48:36] <arturo>	 the admission controller
[15:49:15] <dcaro>	 the reports controller is almost at max cpu, logging `│ 2024-09-17T15:47:43Z    INFO    pattern/pattern.go:82    Expected type int    {"type": "<nil>", "value": null}` from time to time
[15:49:28] <dcaro>	 the admission ones seem idle
[15:49:30] <dcaro>	 -ish
[15:49:47] <dcaro>	 │ 2024-09-17T15:43:06Z    INFO    webhooks.server    logging/log.go:184    2024/09/17 15:43:06 http: TLS handshake error from 192.168.57.64:3504: EOF                                                                                                                                                                                                 │
[15:49:55] <dcaro>	 from kyverno/kyverno-admission-controller-7cb7c68647-sn4mj:kyverno
[15:49:59] <arturo>	 the admission controller should be the one in the hot path for the admssion requests
[15:50:21] <dcaro>	 yep, though is it the one also changing the policy resources?
[15:50:26] <dcaro>	 (to mark them as ready)
[15:50:51] <arturo>	 I think so, but I can double check
[15:50:59] <dcaro>	 `│ 2024-09-17T15:41:05Z    INFO    setup.policy    logging/controller.go:45    resource added    {"type": "Policy", "name": "tool-zygserv/toolforge-kyverno-pod-policy"}`
[15:51:06] <dcaro>	 it seems to be doing something about it
[15:51:54] <arturo>	 this contains a description of what each component does
[15:51:54] <arturo>	 https://kyverno.io/docs/high-availability/#controllers-in-kyverno
[15:52:07] <dcaro>	 I think this is stuck kyverno/kyverno-admission-controller-7cb7c68647-sn4mj:kyverno
[15:52:13] <dcaro>	 are you still restarting them?
[15:52:24] <arturo>	 no I stopped
[15:52:36] <arturo>	 I lost my ssh session to tools-k8s-control-7
[15:53:13] <dcaro>	 oops, I'll restart that one then
[15:53:23] <arturo>	 got my shell again
[15:53:57] <arturo>	 all policies are Ready now
[15:54:17] <dcaro>	 they don't mention 'ready' at all in that help page xd
[15:54:49] <arturo>	 I believe the bullet point `Processes validate, mutate, and verifyImages rules.` is what covers it
[16:00:28] <dcaro>	 slow ops are piling up on ceph :/
[16:00:42] <arturo>	 because the missing mon?
[16:00:52] <dcaro>	 they kinda jump from almost none to >300
[16:01:14] <dcaro>	 I think it might be the storm of "I see the osds X,Y,Z down" from all the up osds
[16:01:18] <dcaro>	 (daemons, not nodes)
[16:01:25] <dcaro>	 each daemon reports about each other daemon
[16:01:34] <dcaro>	 no slow ops now
[16:01:41] <dcaro>	 network is back up
[16:02:14] <arturo>	 all my network tests are reporting only known problems because the cloudsw downtime. The rest of the network bits on cloudvps / toolforge seems to be working mostly normally ATM
[16:04:03] <arturo>	 I see servers in the affected rack coming back online
[16:04:27] <dcaro>	 yep, all the mons are back in the cluster
[16:04:27] <arturo>	 I'll step out for a bit -- I'll be available later if required
[16:04:34] <dcaro>	 ack, thanks
[16:04:43] * arturo offline
[16:18:34] <hashar>	 00:00:10.714 fatal: unable to look up contint2002.wikimedia.org (port 9418) (Temporary failure in name resolution)
[16:18:42] <hashar>	 I got that from a CI build occurring in WMCS
[16:18:56] <hashar>	 so my guess is something is off with DNS
[16:19:22] <dcaro>	 hashar: we had a network issue yes, though interesting that the DNS failed for you
[16:19:32] <dcaro>	 that was running on a VM right?
[16:19:38] <hashar>	 I think
[16:19:42] <hashar>	 but maybe that is from prod
[16:20:42] <hashar>	 ah no that is the WMCS instance trying to do a git clone from git://contint2002.wikimedia.org/ 
[16:20:45] <hashar>	 and failing with DNS
[16:21:05] <hashar>	 som something briefly exploded transiently
[16:21:07] <hashar>	 it is working now
[16:25:00] <dcaro>	 yep, let me know if it happens again
[17:31:44] * dcaro off
[17:32:05] <dcaro>	 I left ceph re-adding one of the hosts in the C8 rack, will take a bit