[09:55:46] Not super-urgent, but I am experiencing a hang from the `sre.k8s.reboot-nodes` cookbook against dse-k8s-eqiad. There is a log in `/var/log/spicerack/sre/k8s/reboot-nodes.log` on cumin1002 is anyone would like to look. It just seems to wait forever and not do a reboot. [09:56:55] correction: *if* anyone would like to take a look. I can get around it by cordoning and running the reboot-single cookbook for each host, but I thought you might like to know about it. [10:05:47] Oh, just after I typed that, it finally kicked into life and started the reboot. Typical. [10:13:54] Well, something weird has happened. The cookbook just cordoned off 8 nodes out of 11 total on the cluster. [10:13:59] https://www.irccloud.com/pastebin/budXot1N/ [10:16:24] For the record, this was the command I executed: `cookbook sre.k8s.reboot-nodes -t T394897 -r "Reboot to pick up new backported kernel" -a dse-k8s-worker --exclude 'dse-k8s-worker[1001-1002].eqiad.wmnet' --k8s-cluster dse-eqiad` [10:16:25] T394897: [Cephfs] Clients occasionally fail to release caps, resulting in blocked requests and Airflow service disruption - https://phabricator.wikimedia.org/T394897 [10:32:55] btullis: I can take a look in a bit. IIRC it cordons a percentage of nodes to not cause to much pod churn. Did you check the code? [10:34:47] jayme: thanks. not urgent. I've briefly checked the code, but mainly just working around it, for now. [12:54:57] btullis: so the reason for it taking so long is because 'The disruption budget postgresql-airflow-research-primary needs 1 healthy pods and has 0 currently' I guess. Would assume that spinning up a new postgres pod took some time (loading data/resync with master node?) [13:01:05] and the cordoning is like I said: It reboots the first node and then cordones all nodes that will also be rebootet. Then it goes on rebooting the second node. This (in theory) means that pods that get evicted from the second node land on the first one (which has already been rebooted) and avoids the pods being scheduled on a node that is to be rebootet right away [13:02:32] I say in theory because it could be that the second node is bigger/has more capacity then the first one in which this system would fail (in case it's run for a whole cluster) [13:04:31] Thanks. What I've seen from doing the cordoning manually is that it only has a pod disruption budget set for the primary postgresql server, but when asked to drain it promotes a replica to primary. This typically takes 10s or so. [13:07:57] hm. you could probably check the k8s events of the cluster to get a better understanding on what happened or did not happen between 09:32:48 and 10:02:49 [13:08:03] And on the cordoning: I was trying to do a whole cluster reboot, except for the first two nodes, which had already been done. So it rebooted 1 node and uncordoned it, then cordoned the remaining 8 nodes. [13:09:37] Have I misunderstood how this cookbook is supposed to be used for a full cluster rolling restart? Should I have been using a different cookbook? I can see how this mechanism would be efficient if I only wanted to restart a small subset of the cluster. [13:16:07] It's actually supposed to be used that way given the k8s nodes are usually equally sized. Maybe that assumption does no longer hold true [13:19:25] OK, cool. I'll come back to this another time, I think. Not my biggest issue, at the moment :-) [16:53:02] Oh, is it that they are cordoned, but not drained? So existing workload isn't moved away from these hosts, but new pods aren't scheduled to run on them. Is that right? [17:06:03] yeah [17:06:07] they get drained on reboot