[09:26:26] klausman: I 'll be merging this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1145982 as a test. It should be inconsequential, but keep it mind in case something comes up [09:45:04] actually scratch that, it's using the older version of the chart, it won't apply and there is no diff [09:45:43] 0.2.10 that is [09:48:48] hey folks I have something that I can't totally explain when deploying kartotherian [09:49:17] I am changing the health check probes for the pods, plus rolling out the mesh envoy new bucket list [09:49:33] now a group of new pods comes up, but the rest doesn't [09:49:51] I checked events for the usual suspects but I can't see much resource-related [09:50:06] and I don't think the kube scheduler is not finding capacity [09:50:19] any suggestion where to look? [09:50:45] (I am deploying now for wikikube-codfw) [09:51:19] elukey: I would assume the pods don't become ready [09:51:32] that should be mentioned in the namespaces events, though [09:51:40] or in the pods events 🤔 [09:52:55] do you mean describe pod? [09:52:59] yeah [09:53:09] the baffling thing is that some of them come up in few seconds [09:53:24] and in staging everything went fine [09:53:37] maybe termination takes forever? [09:54:10] if they're not terminating gracefully and the scheduler has to wait for terminationGracePeriod to end [09:55:27] ah, it's already rolling back I guess [09:56:08] yes sorry, it kicked off a few moments ago I think [09:56:38] 40m Warning FailedCreate replicaset/kartotherian-main-5564488cb9 Error creating: pods "kartotherian-main-5564488cb9-ggk5t" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=7, used: limits.cpu=420, limited: limits.cpu=420 [09:56:59] I totally missed it [09:57:04] in events? [09:57:08] yes [09:57:18] 40m ago though [09:57:37] I checked get resourcequota and it didn't indicate issues [09:57:57] what about the replicaset status for said pods? [09:58:07] s/status/events [09:58:11] ok I'll try to manually bump the limits, maybe to 460/480 [09:58:53] restarted [09:59:25] seems way better now with 460 [09:59:44] sort of, stalling again now [10:01:20] yep because I didn't fix correctly resource quotas [10:01:28] anyway, I think this is the issue, thanks a lot folks! [10:01:31] will file a patch [10:02:55] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1148290 [10:04:45] +1ed [10:06:26] <3 [10:12:53] Hello 👋 Is ceph-based persistent storage in aux-k8s considered usable/generally available? I saw there is a storage class and a test persistentvolume from cdanis but I'm not sure if we are at a point to deploy stateful workloads [10:14:22] jelto: o/ I think it is not enabled yet in codfw due to the ceph cluster being only in eqiad (IIRC) [10:16:14] Ah good point. But theoretically it's usable already in eqiad? (Just for evaluation, I don't have anything to deploy yet). [10:17:28] I think so yes, it shold be [10:17:31] *should [10:17:57] re: kartotherian, even with the new resource quota I see the pods stuck [10:18:02] Ack thanks :) [10:18:27] the last forbidden event mentions 420, and resourcequotas are at 460 now [10:18:44] (and it is from 20 mins ago) [10:18:48] sigh [10:23:36] I'll wait a bit, other deployments are in progress, it is maybe a bad moment resource-wise for the cluster [12:50:53] brouberol, jayme - retried now a codfw deployment for kartotherian and it completed in no time [12:51:28] I am leaning towards the fact that scap deployments may have put the cluster into pressure (less capacity available) earlier on [12:51:42] ack, that's good to know! [12:51:46] without change to the quota? [12:52:20] with the quota change, but I applied it also earlier on before the last attempt [12:53:06] ok. just wondering as the quota message should clearly be independend from the cluster load [12:53:59] I know that you wanted to make sure that everything was indipendent from Luca's PEBKACs [12:54:54] ;p ... I;m just trying to understand. Not enough capacity should have raised events as well IMHO [12:55:21] yes yes I am joking :) [12:55:32] I am wondering if the kube-scheduler may leave some traces [12:55:33] FWIW we have https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/3a97043c2cecd0351ccf384b768a37de5a910597/team-data-platform/kubernetes-resources.yaml#2 in dse-k8s [12:55:52] so that we can alert on airflow instances gobbling up too much resources, but before they hit the wall [12:56:03] really nice [12:56:30] we could apply this to all clusters. I initially didn't because I wanted to smooth out any potential noise beforehand [12:56:47] but it's been super quiet and predictable [12:56:56] update regarding the MTU+IPIP+Liberica thing: https://phabricator.wikimedia.org/T352956#10839001 [12:57:12] and yes, I also suffered my own PEBKAC for not noticing it earlier [12:59:23] akosiaris: am I reading you right that you're suggesting to move to "run calico-node with full hostpath access to place binaries ect."? [13:01:21] that's a good question. I was aiming first for just the configuration file [13:01:45] I think CALICO_MANAGE_CNI is all or nothing [13:02:42] Good point. I 'll need to dig up a bit more what on earth is going on in the newer manifest stuff. I might have to backtrack from my Long Term "plan" in that task [13:03:29] I mean the conclusion that it would make life easier is probably not totally wrong...but it's still a bit scary [13:06:26] Yup. My quick reading of that manifest file is that they populate that config map, also set some variables that rely on it. e.g. there's the CNI config file that is being populatd, but it's reference __CNI_MTU__ which is being set by https://github.com/projectcalico/calico/blob/master/manifests/calico-typha.yaml#L7064, referencing the veth_mtu [13:06:27] thing [13:07:27] that whole file is then used to set CNI_NETWORK_CONFIG, which probably is indeed used on the on/off toggle you mention [13:07:36] 😠 [13:08:04] I 'll need to read the code a bit I guess [13:19:37] elukey: fwiw I don't see a clue on why the deployment failed in the events either - apart from the quote issue [13:22:36] me too [13:24:42] it is true that kartotherian is a special use case, there are 50 pods scheduled each of them requesting 7 cores IIRC [13:25:16] so the scheduler might get a bit into distress to gather all that capacity when other deployments are running [13:26:11] I have another thing to discuss that it is totally unrelated [13:26:22] namely a new ML base image: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1146891 [13:26:56] I have been working with Tobias and Kevin on that, it is along the line of the pytorch one (so inference services on lift wing will use that base image etc.., mostly LLMs IIUC) [13:27:47] they have been experimenting docker-pkg on ml-lab1002 (unpuppetized), since the memory/cpu requirements are high (the last build for the image peaks at 40G of used memory) [13:28:39] we may think about allowing ml-lab1002 to push to the registry, possibly locking them down properly from misusage (only ml-admins can ssh, be in the docker group, etc..) [13:29:02] thoughts? [13:43:13] jelto: yes, what elukey said -- no one is actually using the instance in aux-k8s eqiad yet, but, it's backed by the DSE ceph pool, which is in active use aiui [13:43:37] it is intended to be generally available (without slo ;) and i had hoped to end the summit with something using it, but [13:54:19] elukey: I think that could be feasible...given that image builds are no longer centralized anyways (builds from gitlab - don't know where ci images are build) [13:55:19] thanks for the additional context :) [13:55:42] np jelto ! please do poke me if you get stuck or anything [13:55:51] I'll do :) [16:56:33] jayme: I 've looked at the code and you are right. Writing the CNI configuration file is all gated behind the install calico-cni binary, which we don't even ship in our Debian packages right now. [16:56:45] My Long Term plan just got torpedoed [16:56:53] I 'll update the task