[07:33:14] good morning [07:35:12] morning! [08:28:11] finally online! [09:03:48] klausman: o/ Janis created https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Upgrade/1.31 [09:04:00] the new istio version is 1.24.x [09:04:17] there is no rush for the upgrade, maybe we could do it in Q2 after holidays etc.. [09:04:32] but as pre-requisite we should target a version of knative and kserve [09:04:34] Yeah, that sounds good. Thanks for the heads-up [09:06:19] also there is a way to bootstrap a cluster locally now, that could be handy to test knative [09:06:30] it will probably be a big jump [09:08:26] yeah, knative and kserver and all teh deps [09:59:47] it may also be a nice experience for the new SRE that we'll hire (if we make it for Q2) [10:11:18] 06Machine-Learning-Team: Inputs for tone check model prediction - https://phabricator.wikimedia.org/T397013#10966843 (10achou) I wanted to deploy and test this edit-check [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1164271 | change ]] in experimental ns, but helmfile diff shows something u... [10:13:30] -----^ hey klausman o/ do you have any idea on this? [10:13:48] Looking [10:13:48] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10966851 (10OKarakaya-WMF) ## single model for multiple languages experiment I've clustered (kmeans) languages by generating featu... [10:14:26] thank uuu [10:18:41] I think this is due to values.yaml having an entry for them, but not v-staging. [10:19:19] So the absence of them in v-staging.yaml does not remove the stanzas inherited from v.yaml [10:19:55] i.e. 1164271 is incomplete in that it should have also removed them from v.yaml. [10:19:59] I'll make a patch [10:21:31] good point!! indeed, they are also in values.yaml [10:22:08] I missed that [10:22:36] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1165845 [10:24:28] Once it's merged, you should be able to deploy just the intended changes. I'll sync the two prod clusters [10:25:21] perfect, thank you <3 [10:38:59] I am running the new docker report (inspecting k8s running containers and their images) for ml-staging, and look what I found [10:39:06] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1165850 [10:39:07] :( [10:39:32] Both yay and boo to that :) [10:39:38] the queue proxy is the sidecar container running in all the kserve pods, so it may take a bit before we upgrade all [10:40:07] +1'd [10:43:45] btw, I am also pushing the adminng updates for external services, including the new entry for Thanos-Swift [10:44:51] ack [10:45:00] elukey: I presume pushing the ClusterRole updates for debmonitor are ok to push along with that [10:46:10] yes yes please go ahead [10:48:04] done & done [10:48:10] * elukey lunch [10:57:33] ditto [11:57:26] 06Machine-Learning-Team: AI/ML Infrastructure Request: Expand ORES-enabled RevertRisk filters deployment to all wikis, excluding Commons and Wikidata - https://phabricator.wikimedia.org/T398291#10967339 (10SSalgaonkar-WMF) Hey @kostajh ! Thanks so much for submitting this! Totally agree with you about how the wo... [12:08:37] 06Machine-Learning-Team: AI/ML Infrastructure Request: Expand ORES-enabled RevertRisk filters deployment to all wikis, excluding Commons and Wikidata - https://phabricator.wikimedia.org/T398291#10967372 (10kostajh) >>! In T398291#10967339, @SSalgaonkar-WMF wrote: > Hey @kostajh ! Thanks so much for submitting th... [12:31:59] I deployed the edit-check change, but now in CrashLoopBackOff state. The error has nothing to do with the new image because it happens in storage-initializer. (I tested an old prod image, and it ran into the same error) [12:32:15] The message shows Access Denied when copying model from s3://wmf-ml-models/edit-check/peacock/ to local.. full logs: https://phabricator.wikimedia.org/P78737 [12:32:51] that's weird.. is there any changes in s3 recently? [12:38:28] I'll investigate more later [12:47:52] weird, does it happen to all models, or only edit-check? [12:48:45] trying to kill readability-predictor-default-00023-deployment-7b6495d9c6-snpxq on staging [12:49:22] no bueno, it happens consistently [12:49:40] I am wondering if it is something that went wrong when I tried to config the machinetranslation's perms [12:51:06] it likely happens to all models [12:52:35] trying a fix [12:54:07] yes ok the error changed, now it is a 409 [12:54:10] maybe transient [12:56:12] ok it is definitely my bad [12:56:59] hopefully fixed now [12:57:18] yep [12:57:19] What happened there? [12:57:44] it is a side effect of swift post wmf-ml-models+segments -r 'mlserve:ro' and swift post wmf-ml-models -r 'mlserve:ro' [12:57:50] that I did for machinetranslation [12:57:59] apparently it doesn't add a new one, it overrides the list [12:58:07] oooh [12:58:15] I checked with swift stat wmf-ml-models for example, there are ro perms ACL [12:58:20] once restored, all good [12:58:21] yeah, that makes sense now [12:58:33] Luckily we didn't have a restart-storm in prod [12:58:53] this is where something misconfigured came up [12:59:05] afaics we are using the mlserve:prod account for prod [12:59:08] not the mlserve:ro [12:59:31] mmmh, yeah that seems wrong. [13:00:05] no worries Luca :) things like this happen. thanks for fixing it!! [13:00:39] klausman: yeah we need to update to use mlserve:ro [13:00:57] need to go to the office, will be back in a sec [13:01:05] aiko: your deployment should be ok now! [13:01:07] apologies again [13:01:08] heading into meeting, so ttyl [13:05:19] elukey: it's working <3 [13:11:02] mmmm I am wondering if this is why the machinetranslation account doesn't work [13:13:34] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#10967688 (10klausman) I've written up my thoughts, and some of the things we discussed outside of this ticket regarding making vLLM images available for use with LiftWing workloads: # Cu... [14:09:27] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#10967994 (10elukey) Thanks for the summary! I see two separate problems being listed: 1) Have a separate Docker Registry to be able to push images with compressed layers bigger than 4G.... [15:58:56] isaranto: o/ not super urgent but lemme know if next week anybody is willing to work on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1165850 [15:59:16] the container that will change is the queue-proxy one, used on all isvcs sadly [15:59:35] so we should test it briefly on ml-staging to be sure [15:59:42] and then move to prod (can happen incrementally) [16:01:08] Ok! Well take a look and let you know tomorrow! [16:01:16] thanksss [16:01:22] We will work on it [16:01:49] we have also to migrate all storage initializers in prod to the mlserve:ro swift account, rather than the prod one, that could go alongside to this change [16:07:26] I'll still be around next week, so I can help [16:31:08] super, in theory it should be a low effort thing, just a lot of deployments