[06:51:08] <elukey>	 isaranto: re: boowork - yes correct!
[06:51:47] <elukey>	 the update of the docker images for knative is only the first no-op step, then we'll need to test the kserve-seccomp-feature deploy
[06:52:04] <elukey>	 once all the model servers are updated in codfw, then we'll enable the new knative security features
[06:52:08] <elukey>	 check that all works etc..
[06:52:20] <elukey>	 and finally, we'll be able to move to PSS and deprecate PSP
[06:52:33] <elukey>	 note: we are updating a MINOR version of k8s
[07:00:06] <elukey>	 new knative up and running
[07:01:10] <elukey>	 isaranto: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1140078 if you have a min
[07:01:34] <elukey>	 in theory this should not work
[07:08:13] <kart_>	 klausman: Sorry for pinging again, but any updates with https://phabricator.wikimedia.org/T391958? :)
[07:10:41] <isaranto>	 good morning!
[07:14:29] <ozge_>	 Good morning
[07:17:27] <wikibugs>	 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10779085 (10elukey) Alternative discussed on IRC:  `  <kart_> isaranto: anything for https://phabricator.wikimedia.org/T391958 ? :)  <isaranto> o/ kart_ sorry f...
[07:18:26] <elukey>	 kart_: updated the task with our IRC discussion
[07:18:54] <kart_>	 heh
[07:24:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[07:24:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=hewiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[07:28:00] <elukey>	 isaranto: since I am stupid, I need also https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1140086
[07:28:04] <elukey>	 this is why it failed the last time
[07:29:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[07:29:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=hewiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[07:37:04] <georgekyz>	 good morning
[07:43:21] <elukey>	 o/
[07:43:31] <elukey>	 niceee pods are finally coming up with seccomp!
[07:45:02] <elukey>	 elukey@deploy1003:~$ httpbb --hosts inference.svc.codfw.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing/production/test_revscoring-editquality-reverted.yaml 
[07:45:05] <elukey>	 Sending to inference.svc.codfw.wmnet...
[07:45:07] <elukey>	 PASS: 9 requests sent to inference.svc.codfw.wmnet. All assertions passed.
[07:45:11] <elukey>	 \o/
[07:45:17] <elukey>	 all right finally unblocked
[07:46:22] <wikibugs>	 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10779175 (10elukey) ` elukey@deploy1003:~$ httpbb --hosts inference.svc.codfw.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing...
[07:49:33] <klausman>	 kart_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140118 sorry again for the delay
[07:58:04] <kart_>	 NP!
[08:02:52] <elukey>	 kart_: please read step 2 of https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing
[08:03:00] <elukey>	 for the initial push of the model
[08:03:06] <elukey>	 (will be done by the ml-team)
[08:08:05] <kart_>	 Thanks. Note that we have slightly different models directory!
[08:08:05] <kart_>	 Models we are using are in,
[08:08:05] <kart_>	 https://people.wikimedia.org/~santhosh/indictrans2/
[08:08:05] <kart_>	 https://people.wikimedia.org/~santhosh/madlad400-3b-ct2/
[08:08:05] <kart_>	 https://people.wikimedia.org/~santhosh/nllb/
[08:08:06] <kart_>	 https://people.wikimedia.org/~santhosh/softcatala/
[08:08:06] <kart_>	 https://people.wikimedia.org/~santhosh/opusmt/
[08:20:02] <klausman>	 I think as long as there aren't any top-level name clashes, that would be fine
[08:20:25] <klausman>	 You could use a subdir below s3://wmf-ml-models/ like s3://wmf-ml-models/mint/nllb/ 
[08:21:28] <elukey>	 kart_: yep yep the ml-team and your team will decide the naming, what I want to make sure is that the model is properly passed in a "secure" place and with a SHA512 set in stone on phabricator etc..
[08:21:51] <elukey>	 that is critical to avoid, as much as possible, the wrong binary to get to prod
[08:22:09] <elukey>	 and/or somebody in the future sneaking in while we upload the new binary
[08:22:34] <kart_>	 Noted!
[08:23:32] <klausman>	 as for subdir-for-mint or no, I have no strong opinion, so I'll let others chime in :)
[08:23:59] <kart_>	 We can probably create tar version of all models and untar them during startup (doing it for some models already) and it will be easy to generate shasum
[08:34:11] <wikibugs>	 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10779348 (10elukey) Next steps:  - enable seccomp default settings for all ml-serve-codfw isvsc (https://gerrit.wikimedia.org/r/1140120) - F...
[08:34:32] <elukey>	 klausman: I left a note about the ml-serve-codfw's PSS migration steps in https://phabricator.wikimedia.org/T369493#10779348, lemme know if you have doubts/suggestions/etc..
[08:41:49] <klausman>	 elukey: LGTM. 
[08:41:58] <elukey>	 ack super
[09:16:48] <elukey>	 slowly rolling out seccomp to ml-serve-codfw
[09:39:13] <elukey>	 all right all deployed afaics
[09:52:57] <elukey>	 elukey@deploy1003:~$ httpbb --hosts inference.svc.codfw.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing/production/test_*
[09:53:00] <elukey>	 Sending to inference.svc.codfw.wmnet...
[09:53:02] <elukey>	 PASS: 114 requests sent to inference.svc.codfw.wmnet. All assertions passed.
[09:58:24] <klausman>	 nice!
[10:14:11] <isaranto>	 \o/
[11:42:45] <kart_>	 klausman: seems we already had account for machinetranslation in thanos/swift? :/
[12:12:10] <klausman>	 Yeah, it seems so. Can you give that a try from your side? 
[12:15:16] <kart_>	 I think I haven't got access or details, but let me check.
[12:16:57] <klausman>	 elukey: what is the done thing to get e.g. S3 credentials to people who don't have puppet(server) access?
[12:54:29] <elukey>	 klausman: IIUC there will be two workflows:
[12:55:10] <elukey>	 1) the ml-team will upload the model binary to the bucket via the usual workflow (see point 2 of https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing) so we track what's being pushed etc..
[12:55:32] <elukey>	 2) mint will be configured for ro-access using the machinetranslation account
[12:55:43] <elukey>	 via puppet private etc..
[12:56:05] <klausman>	 Yes, exactly. As it turns out, there already is a Swift user for 2)
[12:56:48] <klausman>	 But my question is: how do the Mint folks know what the pw for that user is? AIUI, it's only in puppet-private, so I am not sure how k.art could try if it works for them
[12:57:24] <klausman>	 See Matt's comments on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140118 rearding the existing user.
[12:57:34] <elukey>	 they don't need to know, in theory it is sufficient for the pod to read two env variables
[12:58:13] <klausman>	 Ah, so we'd just make a usual service+storage initializer, but use the Mint credentials instead of the mlserve:ro one
[12:58:29] <klausman>	 I wasn't sure if there was off-LW need for RO access to the models.
[12:58:37] <elukey>	 sort-of, mint is on wikikube, so we'll have to use a different strategy
[12:59:08] <elukey>	 they don't have the storage initializer, not sure what they are planning to do, but it will be probably a bash/python script that downloads the binary
[12:59:12] <klausman>	 Ah, now I got a better picture
[12:59:46] <elukey>	 if so, the pod will need to have the S3 user/pass credentials published as env vars, and there is a way in deployment-chart to make it fetch a secret etc..
[13:00:00] <klausman>	 ack
[13:00:46] <elukey>	 klausman: if you grep for AWS_ACCESS_KEY_ID in puppet private you'll see how tegola does it (that runs on wikikube)
[13:01:43] <elukey>	 if kart_ wants to test it locally somewhere on statXXXX or similar you'll have to copy the pass to a file readable only for them on the host or similar
[13:01:52] <elukey>	 no better solution that I can think of
[13:03:32] <kart_>	 Also, we will need 'public' place to put copy of all models for anyone wants to setup their on MT system (and to test on cloud instance as well!)
[13:04:21] <kart_>	 current people.wikimedia.org/ -- is really not suitable place. elukey mentioned about using statxxxx with public URL AFAIK.
[13:05:04] <elukey>	 kart_: that will be taken care by the ml-team via their automation, we use https://analytics.wikimedia.org/published/wmf-ml-models/
[13:05:15] <elukey>	 (as part of the upload step)
[13:05:34] <elukey>	 the binaries will be published with their sha512 etc..
[13:05:48] <elukey>	 accessible by everybody
[13:05:55] <elukey>	 does it work for you?
[13:39:25] <kart_>	 That would be fine. Thanks.
[14:19:37] <aiko>	 georgekyz: about the peacock model's issue, I'm now trying to run gpu blubber.yaml locally (maybe it's not possible). I'm also thinking maybe we can try disable gpu in staging and see if that changes the score
[14:21:28] <georgekyz>	 aiko: Yes that is a nice idea. I am trying to see first of all if (and why) we are taking always True in the prediction. If you parse anything interesting please share it, I will share as well
[14:35:06] <aiko>	 ack! here is the patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1140195
[14:56:55] <georgekyz>	 aiko: thnx for that one, I pasted in the slack channel something and I +1 your patch, feel free to merge it
[14:59:29] <georgekyz>	 instead of merging this we could edit the isvc directly on staging just to test it 
[14:59:37] <georgekyz>	 but is not always a good practice :p
[15:59:55] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10780983 (10kevinbazira) Testing the `aya-expanse-8b` model with the `wmf-debian-vllm` image built in T385173#10771940 returns the following error: {P75714} I have faced this hardware exception be...
[16:01:17] <kevinbazira>	 o/ finally shared the update on how I got aya-expanse 8b and 32b models to work with the `wmf-debian-vllm` image: https://phabricator.wikimedia.org/T385173#10780983
[16:01:52] <isaranto>	 nice!
[16:03:15] <isaranto>	 georgekyz: aiko feel free to edit deployments directly on experimental in ml-staging-codfw that is what it is for (allow to quickly test things without going through ci/cd ). The edit check might be a special case, but we can make sure to leave things as we find them
[16:05:10] <isaranto>	 going afk folks, I'll be at the hackathon but I'll keep an eye -- ping me if you need anything. ciao!
[16:12:18] <aiko>	 I deployed a new model server that only uses cpu because I think maybe someone still want to test the gpu version
[16:13:14] <aiko>	 I tested the model and the outputs are the same as what I got locally, so the problem is definitely on gpu! 
[16:13:23] <elukey>	 hey folks! Please note that we have a new knative in *ml-serve-codfw*, and all the isvcs are running with seccomp's default profile (basic security setting to prep for the new K8s version)
[16:13:38] <elukey>	 it should be a no-op, but if you see anything weird lemme know
[16:13:53] <elukey>	 I'll be afk during the next couple of days, but I'll check every now and then just to be sure
[16:15:03] <aiko>	 ack! thank you Luca :)
[16:16:24] <elukey>	 I added a rollback in https://phabricator.wikimedia.org/T369493#10781067
[16:16:25] <wikibugs>	 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10781067 (10elukey) Rollback if needed for ml-serve-codfw:  - Revert https://gerrit.wikimedia.org/r/1140120 https://gerrit.wikimedia.org/r/1...
[16:17:11] <elukey>	 np! Have a good rest of the day
[16:18:33] <aiko>	 for people want to try, this is the endpoint: curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/edit-check-cpu:predict" -X POST -d@./input.json -i -H "Host: edit-check-cpu.experimental.wikimedia.org"
[16:22:07] <aiko>	 but I also see edit-check-cpu is not using batcher, that's weird.. we have configured it as well 
[16:30:16] <aiko>	 I'm going to disable inference batching for edit-check to see if that changes anything
[16:34:39] <aiko>	 no, it's the same.. with and without batcher
[17:15:59] <aiko>	 I disabled the gpu for edit-check for now
[20:48:26] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [onboarding] Improving language agnostic articlequality model + service - https://phabricator.wikimedia.org/T391679#10782049 (10Isaac) I'm late on my acknowledgement but thanks both for engaging here and being open to the feedback!