[00:02:29] 06Machine-Learning-Team, 10Wikilabels, 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "wikilabels" project Buster deprecation - https://phabricator.wikimedia.org/T367562#9993181 (10Andrew) I'm shutting down the Buster VMs today since they appear abandoned. If anyone restarts them, please follow up on thi... [06:27:56] o/ good morning [08:52:35] I suggest we just delete the llm image https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/llm/ [08:53:20] since we are using the huggingface image I don't think we are going to invest any more time in that [08:54:01] in the case where we want to focus on a specific llm in the future we can revamp this. wdyt? [08:57:37] isaranto: o/ [08:57:42] I agree! [08:58:27] one caveat though: [08:58:31] just noticed the k8s deployment config for langid was added to the "llm" namespace: [08:58:31] https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/llm/values.yaml [08:58:31] if we are to remove llm from the isvc repo, then we shall have to remove the "llm" namespace too and create a "langid" namespace. [08:59:20] I was just talking about the llm image. the llm namespace will be used to deploy all other llms (huggingface image, langid etc) [09:00:24] I get that there's is some confusion due to the naming though [09:00:57] so my suggestion is just to remove the code from isvc repo + CI pipelines in puppet [09:07:13] yes, confusion arises because of the current mapping system (isvc-in-repo==namespace-in-k8s). [09:07:13] when we have an isvc named "X" in the repository, we also have a matching namespace "X" in the k8s deployment. [09:07:13] this is the usual flow for most isvcs. however, if we want to place all upcoming llms into the llms namespace, that's also ok with me. [09:08:37] yes, this is my understanding so far, that we'll have a few llms in the same namespace [09:10:34] okok, I'll work on removing the llm dir from the isvcs repo [09:13:44] sounds good! there's no hurry though, let's wait a bit to get other folks' opinion in case there's sth we're missing [09:15:34] no problem :) [09:16:46] +1, it seems good! The namespace grouping in k8s should be considered a grouping of isvcs that have, more-or-less, the same requirements and usages (so we can better assign resources, allow calls to external services in helmfile, etc..) [09:26:23] nicely put Luca. We could add this info about how models are assigned to namespaces in Wikitech (mayb even isvc README repo) [09:34:02] yep, the docs on Wikitech show the k8s namespace e.g: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Language_models [09:49:31] I agree that the NS grouping is mostly functional/technical (as Luca pointed out). That doesn't stop us from using said grouping elsewhere, of course. Or we can do something else. As long as we're consistent about it :) [09:55:22] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#9994052 (10isarantopoulos) Thanks for the update Isaac! By looking at the above code + model iiuc the following changes need to be introduced in Lift Wing: - switc... [10:17:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00019-deployment in articletopic-outlink at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:22:44] it seems that there is CrashLoopBackOff for outlink articletopic model [10:22:49] FIRING: [4x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00019-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:24:50] iiuc the unavailable replicas refers to the crashloopbackoff [10:25:28] kevinbazira: --6 [10:25:39] isaranto: https://phabricator.wikimedia.org/P66808 [10:25:41] * --^ [10:26:07] it's being caused by `ModuleNotFoundError: No module named 'python'`. [10:26:07] ack! this is what I saw as well [10:26:48] all ok since the previous deployment/revision is working, but please deploy on ml-staging before you go to prod [10:27:30] sure sure [10:28:11] lemme know if you need any help or just ping me for reviews [10:30:05] klausman: o/ [10:30:05] running `docker buildx build --target production -f .pipeline/blubber.yaml .` on ml-testing nolonger builds an image from the blubberfile as it did in ml-sandbox. [10:30:39] what if you remove the buildx directive? [10:32:03] that returns: [10:32:03] ``` [10:32:03] kevinbazira@ml-testing:~/rec-api-modernization/recommendation-api$ docker build --target production -f .pipeline/blubber.yaml . [10:32:03] Sending build context to Docker daemon 1.895MB [10:32:03] Error response from daemon: dockerfile parse error line 2: unknown instruction: VERSION: [10:32:04] ``` [10:35:30] what is the blubber.yaml you are building? is it the rec-api one? what is the docker version in ml-testing? [10:36:16] Docker version 20.10.24+dfsg1, build 297e128 [10:37:18] image had built on ml-staging and now can't rebuild on ml-testing is the rec-ap--modernization patch: https://gerrit.wikimedia.org/r/c/research/recommendation-api/+/1052445 [10:38:13] since the blubber directive is there it should work with docker. [10:38:40] once thing I notice is that the docker version is really old and doesnt even have security support atm [10:38:40] https://endoflife.date/docker-engine [10:41:53] apart from that the error suggests that there is an issue with the version field [10:42:56] are you able to build that locally? at least trigger the build to check that everything is ok without waiting for the actual image to build [10:46:18] yes, I am able to trigger the build locally [10:47:11] using Docker version 24.0.5, build 24.0.5-0ubuntu1~22.04.1 [10:48:10] the docker on ml-testing might have to be updated [10:51:44] ok then, that narrows it down. it seems that it can't read the blubber syntax at all [11:02:48] (03PS1) 10Ilias Sarantopoulos: (WIP) articlequality: update to ordinal regression from statsmodels [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055177 (https://phabricator.wikimedia.org/T360455) [11:03:55] * isaranto lunch! [11:35:06] kevinbazira: sorry, I was out for lunch, taking a look [11:39:51] klausman: no problem, it was just an fyi. thank you for checking :) [11:40:19] Looks like one each of predictors and transformers are crashlooping [11:40:34] (currently only looking at outlink in eqiad) [11:41:28] https://phabricator.wikimedia.org/P66811 [11:41:59] Did we maybe break the Python subdir with the src/ move? [11:42:22] Not sure though why it would only break on one of the replicas [11:43:17] klausman: we have covered the failing deployment on outlink and it happened during the move indeed. the replica that is working is the old one [11:43:29] I mean the previous revision/deployment [11:43:30] ah, I see [11:44:13] but! we do need your help with updating docker on ml-testing [11:45:46] or shall I go and give it a try? [11:45:59] gimme a sec to have a look-see [11:46:24] How do you propose updating it? [11:47:10] I would just try sudo apt-get upgrade docker-ce [11:47:33] and anything else that is needed (dont' rememeber atm) [11:47:35] docker-ce is the old Bullseye name. The new (current) package is docker.io [11:47:43] ack! [11:47:59] v20.10.24+dfsg1-1+b3 [11:48:09] That's the newest Bookworm ships. [11:49:36] And as far as I can tell, there is n't a newer version that only WMF ships [11:50:28] FWIW, on my trixie (Debian testing) laptop, docker.io is 20.10.25+dfsg1 [11:51:57] you're right https://packages.debian.org/bookworm/docker.io [11:52:13] I was mislead by https://endoflife.date/docker-engine [11:53:10] as I have docker 2.26 on my mac [11:54:31] FIRING: [4x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00019-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:54:50] FIRING: [4x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00019-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:54:56] *26.1 [11:55:47] 06Machine-Learning-Team: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408 (10kevinbazira) 03NEW [11:57:25] So buildx is a plugin for docker, that used to be a separate package, but it seems it's gone from Bookorm [11:59:04] I reverted the current deployment to get rid of the alerts until we fix the issue https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1055192 [11:59:09] ty! [12:00:42] ah! the old install used the upstream DEB repo, I'll do that here as well [12:02:14] I need a review and I'll deploy that to both codfw and eqiad . ty! [12:06:27] oh thanks both! [12:06:36] :D [12:06:55] kevinbazira: can you try a build again? [12:07:22] (03PS1) 10Kevin Bazira: outlink: match python module usage with other isvcs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) [12:08:12] isaranto: I've pushed a patch to fix the CrashLoopBackOff issue: https://gerrit.wikimedia.org/r/1055195 [12:09:26] klausman: it works like a charm. thanks! :) [12:09:32] excellent [12:10:11] (03CR) 10CI reject: [V:04-1] outlink: match python module usage with other isvcs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [12:10:16] nice! [12:11:02] I deployed the revert change so pods are up and running. this service hadn't been deployed for a while (since December!) so the changes for pyopencl were never tested [12:16:34] (03PS3) 10Nik Gkountas: Recommend articles to translate based on topic [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [12:17:58] (03CR) 10CI reject: [V:04-1] Recommend articles to translate based on topic [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [12:18:02] RESOLVED: [4x] KubernetesDeploymentUnavailableReplicas: Deployment outlink-topic-model-predictor-default-00019-deployment in articletopic-outlink at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:18:10] (03CR) 10Ilias Sarantopoulos: "We'll need to change the aiohttp version in the requirements.txt of the model and the transformer to match that of the python requirements" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [12:20:30] (03PS2) 10Kevin Bazira: outlink: match python module usage with other isvcs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) [12:21:36] (03CR) 10CI reject: [V:04-1] outlink: match python module usage with other isvcs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [12:34:46] (03PS3) 10Kevin Bazira: outlink: match python module usage with other isvcs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) [12:41:07] (03CR) 10Kevin Bazira: "thanks, conflicts have been fixed for both aiohttp and PyYAML." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [12:47:40] ok! I'm reviewing --^ [13:30:06] (03CR) 10Ilias Sarantopoulos: "One more change is needed. I tested this with the new change and it seems to work." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [13:30:54] kevinbazira: there seem to be some changes which have never been deployed for this server.for example https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1012711 [13:31:32] so let's do some thorough in staging before going to prod [13:32:07] locally I tested it but I had to manually run the python command from the transformer docker so not sure how that plays out in staging/prod [13:32:16] otherwise the predictor_host argument couldn't be read [13:32:44] isaranto: yes, I noticed the backlog of undeployed changes [13:33:08] once we've merged this patch. I'll test on staging [13:33:22] ok! [13:33:31] thanks for taking care of this [13:33:45] no problem. thanks for all the reviews :) [13:33:47] (03CR) 10Ilias Sarantopoulos: "Done" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [13:34:56] (03CR) 10Kevin Bazira: [C:03+2] outlink: match python module usage with other isvcs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [13:36:54] kevinbazira: there is still an unresolved issue that fails [13:37:14] isaranto: just seen it. fixing now :) [13:37:22] I didn't +1 [13:38:20] tbh I think that we should be doing -1 whenever we want a change and a 0 only if it is a comment, that is the intended way of gerrit code reviews [13:39:15] (03PS4) 10Kevin Bazira: outlink: match python module usage with other isvcs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) [13:40:14] (03CR) 10Kevin Bazira: outlink: match python module usage with other isvcs (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [13:40:26] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Let's give it a try!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [13:41:13] (03CR) 10Kevin Bazira: [C:03+2] "Super! Thanks for the reviews :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [13:45:16] (03Merged) 10jenkins-bot: outlink: match python module usage with other isvcs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055195 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [14:02:18] just saw that Research are having a focus week, so meeting is cancelled! [14:02:35] ack! [14:03:27] isaranto: I've pushed patch that will test outlink on staging first: https://gerrit.wikimedia.org/r/1055233 [14:06:32] roger [14:17:46] Well, Diego and I met :) [14:18:05] isaranto: Diego has questions about the Gemma model, I suspect he'll poke you soon about it [14:18:53] I just joined I saw everybody declined the meeting [14:18:58] :D [14:19:15] I've talked with diego a bit on the topic, I'll ping him to rejoin if he feels chatty :D [14:19:24] ack! [14:19:51] Unrelated heads up: more network switche maintenance, so I'll drain&cordon ml-serve1008 in ~40m [14:24:12] 06Machine-Learning-Team, 13Patch-For-Review: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#9994904 (10kevinbazira) A new articletopic-outlink image has been deployed in staging and the predictor is now running but the transformer is not: ` kevinbazira@deploy1002:~$... [14:25:43] isaranto: --^ [14:26:05] the predictor is now running but the transformer isn't [14:26:18] testing on staging [14:26:22] ok this is what I experienced when running locally [14:26:31] in a meeting , will be back with you afterwards [14:26:43] okok [14:35:29] is outlink the first/only model we have where the predictor and transformer a separate? [14:42:46] yes, and our generic entry point: [14:42:46] https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/model_server_entrypoint.sh#L12 [14:42:46] does not add a required `--predictor_host` flag when calling the transformer: [14:42:46] https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/src/models/outlink_topic_model/transformer/transformer.py#L167 [14:43:03] So how did this work before? [14:47:35] back! [14:47:37] I am a bit confused as to what changed between the current running version and the oen that needs said arg [14:47:56] e.g. kubectl -n articletopic-outlink describe pod outlink-topic-model-transformer-default-00023-deployment-5t2mvh shows the flag being there [14:48:15] in my case when running locally I run `python3 transformer/transformer.py --model_name outlink-topic-model --predictor_host 172.17.0.2:8080` inside the container [14:48:45] following the instructions from here https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe#Example_3_-_Testing_outlink-topic-model_%28two_containers%29 [14:49:46] Sure, but predictor_host is mentioned nowhere in the deployment-charts repo, yet is visible in prod as mentioned above. [14:50:10] ack, just mentioning what I did [14:50:14] lemme have a look [15:07:44] * kevinbazira brbr [15:28:10] * kevinbazira back [15:47:18] I still haven't found something [15:48:20] I've looked at the kserve chart and the configuration in deployment-charts [16:13:57] 06Machine-Learning-Team: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#9995559 (10isarantopoulos) This service hasn't been deployed for quite a while ([[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/982043 | last deployed change on 1... [16:15:46] I just added some content in the task above. I need to write down also exactly the places I've already looked and what I've done [16:17:06] calling it a day on my side, we can continue working on this tomorrow, perhaps Aiko will have more knowledge on this [16:17:15] ack. [16:17:18] have a nice evening/rest of day folks o/ [16:17:25] I've spelunked through git history, but found nothing [16:18:13] me2, also kserve charts and kserve git history, still nothing [16:18:47] well since there is not problem with the prod service , a fresh look might do the trick 🤞 [16:19:00] Let's hoppe so. Enjoy your evening [16:19:34] thanks, u2! the temp in Athens doesnt fall below 30 these days not even at night :(