[14:12:41] 10Machine-Learning-Team, 10Analytics-Radar, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10fkaelin) I created a separate [[ https://docs.google.com/document/d/1Nffi3jUojC3BGNHkm2TyG7k5x30_7nzuPqgZ_tBeWNM/edit# | document ]] to discuss some of the bigger questions around orche... [14:12:42] 10[1] 04https://meta.wikimedia.org/wiki/https://docs.google.com/document/d/1Nffi3jUojC3BGNHkm2TyG7k5x30_7nzuPqgZ_tBeWNM/edit%23 [15:26:44] elukey That seems like a good move, but probably unnecessary right now. Can we add it to the backlog as an option if we run into ORES memory issues. [15:27:30] I think, from a strategic view, I want to focus on Lift Wing now, but I do want to maintain a backlog of updates to ORES we can do because that system will continue to be around for 18+ months [15:37:36] 10Machine-Learning-Team, 10Analytics: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10fkaelin) I don't think splitting the GPU machines from the yarn cluster is a far fetched idea, especially given the hurdles of making this work with yarn -... [15:43:26] I talked to Seve about the API gateway, their plan is for it to be self service. This actually opens up an interesting concept around eventually configuring Lift Wing to automatically set up new models in the API gateway [15:43:51] I'm going to set up a meeting with Hugh with all of us to walk through the api gateway and see what we need [16:24:14] I'm neck deep in annual planning for the next three weeks, so if you need anything I might be a little slow in replying. [16:47:10] 10Machine-Learning-Team, 10Analytics: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10elukey) @fkaelin I completely get your point, there is a bit of history behind the hadoop worker nodes with GPUs. They were bought when the ML team was not... [17:03:08] 10Machine-Learning-Team, 10Analytics-Radar, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) >>! In T275551#6913820, @fkaelin wrote: > I created a separate [[ https://docs.google.com/document/d/1Nffi3jUojC3BGNHkm2TyG7k5x30_7nzuPqgZ_tBeWNM/edit# | document ]] to discuss... [17:03:10] 10[2] 04https://meta.wikimedia.org/wiki/https://docs.google.com/document/d/1Nffi3jUojC3BGNHkm2TyG7k5x30_7nzuPqgZ_tBeWNM/edit%23 [17:33:53] 10Machine-Learning-Team: Investigate separating k8s-level users between our k8s and thr ServiceOps k8s - https://phabricator.wikimedia.org/T277492 (10klausman) p:05Triage→03Medium [17:35:07] elukey: ^^^ this is one for after I'm back from vacation. It's definitely not needed for the POC, but likely for the MVP. [17:38:26] klausman: we can also sync about what I can do while you are away, I thought to have a quick test of Istio but I can do other things :) [17:39:30] Yeah, we should have a "desync" :) on Friday or so [17:40:37] https://istio.io/latest/docs/setup/install/helm/ seems very promising, in theory we should be able to use helm (3) [17:41:28] and I guess that all istio-related docker images will have to be imported/vetted to our docker registry beforehand [17:43:18] https://istio.io/latest/about/supported-releases/ also shows that Istio 1.9 might not work on 1.16, meanwhile 1.8 should work on it [17:43:30] elukey: yeah i saw the helm docs for istio last week - definitely looks promising [17:43:46] Nice. Definitely work a closer look [17:44:34] as for k8s versions, we already have some code that allows for normal/newer package selection. We could probably go further and have more subvariants. Though obviously only if there is a compelling reason for a version zoo [17:44:44] https://www.kubeflow.org/docs/components/serving/kfserving/#standalone-kfserving seems to suggest that Istio 1.3+ is required, sooo maybe targeting 1.8 is good? [17:45:38] yeah we'll have to experiment with the versions a bit, i remember there were some gotchas with knative and istio mismatchs [17:45:40] klausman: the main blocker for us will be, I think, https://hub.docker.com/r/rocm/k8s-device-plugin for train wing, since it supports only 1.18+ [17:45:55] accraze: I am pretty sure it will be like playing Jenga :D [17:46:09] lol [17:46:29] so IIUC the service ops team will migrate next fiscal (at some point) to 1.20 [17:46:38] we'll likely be the first testers [17:47:01] of course I expect the rocm plugin to not work with 1.20, so it will be even more fun to follow up with upstream :D [17:49:07] sorry the correct link for the plugin is https://github.com/RadeonOpenCompute/k8s-device-plugin [17:49:29] * elukey bbiab [18:59:23] 10Lift-Wing, 10Machine-Learning-Team: Load a fastText model in to KFServing - https://phabricator.wikimedia.org/T276862 (10Isaac) FYI some context on fastText and why I use it: in my experience, fastText is way way faster to train than any other library I've tried (without needing GPUs) and perhaps more import... [20:50:13] 10Lift-Wing, 10Machine-Learning-Team: Load a fastText model in to KFServing - https://phabricator.wikimedia.org/T276862 (10ACraze) @Isaac thank you for providing more context about fastText. I did some initial work on loading your model into KFServing last week and I am not anticipating any major issues so far...