[06:20:13] hola! [06:22:16] 10Machine-Learning-Team, 10ORES: Add approvals on Github for all the ORES-related repositories - https://phabricator.wikimedia.org/T281711 (10Legoktm) >>! In T281711#7068832, @elukey wrote: > I took some extra steps: > > * added only ML-team members to the list of users able to push to the master branches of... [06:29:07] 10Machine-Learning-Team, 10ORES: Add approvals on Github for all the ORES-related repositories - https://phabricator.wikimedia.org/T281711 (10elukey) @Legoktm we just added a step for github repositories that ends up in production to ensure that a member of the ML team reviews the patch, it is a compromise to... [07:04:54] 10Lift-Wing, 10Machine-Learning-Team: Install Istio on ml-serve cluster - https://phabricator.wikimedia.org/T278192 (10elukey) ` FROM docker-registry.wikimedia.org/golang:1.13-3 as build ENV ISTIO_VERSION=1.6.2 ENV SOURCE_REPO=https://github.com/istio/istio.git ENV REPO_BASE=/go/github.com/istio/istio ENV BU... [09:55:30] just filed https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/688211 as proof of concept about my current understanding of how we should build docker images [09:55:45] the Dockerfiles are still not complete, users/etc.. will surely need some refinement [09:56:13] so in theory, for istio, this is what I'd do: [09:56:30] 1) build the images that we need now via the procedure/repo above [09:56:56] 2) create a light deb with the istioctl binary, to be used on the deployment server and/or the kubemasters [09:57:15] the same idea should be applicable to knative [09:57:38] and eventually to kubeflow/kfserving, even if it might be more complicated [09:58:23] (for istio the proxyv2 image is still missing, I am working on it now, more complicated Dockerfile from upstream) [09:59:07] does it make sense? [09:59:16] Let me have a quick peek at the PR [10:00:06] Yeah, this looks/sounds good. [10:00:33] GID 1337. Classic :) [10:00:49] I found it on the upstream Dockerfile, not from me :D [10:01:50] You knwo its meaning? [10:02:23] nope! [10:03:45] if you are ok we could split istio/knative for the moment, to go in parallel, and then we can work together on kubeflow's images [10:03:59] I hope that after a bit of practice the whole thing will become super easy [10:04:00] https://en.wikipedia.org/wiki/Leet [10:04:28] Yeah, we can split this stuff up. Do you keep notes on the Istio efforst so far somewhere? [10:04:55] ahhh TIL [10:05:22] yes yes I add everything to https://phabricator.wikimedia.org/T278192 [10:05:29] Excellent [10:05:43] and I added some info about the docker images used by knative in the related task (at least, the ones popping up in my minikube tests) [10:06:09] T278194 I presume [10:06:10] T278194: Install Knative on ml-serve cluster - https://phabricator.wikimedia.org/T278194 [10:06:13] it took me a bit to digest the istio's makefile but now I have a clearer picture [10:06:17] yes exactly! [10:07:33] Goody gumdrops. [10:07:44] I'll have lunch and do some reading :) [10:08:22] super [10:42:55] * elukey lunch! [12:21:55] 10Machine-Learning-Team, 10Wikilabels: Translations updates are blocked - https://phabricator.wikimedia.org/T282449 (10Nikerabbit) [16:25:20] 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10Analytics, and 2 others: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia - https://phabricator.wikimedia.org/T280107 (10Milimetric) p:05Triage→03Medium [16:27:19] 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10Analytics, and 3 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10Milimetric) 05Open→03Resolved p:05Triage→03High a:03Milimetric [16:27:51] 10Machine-Learning-Team, 10Analytics: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10Milimetric) p:05Triage→03High [16:31:57] 10Machine-Learning-Team, 10Analytics: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10elukey) 05Open→03Resolved a:03elukey This is done! With T277062 Aiko and Miriam were able to run tensorflow-rocm only on GPU nodes :) [17:14:36] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) Sure! We can even use local templates. Would you be interested in creating templates with badges/colors you l... [17:34:23] going to add a note about some work that Miriam and Aiko are doing with tensorflow on hadoop for image classification [17:34:26] (we were out of time) [17:34:34] https://phabricator.wikimedia.org/T276407 [17:35:13] the distributed training for the neural net works, but we hit a bottleneck in how the weights are exchanged periodically by some nodes [17:35:25] (namely we saturated 10Gbps in tx bandwidth) [17:35:47] Aiko and Miriam are working on it, but it is interesting since we might get the same problem on trainwing [17:36:00] (all the training above is on hadoop gpu worker nodes) [17:36:26] elukey: is this using the rocm plugin you showed us? [17:37:22] accraze: nono it is on hadoop using https://github.com/criteo/tf-yarn, with tensorflow-rocm (the one running on amd gpus) [17:37:45] we have 6 nodes with GPUs that we can now target directly [17:38:07] oh cool! [17:38:30] it they get the training right we might have a test model to run on lift wing! [17:38:42] it may be really interesting [17:38:42] that's actually pretty exciting [17:39:02] yeah if it works, that can open a ton of doors for us [17:39:39] it seems that distributing the training for neural networks comes with a price, i am wondering if/how kubeflow handles it [17:39:48] (namely network bw usage etc..) [17:41:46] I know TensorFlow has distributed training capabilities (TFJobs) and most of the other big frameworks seem to have something similar. [17:43:19] the real question is how it will integrate with the amd gpus but from things seem to look promising :) [17:43:56] from the tensorflow perspective, just using tensorflow-rocm seems to have worked fine so far [17:44:10] on kubernetes it will be a different game though [17:47:05] unrelated: i requested a new gerrit repo for our inference services so kevinbazira and I won't need to send pastes/gists back and forth all the time. [17:47:14] https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests [17:47:56] machinelearning/liftwing/inference-services [17:48:35] (going afk, ttl!) [17:48:47] see ya elukey [22:37:51] 10Lift-Wing, 10Machine-Learning-Team: Load a fastText model in to KFServing - https://phabricator.wikimedia.org/T276862 (10ACraze) Confirming that the Outlinks topic model can indeed be loaded as a custom KFServing inference service to be used by #lift-wing . I was able to package and deploy the model inside...