[06:20:13] <elukey>	 hola!
[06:22:16] <wikibugs>	 10Machine-Learning-Team, 10ORES: Add approvals on Github for all the ORES-related repositories - https://phabricator.wikimedia.org/T281711 (10Legoktm) >>! In T281711#7068832, @elukey wrote: > I took some extra steps: >  > * added only ML-team members to the list of users able to push to the master branches of...
[06:29:07] <wikibugs>	 10Machine-Learning-Team, 10ORES: Add approvals on Github for all the ORES-related repositories - https://phabricator.wikimedia.org/T281711 (10elukey) @Legoktm we just added a step for github repositories that ends up in production to ensure that a member of the ML team reviews the patch, it is a compromise to...
[07:04:54] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Install Istio on ml-serve cluster - https://phabricator.wikimedia.org/T278192 (10elukey) ` FROM docker-registry.wikimedia.org/golang:1.13-3 as build  ENV ISTIO_VERSION=1.6.2 ENV SOURCE_REPO=https://github.com/istio/istio.git ENV REPO_BASE=/go/github.com/istio/istio  ENV BU...
[09:55:30] <elukey>	 just filed https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/688211 as proof of concept about my current understanding of how we should build docker images
[09:55:45] <elukey>	 the Dockerfiles are still not complete, users/etc.. will surely need some refinement
[09:56:13] <elukey>	 so in theory, for istio, this is what I'd do:
[09:56:30] <elukey>	 1) build the images that we need now via the procedure/repo above
[09:56:56] <elukey>	 2) create a light deb with the istioctl binary, to be used on the deployment server and/or the kubemasters
[09:57:15] <elukey>	 the same idea should be applicable to knative
[09:57:38] <elukey>	 and eventually to kubeflow/kfserving, even if it might be more complicated
[09:58:23] <elukey>	 (for istio the proxyv2 image is still missing, I am working on it now, more complicated Dockerfile from upstream)
[09:59:07] <elukey>	 does it make sense?
[09:59:16] <klausman>	 Let me have a quick peek at the PR
[10:00:06] <klausman>	 Yeah, this looks/sounds good. 
[10:00:33] <klausman>	 GID 1337. Classic :)
[10:00:49] <elukey>	 I found it on the upstream Dockerfile, not from me :D
[10:01:50] <klausman>	 You knwo its meaning?
[10:02:23] <elukey>	 nope!
[10:03:45] <elukey>	 if you are ok we could split istio/knative for the moment, to go in parallel, and then we can work together on kubeflow's images
[10:03:59] <elukey>	 I hope that after a bit of practice the whole thing will become super easy
[10:04:00] <klausman>	 https://en.wikipedia.org/wiki/Leet
[10:04:28] <klausman>	 Yeah, we can split this stuff up. Do you keep notes on the Istio efforst so far somewhere?
[10:04:55] <elukey>	 ahhh TIL
[10:05:22] <elukey>	 yes yes I add everything to https://phabricator.wikimedia.org/T278192
[10:05:29] <klausman>	 Excellent
[10:05:43] <elukey>	 and I added some info about the docker images used by knative in the related task (at least, the ones popping up in my minikube tests)
[10:06:09] <klausman>	 T278194 I presume
[10:06:10] <stashbot>	 T278194: Install Knative on ml-serve cluster - https://phabricator.wikimedia.org/T278194
[10:06:13] <elukey>	 it took me a bit to digest the istio's makefile but now I have a clearer picture
[10:06:17] <elukey>	 yes exactly!
[10:07:33] <klausman>	 Goody gumdrops.
[10:07:44] <klausman>	 I'll have lunch and do some reading :)
[10:08:22] <elukey>	 super
[10:42:55] * elukey lunch!
[12:21:55] <wikibugs>	 10Machine-Learning-Team, 10Wikilabels: Translations updates are blocked - https://phabricator.wikimedia.org/T282449 (10Nikerabbit)
[16:25:20] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10Analytics, and 2 others: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia - https://phabricator.wikimedia.org/T280107 (10Milimetric) p:05Triage→03Medium
[16:27:19] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10Analytics, and 3 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10Milimetric) 05Open→03Resolved p:05Triage→03High a:03Milimetric
[16:27:51] <wikibugs>	 10Machine-Learning-Team, 10Analytics: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10Milimetric) p:05Triage→03High
[16:31:57] <wikibugs>	 10Machine-Learning-Team, 10Analytics: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10elukey) 05Open→03Resolved a:03elukey This is done! With T277062 Aiko and Miriam were able to run tensorflow-rocm only on GPU nodes :)
[17:14:36] <wikibugs>	 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) Sure!  We can even use local templates.  Would you be interested in creating templates with badges/colors you l...
[17:34:23] <elukey>	 going to add a note about some work that Miriam and Aiko are doing with tensorflow on hadoop for image classification
[17:34:26] <elukey>	 (we were out of time)
[17:34:34] <elukey>	 https://phabricator.wikimedia.org/T276407
[17:35:13] <elukey>	 the distributed training for the neural net works, but we hit a bottleneck in how the weights are exchanged periodically by some nodes
[17:35:25] <elukey>	 (namely we saturated 10Gbps in tx bandwidth)
[17:35:47] <elukey>	 Aiko and Miriam are working on it, but it is interesting since we might get the same problem on trainwing
[17:36:00] <elukey>	 (all the training above is on hadoop gpu worker nodes)
[17:36:26] <accraze>	 elukey: is this using the rocm plugin you showed us?
[17:37:22] <elukey>	 accraze: nono it is on hadoop using https://github.com/criteo/tf-yarn, with tensorflow-rocm (the one running on amd gpus)
[17:37:45] <elukey>	 we have 6 nodes with GPUs that we can now target directly
[17:38:07] <accraze>	 oh cool!
[17:38:30] <elukey>	 it they get the training right we might have a test model to run on lift wing!
[17:38:42] <elukey>	 it may be really interesting
[17:38:42] <accraze>	 that's actually pretty exciting
[17:39:02] <accraze>	 yeah if it works, that can open a ton of doors for us
[17:39:39] <elukey>	 it seems that distributing the training for neural networks comes with a price, i am wondering if/how kubeflow handles it
[17:39:48] <elukey>	 (namely network bw usage etc..)
[17:41:46] <accraze>	 I know TensorFlow has distributed training capabilities (TFJobs) and most of the other big frameworks seem to have something similar.
[17:43:19] <accraze>	 the real question is how it will integrate with the amd gpus but from things seem to look promising :)
[17:43:56] <elukey>	 from the tensorflow perspective, just using tensorflow-rocm seems to have worked fine so far
[17:44:10] <elukey>	 on kubernetes it will be a different game though
[17:47:05] <accraze>	 unrelated: i requested a new gerrit repo for our inference services so kevinbazira and I won't need to send pastes/gists back and forth all the time.
[17:47:14] <accraze>	 https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests
[17:47:56] <accraze>	 machinelearning/liftwing/inference-services 
[17:48:35] <elukey>	 (going afk, ttl!)
[17:48:47] <accraze>	 see ya elukey
[22:37:51] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Load a fastText model in to KFServing - https://phabricator.wikimedia.org/T276862 (10ACraze) Confirming that the Outlinks topic model can indeed be loaded as a custom KFServing inference service to be used by #lift-wing .  I was able to package and deploy the model inside...