[09:45:21] hello people [09:57:26] I have been watching the apply() conference talks and there was one from Etsy that caught my attention [09:57:51] https://www.applyconf.com/agenda/towards-a-unified-real-time-ml-data-pipeline-from-training-to-serving/ [09:58:07] their approach is different, IIUC, from the online vs offline feature store distinction [09:58:37] I think that they use Cassandra (or similar) as common storage layer, plus a self made registry + schema [09:58:54] and they load feature datasets to Cassandra via Spark (from I guess something like Hadoop or similar) [09:59:22] (and the load can happen in any way, like Airflow etc..) [09:59:43] the general assumption for hw orders for us has been [09:59:53] 1) HDFS will be used as "offline" storage (or maybe hive) [10:00:15] 2) the online storage will be used only for serving and it will be Redis based (so no big needs of storage capabilities) [10:00:52] if this is not the case, we should think about changing the feature store hosts to something that can hold more disks [10:01:31] it is very difficult to get this right, soo fuzzy and high level [10:01:46] ideally I'd love to pull data from Hive during training [10:02:14] but practically feast doesn't support it yet (we can contribute code) and kubernetes + kerberos + hadoop seems to be nasty [10:02:22] (see https://engineering.linkedin.com/blog/2020/open-sourcing-kube2hadoop) [10:03:05] so having a common storage layer where we push stuff, using some auth different from Kerberos might be time saving [10:03:24] but we'd likely duplicate resources a lot [10:05:11] (the kerberos mess might be solvable with an easier solution) [10:06:33] anyway lemme know your thoughts :) [10:20:06] 10Lift-Wing, 10Machine-Learning-Team: Install Istio on ml-serve cluster - https://phabricator.wikimedia.org/T278192 (10elukey) Joe gave me a nice pointer in production-images, namely the loki multi-stage container example. Basically the idea is to build go binaries in one container first, then use them for the... [12:57:19] kevinbazira o/ how are we going to do it? Shared session on meet? [12:57:26] (going to be ready in 5 mins) [12:57:46] Yep ... elukey o/ [14:10:46] Victory! ORES deploy looks good thanks to elukey and klausman [14:13:10] nice job! [15:19:24] It's been an hour and the graphs still look okay. I also haven't seen any complaints from users. [15:20:42] klausman: the scores errored in codfw are too high :( [15:20:54] there is a problem ongoing [15:20:55] Oh, I missed that, let me have a look. [15:21:56] there are stuff like [15:21:57] Task ores.scoring_systems.celery_queue._process_score_map[698f6bda-08d1-4ff2-875c-3ae89ab9527b] raised unexpected: AttributeError("'BinomialDeviance' object has no attribute 'get_init_raw_predictions'",) [15:22:01] in logstash [15:22:02] If you go to the 6h view, it doesn't look unsual [15:22:04] kevinbazira: --^ [15:22:30] Oh, but if you only show codfw, it's very obvious [15:22:36] exactly yes [15:22:40] otherwise it seems good [15:22:42] eqiad is fine [15:22:52] so it may be some specific traffic being impacted [15:23:31] The scores errored graph is also spiking: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m [15:25:25] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.05.05?id=5qMhPXkBfVMx58vqE8EF [15:25:30] the above is an example of error [15:25:58] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10kevinbazira) The ORES deployment has been completed. Thanks to @elukey and @klausman. In case there ar... [15:27:56] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10elukey) There seems to be a regression in scores errored mostly in codfw (ORES is active/active), so so... [15:27:59] kevinbazira: --^ [15:28:03] I added the stacktrace [15:29:14] this might be related to the warning that we were seeing in celery [15:29:51] Looking at the stacktrace [15:30:51] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10elukey) ` elukey@ores2001:~$ sudo journalctl -u celery-ores-worker.service | grep Warning May 05 13:59... [15:34:29] it seems some issue with sklearn [15:34:45] but it is weird, we shouldn't have changed deps [15:36:41] yeah the frozen requirements were not touched afaics [15:42:37] and most of the errors that I see are for viwiki [15:43:56] kevinbazira: did we modify viwiki? [15:44:20] checking ... [15:47:42] I think that we either rollback or we find a quick patch [15:51:10] klausman: can you imagine that I wasn't aware of "git show" ? [15:51:27] now I question my working life [15:52:02] I suppose that after some years working with you I'll fill my horrible gaps :D [15:53:10] elukey: I've checked and looks like viwiki wasn't modified [15:53:13] https://github.com/wikimedia/articlequality/search?q=viwiki&type=commits [15:53:13] https://github.com/wikimedia/editquality/search?q=viwiki&type=commits [15:53:13] https://github.com/wikimedia/draftquality/search?q=viwiki&type=commits [15:53:40] kevinbazira: it is my impression too, I was checking git commits and I don't see it [15:53:45] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10Halfak) Can we figure out what request caused this error? It's likely that a model was accidentally... [15:54:14] elukey: maybe I should have a "Linux command line thing you may not know about" weekly series :D [15:54:58] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10elukey) @Halfak I see mostly `'model_names': ['reverted', 'articletopic']` for `viwiki` in codfw.. [15:55:06] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10Halfak) Aha! Looks like https://ores-beta.wmflabs.org/v3/scores/viwiki/123125/articletopic raises the... [15:55:21] https://github.com/wikimedia/drafttopic/search?q=viwiki&type=commits [15:56:08] klausman: ahahah how many hours do you have to spare?? [15:57:21] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10elukey) In my opinion we should rollback, work on a patch and re-rollout when we are ok, doing more tes... [15:57:26] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10Halfak) I'll try to find some time this evening to rebuild the viwiki model with the right version of s... [16:02:49] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10elukey) @Halfak what is the likelihood that other models have the same issues, but we haven't seen erro... [16:05:41] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10Halfak) The pipelines are documented/automated in the relevant Makefiles. E.g. if you install the depe... [16:07:05] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10elukey) >>! In T278723#7062438, @Halfak wrote: > The pipelines are documented/automated in the relevan... [16:32:31] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10Halfak) Sorry I missed one of your other questions. >>! In T278723#7062423, @elukey wrote: > @Halfak... [17:19:25] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 (10Halfak) Found a few minutes. Rebuild in progress. [17:48:39] * elukey afk! [17:48:41] o/