[00:04:30] https://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/ [00:23:39] 10Scoring-platform-team-Backlog: [Investigate] Hadoop integration for ORES training - https://phabricator.wikimedia.org/T170650#3438153 (10awight) [00:23:42] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Patch-For-Review: Summarize what it will take to separate product and platform for ORES Extension - https://phabricator.wikimedia.org/T167908#3438167 (10jmatazzoni) [00:28:52] awight: yes i'm already doing it for search, but not with scikitlearn [00:30:33] awight: you don't need to port anything, you can run standard python code on hadoop with pyspark as long as you can break up the task somehow (the individual hyperparameters of a model training being an obvious spot) [00:31:05] it would look something like: sc.parallelize(list_of_parameters_to_run).map(function_taking_parameters_and_returning_something).collect() [00:31:24] each item in the parallelized list will be provided to the function on an executor [00:32:38] i suppose the mildly annoying part is you nede to find a reasonable way to get your training data there, but its doable [00:34:55] if the data is small enough (10's of MB?) you can just broadcast it and call it aday [00:36:01] ebernhardson: I like it. Yeah training data with cached features is about 150MB, maybe fine to broadcast. [00:36:53] We're currently splitting threads on cross-validation folds, and can extract features in parallel. [00:37:03] Single models are being trained on 1 cpu, though. [00:37:51] The biggest win AFAICT would be to be able to queue lots of jobs with differing numbers of workers, and keep our hardware fully employed. [00:38:34] the problem is you can't just take the whole hadoop cluster :P there are scheduled jobs that run hourly or whatever that need hundreds of cpu cores and hundreds of Gb of memory :P [00:39:13] hehe, we'd be fine with a quota [00:39:27] or even "nice 19" [00:39:35] but something can be worked out. Generally what i'm doing in search is that one model for feature engineering (~150M training data) is trained on 4 cores, 5 CV folds are trained in parallel, and then 20 hyperparameters are tested at a time [00:40:07] hadoop doesn't really have a 'nice 19' kind of concept, instead when you run something you ask for X executors with M gb of memory and N cpu cores, and it give sthem to you (or not depending on how busy things are) [00:40:14] those are then your memory, and your cpu cores until you give them back [00:40:36] This might not win us friends :) [00:41:20] We have our own hardware, so in theory could run hadoop there... I have no idea what the ops overhead would look like. [00:41:39] generally how i've been going about it so far is my job asks for 100 executors with 4GB of memory and 4 cpu cores, and runs for ~20 minutes and shuts down. This is basically a full training with many hundreds of models trained to choose a set of hyperparamters for the current feature set [00:42:04] * awight warms hands by the data center's exterior wall [00:42:08] :) [00:44:04] if you have your own hardware a job queue can get you roughly the same thing, hadoop was convenient for me because it's an existing system with just shy of 1000 cores and 2.2TB of memory that i can just grab resources from and return [00:44:36] It feels wrong that we would have an empire of celery for production, but then use hadoop for training... [00:44:57] yeah I was thinking something as simple as a shared python multiprocessing pool [00:46:02] I guess we already have divergent tech stacks for training and production [00:47:10] i'm somewhat surprised your training 150MB sets on a single core though, my sets for feature engineering are about 150M (prod models closer to 2.5GB) and it still takes some time to train the ~900 models for hyperparameter tuning with CV [00:47:32] This how we multiprocess CV: https://github.com/wiki-ai/revscoring/blob/master/revscoring/scoring/models/model.py#L220 [00:49:02] i mean, at a high level it all looks the same :P https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/training/tuning.py#L231 [00:49:18] I'll have to learn more about our hyperparameter tuning, AIUI we're tuning manually. [00:50:10] mine isn't that much different, basically just define a space to search, search it, move on: https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/training/xgboost.py#L486-L504 [00:50:39] i think mine is slightly different in that it's somewhat common to tune all the parameters at once, while i'm tuning just a couple related parameters at a time to keep down on the exponential explosion of possibilities [00:51:06] :) pprint ftw [00:51:26] :) [00:51:31] ty, I need to run but will be sure to bother you next week. [00:51:57] sure [04:36:28] 10Scoring-platform-team-Backlog: [Investigate] Hadoop integration for ORES training - https://phabricator.wikimedia.org/T170650#3438380 (10awight) [14:29:18] * halfak reads about http://bkmla.org/ [16:35:00] O: [16:35:04] O/* [17:05:50] halfak: wanted to report about my epic fail:) log2 seems to work better than None. The ROC-AUC of GBC for ruwiki damaging edits was 0.934 which is lower than the benchmark 0.936 (I ran log2 version on Tuesday). [18:07:06] fajne, good to know! Would you write that up and we'll make a little report for it? [18:07:16] sorry to be AFk. I had a lunch that went late. [18:07:22] Hi halfak :) [18:07:33] I'm just about to change locations again so I'll be AFK for ~30 mins [18:07:35] o/ Zppix [18:07:36] Hows the logstash thing coming? [18:07:54] Not quite sure. I haven't really been able to look at it since mutante's note. [18:08:03] I'll be around to talk more about it soon :) [18:08:03] o/ [18:08:05] Ok [18:08:08] O/ [18:15:41] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10ORES, 10Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017), and 2 others: Conform ORES sensitivity levels to the new ERI standards - https://phabricator.wikimedia.org/T160575#3439860 (10jmatazzoni) @Etonkovidova notes: > 2) Likely have pr... [18:34:16] o/ [18:34:18] Zppix & fajne: just got to coffee shop :) [18:34:20] So I'm here for several hours [18:34:47] ok [18:35:37] can you point me to a similar report maybe? [18:36:39] Hmm.. yeah. Let me find one. [18:37:29] https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_draft_quality/Work_log/2016-12-01 [18:37:36] https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_draft_quality/Work_log/2016-12-03 [18:37:57] Totally casual, captures results and the thoughts you had along the way. [18:37:59] Last night, I read up on GBC params tuning and then on the imbalanced classes problem. Hence, a quick question: did you consider balancing your training set (up to 50/50)? [18:38:13] No way we're going to do that for vandalism [18:38:18] It's too infrequent. [18:38:42] We do it a lot for article quality -- where the classes can be balanced more because we have enough observations of the top quality classes. [18:39:14] another one: is switching to an anomaly detection algorithm a good idea? [18:40:57] for what? [18:41:08] And what do you mean by "anomaly detection algorithm" [18:41:11] Like, which one? [18:41:48] 1 sec [18:42:21] https://www.semanticscholar.org/paper/Efficient-Anomaly-Detection-by-Isolation-Using-Nea-Bandaragoda-Ting/764896ad374224e53d3cafe9575ca58487724802 [18:44:10] it's just a suggestion from one blog I found interesting. not sure it's worth exploring, so just checking with you [18:44:26] since vandalism is so infrequent... [18:45:07] Im here to talk about logstash if you want [18:45:07] Ok [18:45:56] also, I had a feeling that balancing a training set is actually a solution for imbalanced classes problems. not sure i understand why it's a no way thing here.. [18:47:15] halfak: maybe i'll just put my stupid thoughts into this report and we discuss it on monday, if you have other (practical) things to do today)) [18:48:55] btw, what's the total number of features for GBC we have? I found 9 for eng and 11 for rus [18:50:09] Zppix, homework for you is to figure out how logstash is picking up any of our logs. I think we found something in gerrit or git last time that was helpful [18:50:56] fajne, the anomaly detection could be interesting. Maybe we can do that half-way by engineering some new features that should carry high signal for anomalies. [18:51:10] E.g. measuring the dominant unicode char ranges for a wiki [18:51:34] Adding chars that are way out of the range of the chars that are common in that wiki probably should get flagged for review. [18:52:57] halfak: ill try :/ but i have zero clue [18:53:03] yep, modifying existing an algo to be more sensitive to a rare class is also one of classical solutions I read about. No idea though what this feature would be))) [18:53:34] halfak: depending on which thing is sending to logstash ... for example amir started sending ores logs to logstash with https://gerrit.wikimedia.org/r/#/c/321096/11/modules/service/manifests/uwsgi.pp [18:53:43] basically just stuffs python logging on a socket that logstash reads [18:54:19] fajne, one thing we do is tell the GB algorithm to weight the lower frequency class highly. [18:54:22] s/halfak/zppix/ [18:54:29] Thanks ebernhardson :D [18:54:54] * ebernhardson sees pings when people mention logstash :P [18:55:32] :D Glad to have you around to give us a hand! [18:55:35] Homework is done (i did it all no one helped xD jk) now i perfer if someone set it up i would however im uncomfortable [18:55:55] Zppix, OK we can find something else for you to hack on. [18:56:15] I can make sure it works and its done right [18:56:26] Just the code for logstash im not familar [18:56:37] Zppix, will you add ebernhardson's notes to the task? [18:56:45] Yes [18:56:51] Whats the task again? [18:57:11] 10Scoring-platform-team, 10WMF-Communications, 10Wikimedia-Blog-Content: Draft announcement for new team: "Scoring Platform" - https://phabricator.wikimedia.org/T169755#3439951 (10Halfak) [18:57:23] 10Scoring-platform-team, 10WMF-Communications, 10Wikimedia-Blog-Content: Draft announcement for new team: "Scoring Platform" - https://phabricator.wikimedia.org/T169755#3407584 (10Halfak) 05Open>03Resolved [18:57:42] 10Scoring-platform-team, 10ORES, 10User-Zppix: Extend icinga check to catch 500 errors like those of the 20170613 incident - https://phabricator.wikimedia.org/T167830#3439955 (10Halfak) Ok maybe a little rush ;) [18:58:24] https://phabricator.wikimedia.org/T169586 [18:59:40] 10Scoring-platform-team-Backlog, 10ORES, 10Wikimedia-Logstash: Send celery logs and events to logstash - https://phabricator.wikimedia.org/T169586#3439962 (10Zppix) 1:53 PM depending on which thing is sending to logstash ... for example amir started sending ores logs to logstash with https://g... [19:00:19] There aaron ^ [19:10:42] Awesome. [19:10:45] o/ Amir1 [19:11:05] hey halfak, I just got back from the conference [19:11:08] have are you? [19:12:23] Good good. [19:12:33] Oh BTW, Amir1, I built the tawiki revert model [19:12:36] Just about to submit a PR [19:13:10] oh nice [19:13:12] Thanks [19:15:27] n/p [19:15:34] Looking for another nice task to pick up? [19:15:54] Maybe you could work with fajne on her report ^_^ [19:16:15] She tested out a new tuning param for GB and it didn't work, but I'd like to have a report about it for posterity. [19:16:36] Also she's looking into anomaly detection and I suggested we could use unicode ranges to build some new features. [19:17:18] halfak: please ping me :((( [19:17:32] I got distracted now feel guilty for not responding [19:17:34] No worries. If you didn't respond in a second, I would have :) [19:18:05] sure, I will give it a try [19:18:33] Cool. I'm sure she'll appreciate the questions you have about it. It will be great to include in the report :D [19:18:50] It == Answers to your questions [19:19:16] yeah, sorry was afk this week [19:19:25] will work way more (and all of the weekend) [19:21:04] Amir1: do you mind if email you the results? [19:21:12] fajne: not at all [19:21:18] ok [19:27:24] halfak: since you're around what do you think of removing old graphite data of ores https://phabricator.wikimedia.org/T169969 [19:27:45] I'm Ok with it. haven't been able to really think about it yet though [19:27:51] basically we'll lose data older than 30 days for grafana and stuff [19:30:06] Oh. Can we have that go back 90 days instead? [19:30:36] I'd like to see more than a full month. Also, if we're doing this, we need to take screenshots of grafana we post in tasks. [19:31:47] yeah [19:32:38] Amir1, I wonder if we could set some metrics to keep a shorter history than others [19:32:52] E.g. any metrics that show up in our dashboard, I'd like to keep practically indefinitely [19:32:58] But we store a lot more metrics to help with debugging [19:33:06] hmm [19:33:07] We only need those for 30 days or so [19:33:18] I'm trying to make the patch [19:33:22] hope it's not too complex [19:33:26] let me check [19:34:04] kk [19:42:30] halfak: it seems it's based on directory [19:42:58] I don't know how directories are stored in graphite servers but I doubt it is based on metric [19:43:57] Oh... are directories just the dot separation? [19:44:10] E.g. ores..foo.bar [19:49:20] \o/ [19:50:09] 10Scoring-platform-team, 10ORES, 10Puppet: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440098 (10Halfak) [19:50:14] 10Scoring-platform-team, 10ORES, 10Puppet: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440111 (10Halfak) https://gerrit.wikimedia.org/r/#/c/365289 [19:51:07] :D [19:51:28] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3440116 (10Halfak) [19:51:30] 10Scoring-platform-team, 10ORES, 10Puppet: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440115 (10Halfak) [19:52:54] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3432972 (10Halfak) [19:52:56] 10Scoring-platform-team, 10editquality-modeling, 10Bengali-Sites, 10artificial-intelligence: Train reverted model for Bengali Wikipedia - https://phabricator.wikimedia.org/T170490#3433085 (10Halfak) 05Open>03Resolved a:03Halfak https://github.com/wiki-ai/editquality/pull/80 [19:53:16] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10Tamil-Sites, and 2 others: Train/test reverted model for tawiki - https://phabricator.wikimedia.org/T166051#3283936 (10Halfak) 05Open>03Resolved a:05Ladsgroup>03Halfak https://github.com/wiki-ai/editquality/pull/80 [19:53:18] 10Scoring-platform-team-Backlog, 10ORES, 10editquality-modeling, 10Tamil-Sites, 10artificial-intelligence: Deploy reverted model for tawiki - https://phabricator.wikimedia.org/T166048#3440129 (10Halfak) [19:53:33] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3440133 (10Halfak) [19:53:35] 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Train reverted model for Greek Wikipedia - https://phabricator.wikimedia.org/T170491#3433108 (10Halfak) 05Open>03Resolved a:03Halfak https://github.com/wiki-ai/editquality/pull/80 [20:01:31] brb [20:05:49] im on pc now halfak fyi [20:16:05] o/ [20:28:38] 10Scoring-platform-team, 10ORES: Update revscoring to 1.3.17 in wheels repo - https://phabricator.wikimedia.org/T170713#3440223 (10Halfak) [20:41:48] 10Scoring-platform-team, 10ORES: Update revscoring to 1.3.17 in wheels repo - https://phabricator.wikimedia.org/T170713#3440296 (10Halfak) https://gerrit.wikimedia.org/r/365354 [20:43:56] 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Train damaging/goodfaith model for English Wiktionary - https://phabricator.wikimedia.org/T170487#3440297 (10Halfak) a:03Halfak [20:44:28] 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Train/test damaging & goodfaith models for Albanian Wikipedia - https://phabricator.wikimedia.org/T163009#3440300 (10Halfak) [20:44:28] 10Scoring-platform-team, 10Bad-Words-Detection-System, 10revscoring, 10artificial-intelligence: Add language support for Albanian - https://phabricator.wikimedia.org/T168369#3362451 (10Halfak) 05Open>03Resolved [20:45:03] halfak need me to merge anything? [20:45:13] working on albanian and romanian models [20:45:20] Zppix, no I don't think so. [20:45:24] ok [20:45:30] ores should have the github mirror added [20:45:31] https://phabricator.wikimedia.org/diffusion/EORS/ [20:45:33] the puppet and wheels repo should have other reviewers. [20:45:38] it will gain a github download button heh [20:46:04] paladox that diffusion repo is for the ext [20:46:10] yep [20:46:15] we mirror to gtihub [20:46:16] github [20:46:27] Oh yeah. Is that something we can do directly from diffusion [20:46:30] but we added support for a github button [20:46:32] yes [20:46:40] we added additional support [20:46:42] check this out [20:46:44] out [20:46:50] https://phab-01.wmflabs.org/diffusion/TS/ [20:47:07] and https://phabricator.wikimedia.org/diffusion/SMTL/ [20:47:29] (note phab-01 is running the next update we will do on phabricator.wikimedia.org) [20:47:34] paladox our github repo your thinking of (github.com/wiki-ai/ores) is not what phabricator.wikimedia.org/diffusion/eors/ is [20:47:45] nope wrong repo [20:47:48] it's github [20:47:52] github.com/wikimedia [20:48:03] https://github.com/wikimedia/mediawiki-extensions-ORES [20:48:05] * Zppix facepalm [20:48:08] mirror to https://github.com/wikimedia/mediawiki-extensions-ORES [20:48:14] gerrit does it [20:48:24] but we are slowly migrating repos to get diffusion to do it [20:48:28] iirc we arent really developing the ext any more are we halfak [20:54:51] Zppix i got the mirror added [20:54:52] https://phabricator.wikimedia.org/diffusion/EORS/ [20:54:55] github buttons now show [20:54:56] :) [21:05:14] Zppix, the extension is still under dev [21:05:26] We the scoring platformy people are moving away from the UI stuff. [21:06:03] OH! Amir1, please have a look at https://phabricator.wikimedia.org/T167911 [21:06:05] ^ relevant [21:06:35] oh [21:06:37] thanks for setting up that mirror paladox [21:06:49] Sagan did it :) [21:07:12] oh. thanks sagan then :) [21:26:25] Albanian models are building. [21:26:30] now for English Wiktionary [21:42:43] I think I might finish all these models before I need to leave :))) [21:42:51] Maybe I'll get the deployment ready for Monday then [21:46:52] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3440483 (10Halfak) [21:58:47] OK all the models are either built or building. I'm going to head out. I'll be around for the hack session tomorrow at 1500 UTC [21:58:58] AKA office hours with halfak [21:59:07] "halfak session" [21:59:09] :D [21:59:11] o/ [22:53:39] 10Scoring-platform-team-Backlog, 10ORES: Improve cleaning of article quality assessment datasets - https://phabricator.wikimedia.org/T170434#3440730 (10Nettrom) I've gathered revision timestamps for all the revisions in the [[ https://figshare.com/articles/English_Wikipedia_Quality_Asssessment_Dataset/1375406...