[00:04:30] <awight>	  https://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/
[00:23:39] <wikibugs>	 10Scoring-platform-team-Backlog: [Investigate] Hadoop integration for ORES training - https://phabricator.wikimedia.org/T170650#3438153 (10awight)
[00:23:42] <wikibugs>	 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), 10Patch-For-Review: Summarize what it will take to separate product and platform for ORES Extension - https://phabricator.wikimedia.org/T167908#3438167 (10jmatazzoni)
[00:28:52] <ebernhardson>	 awight: yes i'm already doing it for search, but not with scikitlearn
[00:30:33] <ebernhardson>	 awight: you don't need to port anything, you can run standard python code on hadoop with pyspark as long as you can break up the task somehow (the individual hyperparameters of a model training being an obvious spot)
[00:31:05] <ebernhardson>	 it would look something like: sc.parallelize(list_of_parameters_to_run).map(function_taking_parameters_and_returning_something).collect()
[00:31:24] <ebernhardson>	 each item in the parallelized list will be provided to the function on an executor
[00:32:38] <ebernhardson>	 i suppose the mildly annoying part is you nede to find a reasonable way to get your training data there, but its doable
[00:34:55] <ebernhardson>	 if the data is small enough (10's of MB?) you can just broadcast it and call it aday
[00:36:01] <awight>	 ebernhardson: I like it.  Yeah training data with cached features is about 150MB, maybe fine to broadcast.
[00:36:53] <awight>	 We're currently splitting threads on cross-validation folds, and can extract features in parallel.
[00:37:03] <awight>	 Single models are being trained on 1 cpu, though.
[00:37:51] <awight>	 The biggest win AFAICT would be to be able to queue lots of jobs with differing numbers of workers, and keep our hardware fully employed.
[00:38:34] <ebernhardson>	 the problem is you can't just take the whole hadoop cluster :P there are scheduled jobs that run hourly or whatever that need hundreds of cpu cores and hundreds of Gb of memory :P
[00:39:13] <awight>	 hehe, we'd be fine with a quota
[00:39:27] <awight>	 or even "nice 19"
[00:39:35] <ebernhardson>	 but something can be worked out. Generally what i'm doing in search is that one model for feature engineering (~150M training data) is trained on 4 cores, 5 CV folds are trained in parallel, and then 20 hyperparameters are tested at a time
[00:40:07] <ebernhardson>	 hadoop doesn't really have a 'nice 19' kind of concept, instead when you run something you ask for X executors with M gb of memory and N cpu cores, and it give sthem to you (or not depending on how busy things are)
[00:40:14] <ebernhardson>	 those are then your memory, and your cpu cores until you give them back
[00:40:36] <awight>	 This might not win us friends :)
[00:41:20] <awight>	 We have our own hardware, so in theory could run hadoop there... I have no idea what the ops overhead would look like.
[00:41:39] <ebernhardson>	 generally how i've been going about it so far is my job asks for 100 executors with 4GB of memory and 4 cpu cores, and runs for ~20 minutes and shuts down. This is basically a full training with many hundreds of models trained to choose a set of hyperparamters for the current feature set
[00:42:04] * awight warms hands by the data center's exterior wall
[00:42:08] <ebernhardson>	 :)
[00:44:04] <ebernhardson>	 if you have your own hardware a job queue can get you roughly the same thing, hadoop was convenient for me because it's an existing system with just shy of 1000 cores and 2.2TB of memory that i can just grab resources from and return
[00:44:36] <awight>	 It feels wrong that we would have an empire of celery for production, but then use hadoop for training...
[00:44:57] <awight>	 yeah I was thinking something as simple as a shared python multiprocessing pool
[00:46:02] <awight>	 I guess we already have divergent tech stacks for training and production
[00:47:10] <ebernhardson>	 i'm somewhat surprised your training 150MB sets on a single core though, my sets for feature engineering are about 150M (prod models closer to 2.5GB) and it still takes some time to train the ~900 models for hyperparameter tuning with CV
[00:47:32] <awight>	 This how we multiprocess CV: https://github.com/wiki-ai/revscoring/blob/master/revscoring/scoring/models/model.py#L220
[00:49:02] <ebernhardson>	 i mean, at a high level it all looks the same :P https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/training/tuning.py#L231
[00:49:18] <awight>	 I'll have to learn more about our hyperparameter tuning, AIUI we're tuning manually.
[00:50:10] <ebernhardson>	 mine isn't that much different, basically just define a space to search, search it, move on: https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/training/xgboost.py#L486-L504
[00:50:39] <ebernhardson>	 i think mine is slightly different in that it's somewhat common to tune all the parameters at once, while i'm tuning just a couple related parameters at a time to keep down on the exponential explosion of possibilities
[00:51:06] <awight>	 :) pprint ftw
[00:51:26] <ebernhardson>	 :)
[00:51:31] <awight>	 ty, I need to run but will be sure to bother you next week.
[00:51:57] <ebernhardson>	 sure
[04:36:28] <wikibugs>	 10Scoring-platform-team-Backlog: [Investigate] Hadoop integration for ORES training - https://phabricator.wikimedia.org/T170650#3438380 (10awight)
[14:29:18] * halfak reads about http://bkmla.org/
[16:35:00] <Zppix>	 O:
[16:35:04] <Zppix>	 O/*
[17:05:50] <fajne>	 halfak: wanted to report about my epic fail:) log2 seems to work better than None. The ROC-AUC of GBC for ruwiki damaging edits was 0.934 which is lower than the benchmark 0.936 (I ran log2 version on Tuesday).
[18:07:06] <halfak>	 fajne, good to know!  Would you write that up and we'll make a little report for it? 
[18:07:16] <halfak>	 sorry to be AFk.  I had a lunch that went late. 
[18:07:22] <Zppix>	 Hi halfak :)
[18:07:33] <halfak>	 I'm just about to change locations again so I'll be AFK for ~30 mins
[18:07:35] <halfak>	 o/ Zppix 
[18:07:36] <Zppix>	 Hows the logstash thing coming?
[18:07:54] <halfak>	 Not quite sure.  I haven't really been able to look at it since mutante's note. 
[18:08:03] <halfak>	 I'll be around to talk more about it soon :) 
[18:08:03] <halfak>	 o/
[18:08:05] <Zppix>	 Ok
[18:08:08] <Zppix>	 O/
[18:15:41] <wikibugs>	 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10ORES, 10Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017), and 2 others: Conform ORES sensitivity levels to the new ERI standards - https://phabricator.wikimedia.org/T160575#3439860 (10jmatazzoni) @Etonkovidova notes:  > 2) Likely have pr...
[18:34:16] <halfak>	 o/
[18:34:18] <halfak>	 Zppix & fajne: just got to coffee shop :) 
[18:34:20] <halfak>	 So I'm here for several hours 
[18:34:47] <fajne>	 ok
[18:35:37] <fajne>	 can you point me to a similar report maybe?
[18:36:39] <halfak>	 Hmm..  yeah.  Let me find one. 
[18:37:29] <halfak>	 https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_draft_quality/Work_log/2016-12-01
[18:37:36] <halfak>	 https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_draft_quality/Work_log/2016-12-03
[18:37:57] <halfak>	 Totally casual, captures results and the thoughts you had along the way. 
[18:37:59] <fajne>	 Last night, I read up on GBC params tuning and then on the imbalanced classes problem. Hence, a quick question: did you consider balancing your training set (up to 50/50)?
[18:38:13] <halfak>	 No way we're going to do that for vandalism
[18:38:18] <halfak>	 It's too infrequent. 
[18:38:42] <halfak>	 We do it a lot for article quality -- where the classes can be balanced more because we have enough observations of the top quality classes. 
[18:39:14] <fajne>	 another one: is switching to an anomaly detection algorithm a good idea?
[18:40:57] <halfak>	 for what?
[18:41:08] <halfak>	 And what do you mean by "anomaly detection algorithm"
[18:41:11] <halfak>	 Like, which one?
[18:41:48] <fajne>	 1 sec
[18:42:21] <fajne>	 https://www.semanticscholar.org/paper/Efficient-Anomaly-Detection-by-Isolation-Using-Nea-Bandaragoda-Ting/764896ad374224e53d3cafe9575ca58487724802
[18:44:10] <fajne>	 it's just a suggestion from one blog I found interesting. not sure it's worth exploring, so just checking with you
[18:44:26] <fajne>	 since vandalism is so infrequent...
[18:45:07] <Zppix>	 Im here to talk about logstash if you want
[18:45:07] <Zppix>	 Ok
[18:45:56] <fajne>	 also, I had a feeling that balancing a training set is actually a solution for imbalanced classes problems. not sure i understand why it's a no way thing here..
[18:47:15] <fajne>	 halfak: maybe i'll just put my stupid thoughts into this report and we discuss it on monday, if you have other (practical) things to do today))
[18:48:55] <fajne>	 btw, what's the total number of features for GBC we have? I found 9 for eng and 11 for rus
[18:50:09] <halfak>	 Zppix, homework for you is to figure out how logstash is picking up any of our logs.  I think we found something in gerrit or git last time that was helpful
[18:50:56] <halfak>	 fajne, the anomaly detection could be interesting.  Maybe we can do that half-way by engineering some new features that should carry high signal for anomalies. 
[18:51:10] <halfak>	 E.g. measuring the dominant unicode char ranges for a wiki
[18:51:34] <halfak>	 Adding chars that are way out of the range of the chars that are common in that wiki probably should get flagged for review. 
[18:52:57] <Zppix>	 halfak: ill try :/ but i have zero clue
[18:53:03] <fajne>	 yep, modifying existing an algo to be more sensitive to a rare class is also one of classical solutions I read about. No idea though what this feature would be))) 
[18:53:34] <ebernhardson>	 halfak: depending on which thing is sending to logstash ... for example amir started sending ores logs to logstash with https://gerrit.wikimedia.org/r/#/c/321096/11/modules/service/manifests/uwsgi.pp
[18:53:43] <ebernhardson>	 basically just stuffs python logging on a socket that logstash reads
[18:54:19] <halfak>	 fajne, one thing we do is tell the GB algorithm to weight the lower frequency class highly. 
[18:54:22] <ebernhardson>	 s/halfak/zppix/
[18:54:29] <halfak>	 Thanks ebernhardson :D 
[18:54:54] * ebernhardson sees pings when people mention logstash :P
[18:55:32] <halfak>	 :D  Glad to have you around to give us a hand! 
[18:55:35] <Zppix>	 Homework is done (i did it all no one helped xD jk) now i perfer if someone set it up i would however im uncomfortable
[18:55:55] <halfak>	 Zppix, OK we can find something else for you to hack on. 
[18:56:15] <Zppix>	 I can make sure it works and its done right
[18:56:26] <Zppix>	 Just the code for logstash im not familar
[18:56:37] <halfak>	 Zppix, will you add ebernhardson's notes to the task? 
[18:56:45] <Zppix>	 Yes
[18:56:51] <Zppix>	 Whats the task again?
[18:57:11] <wikibugs>	 10Scoring-platform-team, 10WMF-Communications, 10Wikimedia-Blog-Content: Draft announcement for new team: "Scoring Platform" - https://phabricator.wikimedia.org/T169755#3439951 (10Halfak)
[18:57:23] <wikibugs>	 10Scoring-platform-team, 10WMF-Communications, 10Wikimedia-Blog-Content: Draft announcement for new team: "Scoring Platform" - https://phabricator.wikimedia.org/T169755#3407584 (10Halfak) 05Open>03Resolved
[18:57:42] <wikibugs>	 10Scoring-platform-team, 10ORES, 10User-Zppix: Extend icinga check to catch 500 errors like those of the 20170613 incident - https://phabricator.wikimedia.org/T167830#3439955 (10Halfak) Ok maybe a little rush ;)
[18:58:24] <halfak>	 https://phabricator.wikimedia.org/T169586
[18:59:40] <wikibugs>	 10Scoring-platform-team-Backlog, 10ORES, 10Wikimedia-Logstash: Send celery logs and events to logstash - https://phabricator.wikimedia.org/T169586#3439962 (10Zppix) 1:53 PM <ebernhardson> depending on which thing is sending to logstash ... for example amir started sending ores logs to logstash with https://g...
[19:00:19] <Zppix>	 There aaron ^
[19:10:42] <halfak>	 Awesome. 
[19:10:45] <halfak>	 o/ Amir1 
[19:11:05] <Amir1>	 hey halfak, I just got back from the conference 
[19:11:08] <Amir1>	 have are you?
[19:12:23] <halfak>	 Good good.  
[19:12:33] <halfak>	 Oh BTW, Amir1, I built the tawiki revert model
[19:12:36] <halfak>	 Just about to submit a PR
[19:13:10] <Amir1>	 oh nice
[19:13:12] <Amir1>	 Thanks
[19:15:27] <halfak>	 n/p 
[19:15:34] <halfak>	 Looking for another nice task to pick up?
[19:15:54] <halfak>	 Maybe you could work with fajne on her report ^_^
[19:16:15] <halfak>	 She tested out a new tuning param for GB and it didn't work, but I'd like to have a report about it for posterity. 
[19:16:36] <halfak>	 Also she's looking into anomaly detection and I suggested we could use unicode ranges to build some new features. 
[19:17:18] <Amir1>	 halfak: please ping me :(((
[19:17:32] <Amir1>	 I got distracted now feel guilty for not responding 
[19:17:34] <halfak>	 No worries.  If you didn't respond in a second, I would have :) 
[19:18:05] <Amir1>	 sure, I will give it a try
[19:18:33] <halfak>	 Cool.  I'm sure she'll appreciate the questions you have about it.  It will be great to include in the report :D 
[19:18:50] <halfak>	 It == Answers to your questions
[19:19:16] <Amir1>	 yeah, sorry was afk this week
[19:19:25] <Amir1>	 will work way more (and all of the weekend)
[19:21:04] <fajne>	 Amir1: do you mind if email you the results?
[19:21:12] <Amir1>	 fajne: not at all
[19:21:18] <fajne>	 ok
[19:27:24] <Amir1>	 halfak: since you're around what do you think of removing old graphite data of ores https://phabricator.wikimedia.org/T169969
[19:27:45] <halfak>	 I'm Ok with it.  haven't been able to really think about it yet though
[19:27:51] <Amir1>	 basically we'll lose data older than 30 days for grafana and stuff
[19:30:06] <halfak>	 Oh.   Can we have that go back 90 days instead?
[19:30:36] <halfak>	 I'd like to see more than a full month.  Also, if we're doing this, we need to take screenshots of grafana we post in tasks. 
[19:31:47] <Amir1>	 yeah
[19:32:38] <halfak>	 Amir1, I wonder if we could set some metrics to keep a shorter history than others
[19:32:52] <halfak>	 E.g. any metrics that show up in our dashboard, I'd like to keep practically indefinitely
[19:32:58] <halfak>	 But we store a lot more metrics to help with debugging
[19:33:06] <Amir1>	 hmm
[19:33:07] <halfak>	 We only need those for 30 days or so 
[19:33:18] <Amir1>	 I'm trying to make the patch
[19:33:22] <Amir1>	 hope it's not too complex
[19:33:26] <Amir1>	 let me check
[19:34:04] <halfak>	 kk
[19:42:30] <Amir1>	 halfak: it seems it's based on directory 
[19:42:58] <Amir1>	 I don't know how directories are stored in graphite servers but I doubt it is based on metric
[19:43:57] <halfak>	 Oh... are directories just the dot separation?
[19:44:10] <halfak>	 E.g. ores.<hostname>.foo.bar
[19:49:20] <halfak>	 \o/
[19:50:09] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Puppet: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440098 (10Halfak)
[19:50:14] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Puppet: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440111 (10Halfak) https://gerrit.wikimedia.org/r/#/c/365289
[19:51:07] <Amir1>	 :D
[19:51:28] <wikibugs>	 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3440116 (10Halfak)
[19:51:30] <wikibugs>	 10Scoring-platform-team, 10ORES, 10Puppet: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440115 (10Halfak)
[19:52:54] <wikibugs>	 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3432972 (10Halfak)
[19:52:56] <wikibugs>	 10Scoring-platform-team, 10editquality-modeling, 10Bengali-Sites, 10artificial-intelligence: Train reverted model for Bengali Wikipedia - https://phabricator.wikimedia.org/T170490#3433085 (10Halfak) 05Open>03Resolved a:03Halfak https://github.com/wiki-ai/editquality/pull/80
[19:53:16] <wikibugs>	 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10Tamil-Sites, and 2 others: Train/test reverted model for tawiki - https://phabricator.wikimedia.org/T166051#3283936 (10Halfak) 05Open>03Resolved a:05Ladsgroup>03Halfak https://github.com/wiki-ai/editquality/pull/80
[19:53:18] <wikibugs>	 10Scoring-platform-team-Backlog, 10ORES, 10editquality-modeling, 10Tamil-Sites, 10artificial-intelligence: Deploy reverted model for tawiki - https://phabricator.wikimedia.org/T166048#3440129 (10Halfak)
[19:53:33] <wikibugs>	 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3440133 (10Halfak)
[19:53:35] <wikibugs>	 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Train reverted model for Greek Wikipedia - https://phabricator.wikimedia.org/T170491#3433108 (10Halfak) 05Open>03Resolved a:03Halfak https://github.com/wiki-ai/editquality/pull/80
[20:01:31] <halfak>	 brb
[20:05:49] <Zppix>	 im on pc now halfak fyi
[20:16:05] <halfak>	 o/
[20:28:38] <wikibugs>	 10Scoring-platform-team, 10ORES: Update revscoring to 1.3.17 in wheels repo - https://phabricator.wikimedia.org/T170713#3440223 (10Halfak)
[20:41:48] <wikibugs>	 10Scoring-platform-team, 10ORES: Update revscoring to 1.3.17 in wheels repo - https://phabricator.wikimedia.org/T170713#3440296 (10Halfak) https://gerrit.wikimedia.org/r/365354
[20:43:56] <wikibugs>	 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Train damaging/goodfaith model for English Wiktionary - https://phabricator.wikimedia.org/T170487#3440297 (10Halfak) a:03Halfak
[20:44:28] <wikibugs>	 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Train/test damaging & goodfaith models for Albanian Wikipedia - https://phabricator.wikimedia.org/T163009#3440300 (10Halfak)
[20:44:28] <wikibugs>	 10Scoring-platform-team, 10Bad-Words-Detection-System, 10revscoring, 10artificial-intelligence: Add language support for Albanian - https://phabricator.wikimedia.org/T168369#3362451 (10Halfak) 05Open>03Resolved
[20:45:03] <Zppix>	 halfak need me to merge anything?
[20:45:13] <halfak>	 working on albanian and romanian models
[20:45:20] <halfak>	 Zppix, no I don't think so. 
[20:45:24] <Zppix>	 ok
[20:45:30] <paladox>	 ores should have the github mirror added
[20:45:31] <paladox>	 https://phabricator.wikimedia.org/diffusion/EORS/
[20:45:33] <halfak>	 the puppet and wheels repo should have other reviewers. 
[20:45:38] <paladox>	 it will gain a github download button heh
[20:46:04] <Zppix>	 paladox that diffusion repo is for the ext
[20:46:10] <paladox>	 yep
[20:46:15] <paladox>	 we mirror to gtihub
[20:46:16] <paladox>	 github
[20:46:27] <halfak>	 Oh yeah.  Is that something we can do directly from diffusion
[20:46:30] <paladox>	 but we added support for a github button
[20:46:32] <paladox>	 yes
[20:46:40] <paladox>	 we added additional support
[20:46:42] <paladox>	 check this out
[20:46:44] <paladox>	 out
[20:46:50] <paladox>	 https://phab-01.wmflabs.org/diffusion/TS/
[20:47:07] <paladox>	 and https://phabricator.wikimedia.org/diffusion/SMTL/
[20:47:29] <paladox>	 (note phab-01 is running the next update we will do on phabricator.wikimedia.org)
[20:47:34] <Zppix>	 paladox our github repo your thinking of (github.com/wiki-ai/ores) is not what phabricator.wikimedia.org/diffusion/eors/ is
[20:47:45] <paladox>	 nope wrong repo
[20:47:48] <paladox>	 it's github
[20:47:52] <paladox>	 github.com/wikimedia
[20:48:03] <paladox>	 https://github.com/wikimedia/mediawiki-extensions-ORES
[20:48:05] * Zppix facepalm
[20:48:08] <paladox>	 mirror to https://github.com/wikimedia/mediawiki-extensions-ORES
[20:48:14] <paladox>	 gerrit does it
[20:48:24] <paladox>	 but we are slowly migrating repos to get diffusion to do it
[20:48:28] <Zppix>	 iirc we arent really developing the ext any more are we halfak
[20:54:51] <paladox>	 Zppix i got the mirror added
[20:54:52] <paladox>	 https://phabricator.wikimedia.org/diffusion/EORS/
[20:54:55] <paladox>	 github buttons now show
[20:54:56] <paladox>	 :)
[21:05:14] <halfak>	 Zppix, the extension is still under dev
[21:05:26] <halfak>	 We the scoring platformy people are moving away from the UI stuff. 
[21:06:03] <halfak>	 OH!  Amir1, please have a look at https://phabricator.wikimedia.org/T167911
[21:06:05] <halfak>	 ^ relevant
[21:06:35] <Zppix>	 oh
[21:06:37] <halfak>	 thanks for setting up that mirror paladox 
[21:06:49] <paladox>	 Sagan did it :)
[21:07:12] <halfak>	 oh.  thanks sagan then :) 
[21:26:25] <halfak>	 Albanian models are building. 
[21:26:30] <halfak>	 now for English Wiktionary
[21:42:43] <halfak>	 I think I might finish all these models before I need to leave :))) 
[21:42:51] <halfak>	 Maybe I'll get the deployment ready for Monday then
[21:46:52] <wikibugs>	 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3440483 (10Halfak)
[21:58:47] <halfak>	 OK all the models are either built or building.  I'm going to head out.  I'll be around for the hack session tomorrow at 1500 UTC
[21:58:58] <halfak>	 AKA office hours with halfak
[21:59:07] <halfak>	 "halfak session"
[21:59:09] <halfak>	 :D
[21:59:11] <halfak>	 o/
[22:53:39] <wikibugs>	 10Scoring-platform-team-Backlog, 10ORES: Improve cleaning of article quality assessment datasets - https://phabricator.wikimedia.org/T170434#3440730 (10Nettrom) I've gathered revision timestamps for all the revisions in the [[ https://figshare.com/articles/English_Wikipedia_Quality_Asssessment_Dataset/1375406...