[09:56:38] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1868826 (10IKhitron) I am happy, @Edgars2007, but: 1: Maybe I'm a troll. ;-) 2: Maybe somebody else was in your account, and he is a troll. [16:18:56] o/ [17:08:38] * guillom notes that halfak filled out the agenda for the RG meeting, changes his response to the invite. [17:10:12] yay! [17:10:17] Sorry for the late entry. [17:10:24] I think this discussion will be super interesting :) [17:11:20] Yeah, I don't mind last-minute agendas. I like the RG meeting. [17:11:36] (Also, I have very few meetings.) [17:14:58] guillom, if you want a bit of background and have 5 minutes, see http://socio-technologist.blogspot.com/2015/12/disparate-impact-of-damage-detection-on.html [17:15:09] HaeB, ^ if you are coming to RG meeting today [17:15:40] halfak: I was looking for something to do in the next 10 minutes before the meeting. Perfect :D [17:15:58] woot! [17:18:21] halfak: Where's the edit button on your blog to fix typos? :p [17:18:37] ("therefor") [17:18:39] lol. [17:18:41] Want one [17:18:46] That would be awesome. [17:18:55] A blog with flagged revisions so I can review when I have time. [17:19:18] I'm building the next version of my site using a static site generator, with edit and history links pointing to Github :) [17:19:47] I like the idea of having anons and newcomers as a protected class. [17:22:08] Balancing model fitness and class protection... yes, I can see the dilemma. [17:22:44] Yay! Fund discussions ahead then. :) [17:23:38] * guillom makes his way to the meeting room. [17:23:58] Not sure if I have much to discuss. I basically agree that protection should come first :) [17:23:59] brb [17:25:08] * halfak is hoping that Ellery will look at what I'm doing and say something like, "Oh! But you're doing it wrong. Just fix this one thing and you can be both more fit and protect this class." [17:25:42] Also, goodmorning Ellery :D [17:30:58] I need to get Toby out of the room. Will be a couple minutes late [22:09:24] hey halfak, is there a category on metawiki for "WMF research" that I can add to my research study pages? [22:10:23] J-Mo: btw, another thing I wanted to show you: http://jupyter-mw.wmflabs.org/wiki/Notebook:Evil is mediawiki rendering for the 'notebooks' I was talking about :) hopefully in the future research about wiki can happen in totally reproducible fashion (with notebooks) and be published on wiki itself :D [22:10:52] ooh, fancy YuviPanda [22:12:10] J-Mo, good Q. https://meta.wikimedia.org/wiki/Category:Wikimedia_hosted_projects [22:12:10] That's sort-of one of them [22:12:10] But not exclusive to WMF research. [22:12:29] * halfak needs to do some gardening on mea [22:12:33] *meta [22:12:43] Any objection to me creating a "WMF Design Research" sub-category? [22:13:10] I want to be able to easily link to all of it with one link from mediawiki.org [22:13:57] J-Mo, should make a "WMF Research project" category and make "WMF Design Research project" subcat [22:14:12] will do! [22:14:19] * halfak doesn't want to make a "WMF Research-and-Data project" distinction [22:20:22] done! also, you're right, halfak. I kept it to "Wikimedia Research". Down with artificial distinctions! https://meta.wikimedia.org/wiki/Category:Wikimedia_Research_project [22:20:44] Yay! [22:20:48] :D [23:03:48] halfak: I'm at the spark thing, and I really want to make this happen for volunteers to use [23:03:57] we might not have the resources though [23:04:07] listening to him say 'a small 30GB container' makes me feel sad [23:04:09] The databricks setup or Spark generally? [23:04:12] latter [23:04:21] 90% of what they have we can do with jupyter + spark [23:05:43] If only we had Capacity for a small cluster in labs [23:06:05] I'd be doing my model tuning on spark then :) [23:06:14] yeah [23:06:26] halfak: what do you mean by 'small' cluster btw [23:06:49] Not sure. Need a spark/hadoop person to think about what size we would need in order to operate. [23:06:56] * YuviPanda nods [23:07:08] also if I want to 'open it up to whoever' it needs to be pretty large I gues [23:07:10] *guess [23:07:19] YuviPanda, or we need good processes around using it. [23:07:24] that too [23:07:38] but I generally prefer doing things where we just open it up to people and let creativity flow [23:07:52] E.g. our stat boxes that we use in research are not all that big and powerful compared to a spark cluster and we share them alright with a mailing list and good behavior. [23:07:59] right [23:08:03] +1 for that. [23:08:15] I think we can have good process *and* be non-restrictive. [23:08:46] so spark also seems to have no multi-tenancy [23:09:16] Interesting. [23:09:22] That's kind of surprising. [23:09:34] https://en.wikipedia.org/wiki/Multitenancy [23:10:11] http://spark.apache.org/docs/latest/security.html [23:10:14] just a shared secret [23:10:21] so only authentication, no authorization [23:10:28] and hence no namespacing [23:10:36] Gotcha. [23:11:29] halfak: during the event other people accidentally started attaching to the presenter's running cluster for example :) [23:11:42] so all of this is predicated on 'if someone fucks up just fire them I guess? HR deals with those, right?' [23:12:32] Well, it seems that the spark devs are following the "open first" principle in prioritizing features. [23:13:11] yeah, but authorization is one of those things that are hard to bolt on [23:13:30] esp. for arbitrary code executio [23:13:32] n [23:13:39] that's why we have no mongodb on tools [23:15:10] * halfak still wonders why there isn't a simple relational protocol for querying JSON objects [23:15:27] postgres JSON? [23:15:31] Yeah. [23:15:43] I like postgres JSON OK, but I want the whole row to be JSON :) [23:15:52] Seriously. MongoDB's greatest failure was deciding that they were NoSQL. [23:15:58] I've heard good things about RethinkDB [23:16:07] live by the buzzword, die by the buzzword [23:16:13] yup [23:16:19] that, and making premature claims witout understanding what claims they were making! [23:16:23] it's goland I think? [23:16:31] golang [23:16:32] 'we are blazing fast!' 'um, because you are not actually writing to disk' [23:16:39] lol [23:16:50] I'm really fast at running races too. [23:17:01] I just tell you I'm done as soon as the gun goes off! [23:17:11] * YuviPanda won a 0meters race all the time [23:17:14] I might or might not finish. That's not the point. [23:17:17] ;) [23:17:33] chasemp, golang is going in a lame direction? [23:17:53] * halfak has been looking at Rust/Haskell longingly these days. [23:18:05] oh I don't think so just it's was the flavor of the week when rethinkdb was making the rounds [23:18:09] so it got a bit of press out of it [23:18:10] Gotcha. [23:18:15] * YuviPanda actually quite likes Golang [23:18:25] it's like C but less crappy and with nice asnc/parallel stuff! [23:18:35] http://yager.io/programming/go.html [23:18:42] YuviPanda, ^ curious about your thoughts [23:18:56] I think I agree it's not a good language. It's a pretty useful one though. [23:19:10] Yeah. Lots of those. [23:19:15] I think PHP meets that definition ;) [23:19:17] yeah [23:19:21] all languages stink, some are useful :) [23:19:25] Go is far better than PHP [23:19:33] perl6 is really a good language, just not very useful atm [23:19:40] Yeah. Unfair comparison. I'm just saying that PHP is useful [23:19:46] and also not good [23:19:57] 'good' languages require 'good' programmers to write 'good' code, IMO. that increases barrier to adoption [23:20:10] useful languages allow useful programmers to write useful code [23:20:26] lots of overlap there, not black and white etc [23:20:34] but I think people talking about 'good' languages miss this [23:20:48] that could be a good mad lib [23:20:59] ____ languages allow ____ programmers to write _____ code [23:21:00] ok go [23:21:16] "smelly" languages allow "smelly" programmers to write "smelly" code [23:21:21] lololol [23:21:21] genius [23:22:06] halfak: I wonder if I should setup like, a 4-5 machine spark cluster just for ores / ML stuff [23:22:47] and just mess around with it [23:22:49] YuviPanda, I wouldn't use it right away. I figured out how to hack parallelization on a single machine and that's working OK for now. [23:22:50] to see how it goes [23:22:53] right [23:23:04] let me know when you can :D [23:23:06] But maybe we could pull in an ML intern to pick up where I left off. [23:23:09] oooo yes [23:23:11] totally [23:23:15] we should have more real interns IMO [23:23:27] Would not be sad to have someone actually spend time looking at one aspect of this work for more than a week. [23:23:39] * halfak is spread very thin. [23:23:47] Lots of short periods of focused attention. [23:24:17] yeah [23:24:21] I know the feeling heh [23:24:43] * YuviPanda wants to finish up the tests for the HTMLSanitizer so he can move on to other things [23:25:19] * halfak refactors revscoring languages for the Nth time to deal with old assumptions. [23:25:27] fun! [23:25:37] Really, I'm just making it easy to implement the next set of features. [23:25:43] Term frequency FTW [23:25:46] ML talk going on [23:25:48] https://gfycat.com/UnitedImpureAlaskajingle [23:26:36] Hyperparameter optimization! [23:26:41] Decision function! [23:26:45] Platt Scaling! [23:27:01] * YuviPanda reverses the polarity of the tachyon beam [23:27:02] Ensemble boosting [23:27:40] Stochastic selection of features [23:28:01] halfak: I bought some beer yesterday called 'stochastic beer' [23:28:08] lol! [23:28:12] Was it any good? [23:28:16] It might not be next time! [23:28:33] hahaha [23:28:38] no I have no idea haven't tasted it yet [23:28:43] I hope it isn't a dark beer [23:28:52] I've been going to bars and asking 'give me your sweetest beer' [23:29:08] 'multilayer perceptron' [23:29:11] augghhhhh [23:31:42] Math jokes <3 [23:32:13] :D [23:33:43] ooooh. We're almost done with all of the tuning reports! [23:34:32] Looks like GradientBoost is a clear winner. [23:34:41] RandomForest does pretty good too [23:34:58] bah, spark streaming requires HDFS :'( [23:35:04] Yup [23:35:12] https://phabricator.wikimedia.org/T100082 [23:35:24] why is that even tagged RESTBase?! [23:35:43] Hadoop has infected everything with its javamess [23:35:55] YuviPanda, OH! could be that we generate the diffs and store them in restbase. [23:36:02] There's another cache for diffs, right? [23:36:40] unrelated though [23:37:12] Yeah. [23:37:19] Probably not totally realted. [23:37:33] Maybe the services team wants to track requests like that for potential services. I dunno. [23:37:55] Fun story is that we can replicate the functionality we need in a few lines of code if it's OK that we have a few parallel request to the API at a time. [23:39:50] I think we should JDI and see what happens :) [23:42:41] +1. [23:42:47] Have a good user-agent, of course [23:43:06] And I get to play with concurrent.futures again :) [23:43:32] Man... across the board, GradientBoost is a winner. [23:43:51] yeah [23:45:02] * halfak google Mixin patterns for python