[15:03:41] SHIT [15:03:46] it's CHI conditional acceptance day [15:22:25] morning tnegrin :) [15:22:44] hi Ironholds [15:23:02] have not had coffee [15:25:48] heh. Noted; will avoid nattering :) [15:30:34] ahh— brother in law gave me some amazing beans. better now [15:30:59] saw the session tweet — that’s pretty much it then. as users move to mobile they use WP less [15:32:00] hum? Oh, no. Still digging into that [15:32:20] in fact, the mobile sample has more sessions per user than the desktop one [15:32:32] the tweet interested me because of the distribution: all the peaks and valleys are in the same place [15:32:44] which implies that mobile and desktop users naturally bucket into the same classes, in terms of sessions-per-user. [15:33:02] oic — what’s the density in the plot mean? [15:33:34] proportion of users at that value. It's log scaled, hence why the value is 1.5, 1.2, etc [15:33:47] and we see the same peaks! Which is WEIRD. [15:33:58] I've thrown it at Aaron in case I've committed some egregious stats violation [15:34:54] (trying to work out a good way of checking for statistical significance for this sort of data. It's bimodal. Bleh.) [15:35:58] I’d love to get this by abby and have her do some qual research [15:36:15] just talk to people and see how they use mobile and desktop [15:36:21] where are the users btw? [15:36:27] worldwide or just US? [15:36:28] geographically? [15:36:31] y [15:36:31] worldwide [15:36:48] you can breakdown connections by type right? :) [15:36:51] the dataset is prrrobably not big enough to let us do much per-country, but we should look into that when we have UUIDs [15:37:04] actually, not for this data :(. Analytics Engineering did their privacy due dilligence and hashed the IPs [15:37:05] I seem to remember paying for something [15:37:11] ok - good for us [15:37:12] this is the January EventLogging set [15:37:51] ok — have to go — enjoy the weekend [15:38:00] take care! [16:28:54] hey halfak! [16:29:24] hey Ironholds [16:29:30] how goes? [16:29:34] Just responded to your linear regression mail. [16:29:38] Good! [16:29:45] * Ironholds hits F5 [16:29:47] grand! [16:29:54] I've spent all day doing dip tests [16:29:57] I took yesterday to hang out with Jenny and it was nice. [16:30:00] I could do with not ever seeing a dip test ever again [16:30:02] Today is revscores day. [16:30:04] nice :) [16:30:12] I spent yesterday writing up a dive into the mobile data [16:30:19] the TL;DR is that the global numbers lie [16:30:32] several projects and countries have either already hit the mark where 50% of traffic is from mobile, or are about to. [16:30:45] That makes sense to me. [16:31:05] totally. What's interesting is the countries and projects. [16:31:23] Japan(ese), Indonesia(n), Korea(n), makes sense [16:31:26] and then the UK and Italy [16:31:29] and then Canada [16:31:38] it's really cool [16:32:54] :) [16:33:05] * halfak scopes our your email about session counts. [16:33:08] say, do you have any thoughts on how to do significance tests and visualisations on bimodal distributions? I spent some time doing a dip test on the bimodal sessions-per-user data, and it looks like it's definitely bimodal - but if I'm understanding the test (Hartigan's dip test statistic) right, not enough to be a tremendously big deal [16:33:59] value is 0.07 or 0.7; I don't have it off hand. But it's 0.*7, where >0.1 indicates serious multimodality and <0.*5 indicates unimodality. [16:34:30] so I've been cautiously treating it as a log-distribution for the purposes of testing, but not hitting "publish" until I've checked I ain't an idiot. [16:36:25] Ironholds, I'm not sure what to do with that one. If you want to model it Right(TM), I am the wrong person to help. I tend to get in trouble with stats people for my shortcuts. [16:36:38] I'd like to poll Ellery and Leila. [16:36:51] hi halfak, I have just discover you have already started the code for extrarct the revision features https://github.com/halfak/Revision-Scoring , I was thinking this part was not made yet... I'm feeling a little lost, in what point we are? where can I help in the project? [16:36:54] I suspect they might know standard practice here. [16:37:23] halfak, cool! Will wait on them then; thanks :) [16:37:35] danilo, good Q. I think the trick is to try to turn work items into cards and then discuss them during our sprint planning. [16:37:53] Ironholds, sorry to not help more. I'm looking forward to the discussion though :) [16:38:07] halfak, dude, it's fine! You help a tremendous amount :) [16:38:12] unrelated, it's CHI conditional acceptance day [16:38:18] danilo, if you give me a few more minutes to dig through the email I missed, I'll have a look at the system and trello board before coming back with proposals. [16:38:29] Ironholds, woo! [16:38:33] My paper with Shilad, Heather et al got in, and Shilad seems pretty certain it'll be an honourable mention, but I haven't heard anything from Scott and it's driving me CRAZY. [16:38:50] but worst-case there is I actually get the time to expand it to a full article and resubmit, so eh. [16:38:58] ok [16:41:31] Ironholds, cool! [16:41:45] Make sure you get the author version of the Shilad/Heather one online ASAP. [16:41:49] And tweet about it. [16:41:52] I wanna see. [16:42:56] shall do! [16:43:05] also, re your last email, I'm doing some power law experimentation now [16:43:13] (i.e., what happens if we log-log the dataset) [16:45:17] You can do fitness tests on the data too. [16:45:43] http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/fitdistr.html [16:45:52] (for non-mixture models) [16:45:56] neat! [16:45:58] of course MASS has it [16:46:03] :) [16:47:33] oooh [16:47:35] HAH [16:47:42] halfak, the WP article on power laws genuinely has R code embedded in it. [16:47:47] ...this is a ridiculous world [16:49:05] :) R is becoming a bit of a standard. [16:49:23] well, you know [16:49:38] it's the only programming language with C-optimised session reconstruction algorithms ;) [16:49:44] also, Yuvi is volunteering to build a diff generator [16:49:55] In R. [16:50:37] * halfak withholds comment on the silliness [16:50:39] oh, no. He wants to write C or C++, and he also wants to write a diff generation service [16:50:42] (sort of) [16:50:54] I suggested he might look at making diff-match-patch, which is Google's implementation of Myer's algorithm, platform-agnostic [16:51:06] and then we can run it as a streaming service and also port it into various languages for ad-hoc shit. [16:51:09] Oh! [16:51:16] I dunno what he's actually going to do, but that was my pitch ;p [16:51:20] Indeed. I'm working with milimetric to do that right now. [16:51:28] no way! [16:51:31] YuviPanda, read up when you're awake. [16:51:35] That's what the Diffengine is. [16:51:52] what's it written in? [16:52:08] Python. It's trivial to keep up with Wikipedia in python [16:52:20] No performance issues there that warrant C/C++. [16:52:21] cool! [16:52:29] yeah, I think he more meant for calculating back in time, as well. [16:52:34] We're looking for a framework that would make abstraction of this stream processing work easy. [16:52:37] * Ironholds nods [16:52:42] Yeah. That's my work in hadoop. [16:54:04] Or rather, that's why I am working in hadoop. [16:54:15] To generate the old stuff -- and to do it fast. [16:54:19] yeah [16:54:31] sessions-by-user-versus-power-law graph and values coming atcha, btw [16:54:35] It turns out that diff algorithm matters for far more important reasons than performance. [16:54:37] I need to throw this data up somewhere already :/ [16:54:39] oh? [16:54:51] See Flock and Alfaros last few WWW papers. [16:55:08] The stuff I presented about in the WikiCredit presentation. [16:55:38] halfak: hey [16:56:05] I didn’t quite volunteer to build the algorithm itself, since I suck at them and would probably fuck up anyway :) [16:56:17] Ironholds, start here and page forward: https://commons.wikimedia.org/w/index.php?title=File:WikiCredit_(Wikimania%2714_presentation_slides).pdf&page=78 [16:56:25] cool! [16:56:26] * Ironholds reads [16:56:32] that doesn't exist [16:56:35] YuviPanda, we should talk with milimetric about streaming event systems. [16:56:48] oh, the link breaks. Damn you xchat. [16:56:52] I have Storm on my to-do list. [16:56:55] yeah [16:57:04] wait, *Storm*? As in Apache Storm? [16:57:09] Yeah. [16:57:21] hmm, [16:57:23] We were also looking at Celery -- which would work, but would not explicitly support the abstraction. [16:57:26] for what? EL data or rcstream data? [16:57:40] RCStream isn't a very good source. [16:58:22] halfak: yey! revscores day! [16:58:23] Since it can drop events and can't produce historic events. [16:58:28] o/ Helder [16:58:42] right, so you’d need to combine it with something else [16:58:57] Yeah. I'd rather just poll an oracle honestly. [16:59:02] E.g. api or DB [16:59:09] Over RCStream. [16:59:16] Lots of good guarantees then. [16:59:24] Which is a shame. [16:59:25] you REALLY love that science bucket image [16:59:31] I do. [16:59:51] * YuviPanda installs oracle dbs everywherer [16:59:56] :P [17:00:04] halfak: polling the API makes me cringe, however. [17:00:16] Sure, but you can be positive you didn't miss and event. [17:00:23] And it is easy to rewind and re-play. [17:01:13] hmm [17:01:16] querying the db itself might not be the worst of ideas [17:01:27] definitely less terrible than hitting the API [17:01:35] anything is less terrible than our API [17:01:48] It would be great if we had something that transitioned naturally from polling to listening with good guarantees. [17:02:11] YuviPanda, either way, API is the only source of content to diff. [17:02:22] Gonna end up polling it anyway. [17:02:32] Even if it isn't the source of events. [17:02:50] well, technically if you’re doing this inside the cluster you can just hit externalstorage and diff it yourself :) [17:03:15] Someone needs to show me this magical thing. [17:03:25] Because I'll totally do that instead of hitting the API. [17:03:52] heh :) [17:03:57] YuviPanda, regardless, I want to hide whatever magic we come up with behind this: https://github.com/halfak/MediaWiki-events [17:04:01] halfak: it gives you full content, I believe. [17:04:04] right [17:04:43] See https://meta.wikimedia.org/wiki/Research:Ideas/MediaWiki_events:_a_generalized_public_event_datasource for a description of events I've been working on. [17:05:10] Note that many event fields link to bugs. [17:05:42] * halfak pulls changes to revscores. [17:05:49] yeah, I remember reading it when you started working on it [17:05:55] okay, reading Myer's paper has given me a headache. Bah. [17:05:58] before you got swallowed by meetings [17:06:01] * Ironholds steals the guts of difference-engine instead [17:06:08] heh YuviPanda [17:06:46] Ironholds, if you are looking for diff algorithms, see http://pythonhosted.org/deltas/ [17:07:02] neat! [17:07:08] I think mostly I'm not mathy enough to Get It [17:07:17] I tend to do most of my maths by basically human bootstrapping [17:07:20] I spent a couple of days solidifying Magnus’ WikiDataQuery. that was nice [17:07:27] pen, paper, and "okay, think of aaall the possible outcomes and inputs" [17:07:37] YuviPanda, got it to be nicely distributable? [17:08:04] halfak: kind of, got it to be stateless so I could put a LB in front [17:08:18] LB? [17:08:23] load balancer [17:08:35] danilo, I think we should work together on the Scorer stuff. [17:08:37] which also does health checks, so if a machine is down it just routes around it until it its back up [17:08:43] That would build off our discussion from last week. [17:09:07] YuviPanda, gotcha. [17:09:13] How many querying machines are there? [17:14:22] hmm? [17:14:27] halfak: querying machines? [17:14:57] halfak: oh, wdq? wdq.wmflabs.org is still the old one, I’m waiting for magnus to get back from travelling to make it point to the load balancer, which is at wdq-lb.wmflabs.org, so that will load balance between two machines [17:15:19] Yeah. I assumed the architecture was like this --> [Load balancer] --> [Query machines 1,2,3,...] [17:15:32] Gotcha [17:15:56] halfak: my English is not good, what you mean with Scorer stuff? [17:17:04] danilo, the design decisions we were discussing last week regarding module files (pickled/serialization) and strategies for incorporating multiple different classification strategies. [17:17:53] ah, ok [17:19:08] danilo, https://trello.com/c/VluhMT1h/9-implement-scorer-design [17:19:31] I think that it would be good if we started with writing code for the scorer interfaces. [17:19:51] Then I can go get us some sample data while you look into implementing a scorer. Sound good? [17:24:27] * halfak gets to work on fleshing out the interface classes. [17:24:32] halfak: implement a scorer using https://github.com/halfak/Revision-Scoring ? or start another code? [17:24:49] danilo, yeah. Within that project. [17:25:10] It would be nice if we could pair program for a bit. I'm looking for some good options. [17:25:19] halfak: where will we implement the application, tool labs? [17:25:44] Probably our own project within labs. [17:25:56] http://pairjam.com/#ti59ms [17:26:00] ^ danilo [17:26:08] You can find me there. We can program together. [17:28:32] halfak: this link is cool! :-) [17:28:37] :) [17:28:57] I've heard about pair programming before, but didn't get a chance to see it in action [17:29:20] Heh. Usually we'll incorporate some audio. Let's see if I can get that going. [17:29:54] I just turned on my audio, but I'm not sure if you guys can hear me. [17:30:54] halfak: yes [17:31:01] cool! [17:31:02] I can hear you after clicking in a mic icon [17:31:37] does "pair" programming scale to more than two people? o.O [17:32:11] Yes [17:32:12] :) [17:34:13] it was me [17:34:15] sorry [17:34:38] https://github.com/halfak/Difference-Engine [17:34:39] I was trying to find a way to hide the sidebar [17:40:21] I'm here [17:40:27] yeah [17:40:28] :-) [17:44:06] https://trello.com/c/lUls4wuc/16-test-simple-mlscorer [17:49:16] https://en.wikipedia.org/wiki/Feature_extraction [17:49:33] halfak: I think is good to say that I'm not a professional programer, I can't program fast, I still need to study your code before I can collaborate with it [17:50:15] No worries. Even if we were on the same skill level, there's a lot of assumptions that I've made that we'll want to re-think through. [17:50:35] I'm not sure if danilo is hearing too [17:50:56] danilo, if you can join audio in pairjam, I'm narrating my thoughts at Helder :) [17:51:16] That might help us work through my assumptions. [17:51:19] :) [17:51:39] where do I ativate the sound? [17:51:44] in the bottom [17:51:49] there is a video icon [17:51:58] and then a menu with a microphone icon [17:52:52] danilo let me know when you can hear me [17:53:40] I can't find where to turn the sound on [17:54:18] Bummer. [17:54:21] danilo: on the right side of the "halfak" item in the menu which appears when you click in the video icon [17:54:34] (but it is a little confusing, yeah) [17:56:08] I have already made this, but it not work, maybe is some bug with my browser... [17:56:54] What browser? [17:56:58] I'm in chrome and it is working for me. [17:57:23] firefox 34 [17:57:50] for what is worth, I'm not hearing anything right now (halfak is quiet) [17:58:14] Cna you hear me right now? [17:59:02] ah, [17:59:03] ok [17:59:11] the icon needs to be green :-) [17:59:31] I think I clicked on it and turned it gray by accident [17:59:35] the mic icon [18:00:36] yes [18:00:43] I can hear you now [18:01:59] its green for me, and I'm using chrome now, but I can't hear [18:02:37] Can hear other things? [18:02:38] halfak: I activated the audio here, and it appears an icon on top of firefox [18:04:09] halfak, unrelated to this thread, a twitter feed you will enjoy: https://twitter.com/LegoAcademics/status/507898512659714048 [18:04:21] yes, I tested to watch a video in youtube, I can hear the yotube video, but not the pairjam [18:04:25] oh, you already follow it! Hah [18:06:21] :) [18:09:39] Still no luck danilo? [18:09:52] We could switch to a google hangout/skype or whatever [18:09:59] halfak: now I can hear you [18:10:05] cool :-) [18:11:09] https://en.wikipedia.org/wiki/Dependency_injection [18:26:57] dammit [18:27:08] when someone writes a replacement for R, can they remove factors as a class please? [18:27:22] they are STUPID and I spend an annoying amount of time working around them [18:31:53] halfak: I lost your audio =/ [18:31:59] I closed the tab by accident [18:32:06] Boo. I still hear something. [18:32:07] and it is mute now, after reopening it [18:32:24] Try again [18:33:19] =/ [18:33:42] I can hear you [18:33:44] Helder [18:34:04] Can you see me? [18:41:09] :-) [18:52:44] halfak: disappeared [18:52:49] maybe pressed F5? [18:52:53] Try click mic again [18:53:09] nothing =/ [18:53:10] I can hear you still [18:53:14] Can you see my video [18:53:20] nope [18:57:23] I'm hearing Helder [18:57:26] Are you hearing me? [18:57:31] it is muted again =/ [18:57:36] * halfak shares fist at pairjam [18:57:38] disable/enable it [18:57:40] *shakes [18:58:05] weird... it is green but with no sound [19:05:03] Helder: https://github.com/halfak/Revision-Scoring/pull/10 [20:03:27] Helder, just got back [20:03:31] danilo, still around? [20:03:41] hey! [20:03:46] yes [20:03:51] hexchat notifications works :-) [20:03:52] I'm in pairjam. [20:13:13] http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html [20:20:31] goddammit hive! [20:20:34] * Ironholds weeps [20:20:39] I've written up the pageviews hive query [20:20:41] and it won't run. [20:39:12] halfak, what's our approach to interviews? [20:39:18] Andrew Lih wants to talk to me about the mobile shift. [20:40:52] Contact comms. Let them know that there's an open interview request. [20:41:03] kk [21:19:11] halfak: how can I install revscores to test? I did "pip install nltk nose deltas pytz mediawiki-utilities" but "import mw" returns "ImportError: No module named 'mw'" [21:19:59] danilo, which version of python? [21:20:23] 3.3.0 [21:20:58] pip install mediawiki-utilities should get you "import mw" [21:21:42] Hmmm. Did pip throw an error? [21:23:24] pip installed in /usr/lib/python2.7/site-packages , how can i say pip to install it for python 3? [21:28:40] I'm installing python3-pip package [21:33:29] I installed using "pip-3.3 install nltk nose deltas pytz mediawiki-utilities" and now it works :) [21:34:17] danilo: I think I used [21:34:18] https://docs.python.org/3/library/venv.html [21:34:26] for working with python 3