[00:05:48] ow actually, you say 1 hour is the length of the session, not that 1 hour of inactivity means the session has ended, Ironholds? [00:06:02] the latter [00:06:14] ow okay. sorry. missed that [00:06:17] no problem! [00:06:25] so, hour inactivity threshold for mobile web/desktop [00:06:26] 29 minutes for apps [00:06:33] got it [00:11:18] hey halfak :) [00:11:24] Hey dude. [00:11:32] so, other than eyeballing, can you think of a good way to identify lognormal distributions? [00:11:33] Holy meetings. [00:11:44] I'm writing up a blog post on the impact of fingerprinting versus UUIDs [00:11:53] figured I might make it scientific and do actual testing to see if variation was significant. [00:11:55] Yes. There are a few different strategies for goodness of fit of a log-normal. [00:12:05] cool! [00:12:11] Oh... Hmm. One sec. [00:15:49] Ironholds, do you want to fit a mixture model for the time between events? [00:16:14] Ah, no. I mean, I could, but that's definitely a log-normal distribution [00:16:23] A single or a mixture of two? [00:16:39] mostly I want to end up with pages-per-session, sessions-per-user and session-length datasets for calculated UUIDs and EL UUIDs [00:17:04] and then be able to say "for these metrics, there's a statistically significant variation in the outcome, and fingerprints are ergo less accurate to the point where it will screw with your results" [00:17:15] which requires knowing what test to apply, and therefore what probability distribution each metric follows [00:17:30] this is frankly overkill for some things (session length is absolutely different, you can just eyeball it) but for others.. [00:20:12] t.test [00:20:38] Oh wait. You want to know if it is log-normal or not? [00:20:48] qq.plot is a good way to visualize. [00:20:52] I want to know if I should be applying t.test to the normal values, or the log-normal values, or... [00:21:08] normal values? [00:21:13] non-logged, rather [00:21:13] Raw data vs. logged? [00:21:16] Gotcha [00:21:16] yep [00:21:35] So, drop the logged data into qqnorm() [00:21:49] Look for substantial deviations from a diagonal line. [00:22:00] * halfak looks up goodness of fit tests [00:24:17] aha [00:24:25] clever! [00:28:32] OK. So you have a lot of options. I'm not sure what to suggest. [00:31:43] I'm playing around with the ks.test(), but it seems to be really unstable. [00:31:50] Will qqnorm work for you? [00:34:42] halfak, possibly? I'm not really sure [00:34:48] I can tell you that sessions per user is REALLY weird [00:34:58] almost-straight horizontal line that then goes totally diagonal [00:35:35] I'm skeptical. [00:36:08] Do you think it would be valuable if I picked up the MS dataset and performed some comparisons to replicate your work? [00:38:27] sure! [00:38:32] Also I can just send you the graphic ;p [00:39:24] YGM [00:41:12] OH! [00:41:17] The QQ plot is like that. [00:41:41] huh? [00:41:54] It looks like you have a bunch of zero/1 values before it goes generally normal. [00:42:01] oh, doy. Makes sense. [00:42:05] Can you send me a histogram? [00:42:23] sure! [00:42:26] yep, ton of zero-values [00:42:42] How do you get a zero session count? [00:43:10] you don't [00:43:11] > min(sess_per_user) [00:43:11] [1] 1 [00:43:11] > log10(1) [00:43:11] [1] 0 [00:43:30] thar's the problem [00:45:27] Ahh! Of course. [00:45:38] so, log-normal indeed? [00:45:52] So, looks like that isn't log-normal. At best it is a mixture that involves a log-normal. [00:46:03] It looks bimodal to me [00:46:34] cool! [00:47:07] that seems like a pain to run a t-test over [00:48:14] Yeah. I don't think a t-test will work the way we hope. It also wouldn't be terrible. [00:49:17] I think sessions-per-user may actually have been the one where the difference was most dramatic [00:49:36] https://github.com/Ironholds/SessionDelta/blob/master/Output/desktop_session_count.png [00:49:39] halfak, ^ [00:49:49] I think we probably don't need a t-test to say "yeah, something is fucked there" [00:50:09] and https://github.com/Ironholds/SessionDelta/blob/master/Output/mobile_session_count.png [00:50:15] Yeah. I think that's the one that makes the most sense to be off as well. [00:50:23] mind you, since it's bimodal those distributions may be totally visually useless. bah! [00:50:30] How's the total session count? [00:51:28] just the sum? [00:51:33] leila: kaldari started the test2 docs on Meta [00:51:39] good point. Don't have that to hand, but I will grab it :) [00:51:49] DarTar, yes, I'll build on them tonight and tomorrow. [00:52:12] b.t.w., they pushed the test further down the line, now the plan is to start around 11am tomorrow and decide by 4pm [00:53:21] It's a bit tight for debugging if things go wrong, but we also don't have to push to production at 4pm [00:53:35] leila: ok cool [00:53:41] I'm blocking the whole tomorrow for this, given that it's not clear what happens when exactly [00:54:47] thanks! [05:51:50] Ironholds: are you aware of anything WMF/researchy about https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/StanfordLinkBot ? https://meta.wikimedia.org/wiki/Research:Improving_link_coverage doesn't really say much and there are some questions on the BRFA. [05:52:27] err [05:52:28] Improving Link Coverage [05:52:28] Main contact [05:52:28] Jure Leskovec [05:52:28] Stanford University [05:52:28] Robert West [05:52:30] Stanford University [05:52:32] Leila Zia [05:52:34] Wikimedia Foundation [05:52:46] I guess, poke lzia/leila when she's online? [06:19:18] sure :P [14:51:06] YuviPanda, Hey dude. Saw your email re. quarry. [15:31:20] hey all, I need to restart some services on analytics1027 quickly (mysql is filling up the root partition). this means hive will be down for a few minutes. I don't see anyone running queries there, so this hsould be fine [15:31:23] it shoudl be quick. [15:32:46] ottomata, I'm running some streaming jobs. They are tests. No worries if they go down. Should I expect them to? [15:33:59] nope [15:34:03] streaming is fine [15:34:06] kk [15:39:51] ottomata, relatedly, I'd like to grab you for a little bit today to hack on some hadoop jobs with me. [15:40:06] I'm struggling to iterate quickly on streaming jobs and I figure you'll have some pro-tips. [15:44:52] halfak: [15:44:56] i am yours all day long! [15:45:05] i have set aside today for coding hadoop stuff, especially avro/wikihadoop [15:45:07] and whatever you need. [15:45:11] just fixing up this thing [15:45:11] Wut! WOoo! [15:45:13] then gonna start on that [15:45:26] i'm first going to study your schema and try to use that to do wikihadoop -> avro [15:45:38] OK. I've got a couple bits in flight right now. I'll get them squared away in the next 30 minutes. [15:45:50] cool [15:46:04] Oooh. We should update the schema to include diffs. [15:46:06] Yes? [15:46:28] hehe, ha, HMM OK! if we must :p [15:46:35] it is so much simpler to not have them, but i can see how they would be usefuli :) [15:46:50] if we do, we need to talk about how the diff is represented, yes? [15:46:51] Na. Let's not have them if you think it's simpler in the short term. [15:46:56] it is for sure [15:47:00] mainly because of that issue [15:47:03] Indeed. I have a format I'm already working with. [15:47:12] and, because i'd be generating the diffs in jav [15:47:13] java [15:47:33] But I'd rather get XML stuff into hadoop now than worry about diff formats. [15:47:43] aye, cool [15:47:52] btw, hive should be back up. [15:47:53] thank you! [15:48:30] OK I'll get back to my other bits for now, but I'll ping in a bit re. pair programming. [15:49:02] woot, k [15:49:22] \o/ [15:49:34] * halfak just solved one of the last remaining hadoop streaming issues. [15:49:46] Turns out that loading a virtualenv in clever ways isn't that clever after all. [15:50:09] :) [15:50:33] Bah. Sneaky meeting snuck up on me. Might be more than 30 mins before I ping again. [16:05:17] hey halfak [16:05:48] looking at your schema, i'd kinda prefer if we could avoid nested objects if possible. [16:05:59] at least for now [16:06:01] can we flatten it? [16:06:02] Hmmm.. Why's that [16:06:13] page_id [16:06:13] page_namespace [16:06:13] page_title [16:06:13] etc. [16:06:21] I think we could, but it would mean substantially longer key names. [16:06:26] ja so? :) [16:06:39] it just makes thigns simpler in the short term. I *think* that nested is ok for most things, but i thikn we might run into some issues [16:06:46] like, if we tried to map a hive table on top, we might have to get fancier [16:08:04] Yeah. I guess that flattening isn't a huge issue, but it does make me sad. [16:08:33] Really, if we are flattening, we might just switch to TSVs. [16:08:38] why? just slightly less elegant? [16:08:59] Yeah. We're cramming one data format into the constraints of another. [16:08:59] haha, naw we are switching to avro! :) [16:09:01] binary format! [16:09:04] hm [16:09:10] because the xml is nested, ja [16:09:19] and so is json [16:09:19] but, in that case, the reason it is nested is because the data is not duplicated [16:09:32] in your case, each revision record contains the page data [16:09:32] See [16:09:47] yeah [16:09:53] Not like [16:09:55] yeah [16:10:36] but still, no big deal, really, right? it'd be different if there was a collection of sub-objects somewhere in this schema [16:11:06] ok, hm. tell you what [16:11:11] Indeed. Not a big deal. [16:11:18] let's go ahead and try nested, might as well experiment now? [16:11:24] :) [16:11:27] if we run into annoyances, we can deal with it and revert [16:11:32] better now than later? :) [16:12:21] +1 [16:12:35] There are some nicer flattening schemes too. E.g. "page.id" [16:12:43] That's a style I use all of the time. [16:12:47] Needs rules though. [16:12:52] Like no "." in field names. [16:13:04] An easy one to enforce. [16:13:41] "If you put a period in a field name, we will convert it to a multibyte utf-8 char when flattening to discourage you." [16:13:51] * halfak feels a little evil [16:16:15] haha [16:19:50] haflak, redirect! has to nested? geeeez! [16:20:23] my apartment has heating [16:20:24] I know. It sucks. It's that way in the XML. [16:20:24] * Ironholds dances [16:20:25] :( [16:20:39] ottomata, for no good reason as far as I can tell. [16:20:53] well, i mean, we are doing a transformation here reallllYYYY, we can change whatever we want [16:21:00] everything is pretty explicit anyway [16:21:16] Agreed, but I want people who think in XML to have their thoughts convert nicely. [16:21:24] I don't feel that strongly. [16:21:34] * halfak checks the XML schema for any other bits in [16:22:55] Nope. [16:22:57] Just title. [16:23:01] So. [16:23:06] I'm down for flattening that. [16:23:31] redirect_title? [16:25:08] +1 [16:27:30] your json schema maps pretty well onto an avro schema [16:27:37] slightly different structure but pretty much the same :) [16:27:48] e.g 'int' instead of 'number' for type, etc. :) [16:28:06] arrays of fields with name properties, rather than names as object hash keys :) [16:30:41] halfak: you don't have the rev parent id in your schema? [16:31:10] Ahh! How did that happen? [16:31:28] Oh. Yes I do. [16:31:32] "parent_id" [16:31:56] oh [16:31:57] uhhh [16:32:00] i can't read apparenlty [16:32:01] :) [16:32:52] halfak: what about contributor ip? [16:33:13] Ahh yes. This was one of the places where I decided that the XML schema can die in a fire. [16:33:20] That'll end up in user_text. [16:33:41] haha [16:33:53] is that either one of username and ip are set, but not both? [16:33:58] in the xml? [16:34:00] It turns out that often does not contain an IP address. And the DB they are pulling calls it user_text regardless of registered/ip status [16:34:25] well, when parsing the xml, how do you set this filed? [16:34:27] field* [16:35:27] * halfak gets sourcwe [16:35:42] i'm looking too [16:35:46] this yes? [16:35:46] https://github.com/halfak/MediaWiki-Streaming/blob/master/mwstreaming/dump2json.py [16:36:26] the ContributorType in the xsd actually has 3 fields [16:36:27] http://www.mediawiki.org/xml/export-0.10.xsd [16:36:30] username, id, ip [16:36:38] https://github.com/halfak/Mediawiki-Utilities/blob/master/mw/xml_dump/iteration/contributor.py [16:36:56] Indeed. [16:37:17] so, user_text is first username, else IP [16:37:18] Specifically, line 36 [16:37:23] Yes [16:37:30] looks like it is possible in the xml schema to have both set though, right? [16:37:36] would it be better to keep both fields? [16:38:14] Nope. [16:38:22] It's shameful that we've arrived at this question. [16:38:28] haha, yeah? [16:38:32] Since there are only two fields in MediaWiki's DB [16:38:41] id and user_text? [16:38:43] There's some silly logic in the middle. [16:38:45] Yup [16:38:47] so, never in the xml will both be set? [16:38:53] rev_user & rev_user_text [16:39:12] ottomata, that's right. Not enforced in the schema AFAICT though. [16:39:18] ok. will do what you do in the import logic then :) [16:40:25] :) [16:40:45] o/ milimetric [16:49:14] brb coffee [16:54:47] o/ danilo [16:55:01] I just responded to your thoughts re. storing model info on wiki. [16:55:40] upon re-reading, it looks like I come off as dismissive and just rejecting your ideas. [16:56:17] (1) thanks, (2) I think we should try IRC to iterate on the ideas faster :) [17:05:06] hi halfak [17:08:45] ok... I was using wiki because my English is not fluent and I am slow to understand and write in English, but yes, we can talk by IRC [17:09:11] halfak: it is so temping to fix the schema! [17:09:12] like [17:09:16] why not 'namespace_id' [17:09:19] this is an id! [17:09:20] not a string! [17:09:21] GRRR [17:09:23] :) [17:10:03] you are already switchign from 'ns' to 'namespace' [17:15:21] I am? That's bad. [17:15:25] I know what you mean though. [17:15:28] Should be ID. [17:21:58] yeah, it is 'ns' in the xml [17:28:04] sorry was just getting my car towed halfak [17:28:07] it broke :( [17:28:12] danilo, brb just went into meeting [17:28:15] milimetric, :( [17:28:16] i've gotta grab lunch and I'll ping you after [17:28:42] kk [17:31:24] wheee [17:31:27] * Ironholds pops knuckles [17:31:32] I got IPv6 geolocation working [17:31:37] and it's bloody gorgeous, people. GORGEOUS. [17:32:06] alright, back in a tick. Got to pop out to the bar I left my credit card at. [17:34:00] danilo, just got out. [17:34:45] re. model storage. Will you view links I posted in reply on the talk page and tell me what you think about "model files"? [17:36:56] halfak: yes, I agree, pickle allow a more flexible data [17:37:23] Cool. :) I'm looking forward to sharing model files too. [17:38:05] E.g. someone builds a bot like ClueBot and wants to perform their own classification rather than relying on the service -- they could just install the python library and download the model file. [17:40:23] yes [17:44:50] Great! Thanks danilo. :) [17:46:24] ottomata, you're right about "ns". I should not have renamed that to "namespace". That was a mistake. [17:48:15] haha [17:48:17] awwww [17:48:20] so you want it at ns? [17:48:25] namespace_id is so much better! [17:48:26] I think that's better. [17:48:32] growl, ok. [17:48:48] In the database, it is called "page_namespace" [17:49:09] ottomata, we can still discuss this. [17:49:12] oh ok [17:49:23] namespace_id! page_namespace_id! [17:49:23] :) [17:49:48] So, for your wiki analyst, you need to know that: [17:49:54] In the database, it is page.page_namespace [17:50:00] In the XML, it is [17:50:08] And in the avro, it is page_namespace_id. [17:50:13] All of that is :(((( [17:50:29] But I don't see how we can solve it ourselves. [17:50:38] I hate to add a new inconsistency. [17:50:44] well, in your schema [17:50:46] it is already nested in page [17:50:50] at page.namespace [17:50:52] Indeed. [17:50:55] page.namespace. [17:50:59] are you going to change that? [17:51:02] now that you see 'ns' [17:51:03] ? [17:51:07] or just leave it? [17:51:27] I was going to change it, but now I feel like thinking out loud about it. :) [17:51:31] haha [17:51:50] So, the MySQL is where names come from. [17:51:52] gwicke: did you change field names for restbase xml dump import? [17:51:56] in cassandra (or whatever?) [17:52:00] Oooh [17:52:11] Wait, there's a restbase XML dump import? [17:52:20] haha yup [17:52:22] via node.js [17:53:08] oh man i have 40 mins before SoS, gotta make lunch. think about it, i will do whatever you want. I'm at the point of having a working avro schema with some simple tests. :) [17:53:22] if that gets kinda settled, i will see if I can get a job launched to convert [17:54:23] Cool. Thanks ottomata. [18:10:05] ottomata: I haven't touched it in a while [18:10:34] IIRC there were some changes in the XML format recently, so those wouldn't be in there yet [18:10:57] * gwicke also needs to run [18:11:11] gwicke, you on analytics mailing list? [18:12:27] I'll CC you on an email that summarizes the issue that is annoying ottomata and me. [18:12:51] gwicke: the recent xml format change was just ordering of fields [18:12:56] wasn't any actual content change [18:13:03] made something with xml stream parsing easier [18:13:09] at least, the one that i know about [18:13:38] gwicke: oo,i think i phrased my question wrong [18:13:55] did you change any of the field names from what the xml had to whatever you are saving them in? [18:13:59] e.g. ns -> namespace_id [18:14:05] or anything like that [18:16:11] halfak: I usualy use only python 2, in python3 how do I load that file "enwiki.rf_text.model"? pickle.load(f) is returning an unicode error [18:16:54] danilo, good question. I'll see if I can reproduce that. [18:24:29] ottomata, with great pain, I think that we should go with "page_namespace". [18:24:39] danilo, I'm installing some dependencies. [18:25:23] danilo, While I'm waiting for numpy to compile, can you look at https://github.com/halfak/Wiki-Class/blob/master/examples/classify_text.py [18:25:39] Are you loading the model the way I do on line 7? [18:25:50] OooooK! [18:26:14] ottomata, see the bikesheddy discussion I raised on the analytics list. [18:29:59] * halfak install sklearn [18:30:08] Yikes. It's been a while since I built a model in python! [18:44:04] Ironholds: thanks for writing that note. Imagine my reaction when I look up Chris Sizemore on Google [18:45:01] hah [18:45:13] he's, IIRC, the Executive Editor for their Knowledge project [18:45:29] The chain of people goes me -> Owen Blacker -> Mo (everyone knows Mo) -> Chris [18:45:31] yes, just found him on the BBC site, very interesting [18:45:34] contacts. Contacts are important. [18:45:40] you’re da hub [18:49:09] DarTar, re uploading MOU: I'm doing it as "This file is not my work", as source I put "the WMF Legal and Research-and-Data joint work", as Authors "Manprit, you, and myself". Now why I have the right to publish it? should I go with "The copyright holder published this work with the right Creative Commons license" ? [18:49:12] What note is this? [18:49:49] leila, it is your work! You wrote it ;p [18:50:04] well, Manprit started the draft. [18:50:11] Ironholds, ^ [18:50:24] hi Ironholds. [18:51:57] leila: make it your own work [18:52:21] okay, DarTar. [18:52:42] halfak: a note to BBC (the #1 news source of citations in enwp) about inbound traffic from WP [18:52:57] we have half a million citations to BBC articles in enwiki [18:52:57] "note" as in musical? [18:53:12] Ironholds has a very musical prose [18:53:22] Where is this prose? :P [18:53:45] it’s an email he sent to some of his powerful contacts across the Atlantic [18:53:59] you call that musical? [18:54:04] Oh. A private note. [18:54:04] context: https://meta.wikimedia.org/wiki/User:DarTar/Wikimedia_dark_traffic [18:54:22] All of my good idioms and prose stylings are stolen from someone else. [18:54:29] mostly halfak, K4 and James F. [18:54:42] shakespeare is in the public domain [18:55:03] ;p. Thank you. [18:55:24] now if only my musical prose would let me work out why I can't instantiate objects [18:56:20] “Oh say can I instantiate thee” ♫ ♬ [18:56:45] "My dear, can I instantiate thee/and assign to you a static key" [18:56:57] "Else I could not call upon you - ditto were you listed 'private'ly" [18:57:20] I think I may have been dealing with classes for too long, if I can make that joke. [18:57:27] ha ha [18:57:36] Seriously, though, do we have any C++ nerds in here who can explain what the /hells/ is going on with one of my calls. [18:58:01] * DarTar looks around [19:00:06] hey, halfak used to teach C++ ;p [19:00:42] heh. You want to talk about static pointers, pointers to static values and the silliness of static pointers to static values, I'm your man. [19:01:37] halfak: sorry by my intermittency, I found my error, I missed the 'rb' in open() :S [19:01:59] halfak, actually, two days ago I had insomnia and so I read the C++ standard documentation. [19:02:01] all of it. [19:02:17] Pointers are fucking GLORIOUS; my one objection is to the context-based overloading of *. [19:02:37] (okay, problem fixed. Yay! [19:02:41] danilo, gotcha. We might want to take a filename and let the model decide how to open the file. [19:03:02] If we assume pickling, then asserting 'rb' might be fine. [19:03:26] Ironholds, yes. the "*" and how it works is a big troll. [19:03:50] * halfak remembers a question on his programming languages final about weird "*" parsing. [19:04:01] *sort of remembers [19:05:05] halfak: pickled object is dependent from a module, shoud't it be better pickle only built-in objects for compatibility with other python applications? [19:05:17] yep [19:05:30] but: I get it, now. I love it. [19:05:31] It's great. [19:05:41] I've still yet to find a situation where I actually need it, but that's not the point. [19:06:43] danilo, It doesn't look like there's a module dependency to me. [19:08:26] e.g. I can load the pickled file without importing the module. [19:08:38] halfak: the pickle.load(f) return a error "ImportError: No module named wikiclass.models.model" [19:08:57]