[00:00:13] I made a research about evolution of referencied articles in Portuguese Wikipedia https://commons.wikimedia.org/wiki/File:Ptwiki_references_in_articles.png and I showed this in ptwiki village pump, and a user there suggested me to divulge more, where is the better place to divulge researches? [00:00:55] maybe https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ? [00:01:06] danilo: ^ [00:03:26] danilo, if you want to do a writeup, see the "Add your project" form at http://meta.wikimedia.org/wiki/Research:Projects [00:03:50] meta is multilingual, so your writeup can be in whatever language you like. [00:04:06] We might need to fix some templates so that they work for you. I'd be happy to look at that with you. [00:09:33] Helder_, halfak: ok, I will try to divulge in both, any doubt I ask here, thanks [00:11:18] :) cool graph! [00:16:57] danilo, by any chance did you use the XML dump processor in mediawiki-utilities? [00:20:08] halfak: y’know, we never emailed any mailing list about Quarry :) [00:20:33] YuviPanda, you're right. Do you still have the link to the etherpad with a draft email? [00:20:46] halfak: sadly I do not... [00:25:52] * halfak types guesses into etherpad [00:26:19] halfak: I tried, but I had another script for search in current revisions dumps that I adapted to search in history dumps that was a little simpler to me, so I used my script [00:27:12] I understand. That makes sense. I'd appreciate any feedback you have about how to make it easier to use. [00:27:25] :) [00:28:18] /Wild MEETING appeared/ [00:41:18] halfak: this is the script I used: https://gist.github.com/danilomac/3226bccb156c07c8f4c6 [00:44:00] I bet it's a little faster than the xml_dump parser I built. :) What happens if a tag exists at one of the 10MB boundaries? [00:49:25] sorry, didn't understand the question [00:54:53] halfak: what you mean with "tag exists at one of the 10MB boundaries" ? (my English is bad) [00:57:42] ah, I understood [00:58:07] It looks like you read the file 10MB at a time and then search for XML tags using a regular expression. [00:58:23] * halfak was reading code and typing rather than looking at IRC :S [01:01:00] halfak: the line "del buf[0:tag.end()]" delete the previous buffer only until the end of the last match before extend the next 10MB [01:01:18] Ha! [01:01:56] What if you have no tag in the buffer? There's a revision in English Wikipedia that is > 32MB. O.O [01:02:16] Oh! I suppose it will just load in another buffer's worth. [01:07:41] yes, I suppose too, but I didn't test [01:19:32] this script takes 6 hours to read all history dump of ptwiki [01:27:33] what date was the dump? [01:28:18] Fun story. Wikihadoop is an order of magnitude slower than raw JSON. [01:29:16] wow [01:30:53] So, converting to JSON before hadoop is important. [01:32:42] ...and I have diffs for simplewiki :) [01:32:49] too bad tomorrow is meeting day. [01:32:57] the science will need to wait. [01:35:05] halfak: danilo used the dump "ptwiki-20141122-pages-meta-history", per https://commons.wikimedia.org/w/index.php?diff=142212582 [01:35:14] halfak: the script reads aways the last dump avaibile in tool labs, the graph was made with data of the 20141122 dump [01:35:31] Gotcha. [01:36:09] I want to run a performance test now. I'm curious how much is lost due to use of a full XML parser :) [01:44:18] halfak, ya just need an island grammar! [01:44:29] https://en.wikipedia.org/wiki/Island_grammar [02:25:32] another research I made, I don't know if is useful for something but is interesting, the category loops of Wikipedia: https://en.wikipedia.org/wiki/User:Danilo.mac/Category_loops [03:10:48] If formWizard is related to a research project, please see https://meta.wikimedia.org/wiki/User_talk:Jmorgan_(WMF)#Disabled_formWizard [14:36:28] halfak, baby! [14:36:58] So, I bit the bullet and just ran permutation tests, which I understand are appropriate for bimodal distributions. Is that statement stupid? [14:37:13] ottomata, can you see any reason for me not taking a stab at implementing some of the "tagging" filters as UDFs? [14:37:16] o/ [14:37:22] should I wait on the sensible testing infrastructure? [14:37:35] naw continue, i'm sure we can adapt whatever you come up with [14:37:40] okie! [14:37:55] Ironholds, why are SQL filters insufficient? [14:38:08] like, big CASE WHENs? [14:38:16] I'm worried that we're writing mapreduce code rather than Hive SQL. [14:38:25] CASE WHENs gotta be somewhere [14:38:25] huh? [14:38:34] it's not really mapreduce code, it's just...let me grab you an example. [14:38:52] https://gerrit.wikimedia.org/r/#/c/180023/5/refinery-core/src/main/java/org/wikimedia/analytics/refinery/Webrequest.java,unified [14:38:57] OK. Just map code. :P [14:39:01] that's the (existing) prototype. It's just a UDF [14:39:30] instead of "WHERE mime type is this type and url fits this format and and and and and" you just go "WHERE is_pageview(x,y,z,foo)" [14:39:46] same logic, but it's much more reusable [14:39:58] I agree. [14:39:59] Yeah. If it were a query, the definition would be contained. As it stands, this will be hard to apply when we want to process page views any other way. [14:40:14] example? [14:40:16] Which should probably not be your first concern. [14:40:22] What do you mean "example"? [14:40:30] sorry, of the "any other way" [14:40:34] halfak: i disagree there, it will be easier to do so, as you can now use the definition with technologies other than hive [14:40:36] Outside of Hive [14:40:49] ottomata, perl? [14:41:05] (inside of hadoop) [14:41:11] I don't get how a HQL script would be more applicable to perl, except in the sense that you need to look in fewer places to extract the logic and turn it into perl [14:41:23] Ironholds, definition would be contained. [14:41:28] * Ironholds nods [14:41:35] the hive script should not be the definition though [14:41:41] Oh? [14:41:43] nor should the java code [14:41:47] like, the definition lives on meta. [14:41:50] the java is just an implementation. [14:41:53] well, i think people disagree alittle bit, but jha [14:41:54] ja [14:41:55] exactly [14:41:57] Oh... Well that definition can't be executed and tested. [14:42:00] So it isn't as good [14:42:19] so, the argument for/against UDFs, ime, is: [14:43:14] for: we get to chunk things elegantly. We can distinguish a generalised filter from a tagging setup really easily, and this has implications for future work (example: I do a project where I don't actually care about if the hit is to zero or not, I just want to know if it's a "pageview". Rather than digging through a HQL query, I just call one UDF instead of two) [14:43:36] We also get unit testing, which is a big draw, ime, because it makes it so much easier to find out simply on compilation whether I accidentally fucked up the definition [14:44:12] Because it's chunked, it's more maintainable: you don't have to dig through so much to make a tweak, when tweaks are needed (and they will be needed). [14:44:32] cons: it's chunked, which means someone looking to reconstruct it and apply it to the sampled logs, say, has to go to multiple places to extract the logic. [14:44:36] Look guys. I don't have time to argue the point. If you have already made the decision, that'd cool. [14:44:42] * halfak runs off to strategy stuff. [14:44:56] oookay [14:45:13] hha [15:12:10] * halfak breaks free. [15:12:46] Ironholds, after thinking about it, I think that your strategy makes sense. *but* I'm wondering if you have a test for the entire filter. [15:13:02] If so, that test can be used for any other implementation. [15:13:09] the entire generalised filter, or filter + tagging + etc etc? [15:13:19] I guess generalized for now, but yes to all. [15:14:04] we have unit tests for the entire generalised filter, although I'm not very happy with their extent so far or how they're implemented - they're not really reproducible easily. Christian and Otto are working on changing how tests are implemented so that they can be. [15:14:25] I have a gist at https://gist.github.com/Ironholds/30c09e6b4402bfc0e967 you might like, which was my immediate thoughts on how we might go about structuring things to make them more generalised. It trades off readability for portability. [15:15:07] They seem to be talking about a flat test file (in JSON, maybe?) containing cases, which can be passed through one by one [15:15:14] also a good idea, imo [15:15:18] that was my idea, although qchris might not like it, not sure. [15:15:22] * Ironholds nods [15:15:32] +1 for flat test file. [15:15:39] we don't have tests for the other bits yet, because they haven't been built, although I'm hoping to write the "access method" UDF today. [15:15:50] but I'm sort of debating whether to push off further UDFs until we have a testing answer [15:16:09] because I want this to be as stable as possible, and having to refactor Inf tests is a timesink :D [15:16:30] Ironholds: we could do the test as you say, if we made another evaluate method that took String arguemnts (jsut for testing) [15:17:26] * Ironholds nods [15:17:30] I have no particular preference [15:17:46] flat files means the tests are more rerunnable if a per-language framework is put in place [15:17:58] this format means the tests are more rerunnable if someone goes through with a find-and-replace [15:18:26] it depends on if we expect UDF equivalents in multiple languages; if we do, those implementations should contain frameworks for testing and the flat file is the answer [15:18:40] if we just expect one-off, ad-hoc things, I imagine find-and-replace is less of a PITA than building a tester [15:18:59] Personally, I'd prefer that we went for the flat files, thinking about it [15:19:08] I'm not sure what you mean by "find-and-replace" [15:19:34] so, the gist above; the test call is structured so that you can take the test file and literally find and replace assertTrue, and that's all you really need to do for most languages [15:20:14] Are you suggesting someone uses a regex to rewrite a bit of code? [15:20:27] Or is this more conceptual? [15:20:38] see "trades off readability for portability" and "prefer the flat files" :D [15:21:14] See where. In the chat log? [15:22:56] yup [15:24:24] Sorry. Re-read and I'm still not sure I understand. [15:24:42] Was the "find-and-replace" strategy described? [15:25:10] in the gist, yes. [15:25:21] I think we're going for flat files anyway, absent objections from Christian, so it's somewhat moot. [15:26:01] Oh. You mean that someone would run the same test file rather than compute on the same test input data. [15:26:09] So the test data would be contained within the test. [15:26:52] yup [15:28:29] Gotcha. Yeah. That sounds frightening to me. The nice thing about test data is that you could implement the filter however you saw fit and have the test check the output to see if the right rows were filtered/tagged. [15:28:51] Can you execute the hive bits on a flat file? [15:29:12] Maybe it wouldn't matter since you can effectively load a flat file into Hive. [15:34:12] da [16:21:03] leila, do you have any opinions on the use of permutation testing to validate/invalidate a difference between two sets of results? [16:21:24] are you talking about counting uniques for App? [16:21:39] cuz I'm answering a similar question that Nuria has brought up. ;-) [16:22:07] naw, the sessions stuff again [16:22:23] I'm looking at permutation testing since it tends not to care what the underlying distribution is [16:22:29] (and whether it's unimodal, multimodal,etc) [16:23:37] I need to block some time for that Ironholds to go over it in more details. [16:24:40] sure! [16:24:44] right now is not good since I have to do some WikiGrok analysis for this afternoon and uniques for Nuria. :-\ [16:26:30] understandable! let me know if I can help with WG anyhow :) [18:58:35] Hey DarTar, we're postponing the showcase, right? [18:58:46] halfak: yes [18:58:53] it’s tomorrow at 3pm PT [18:59:00] OK. Shall I send out an email to meeting participants? [18:59:07] I got Reid’s confirmation this morning [18:59:16] Great. Can you move the calendar event? [18:59:19] I want ot update the showcase page first, let me do this [18:59:24] OK. [18:59:28] I will wait. [18:59:39] ...or should I just send out the email now? [18:59:50] Meh.. I'll start drafting so you can review [19:12:51] halfak, I have the changes up on mediawiki.org [19:12:58] and blocked collab 6 [19:13:12] do you want to send out the usual email? [19:13:22] (please include the abstracts if you do so) [19:14:04] Great. I got side-tracked. Just started the draft [19:14:52] DarTar, do you want to call it "special" like we did with yan's? [19:15:16] no, it’s the regular edition, pushed 24 hours :) [19:15:31] thank you [19:15:36] I’ll tweet [19:16:14] http://etherpad.wikimedia.org/p/research_showcase_dec14 [19:25:34] DarTar: wiki-research-l, staff ... any other lists we usually hit? [19:27:14] one email sent internally (with details on the room), another one cross-posted to analytics-l and wiki-research-l [19:27:18] thanks dude [19:27:25] oh, just one thing [19:27:56] maybe I can get the hangout link before we send this out [19:34:45] hey halfak, let me handle this. I’ll set up the hangout on air (I need to host it), add the link and share the announcement. Thanks for getting this started [19:35:13] Oh I was just about to click send. [19:35:14] :P [19:35:19] I'll just send you the mail. [19:35:27] cool, thanks [19:36:22] (gmail does better with formatting) [19:36:25] OK. Sent/ [19:41:13] halfak: hangouts on air driving me nuts, getting help from Office IT [19:42:02] kk [19:42:17] Seriously a *nearly* awesome product [19:43:50] DarTar: do you want to chat? [19:44:08] leila: yes, I need 10 minutes to finish the setup of the hangout [19:44:20] sure. take your time, DarTar [19:48:20] what's happening with hangouts? [19:51:52] Ironholds, where's the current x-analytics format documented? [19:52:59] it's documented? ;p [19:53:07] heh [19:53:10] I can give you an example entry, but I am not aware of decent documentation [19:53:24] halfak, I couldn't find documentation either. [19:53:29] That would be helpful thanks. :) [19:53:38] the general rule is param=value with ; between param=value entries and EOL as a terminator [19:53:50] I'll grab an example line [19:55:12] Leila, one more reason to switch away from the format! :) I want to suggest that we switch to either x-www-form-urlencoded [19:55:15] or json [19:55:47] json is problematic [19:55:54] why Ironholds? [19:56:12] someone throws in 3 more characters than the field can hold. Congratulations: your JSON is now broken and cannot be automatically fixed :D [19:56:31] K4 had someone store JSON in a MySQL db once and encountered precisely this problem. It ended badly. [19:57:04] I love JSON as a format but not for storing in a fixed-length field, because it needs the final characters to be included to be parsable. If those get dropped, bad things happen. [19:57:17] Ironholds, that makes sense. x-www-form-urlencoded wouldn't have that problem. [19:57:26] You can just read all of the fields up to the truncation. [19:57:32] You might have the last field error out. [19:57:33] * Ironholds will look up that standard [19:58:33] oh, it's just that? [19:58:38] not only is that fine, I already build a parser for it. [19:58:59] Good point. [19:59:06] It's shameful that R didn't already know how to do that. [19:59:44] this is a language where the URL decoder breaks on out-of-range characters and is not vectorised [19:59:46] what did you expect ;) [20:00:25] OK. Here's all the notes I've got: http://etherpad.wikimedia.org/p/x-analytics [20:00:29] halfak, example of current format: zero=404-01;proxy=Opera;php=zend [20:01:44] Ironholds, can handle non-ASCII [20:01:51] It just needs to encode them. [20:01:58] Same problem with JSON. [20:02:27] URL encoding replaces non ASCII characters with a "%" followed by hexadecimal digits. [20:03:13] yeah, "not great" == "not efficiently" [20:03:14] will replace [20:03:23] Gotcha. [20:04:46] Hmm... Whole etherpad just get deleted? [20:05:03] No. Nevermind [20:05:04] derp [20:12:07] leila: I’m on the original hangout [20:12:18] jumping there DarTar. [20:12:24] k [21:00:14] ....BAAHHAHAHAHA [21:00:24] oh this is going to be beautiful [21:00:55] quiddity, you remember my comment in today's virtual working group about some spammer adding links to his R website for the R article? [21:02:17] he's come back with "If you are well versed in R programming, I will consider the debate on relevancy of the site" [21:03:12] ...you poor sod. [21:04:54] I overwhelm people I /want/ to like me with R. Precisely how long do you think someone I already don't particularly care for is going to last? [21:06:20] Ironholds, is this a guy arguing to keep a link to his site on the wiki? [21:06:26] SEO much? [21:06:33] yup [21:07:05] the best bit was I just said "hey, don't add links you have a COI with" and his initial response opened with "I have no intention of improving search engine rankings." [21:07:16] I didn't mention you wanted to tweak your SEO, dude, but hey, now that you choose to voluntarily bring it up... [21:07:25] my Java will have to wait while I school this guy. [21:07:53] * halfak puts the popcorn in the microwave. [21:07:55] link? [21:08:06] https://en.wikipedia.org/wiki/User_talk:Ironholds#December_2014:_Added_an_additional_educational_reference_for_learning_R_data_mining_techniques. [21:08:14] will poke you when my reply is completed and posted. [21:14:08] Ironholds, is it a conflict to cite your research? [21:14:13] I feel conflicted about that. [21:14:24] I've deliberately avoided doing so for exactly that reason, but Iunno. [21:14:28] There certainly may be a conflict, but... [21:14:32] I haven't done it either. [21:20:11] hey halfak, do you have any time today or tomorrow for me to pick your brain about some WikiProject stuff? A Hangout would be useful. I can keep it to 1/2 hour. [21:20:32] J-Mo, I've always got time for you :) [21:20:33] you can say no if you're slammed [21:20:36] * halfak looks at calendar. [21:20:36] awwwwwwwww [21:20:40] thanks :) [21:21:38] Invite sent :) [21:21:50] Bah. I moved it. Invite updated [21:22:04] thank you, good sir [21:34:03] Ahh, dropkicking an SEO person always puts me in a good mood [21:34:10] Smells like victory. [22:44:16] fhocutt, I see bug. TY :) [22:44:34] np! [22:45:23] were there any others? Want a general one for considering a general-purpose API and/or documentation? [22:54:28] ^ halfak [22:54:44] Any other what? [22:54:58] Oh! A bug for API docs? Sure! [23:16:04] 3Quarry: Switch Quarry to use Material Design Bootstrap theme - https://phabricator.wikimedia.org/T76140#854487 (10yuvipanda) 5Open>3declined a:3yuvipanda But has a somewhat fucked up license, so... no. https://github.com/FezVrasta/bootstrap-material-design/blob/master/LICENSE.md [23:17:04] mwstreaming successfully ref-yak-tored. [23:55:06] halfak, haha [23:55:15] say, halfak. Update on the x_analytics. [23:55:48] So, we're not going to make any changes to format right now - the scope was just to discuss the AppInstallID. But it has bumped up the importance of the task of writing a parameter extractor. So that's nice! [23:56:05] K. Makes sense. [23:56:17] Do we have a card for the devs re. adding stuff to x-analytics? [23:56:48] I..don't think currently? [23:56:57] Hmm... We should make that.