[15:48:46] halfak: Hey; I hope the traveling and conferencing is going well. I'm having trouble generating the references dataset for enwiki, although it seemed to go fine with the other wikis. Here's the backtrace: https://phabricator.wikimedia.org/P667 , if you have any idea what I might be doing wrong. [15:50:21] Yikes! That should only happen if there is an error in the XML. It seems like it would be a good idea if the utility included the name of the failed path when reporting the error. [15:50:48] guillom, how long does it run before erroring? [15:50:58] halfak: a few seconds [15:51:13] Good. I bet it is right at the beginning of one of the files. [15:51:16] * halfak looks [15:52:21] It may also have happened on another wiki; the dataset for eswiki was suspiciously small but I lost the screen so I may re-generate it once I'm done with plwiki [15:53:50] Oh! I fail. [15:53:53] This is my fault. [15:54:01] I gave you a bad string for matching file names. [15:54:08] heh :) [15:54:21] That glob will match these two wrong files: [15:54:26] /mnt/data/xmldatadumps/public/enwiki/20150403/enwiki-20150403-pages-articles-multistream.xml.bz2 [15:54:39] /mnt/data/xmldatadumps/public/enwiki/20150403/enwiki-20150403-pages-articles.xml.bz2 [15:55:48] oh, I see. The issue is the * [15:55:49] It looks like you want enwiki-20150403-pages-articles*.xml*.bz2 [15:57:12] halfak: While I have the citation TSV sitting on my computer... do you know if anyone is embarking on a "The Wikipedia Bibliography"-type project? Is that something the Citoid team is working on? [15:57:34] What kind of project would that be? [15:57:54] Something that produces a bibliography of the different sources that Wikipedia uses. [15:58:22] Like if you use 50 different works in your book, you prepare a bibliography of those 50 books. Though with Wikipedia's bibliography, you wouldn't dare try to load the entire list at once. [15:58:22] harej, there's some past research on that. I think that guillom is doing something like that for the citoid team right now. [15:58:58] harej, right now, i think most analyses have focused on extracting domain names from tags. [15:59:10] It seems that de-duplication of citations is an open problem. [15:59:14] harej: Have you seen https://meta.wikimedia.org/wiki/Research:Citoid_support_for_Wikimedia_references and https://phabricator.wikimedia.org/T96927#1296851 ? [15:59:36] * halfak runs back to conference stuff. [15:59:38] o/ [15:59:39] halfak: Thanks for the pointer about the *. Sorry I missed that :) [15:59:46] Godspeed guillom :) [16:02:01] guillom: That's for the domain names. Is there anything doing an analysis of the different {{cite}} templates used? [16:02:59] harej: At the moment, I don't think so. In the past there was http://firstmonday.org/issues/issue12_8/nielsen/index.html and https://finnaarupnielsen.wordpress.com/2010/08/25/top-news-cites-referenced-from-wikipedia/ [16:22:10] is there a research meeting today? agenda appears rather empty... [16:23:21] Nettrom: I and many others are not attending due to that empty agenda. [16:23:31] I'll go for lunch, then [16:23:40] I think it's safe to skip it :) [16:23:43] thanks~! [16:23:48] Bon appétit! [16:24:05] merci! [16:24:06] :) [16:54:56] guillom so you will attend if I fill the agenda no matter what I fill it with? [16:55:30] I suppose it is empty due to the hackathon [17:01:47] White_Cat_mobil_: "!A causes !B" isn't equivalent to "A causes B" :) [19:25:30] DarTar, you know the recommender work? [19:25:53] how goes getting the computational intensity down to where our servers can handle it? Because I've got an offer of time on the MIT supercomputing grid, looks like [19:26:12] hey – in a meeting, touch base later? [19:26:17] sure! [19:26:36] but I think for now we’re in good hands between the internal cluster and our external altiscale subscription [19:27:20] awesome! [19:27:25] Ideally more of the latter than the former [19:30:37] Recommender eh??? [19:31:07] harej, yep, ehhh [19:44:09] Ironholds: do you mind sending a note to research-internal + analytics-internal instead? Most people are traveling or just about to leave (including myself), email is probably the best channel [19:47:58] DarTar, yeah, will do [19:48:09] thanks