[01:04:59] I'm signing off. have a good weekend everyone. [15:23:41] I was looking for text table in enwiki_p accessible from login.tools but I couldn't find it in the replica database. [15:24:00] Any clue as to why all other tables are there but this one isn't? [16:39:47] ashwinpp, the text table is not in the MySQL database. [16:40:04] It's been separated for performance reasons. [16:40:21] There are two ways to get text, XML dumps and API. [16:41:03] API: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=5467890&rvprop=content [16:41:26] API is good for small/focused analyses [16:41:42] XML dump is good if you want to perform an analysis across the entire history of the wiki. [18:02:42] okay [18:02:43] thanks [18:04:35] ashwinpp, let me know if you want a hand with the XML dumps. [18:04:44] I have some utilities that make working with them a lot easier [18:05:08] See https://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw-xml-dump [18:05:32] I actually wanted to analyze only a small part of it based on a date range, rather than the article range [18:05:59] This utility seems useful, let me take a look at it [18:07:18] hi halfak. how are you? [18:07:24] Do you have some rough stats, like how long would it take to scan through the dump once? [18:07:25] Hi lzia :) [18:07:36] Good! Good morning :) [18:07:51] halfak: re the showcase: I'm waiting for ashwinpp and Bob to get back to me, but basically, we most probably can't do the 11th and can do 25th [18:08:09] I know Dario can't attend the 25th, but the 11th is too soon for us. [18:08:10] ashwinpp, if you are doing enwiki and you are just running a regex on page content, we can do that in 24 hours on a 12 core machine. [18:08:30] Ah..I see, that's good enough [18:08:31] mornin' ashwinpp [18:08:40] morning lzia [18:08:40] I added you to the Hangout. [18:08:44] Boo. [18:08:50] you mean google hangout? [18:08:53] yup [18:08:59] leila, what about other days of the week? [18:09:15] which week, halfak :-) [18:09:29] GOod Q, you know yours and Dario's availability better than I do. [18:09:53] E.g. what about March 20th? [18:09:54] I'm waiting for bob to respond for the showcase [18:10:01] I most probably won't fly back immediately after CSCW [18:10:14] so for me, March 25 works well [18:10:23] Just so long as we have someone to do setup work at the WMF. [18:10:29] And a good internet connection. [18:10:31] np, ashwinpp. [18:10:39] ashwinpp: did you get my invite? [18:10:46] We could run the showcase without DarTar too. [18:10:50] yeah, I can do that halfak. [18:10:57] I'm going to be back home for the spring break, so if its online I can contribute, but otherwise I won't be there [18:11:04] I couldn't find the invite [18:11:05] The 25th would work for me [18:11:24] got it, ashwinpp. I'm thinking Bob can do that, but I'll wait for him to get back to us. [18:11:42] okay, halfak, let's say tentative the 25th, I'll confirm in the next 2 days? [18:11:57] Sounds good to me :) [18:12:01] ashwinpp: the Google calendar has a Hangout link. can you try that/ [18:12:05] great, halfak. [18:12:29] ok wait [18:32:44] ashwinpp: /a/wikistats_git/dumps [18:34:04] halfak: do you know if the dumps are somewhere in /a/wikistats_git/dumps ? [18:34:20] Which machine? [18:34:29] good Q. stat1002 [18:34:56] /mnt/data/xmldatadumps/public/ [18:35:03] oh! thanks! [18:35:18] :) It turns out that it is the same between machines -- I forgot before I looked. [18:35:34] aha! thanks!