[15:04:01] Morning halfak_ [15:04:30] Hey aetilley [15:05:25] Working R install successful. I'm just browsing the manual now. [15:05:26] How did installing R go? [15:05:30] Cool! [15:06:02] So, first thing we need to do is install "sigclust" [15:06:18] To do that, run R. [15:06:27] You'll get the R prompt [15:06:35] then run 'install.packages("sigclust")' [15:07:19] It's asking if I'd like to use a personal library instead [15:07:28] not sure whaht that means. [15:07:53] Oh, maybe it wants to know if I have the package ready to go or want to fetch it. [15:08:10] I should probably just say "no" right? [15:08:52] ok well I said no and it just returned me to the prompt. [15:09:12] oh wait that's just the tip of the error [15:10:22] Personal is fine [15:10:33] It's going to install it in your account on the machine [15:10:35] ok, I'm selecting a CRAN mirror [15:10:44] Yeah. Anything closeish is fine. [15:11:28] ah, unable to access, let me try another mirror [15:14:03] http://pastebin.com/uS6xEMCV [15:14:08] halfak: ^ [15:14:48] not sure if that's a problem with the mirror or what. Let me check the sigclust data-sheet [15:17:02] Yeah... It could be that sigclust hasn't been updated for this version of R. [15:17:06] I'm looking into that. [15:17:17] ok [15:17:42] It's not prompting me for a mirror anymore when I make that call, so I don't know how to test if the mirror is at fault. [15:18:42] I don't think it's that. [15:18:51] Can you doublecheck the spelling of the package? [15:18:55] "sigclust" [15:19:10] I was just able to install on R 3.0.2 [15:20:05] According to the docs, sigclust blindly supports anything after 2.4.x [15:22:54] Still getting : [15:23:08] halfak: o/ [15:23:10] "Warning: unable to access index for repository " [15:23:24] Hey Amir1 [15:23:31] aetilley, weird. [15:23:35] hey :) [15:23:37] "Warning message: package 'sigclust' is not available (for R version 3.1.1) [15:23:38] " [15:23:47] aetilley: Let's try something that I know should be installed. [15:23:49] I finished first part of four dumps to read [15:23:54] the second one is working [15:23:55] install.packages("data.table") [15:23:58] aetilley, ^ [15:24:12] I add the third and forth pretty soon [15:24:16] Amir1, with that script I wrote, you can process all 4 dumps in parallel [15:24:19] halfak: same error message [15:24:43] Yeah, I'm working parrallel [15:24:52] aetilley, ah ha! So it's something to do with R not being able to get anything useful. Let's try changing the mirror. [15:25:13] Amir1, gotcha. Are you using mwxml or mw.xml_dump to do the parallelization? [15:25:27] aetilley, I don't know how to force change the mirror. I'll start googling [15:29:40] run "chooseCRANmirror()" [15:29:54] I opened a new shell and restarted R and now I'm getting an even stranger error: [15:29:58] > install.packages("sigclust") [15:29:58] Installing package into ‘/home/vagrant/R/x86_64-pc-linux-gnu-library/3.1’ [15:29:58] (as ‘lib’ is unspecified) [15:29:58] --- Please select a CRAN mirror for use in this session --- [15:30:01] Error in m[, 1L] : incorrect number of dimensions [15:30:05] > [15:30:12] oops sorry [15:30:26] halfak: no [15:30:39] I tried mwxml and I failed [15:30:48] it seems you need to unzip the dump file [15:30:51] Nooo. Damn. [15:30:52] Nope [15:30:57] It'll run on the zipped files. [15:31:13] the mean doesn't really matter for me [15:31:13] But if they are 7z, then it'll try to use the commandline utility for you. [15:31:32] Yeah. I just wanna make that lib trivially easy for people to use. [15:31:50] I use pywikibot because I'm fast at writing scripts related to dumps using pywikibot [15:31:55] ok, that worked. [15:31:55] Let me try something not in CA this time. [15:31:56] > install.packages("sigclust") \\ Installing package into ‘/home/vagrant/R/x86_64-pc-linux-gnu-library/3.1’ \\(as ‘lib’ is unspecified) \\ Error: Line starting ' > [15:32:10] I wrote more than 20 scripts using pywikibot [15:32:16] I know methods by heart [15:32:21] Gotcha. Makes sense. [15:33:19] I think it's vagrant screwing things up. [15:33:26] Can you try it on your host machine. [15:33:30] R should work great on a mac. [15:33:31] ya [15:33:51] But I'll have to install it. [15:34:00] This should take a minute. [15:34:07] * halfak crosses fingers for quick install [15:43:23] That worked. [15:43:34] Was able to get sigclust package [15:43:37] "/var/folders/49/8z7qknwn5wn20fl0t0_xjqwm0000gn/T//Rtmp1KGMuC/downloaded_packages" [15:43:52] oops. [15:43:56] anyway, yeah. [15:44:13] Woot! [15:44:20] OK. So now to get the data loaded. [15:44:31] We're going to use read.table() [15:44:39] I'll get you the incantation [15:46:09] df = read.table("datasets/enwiki.features_damaging.20k_2015.tsv") [15:46:26] looks like the default args work great [15:46:39] a df is a "table" [15:46:50] it has named columns. [15:49:43] done [15:50:23] Cool. So, if you run summary(df), it should make sense. [15:50:41] You'll see that the columns are "named" V# where # is the column order number. [15:51:02] So V1 is the rev_id and V48 is true/false for "is_damaging" [15:55:00] ah yes [15:55:00] although I only have 47 columns [15:55:04] oh, ha [15:55:10] I have 48 [15:55:16] :D [15:55:21] I was gonna be very confused. [15:55:33] oh wait [15:55:45] no I was assuming they started with V0 but they don [15:55:46] COol. So, you should be able to pass the middle 45 columns ot sigclust. Now to figure out how we work with sigclust [15:55:46] t [15:55:49] Yeah. [15:55:50] so I have 47 columns [15:56:22] Ok, I have a feeling I can take it from here. [15:57:24] Although I'm a bit confused about why you have 48 columns. [15:58:04] oh you started with the second data set! [15:58:07] ok [15:58:14] Remember they both had the same name [15:58:23] althought the second one had an additional column. [15:58:35] (I have renamed mine to data1.tsv and data2.tsv [15:58:37] ) [15:58:49] All good. [15:59:07] +1 [15:59:18] Cool. I'll be around for another hour. Ping me if you need a hand. [15:59:37] will do [15:59:42] Amir1, can you link me to docs on how pywikibot handles XML? [16:03:17] halfak: sure [16:03:19] one sec [16:04:34] https://doc.wikimedia.org/pywikibot/api_ref/pywikibot.html#module-pywikibot.xmlreader [16:07:44] Interesting. Looks like this does the same thing as my XML reader. [16:07:53] Parallel development? [16:08:41] I'm not sure [16:11:15] Looks like -- whoever first put this together -- didn't copy any of my code. I'm biased, but this looks like a mess. [16:11:32] XmlEntry == Revision [16:11:48] A lot of fields are missing. [16:11:55] But it's the same basic idea. [16:13:47] Oh! It uses threading.Thread. That doesn't distribute CPU [16:13:56] mwxml uses Multiprocessing and distributes CPU [16:14:43] So I will learn mwxml ASAP [16:17:44] Not required. Also, I'm super biased. [16:18:03] I want to take the pywikibot module and mwxml and make an example script for both. [16:18:49] I think we could even replicate the CPU distribution by making mwxml an dependency of pywikibot and replacing some internals. [16:19:08] A multiprocessing.Process gives you the same basic interface and guarantees as a threading.Thread. [16:20:08] It would be nice to have the side-by-side first [16:34:57] We have partial chinese translation [16:34:59] only UI remains [16:35:05] we even have chinese wikilabels landing page [16:37:01] Woot! Do we have groups for zh? [16:37:03] ToAruShiroiNeko, ^ [16:37:21] not yet [16:37:33] I wasnt expecting either trnslation [16:37:38] it happened on its own [16:37:56] so it is a very good thing when things happen without my annoying reuests [16:38:12] we are getting polish too [16:38:26] and spanish [16:38:37] https://meta.wikimedia.org/w/index.php?title=Wiki_labels%2FInterface_translation%2FEdit_quality&type=revision&diff=14273057&oldid=14261234 [16:38:55] germans seem to be not very interested [16:39:00] * halfak reads the pywikibot docs and sees "There surely are more elegant ways to do this." Mwahahaha [16:39:06] I will shove revision scoring down their communities throat :p [16:39:32] Amir1 btw this is addictive: https://www.youtube.com/watch?v=kuORoSEGWYo [16:39:38] Na. Just let 'em be. Wait until they complain, "Why didn't you enable revscoring for us" [16:39:46] hehe [16:39:52] "Because ya'll were jerks! Wanna stop being jerks now?" [16:40:03] all I am trying to do is get translations for UI and Forms atm [16:40:10] and I am not even trying remotely hard [16:41:22] Amir1, can you show me how you use pywikibot.xmlreader.XmlDump in parallel? [16:41:42] It seems like the XmlParserThread doesn't work nicely with XmlDump. [16:41:43] ToAruShiroiNeko: yeah [16:41:45] :) [16:41:48] I might be missing something. [16:41:58] halfak: There are four dumps for history [16:42:01] for wikidata [16:42:12] ToAruShiroiNeko, we could deploy with English as the best fallback to get them mad :) [16:42:17] That would be good incentive. [16:42:32] no, we should do chinese as their fallback [16:42:45] I can troll if cornered :p [16:42:54] Amir1, indeed. Just trying to figure out how one would actually *use* XmlParserThread. [16:43:08] ToAruShiroiNeko, we don't choose the fallback path regretfully. [16:43:26] seriously though, I actually am thinking of launching the labells campaign despite their ignorance [16:43:30] But we *could* copy paste zh in *as* de >:) [16:43:36] I want to think people will work on it despite the loudmouths [16:43:37] +1 [16:43:42] I think we should just launch [16:43:48] I see no reason to punish the community as a whole because of a few dicks [16:44:21] dicks will incidentally be more upset if it works out DESPITE their efforts [16:44:23] Story of Wikimedia. If we can get around the "few loud people" problem around some contentious issues, I think we'll be able to make much faster progress. [16:44:27] that would be satisfying [16:44:38] Meh. Rather have them as allies. [16:46:37] true [16:46:55] first they will be angry and then the aftertaste of niceness will kick in [16:47:01] and they will probably reason [16:48:08] * halfak crosses fingers but does not hold breath [16:49:58] spanish is ready [16:50:16] Woot! [16:51:39] Why can't I run pywikibot without a goddamn user_config.py?!? [16:51:44] * halfak calms down [17:04:05] * halfak runs a test with pywikibot and mwxml to demonstrate (or refute) that mwxml is substantially faster [17:05:13] OK. And with that I'm off. Have a good Saturday folks. [17:33:32] updated spanish trusted users [17:33:46] good saturdays to you halfak [17:33:53] but its almost no more saturday here :p [19:20:12] halfak: rename user-config.py.sample to user-config.py