[02:26:24] evening lzia :) [14:40:52] good morning :) [14:47:22] morrnin [14:53:46] hey ottomata! [14:54:35] thanks for jumping on that thread about dupes so promptly :). So I should be using wmf.webrequest, da? [14:59:49] ja! [14:59:51] duhhh! [14:59:54] okay, signs today is going to be a good day: [14:59:55] where you been?! [15:00:00] 1) I was happy pre-caffeine [15:00:04] 2) I just successfully rebased [15:00:16] * Ironholds nods firmly, goes to armwrestle god [15:00:21] ottomata, Boston! On holiday! [15:00:54] Ironholds: wmf.webrequest is a columnar store in binary format, and should be a lot faster than wmf_raw.webrequest for many queries [15:01:03] ooh [15:01:09] it also has an is_pageview field [15:01:29] based on your definiton [15:02:13] yiip [15:02:18] was totally using wmf_raw.webrequest [15:02:38] Morning guys :) [15:03:02] gooood morning aaron! And what a lovely morning it is [15:03:22] :) [15:04:22] how is your day going? [15:05:06] Not bad -- just started :) [15:05:37] I found out that hadoop's distcp will totally not try to re-copy over a failed partial copy of the file you explicitly told it to move [15:05:50] Or s3 does magic recompression when you pull files out. [15:06:02] Magical recompression that breaks bz2 [15:06:38] is that good or bad? it sounds bad [15:07:00] It's lost me a few days of work, but at least I know what's going on. [15:07:36] hmn. Loading hive has got weirdly slow. [15:14:53] dammit, hive [15:14:56] "webrequest_source=misc/year=2015/month=1/day=27/hour=7" [15:15:34] ja [15:18:39] sweet! [15:18:47] we still haven't cleaned out the January files yet [15:18:50] * Ironholds pops knuckles, reruns queries [15:37:27] ottomata, you know what the heck "Operation category READ is not supported in state standby" means? ;p [15:37:33] partitions without underlying files? [15:37:53] uh oh [15:38:02] that means a namenode is not happy, checking [15:38:15] hm [15:38:16] no [15:38:17] its fine. [15:38:19] ... [15:38:22] what are you doing? [15:38:55] Ironholds: yeah, that is because my script won't do it yet, and Christian is using them [15:39:04] but, we will probably keep 60 days of refined data for now anyway [15:39:05] ottomata, querying January's data [15:39:21] starting with, well, 1 January, text and mobile partitions [15:39:34] you are seeing that message in your hive output? [15:39:41] yup! [15:39:52] stat1002 /home/ironholds/legacy_UDF.hql [15:40:11] try it and find out if I'm crazy (I might be crazy. But I have a note from my doctor saying I'm not, so that's good, right?) [15:40:53] What kinda doctor is this? [15:40:59] does it take a while to give you that error? [15:41:17] 'cause I could write you a note too. :) [15:43:17] ottomata, ooh, lemme check [15:43:32] i'm hesitant to run that since it is a big query [15:43:37] i just ran it on a single hour and it was fine [15:44:05] 30 seconds? [15:44:19] it just freezes after estimating reducers, and then spits out that error [15:44:21] k [15:44:35] so there's no actual processing being done (I assume!) [15:50:04] any ideas? [15:50:36] Ironholds: it looks like itis running for me [15:50:37] after setting [15:50:40] huh [15:50:41] export HADOOP_HEAPSIZE=1024; [15:50:49] oh, a heapsize problem? huh [15:50:57] okay! Let's try that.. [15:51:12] i never swa your standby state thing [15:51:30] Ironholds: I replied to your quarry question! [15:51:35] * Ironholds tests [15:51:42] YuviPanda, yep, I saw! Thanks :D [15:51:46] yw [15:54:21] works like a charm :) [15:54:57] Ironholds: wheee, cool [15:56:10] ottomata, it works! yaay! Thanks :D [15:56:15] morning Nettrom :) [15:56:20] morning! [15:56:56] cool [16:13:45] nuria, did I get them? Sorry about that; I knew that rebase was too good to be true :( [16:33:38] Ironholds: ya, i think you need to merge by hand now [16:34:41] so I did or didn't get them all? :D [16:59:27] morning leila :) [17:03:53] morning Ironholds. :-) [17:04:40] Who has two thumbs and is working on a data release? [19:02:21] DarTar, [19:02:27] hey [19:02:28] I can see you but you can't hear me [19:02:47] hmm, I can’t see you in the hangout at all [21:24:50] hmmm, looks like I have a working Random Forest classifier now... cool! [21:24:59] just need another 30 mins to test it [21:31:24] Nettrom, cool! [21:31:51] I was inspired by your pickling issues to finally get wikiclass adapted to revscoring's feature garden. [21:32:07] It was a good exercise because I discovered some limitations in how language is specified. [21:32:18] Infonoise needs to be a feature of the text. [21:33:00] I'm just finishing a substantial refactor of revscoring and I have a WIP for wikiclass. [21:33:11] I don't think this will make a difference for your work. [21:33:28] cool [21:33:56] I did find a fix for the pickling issue [21:34:27] also fixed a bug in languages, it fails 100% of the time due to not referencing the stemmer right [21:35:02] changed the definition of infonoise to match my R code, but that can be discussed [21:35:18] and also made it predict classes as strings [21:35:23] yes please. I found a weirdness. [21:35:40] (previously it didn't translate the model's integers back to strings) [21:36:00] The implementation of infonoise that I was working with did not take advantage of stemming. [21:36:05] It didn't seem to make sense. [21:36:28] I want to do len(set(non_stopword_stems))/len(words), right? [21:36:42] I'm not 100% sure [21:36:48] The old version just did len(non_stopword_stems)/len(words) [21:37:06] I couldn't figure out the value in generating stems and then not dropping them into a set. [21:37:33] I went back and looked at Stvilia, and it seems the denominator is len(document) [21:37:49] so I've always used byte length of the wikitext there [21:38:10] to match that I do a " ".join(stemmed_nonstops) in the numerator [21:38:21] Oh... So stemming will make the words a little shorter? [21:38:29] yep [21:38:38] removes suffixes, standardizes some terms, etc [21:38:58] Seems weird, but I can see it working. [21:39:34] "presumably" becomes "presum" [21:39:59] for instance [21:40:41] but the description in Stvilia isn't 100% clear, as the numerator is "(The size of [21:40:44] the term/token vector after stemming and [21:40:46] stopping)" [21:41:31] which kinda sounds like counting number of words [21:41:38] but then the denominator doesn't match [21:42:38] so I went back to the one I used earlier, since that was used to train the R model, and that performs well [21:42:52] Yeah. Whatever performs the best :) [22:12:35] lzia: can you block a room for the 4pm? [23:17:53] 10Quarry: Allow published query titles to be searched or filtered by tag - https://phabricator.wikimedia.org/T90509#1060909 (10DarTar) 3NEW [23:26:24] * halfak runs off to (maybe) go pick up a puppy [23:26:29] have a good evening folks [23:26:29] o/ [23:29:17] PUPPY????