[17:10:26] hey Ironholds [17:10:47] :) How's the RfC stuff going? [17:12:11] pretty good! Just arguing with google [17:12:15] they don't like automated searching [17:12:21] which is, you know. Humorous. [17:12:25] how goes you? [17:14:02] halfak, ^ [17:14:31] Not bad. I'm still hanging out with family, so I might suddenly disappear. [17:14:51] But I'm picking the revscoring project so that I can run some tests. [17:15:09] cool! [17:15:14] I've been working on some UDFs [17:15:32] ran into a weird issue with getting maven to recognise JUnit, though, which has blocked me :( [17:15:52] Junit == unit testing framework [17:17:21] yup [17:20:49] Gotcha. [17:21:05] * halfak picks up ipython notebook for the 6th time. [17:21:14] Let's see what's new. [17:21:23] hehe [17:33:07] hi halfak, I runed a test using your code http://pastebin.com/MmyKTAa5 , but it keep running like i was in a infinite loop and don't renponds (I don't know the word in English for that) [17:33:37] hey danilo_, not sure what's up. I'll take a look in a few minutes. [17:33:58] ok [17:41:10] danilo_, I think I know what's up. [17:41:22] I have read a guide about support vector classification http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf and it says in item '2.2 Scaling': "We recommend linearly scaling each attribute to the range [-1, 1] or [0, 1]", and we have features like user_age_in_seconds with huge numbers, I think maybe that is the problem [17:42:26] So, I added probability=True to the constructor of SVC -- that allows us to ask for predict_proba() [17:42:40] But the docs say that the classifier will take substantially longer to train. [17:42:49] Regarding the scaling, I think that you are right. [17:42:52] We have two problems. [17:43:30] I tried without probability=True and I had the same problem [17:43:41] One is that a lot of these variables are long-tailed. It would probably be better if we log-scaled them. [17:45:35] danilo_, not sure then. As you can tell, the examples work. Could it be that you are training with substantially more observations than before? [17:45:40] How long did you let it run? [17:46:21] 2 hours [17:48:11] danilo_, I'm working on running my own tests. I'll let you know if I end up having the same issue. [17:49:50] ok, I'll try to normalize all features to range [-1, 1] and test again [17:57:45] oh, halfak: about an hour after our discussion yesterday, my christmas presents from my secret santa arrived [17:57:50] they include a book on computing and automata ;p [17:59:16] Cool! Which books? [18:09:04] * Ironholds looks [18:09:13] Kozen's "automata and computability" [18:15:37] halfak: Hi! [18:15:38] halfak: Will it be possible to override the "hardcoded" params 'kernel="linear", probability=True' on [18:15:38] https://github.com/halfak/Revision-Scoring/pull/19/files#diff-f66d67fa60b7e7f300f42d45343c331fR16 [18:15:38] ? I assume yes. In that case, will it be possible *because* of the kwargs which is *after* these parameters? (i.e., it will allow these parameters to be redefined because the last ones specified wins) [18:16:04] If you duplicate an arg, it will error out. [18:16:42] But I am a fan of working with different kernels. [18:16:57] It's just that this scorer is called "Linear" SVC, so that should probably be hardcoded. [18:17:10] indeed :-) [18:17:14] We might want to make a simple "SVC" scorer that does not hard code such things. [18:17:52] indeed² and the use it for the linear one? [18:18:15] (use it = subclass/instantiate) [18:20:06] halfak: even if the "linear" should be hardcoded, what about the probability=True? Shouldn't we be allowed to toggle it? [18:20:36] Well, that's essential for returning probabilistic scores. [18:27:28] Helder, I think it is certainly worth running some tests with different args, but we can do that by using the SVC model directly. [18:29:47] makes sense [18:30:48] halfak: should I wait until this is fixed before merging? https://github.com/halfak/Revision-Scoring/pull/19/files#diff-478fa498e644949992e7ed83f5836dc0R37 [18:30:54] "Doesn't work yet" [18:32:10] Nope. it doesn't work yet [18:32:16] It will work in the other pull request. [18:32:25] I split them up because the work was unrelated. [18:33:07] ah, ok [18:33:19] * Helder will finish looking into the first patch [18:36:18] :) Thanks for taking a look. :) [18:44:28] halfak: patch 1 is now merged [18:44:33] Woo! [18:44:52] * Helder will look the second one [18:44:57] I coordinated things so that the second pull request *should* merge without a rebase. [18:46:37] halfak: do you know if it is possible to compare only the two latest commits of https://github.com/halfak/Revision-Scoring/pull/20 [18:46:38] ? [18:47:14] i.e. the diff between "Adds a demo for scorer" and "Adds pickle demo to scorer" [18:50:31] * Helder found it: https://github.com/halfak/Revision-Scoring/compare/master...955923fc5f83c9037965f99e7b2f2bc75fe5a37e [19:20:59] halfak: https://github.com/halfak/Revision-Scoring/compare/master...955923fc5f83c9037965f99e7b2f2bc75fe5a37e [19:21:11] is that supposed to be test_feature_type? [19:24:51] Helder, that link doesn't bring me to a line in particular. [19:25:02] I don't know what you are referring to. [19:25:17] try this: [19:25:17] https://github.com/halfak/Revision-Scoring/pull/20/files#diff-0f08f18a0b5c2b30fe9507b7d086885eR21 [19:26:04] teat_feature_type() [19:26:05] Oh. yes. typo [19:26:46] halfak: in the other pull request I noticed some capitalized "Where"s which could be fixed in the same commit [19:26:54] * Helder tries to find them [19:27:40] halfak: search for "labeled data Where is" [19:27:48] on https://github.com/halfak/Revision-Scoring/pull/19/files [19:27:54] Looks like that was part of the last pull request. [19:28:14] yes :-) [19:28:17] It's fine if you want me to fix them before merging, but it seems unrelated. [19:28:19] Helder, halfak: I got to use linear kernel reduncing the range and setting probability=False: http://pastebin.com/k5TQ8rhd [19:28:33] (it was of minor importance, so I let it go...) [19:31:25] halfak: does this date has any special meaning? https://github.com/halfak/Revision-Scoring/compare/master...955923fc5f83c9037965f99e7b2f2bc75fe5a37e#diff-7f54060a935348f181d937a40763b1feR10 [19:31:36] Timestamp("20050101000000") [19:32:01] That's when use registration dates began being recorded in MediaWiki [19:32:15] So, it is a conservative lower bound. [19:32:30] Should be documented probably. [19:33:12] got it [19:36:32] post office trip made! [19:36:33] danilo_: I think these values "n / " should probably be computed taking the maximum of the values in the dataset for that feature (instead of hardcoding 10, 100, 50000000, etc..) [19:36:42] halfak, did you know you can get both Harvey Milk and Boy Scouts of America stamps? [19:36:55] I bought 20 of each and intend to find excuses to send letters featuring the combination, so I can laugh. [19:37:05] Gotta get back to holiday stuff. Helder, if you give me a list of bits, I'll fix 'em in the pull request next chance I get. [19:37:14] enjoy yerself :) [19:37:27] Ironholds, this juxtaposition is appreciated :) [19:37:31] Have a good one folks. [19:37:32] o/ [19:37:38] halfak: you just need to fix the test typo for that pull request [19:37:48] halfak: you too! [19:37:52] OK. I'll do that quick and ping. [19:41:46] Helder, got that type in the test. [19:41:53] * halfak runs away [19:41:54] o/ [19:41:57] bye :-) [22:32:37] Hey Ironholds [22:32:42] Got some more time to hack. [22:32:48] cool! [22:32:49] I saw your email about hadoop. [22:32:58] I don't know what you are talking about with my job. [22:33:05] I am using less memory and containers than you are right now. [22:33:12] * Ironholds checks hadoop again [22:33:40] yay! This explains why the query ran [22:34:05] yesterday evening (see the email timestamp) there were two halfak-triggered jobs with ~1TB of memory reserved betwixt them [22:34:14] I guess one of them must have ended? or both, but replaced by 1 new? [22:34:43] I dunno dude. I've been monitoring these guys. Also, I haven't *ever* used more than 50 containers for these jobs since I upper-bounded it at that. [22:35:05] The one that is running right now has been running for almost 7 days. [22:36:47] yeah, the others were on ~40-50 [22:37:03] But I swear there were two distinct queries yesterday, with approximately 400gb of memory apiece [22:37:54] But the hive queries use about the same. [23:26:31] stupid google [23:26:41] I am very, very, very slowly automatically crawling their system [23:26:52] and by automatically I mean "restarting every 150 queries or so" [23:32:46] argh. [23:32:54] identifying duplicated SHAs should not be so hard. [23:39:06] If eel like this is one of those fundamental CS things with a Knuth algorithm out there somehwere that nobody told me