[11:13:59] ToAruShiroiNeko: [11:14:09] https://www.irccloud.com/pastebin/q8MvB7tG/ [15:59:49] ToAruShiroiNeko: around? [16:04:21] yes [16:07:12] White_Cat: to the skype [16:07:33] yes [16:07:40] halfak is traveling it seems [16:08:17] yes [16:08:22] we wait for him [16:41:09] Amir1 so I suggest trying Naive Bayes as a control to see how well your ANN is doing [16:41:31] Yes I know cross validaiton [16:41:41] I meant how well its doing compared to naive bayes [16:44:12] Using words sounds good to people [16:44:21] but technically it's far harder [16:45:50] White_Cat: ^ [16:48:50] okay [16:49:09] so you want to train tf-idf on category words? [16:49:17] as in words in category titles [16:49:20] or words in articles? [16:49:26] words in articles [16:49:34] cool [16:49:40] thats an exciting application [16:49:49] what are you trying to achieve? [16:51:00] Classifying articles in a very general way: "is is about a human or not?" "the article is about a male person or female person?" "the subject is gay or not?" [16:51:06] things like that [16:51:32] gay or european :p [16:51:40] :D [16:52:09] okay that sounds awesome [16:52:21] its first step towards semantic categories [16:52:50] that is something super exciting for me [16:53:07] but implementing it technically is really hard, but classifying based on categories it's done already (I will commit it to github tomorrow) [16:53:42] I finished this for Scots Wikipedia for test [16:53:49] and AUC of 99.96% came from that [16:53:52] yes but using words will bring semantics into the mix [16:53:56] woha [16:54:00] they will be very confused :) [16:54:19] "Good morning, you get AI before German wikipedia" [16:54:29] Hahaha :))))) [16:54:46] The feature engineering is a little bit complex but totally worth it [16:55:00] basically i give two labels to each category [16:55:36] first is based on number of articles that already have such statement (like P31:Q5 which mean the article is about human) [16:55:51] sorry not number, proportion [16:56:37] second is based on proportion of articles that have this property but not that statement which means they are not human for certain. [16:57:05] with two features you may get an overfitting situation [16:57:07] and features of every article is number of categories it has for each class [16:57:13] no not [16:57:19] I get 20 features [16:57:21] not 2 [16:57:25] oh ok [16:57:46] I have to run to next meeting but its very exciting [16:57:52] for example let's say Category:1992 Births is a class A category [16:58:12] (means more than 90% already identified as human) [16:58:33] so Abraham Lincoln would have several class A categories [16:58:37] things like that [16:58:42] I will explain more later [16:58:46] I have to go too [16:58:47] bye [21:20:18] I'll probably be not available tomorow [22:07:04] halfak, I'll probably be not available tomorow [22:07:12] we exchanged some ideas with amir [22:07:22] including some fine tuning for tf-idf stop words