[00:40:47] ewulczyn: the table is staging.referer_data [00:56:39] I made another mock for the research index. https://meta.wikimedia.org/wiki/Research:Index/Sandbox_splash [13:06:26] o/ Ironholds [14:13:55] here: https://github.com/halfak/Mediawiki-Utilities/tree/master/examples [14:14:24] great! thanks! [14:14:29] No problemo. [14:14:52] tnegrin, perfect. What is our deliverable for PVs by the end of Q2? I can't see it called out in the slides explicitly. [14:15:17] Say, if you end up using the DB stuff on LabsDB, I think you'll run into issues related to "revision_userindex". [14:15:37] If you have some bad performance issues, I'll be happy to work with you to get them settled. [14:15:43] Ironholds: we need to figure that out -- for R/D it's the definition that the devs can implement [14:16:05] * Ironholds nods [14:16:06] ta [14:16:12] but I don't want to wait until the end of Q2 [14:16:17] we'll discuss at staff [14:19:15] kk [14:25:29] Hey Ironholds, can you look at a mock for the new R:Index home that I've worked out? [14:25:31] See https://meta.wikimedia.org/wiki/Research:Index/Sandbox_splash [14:25:44] Proposed to replace this: https://meta.wikimedia.org/wiki/Research:Index [14:26:36] halfak, sure! After the meeting/etc? Dealing with PV stuff at the mo [14:26:43] No worries. [14:26:45] Thanks [14:38:27] halfak: are these examples supposed to be also in some of my folders once mediawiki-utilities is installed? Or only on github? [14:38:46] Only in github. [14:38:55] ah, ok [14:39:01] Sorry. It would be nice if I could provide a way to run the examples in the package. [14:42:39] halfak: is there a word missing on this? "ClueBot NG provides an IRC feed of its scores which, but not a querying interface."¹ [14:42:40] ¹https://meta.wikimedia.org/w/index.php?title=Research:Revision_scoring_as_a_service#Availability_of_scores [14:42:46] "which" what? [14:43:12] You're right. s/which// [14:44:56] what is the meaning of "svm" in "text_svm"? [14:45:02] halfak: https://meta.wikimedia.org/w/index.php?title=Research:Revision_scoring_as_a_service#Query_scoring_as_a_service [14:45:33] "support vector machine" [14:45:39] brb meeting [14:45:47] ok [15:28:48] anyone knows who runs http://tools.wmflabs.org/geohack/ ? [15:30:26] tools front page says Magnus Manske and Kolossos [15:30:32] ta [15:30:35] * Ironholds will scold them [15:48:18] Hey Helder. Just got done. [15:51:34] Hey Nettrom, just saw your email. I'll do the bits you don't want to do. [15:51:59] I think the two parallel next steps are to (1) play with the datasets I built or (2) gather edit stats for Q3. [15:58:30] hmmm [15:58:34] what's our time schedule? [15:59:20] ASAP [15:59:26] Where possible == reasonable [15:59:58] I'm hoping to have the analysis done by the end of the week. [16:00:10] But that depends on what we find. [16:00:23] All relevant decisions for the Growth team have already been made. [16:00:31] We're informing future work. [16:00:48] And, potentially, your work. [16:00:53] mhm [16:00:58] * Nettrom looks at his calendar [16:06:25] We don't have to *finish* the analysis so much as make a good pass. And by "we" I mean "I". If you wanted to delay digging deeper into this data, that'd be fine. [16:06:56] no, just trying to figure out how to be productive and efficient [16:07:32] I think digging further into the data can be delayed until we've answered the first few questions [16:07:37] I'm sure more will pop up then [16:08:18] how about I tackle the datasets for now? I'd prefer to focus on that for now, then tackle the SQL the next time around [16:08:19] agreed [16:11:47] if that's cool I'll schedule some time for it today and tomorrow morning [16:12:45] Sounds good :) [16:12:49] Nettrom, ^ [16:12:57] halfak: awesome, on my calendar [17:00:01] halfak: once I clone https://github.com/halfak/Revision-Scoring, what is the recommended way to avoid "ImportError: No module named 'revscores'" when I try to execute "from revscores import APIExtractor"? [17:00:38] If you run python from the project directory, it should find "revscores" [17:00:52] * halfak goes from one meeting to another. [17:01:45] * Helder will try tha [17:02:20] it works [17:40:29] Woo Helder. Glad it worked as expected [17:40:43] * halfak is super stoked you are using his tools. [17:40:53] The Revision-Scoring stuff is still pretty WIP. [17:54:07] halfak: how does the system will deal with vandals who deliberately introduce typos in the badwords to attempt to confuse the system? [17:54:33] Good Q. Clever vandals were always a weak spot for machine learning strategies. [17:54:39] Luckily most vandals aren't clever. [17:54:46] for example, say "badword" is changed to "bapword" [17:54:57] We could do edit distance, but I'm worried that will dramatically increase false positives. [17:55:29] halfak: for training the system, are these variations desired in the set of badwords or only *true words*? [17:55:59] I'm not sure I understand the question. [17:56:20] halfak: in lists like this: [17:56:21] https://github.com/halfak/Revision-Scoring/blob/master/revscores/language/english.py [17:56:48] should we add two variations of the same expression? (if their stem is different) [17:57:28] It would be good to test. We could try training and testing classifiers on two different badwords lists -- one that accepts variants and one that does not. [17:57:39] Then see which one has a better AUC or whatever. [17:58:17] I bet the one with variants will work insubstantially better. [17:58:26] We could write some code to generate likely variants. [17:58:34] halfak: AUC = ? [17:58:46] Area Under the Curve (for an ROC plot) [17:59:24] e.g. variants(badword) = ["badw0rd"] [17:59:48] Here is an example I was thinking about: STEMMER.stem("kkkkkkk") != STEMMER.stem("kkk") [18:00:19] a finite sequence of "k" is usually used to express a laugh in Portuguese [18:00:40] how many of them would we need in the list? [18:00:54] "kkk", "kkkk", "kkkkk", ... [18:01:08] Oh, I don't think we need to get all of them. [18:01:48] In this example, we'd have the longest_repeated_char feature catch the "word". [18:01:56] Ironholds, halfak, just a heads up about analytics showcase, in case you'd like to join [18:02:09] Really, I think we need to focus on signal and let the classifier make sense of the intersections. [18:02:18] also, for detecting variations, maybe the approach from AbuseFilter helps... it has a ccnorm function which converts similar characters to a common character and also remove duplicates... [18:02:20] Thanks leila [18:02:37] np [18:02:37] Helder +1 for that. [18:02:41] Makese sense. [18:02:51] leila, I can't make it, I'm afraid, but thanks for the notice [18:02:55] So normalize("badw0rd") --> "badword" [18:03:04] np, Ironholds. [18:03:48] something like that... [18:03:55] See "ccnorm" here: https://www.mediawiki.org/w/index.php?title=Extension:AbuseFilter/Rules_format#Functions [18:04:30] norm( "!!ω..ɨ..ƙ..ɩ..ᑭᑭ..Ɛ.Ɖ@@l%%α!!" ) === W1K1PED1A [20:22:20] ewulczyn, lemme know when you want to talk last-read-page-for-donors, btw. [20:24:30] I'm going to punch MySQL in the face. [20:24:59] As soon as I find out where it's face is, I'm going to punch it. [20:25:25] "I know I have an index, but I'll just scan the whole revision table. What's 500 million rows?" [20:25:32] Stop it, I only need 50k of them. [20:25:48] "Nope. Gonna scan the whole thing. U MAD BRO?" [20:25:48] SELECT beatdown FROM halfak INNER JOIN mysql on halfak.fist_coords = mysql.face_coords? [20:26:14] approved [20:26:44] halfak: does it help if you explicitly suggest it use the index? [20:26:50] Regretfully, no. [20:26:58] * Nettrom facepalms [20:27:05] I'll GIST quick. Maybe you guys will see something I'm missing. [20:27:32] https://gist.github.com/halfak/38b1917d4a34d902d083 [20:27:54] tr_experimental_user contains ~40k rows and most of the users didn't even make an edit. [20:29:17] I just updated it to include the "FORCE INDEX". The query plan doesn't change. [20:30:29] I wish I could just write my own query plan. (1) Scan tr_experimental_user (2) btree lookup on revision (3) PRIMARY lookup on page (4) PRIMARY lookup on revision for parent [20:30:32] DONE [20:30:41] It would take about 15 seconds to run. [20:30:45] hum [20:30:53] tried index hinting? [20:30:57] See gist [20:31:04] bah. I'm an eejit. [20:31:06] "FORCE INDEX (user_timestamp) [20:31:34] halfak: what happens if you move the rev_timestamp check to the WHERE-clause instead of the join? [20:32:13] Same query plan [20:32:18] bollocks [20:32:22] Agreed [20:32:33] or should I not swear here? [20:32:44] I think you're OK. [20:33:05] * halfak wonders if bollocks is considered a swear [20:33:08] Nettrom, you can swear, but they have to be imaginative swears. [20:33:32] I will accept the sentence "MySQL is being an utter shitcanoe" for example. [20:33:46] +1 [20:33:53] for shitcanoe [20:34:04] I discovered it via John Scalzi a few days ago. [20:34:09] and...it made me very happy. [20:34:18] "Finally, I have something for those occasions where shitheel just isn't enough" [20:35:02] halfak: is there a type mismatch between user_id and rev_user? [20:35:30] making it think it can't just use the index [20:35:44] It still thinks the index is relevant. I think the problem is the cardinality of indexes of rev_user [20:36:01] Some editors have lots of edits, therefor, we should not even try using that index. [20:36:16] It's INT vs. UNSIGNED INT [20:40:04] I just converted to UNSIGNED INT in the experimental_user table and that didn't do it. [20:40:24] What's the point of "FORCE INDEX" if it doesn't force it to use the index? [20:41:08] This is why people use postgres [20:41:41] well, postgres is fond of table scans too.. [20:42:10] Indeed, but it does what I tell it. [20:42:11] this SO entry suggests it depends on what the query optimiser things it should do: http://dba.stackexchange.com/questions/23289/mysql-query-not-using-an-index-when-table-contains-many-records [20:42:47] Yeah. It thinks there's a ton of rows behind every rev_user value. [20:42:55] I wish I could tell it to stop thinking that. [20:43:41] move the join to a sub-select? [20:43:49] sounds terrible, but might work [20:44:05] * Nettrom needs to get back to his PhD review [20:44:21] Tried that [20:44:28] No worries. Thanks for looking at it with me ; [20:44:30] o/ [20:47:53] I got it! I made a temp table with no indexes of the complete experimental_user table and ran against that. [20:48:04] * halfak feels no shame [20:48:06] none at all [20:51:16] 1 minute and 27 seconds [20:51:18] Damn right. [20:51:57] halfak, http://whatever.scalzi.com/2014/09/30/hey-kids-lets-define-a-word/ speak of the devil [20:53:02] Yay! I voted for the winningest option [20:53:13] I like second place too [20:53:41] The more you know... (rainbow) [20:53:49] heh [20:53:54] as did I! [20:53:58] I mean, it just parses better. [20:54:07] but it amused me that I evidently got the word /straight from the source/. [21:29:40] Ironholds: Would 4 pm be too late to talk about last-read-page-for-donors? [21:29:56] you mean, 7pm? yes ;p [22:33:18] Hey Nettrom, I just finished up the analysis of Q3. [22:33:20] https://meta.wikimedia.org/wiki/Research_talk:Newcomer_task_suggestions/Work_log/2014-09-30 [22:33:24] TL;DR: No effect [22:49:01] oh darn, you're efficient... I'll have to get moving on this... [22:49:06] interesting to see that there's no effect, though [22:49:26] will work on it later today and tomorrow morning [22:50:08] halfak: I'm heading home now, ping me on mail or IM if there's anything [22:50:17] Will do have a good one! [22:50:20] Lame [23:16:47] DarTar: wtf, [[Less (Unix]] is really highly viewed? [23:17:13] yeah, we’re aware of that one ;) [23:17:26] reason #1 not to get too attached to raw PVs [23:18:59] wat [23:19:43] halfak: ErikM’s explanation was that the planet is finally switching to Unix [23:19:53] ha! [23:19:55] Woo [23:19:59] I <3 less [23:20:08] the reality is that we have a crawler very interested in Less (Unix) [23:20:16] Haha [23:20:17] from a single IP address ;) [23:20:23] not a joke [23:20:34] Incredible. [23:20:45] huh. I bet it's an old joke that's no longer funny. [23:20:51] we had another case where ErikZ suggested it could be a single user with an F5 key stuck on his keyboard [23:21:06] heh [23:22:52] Is it all from the same IP?