[00:00:00] template extraction and handling, and "turn a mediawiki table into a data.frame" [00:01:22] * YuviPanda looks into Ironholds’ future [00:01:26] guess what I am finding? [00:01:29] GUESS [00:02:01] pain [00:02:45] CORRECT! :) [00:02:58] see "limited" [00:04:16] Ironholds: indeed, and hence there is only pain, rather than something worse. [00:05:03] see https://github.com/Ironholds/mwutils [00:05:10] it also does revert detection and timestamp conversion so far [00:05:37] oh, and pageview count retrieval from stats.grok.se [00:06:11] Ironholds: ah, nice :) [00:06:28] Ironholds: is our new hadoop based infrastructure producing public pageview counts anywhere? [00:07:06] actually the high-level http://stats.wikimedia.org/ counts are coming from hadoop now [00:07:15] using the old definition, though. so take it with enough salt to kill a small town. [00:07:31] right, so nothing with the new defs yet that’s public [00:07:42] (just curious since someone asked me about that today https://meta.wikimedia.org/wiki/User_talk:Yuvipanda#Quarry_page_counter_values.3F) [00:09:36] aha [00:09:52] whelp, think my regex...thing. breaks R. [00:10:01] I suspect infinite recursion somehow [00:10:04] * YuviPanda pats Ironholds [00:10:13] I shall go to sleep and enjoy onset of my cold. [00:11:06] enjoy! [00:11:29] Ironholds: thanks for droppingin! [00:11:31] * YuviPanda goes [10:18:52] halfak around? [16:33:36] halfak around? [17:31:08] Hey ToAruShiroiNeko [19:20:23] hey halfak :) [22:34:08] hey Ironholds. [22:37:00] hey leila! [22:37:28] working for the next 4-5 hours, thought to let you know in case you're around working and you want to ping (or to be pinged. :D) [22:39:39] I am indeed working and do not currently have anything to work on :( [22:46:00] mm, Ironholds, maybe you shouldn't work then. It's a Sunday and you don't have something to work on. so just sign off? [22:48:23] I don't have anywhere to be! I'll find something [22:48:39] :-\ okay. [23:34:05] I really want to build a classifier, though. [23:34:20] I think I might have to use /Python/ :/ [23:48:09] leila, so I'm building tools and tests to detect if traffic to a page is primarily automatic [23:49:03] is this for pageview work Ironholds? [23:49:17] well, mostly it's because I think toby is wrong but we're both stubborn old fools [23:49:31] in what sense? [23:49:46] he thinks non-crawler automated traffic is NBD [23:50:30] aah! I see. he's probably speaking from experience here. You think it's not the case? [23:51:50] I think that of the top 50 enwiki articles last week, 9 of them are driven almost entirely by bots [23:52:00] (using very, very basic heuristics - but heuristics I have tested robustly) [23:52:18] I see. [23:52:48] how do you want to use this later? I mean, how did this conversation start? [23:53:30] well, basically it's from my work with the signpost. Every time they run a traffic report they throw some articles at me which shouldn't, in any way, have the amount of traffic they do, and ask what the heck is going on [23:53:57] ah! I see. [23:54:01] so I've been exposed to a lot of automata - but unfortunately the basic test that tells us to dig into things is human-only, right? Short of us checking for the variance between days/weeks/months [23:54:08] and that's vulnerable to breaking news stories [23:54:15] a bot doesn't know what articles "should" have a lot of traffic [23:54:45] but a bot can tell what the concentration of requests is like, over user agents, and a bot can tell what proportion of requests are looping, and a bot can tell what the inter-time values are, and a bot can tell...so on. [23:54:56] so I'm developing some heuristics for identifying this on a per-page and also a per-request basis. [23:55:58] I understand. That same bot, can take into accounts trendy events, for example, it can look at the trendy hashtags or Google tags across languages. [23:56:32] it could, but that's beyond me :( [23:56:54] it makes it more complicated, I agree. [23:59:41] in the meantime, though, I can absolutely test some really basic heuristics over a big dataset [23:59:50] so I've grabbed all the enwiki pageviews in an hour and I'm going to play around with those