[00:03:26] so Ironholds, if this is something you want to do for a large dataset, I can think of building a prediction model that can tell you if a user_agent needs to be removed or not [00:03:58] leila, ooh. How? [00:04:25] If you're doing it for a small data-set, it's not worth the effort to build prediction models, you're better of looking at the distribution of pageviews by user_agents per page, and finding cut-off points. [00:06:44] for the first method, you need to define features that you think can help, for example, for each (user_age, page_id, views) tuple, 1) difference between mean page_views and the specific user_agent's pageview counts, 2) number of user_agents who have viewed the same page_id, etc. [00:07:15] then you need to hand-code, for a subset, those entries that need to be removed. [00:07:21] and then build a model. [00:08:20] If you know what's the distribution of pageview counts for each page, you may be able to use that information to get away with simpler heuristics. [00:09:06] For example, if the distribution is normal, you can remove data points outside of mu+/-3sigma [00:12:35] gotcha [00:12:59] leila, so in this case, what I did was calculated the herfindal measure for each page, and after removing pages with <1000 observations to avoid sample size problems, plotted the distribution [00:13:25] it showed the vast majority of pages have very low concentrations and a small subset have very /high/, so I set an arbitrary threshold at 0.5 and began investigating those above it [00:13:39] all of the ones above ~0.8 definitely contain a big chunk of automated traffic; I hand-coded the lot [00:13:54] not sure if this helps with the modelling, but. I would love to dig into this! I think it could be a really big deal. [00:19:25] Ironholds: I'm curious. Have you looked at the box-plot of a sample of pages to see if it identifies the outliers properly? [00:21:55] actually, no; I drew up some histograms and density plots [00:22:10] oh, you mean the outliers within each page, on a per-agent basis? [00:22:15] yeah [00:22:22] interesting; I'll check it out. The distribution appeared unfortunately bimodal but I'll see what I can do :) [00:22:38] that's what you want, right? for each page, you want to say which user_agent views should be excluded [00:23:13] yep! [00:34:54] Ironholds: two more comments before I leave this to you. [00:35:23] 1) If you can find the underlying distribution of the data, you can use simulation to identify outliers. [00:36:02] you let the simulation run for 10K and see how many times it produces a result like the anomaly you've observed in the data. [00:36:19] ooh [00:36:21] sensible! [00:36:48] 2) If you're interested in this work, reading more on outlier analysis can help you. [00:37:06] I did a quick search and people seem to be recommending this one: http://www.amazon.com/Outlier-Analysis-Charu-C-Aggarwal/dp/1461463955/ [00:37:17] (not sure if it's the best one. feel free to search with the keyword) [00:42:22] thanks! [00:47:36] halfak, you might wanna check out https://github.com/twitter/pycascading btw [17:51:02] mornin' halfak, Ironholds, J-Mo. :-) [17:51:15] hokay, good afternoon, Ironholds. ;-) [17:51:18] good morning, leila! [17:51:34] oops. too many pleasantries. we offended Toby. [17:53:34] he's not voiced. ;-) [18:02:35] morning [18:36:00] haha, mornign Ironholds [18:51:05] so halfak, re traffic per page_id (discussion from @Analytics), is there a way to do this in hive or I should use sqoop to do it? [18:51:40] :) leila, if wait maybe 1 month, it will be very easy [18:51:55] project is underway to get page_id in X-Analytics [18:52:15] love it, ottomata. :-) but now I need to get a sense of traffic for the page impressions of WikiGrok that will go live next week. [18:52:28] could you do with page_title for now? [18:54:08] so I export the number for hits to all page_titles accessed through m.en.wikipedia and then join it with enwiki.page, and then enwiki.wikigrok_questions? [18:54:56] ja for now, if you want to associate a request with a page, you need to do it through the uri_path [18:54:58] which is not ideal [18:55:00] but what we have atm [18:55:13] that's fine. I'll do that ottomata. thanks! [18:55:17] here's an example in an old gist i found [18:55:17] https://gist.github.com/ottomata/d29dba9d04ec786b2d41 [18:55:33] that was joining with the sqooped plwiki page table [19:23:52] Ironholds, yt? [19:23:59] DarTar, I deny everything [19:24:03] too late [19:24:12] actually I am here making a philosophy joke on enwiki [19:24:13] quick question [19:24:30] ah great, you ended up editing the Perdurantism article? :) [19:25:06] so, I sat down with Deskana to talk about the instrumentation / analytics reqs of ShareAFact [19:26:04] if you haven’t heard of that, it’s an app feature that allows you to highlight text, generate a snippet and share it via a number of channels [19:26:44] yeah, I heard of it [19:26:51] and no, it's to illustrate the KnowledgeGraph article [19:27:01] I'm using Edmund Gettier's KG entry and gonna see how many people notice why that's funny [19:27:01] I told Deskana we should do 2 things: [19:27:19] - add the oldid parameter to the backlink [19:27:46] - add a parameter that we can parse from the requested URL to identify inbound traffic driven by this feature [19:28:05] so for example: https://en.wikipedia.org/wiki/December_2014_North_American_storm_complex?oldid=640061525&src=app [19:28:23] question is, do you have any recommendation about the parameter name to use? [19:28:37] I’m already using src with @wpstubs [19:28:49] err [19:28:59] ...why should we do that? Why don't we use EventLogging? [19:29:03] so unless this is used for something else that I am not familiar with or it creates parser nightmares, we’ll got for it [19:29:10] no, to be clear: [19:29:33] we do have EL instrumentation for the generating event [19:29:43] draft is here: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppShareAFact [19:30:07] but I want to measure how much returning traffic the feature is generating [19:30:21] and that requires parsing HTTP requests [19:30:41] DarTar: I'm just writing a patch to add the ?src=app onto the end [19:30:47] DarTar, aha [19:31:08] so, is there any reason *not* to use a src=foo parameter you can think of? [19:31:21] we might be using that for something else? [19:31:25] I'll grab some sampled logs and see [19:31:45] cool, “c” and “campaigns” are also taken [19:31:51] I could easily change it to "sharesrc=app" or something [19:32:07] Whatever you guys want. :) [19:34:28] Deskana: are you guys limiting the ability of using the feature rigth after an edit by the same person? [19:34:57] DarTar, src seems fine, but go for the more complex name to avoid future clashes [19:35:13] DarTar: No. [19:35:38] Ironholds: cool, I can also request that we change the parameter used by wpstubs [19:35:39] Okay, I'll go with "?sharesrc=app" [19:35:56] DarTar: Sound good? [19:35:57] Deskana: how about “source”? [19:36:07] Sure, whatever you like. :) [19:36:07] so we can use it for other use cases [19:36:13] cool [19:36:51] Deskana: I’m a bit worried about the lack of a filter for sharing vandalized articles, I talked to moiz a while ago and I was hoping this could make it to the MVP [19:37:23] The thing is though, for every case we cover there are a ton of other cases that we wouldn't manage to cover. [19:37:27] “look, mum: Wikipedia says I am the President of the US” (and you can verify it by linking back to the article) [19:37:56] Actually, since we don't link to the specific revision, right now that won't work :) [19:38:02] *cough [19:38:12] that’s a bug not a feature :p [19:38:13] If it's vandalism it will get removed and not be there when the user clicks the link [19:38:35] blatant vandalism, yes – the more subtle one, no [19:39:03] All that we've done is make it a bit easier to do this, but the fact is that there's nothing to stop people from doing it right now except that it's a bit harder. [19:39:37] a simple check for the author of the last edit would do the job [19:39:59] this is actually where the revid comes in handy [19:40:21] we’ll be able to see how many of the shared snippets come from reverted revisions [19:42:44] DarTar: Wouldn't a malicious user just log out immediately after making the change? [19:43:46] Deskana: I don’t think so, the point wouldn’t be to permanently vandalize Wikipedia but to brag with your friends that you did it [19:44:53] I receive news alerts for Google News articles about Wikipedia and an ugly lot of these articles are about “Article X was changed to say [your favorite vandalism]” [19:45:05] that says a lot about what the press responds to [19:45:24] still, it worries me that we might be creating a mechanism further amplifying this behavior [19:46:20] DarTar: I'd prefer to worry about that if/when it becomes a problem, rather than trying to catch every edge case with our first pass. [19:47:03] true, I’m just thinking of our past record of candid releases of features (AFT) that didn’t take into account obvious ways of hacking/gaming them [19:47:22] if anything, we’ll be able to tell right after the launch if this is an issue [20:09:40] Oh no! Leila, I have a dataset :) [20:09:47] With view rates. [20:09:49] By page ID [20:09:57] that I wrangled from the sampled logs. [20:10:27] stat1003:/home/halfak/projects/importance/datasets/article_stats.tsv [20:10:43] "resolved_views" accounts for views to redirects. [20:11:01] Oh... but wait. We might need to limit this to just mobile web. [20:12:06] lemme check it halfak. [20:16:25] yeah, halfak. that's a good dataset. It won't work for mobile counts though. :-( [20:17:01] We should be able to updated it reasonably with the sample logs. What's your timescale for getting this done? [20:22:34] leila, ^ [20:23:36] don't worry about it for now, halfak. I need it to get a sense of the traffic to pages to get back to the discussion on the analytics list for wikigrok. I'm working on a code to get the data from the logs. If you had it already, I would happily use it though. ;-) [20:23:53] kk godspeed [20:23:55] :) [20:37:50] R&D peeps, http://cran.r-project.org/web/packages/openssl/index.html boom! [20:37:54] get yo cryptographic hashing here [20:39:33] Cool! Nice work. [20:56:28] * halfak tests out Ironholds' crypto lib. [20:56:46] check out the speeeed [20:57:48] What's a good way to time some R? [21:06:49] halfak, check out the microbenchmark package [21:11:49] > start = Sys.time();v = sha256(as.character(1:100000));Sys.time() - start [21:11:50] Time difference of 0.7537768 secs [21:12:14] Comparable in python was 27 seconds [21:30:19] halfak, hmn? [22:11:20] phabricator makes me want to murder things sometimes [22:28:17] ottomata, if we switch the webrequest storage format over is that going to include voiding the existing data store? [22:38:27] <^d> Ironholds: What gives on T71804 with the [Restricted Project]? [22:38:33] <^d> I can't even see the project. [22:38:36] I don't know, I haven't done shit to it [22:39:01] <^d> What'd you do on Dec. 5th? [22:39:27] <^d> Obv. project might have changed later, I'm just trying to figure out wtf is going on with that project. [22:39:43] I assume it's the research-and-data project? [22:39:45] ask halfak [22:40:34] <^d> I see T770 asking for it to be setup with default policies. [22:40:42] <^d> Wonder why those policies changed. [22:41:20] Hit F5 [22:41:31] * ^d hits Ironholds instead [22:41:47] okay, well, you can do that, or we can actually test and resolve this so I can get on with my job. [22:42:03] <^d> Hehe, I see what you did there ;-) [22:42:07] <^d> Thank you :) [22:43:48] Ironholds: i don't understand the question [22:44:04] ottomata, Toby told me you'd been experimenting with different storage formats from JSON [22:44:09] ah yes [22:44:10] and found one that was an OOM faster to query against [22:44:11] no. [22:44:13] gotcha [22:44:25] the raw one will remain as is [22:44:37] this other format is the first draft of the 'ETL' unicorn [22:44:46] there should be a couple of hours in it already [22:44:52] it'll pick up more as time goes on [22:44:55] wmf.webrequest [22:45:00] try it out! [22:45:08] Ironholds: it is also tagged witih an is_pageview field [22:45:11] using your UDF [22:45:17] gotcha [22:45:18] precomputed with your UDF [22:45:36] Cool! [22:45:40] I'll take a look when I have time [22:46:00] k, just wait a few days, use it for adhoc hive stuff then, instead of wmf_raw.webrequest [22:46:13] especially if you are just developing queries or something, ya know? [22:46:16] should be much faster for most hive queries [22:46:40] Ironholds: also, TABLESAMPLE will work on it [22:46:45] to make the queries faster too [22:47:19] cool! [22:47:31] ottomata, you might wanna point nuria to it too [22:47:38] if TABLESAMPLE will work on it and it only contains app pageviews... [22:47:45] nuria knows all about it :) [22:47:49] cool [23:36:30] Ironholds: excellent, even better as the center can distribute them to many different parties