[00:24:48] halfak: I suppose the trouble with such stuff is keeping it running without much human intervention [00:27:58] gwicke: 'such stuff'? [00:29:20] YuviPanda: SQL servers susceptible to abuse [00:29:26] ah, yeah [00:29:39] quarry's doin ok so far, but it's only got about 300 users [00:31:16] it's always possible to use a bunch of disposable vms that are restarted at regular intervals [00:31:53] kind of the big hammer method [00:39:59] gwicke, I'm not sure what we are talking about here, but it seems that quarry's susceptibility to abuse hasn't taken it down yet. [00:40:34] Also, if services become widely used, that means they are valuable and potentially worth building up. [00:43:06] Now the part about human intervention... we have humans. We might not have enough humans to support an extra couple of DB machines. [00:43:41] But the query rate from the lua module use case would be much lower than the usual load. [00:44:11] I'm not trying to advocate this idea so much as explore it. [00:52:38] if we can automate it without too much work, then IMHO we should do so [00:53:02] gwicke: well, limits, caching etc is already fairly automated [00:53:06] and code is fully puppetized [00:53:08] quarry.wmflabs.org [00:53:19] kk [00:55:33] looks spiffy [14:13:52] * Ironholds yawns [14:13:53] morning [15:22:56] o/ Ironholds [15:23:05] hey halfak! [15:23:21] I'm spending my morning writing C++. Nobody told me it would be this fun (okay, Katie did, but other than her) [15:23:48] I just cut computation time from 40 seconds (R) to 1.6 seconds (C++) to process an 8.5m-element vector. [15:23:59] * Ironholds claps [15:24:06] how's your day going? [15:25:51] Barely got out of bed yet. [15:25:59] :) Felt good to sleep in [15:26:55] I can understand why! [15:32:03] say, crap. [15:32:11] I just thought of a possible confound in our session datasets. [15:32:15] ...whoops. [15:35:00] the good news is, I can definitely fix the ones we're going to release! [15:35:49] What would this confound be? [15:36:50] * halfak fills stat3 up with nice [15:37:02] nicest place evar [15:38:47] hahah [15:38:58] so, I'm not sure if it's more than a hypothetical, but [15:39:26] we have some people proxying in, in the datasets. Easy-peasy; where data$x_forwarded_for does not equal "-", data$ip <- data$x_forwarded_for, right? [15:39:47] except sometimes x_forwarded_for isn't "their real IP", it's "an entire chain of IPs they were proxying through" [15:39:57] they change one proxy in the middle of a session? Bing! Different hash. [15:40:20] I can pretty easily assess the difference, though, since I just wrote a function that very very efficiently extracts the last IP in the chain. [15:40:34] It may not make a big difference, but it'd be nice to be sure. [15:40:55] I wonder how often that happens [15:41:08] Could it explain the weirdness we had in the desktop dataset? [15:41:21] I'd actually expect to see proxying happening more for mobile (opera, for example) [15:41:28] but I can load up the datasets and see. [15:41:44] I've got 8.5m randomly-selected XFFs in a little RData file so I'll start there and try to see frequency [15:42:31] Oh say, if you have time, I was hoping to talk to you about strategies for getting viewrates per state. [15:42:38] 0.5% of the time, apparently [15:42:46] (can wait until next week too) [15:43:06] .5% of the time is pretty uncommon, but not as seldom as I thought. [15:43:06] worth correcting for in the to-be-released datasets, imo, and I can check there to see how much of a difference it made with those specific datasets, but not heart-stopping. [15:43:23] yeah, I'll see what the actual datasets look like. So glad I didn't scrub the raw files yet. [15:43:43] and sure! I have all weekend. It's gonna be geolocation and converting as many things to C++ as is humanly possible. Go nuts. [15:44:22] Great. So, the thing that I'm struggling with is geolocating a very large dataset. [15:44:32] I figured that streaming hadoop might work for us. [15:44:45] But then again, I bet you have a different strategy to get at it. [15:47:13] hmn. how large is large? [15:47:36] Roughly all of the requests/pageviews in the last year. [15:47:44] I could probably deal with a large sample. [15:47:54] * Ironholds thinks [15:48:11] yeah, we're gonna want to stream that. Do you know if the analytics machines have the MaxMind binaries on them? [15:48:27] Negative. [15:48:29] I'm pretty sure they must, since we have the experimental country-level geolocation UDF, buuut. [15:48:30] hmn. [15:48:31] Never used it myself. [15:48:47] so, what I'd recommend, for the most efficient way of doing it, would be... [15:49:21] wait, shit, the last YEAR. [15:49:29] So we're talking the sampled logs, which aren't in Hadoop. [15:50:41] honestly over that range of data, probably the only way to do it is to crack open each day's logs, stream them through extracting "pageviews", geolocate, and save to file whichever ones match a US state [15:51:27] Yeah. I'm thinking streaming is the only way to do non-sampled and expect to be done ever. [15:51:33] I can point you towards (1) the current R code implementing the newest PV def, and (2) the pygeoip library and where to go for the binaries, if those'd help? I haven't tested state-level extraction myself (I know you can get out "region", but does "region" mean "state in the U.S."?) [15:52:01] that is, hadoop streaming or streaming in the sense of "how python operates" - i.e., not relying on big flat files in their entirety, R-style? [15:52:13] we need more words. Get on this problem, Merriam-Webster. [15:52:19] hadoop streaming. [15:52:23] aha [15:52:32] But also python's awesome streaming. [15:52:34] then, that is your baliwick :D. I do not know much about it or particularly understand it. [15:52:35] :) [15:52:39] (re hadoop streaming) [15:53:10] So, now my question is, "How do I stream a hive query to a map-reduce job?" [15:53:17] Or even just the whole hive table. [15:53:46] "ask Otto" I think :(. Hive, Hive I know. MapReduce is...unknown territory to me, in a lot of areas. [15:53:57] Yeah. Typing up an email. [15:54:16] Oh yeah. I'd want the sampled logs too because I want data for the last year. Is that right? [15:54:21] yup [15:54:29] so you'd have to unbundle those and fire em up into the cluster. [15:54:39] that'll give you May 2013-the present [15:54:57] although there are a few gotchas (like: the timestamp format changes in November 2013, because screw everyone) [15:55:16] Do you know why we only have them going back to May 2013? [15:55:37] Oh no. Are they not parsed logs -- like the JSON in hive? [15:55:43] ....bahahaha [15:55:44] oh no [15:55:46] oh nonono [15:56:04] these are unquoted, headerless TSVs with less successful escaping than Alcatraz. [15:56:23] ^ lol [15:56:25] in tar.gzs, for storage space. [15:56:30] they are heinous. [15:56:44] tar? [15:56:46] TAR [15:56:52] WHY!? [15:56:52] protip: check if you can validly parse the timestamp or not. That's a good way of seeing if the log line is fucked, since only one field should look like a timestamp (and it's the first one) [15:56:59] actually, just straight .gz, looks like [15:57:04] my bad [15:57:12] * halfak calms down [15:57:38] OK. This is reaching into the realm of too much trouble. [15:57:50] I've found the trick is, in sequence: read it in ignoring quotes, accept there is no header, see if you can validly translate the timestamps using https://github.com/Ironholds/WMUtils/blob/master/R/log_strptime.R [15:58:02] if you can, the line is fine. Apply the PV def (https://github.com/Ironholds/WMUtils/blob/master/R/log_sieve.R) [15:58:19] if the timestamp cannot be validly converted, it probably means some asshat stuck a newline in his user agent and everything is ruined forever. [15:58:26] so ignore that line and move to the next [15:59:20] as a general thing, though, we should really have a python "handle the sampled logs" thing. If only because it's probably faster than the R equivalent. I keep meaning to do that and then forgetting. [16:00:52] Indeed. Just a simple little program to read lognonsense from standard in and produce logawesomeness for standard out. [16:01:59] * halfak wonders if we could try repairing lines too [16:02:25] probably! it depends where they went goopy [16:02:32] Is sampled 1 in 1000? [16:02:33] I think some of it may be udp2log stupid, not storage format stupid [16:02:38] yep [16:02:44] kk [16:02:51] Thanks dude. :) [16:02:53] np! [16:03:09] and now I write C++ and check the session datasets :). Wanna find some time for us to hack on the blog post and documentation this weekend? [16:03:23] yes. [16:03:33] yay! [16:03:40] Oh say. I need to ask the movielens and cyclopath people about releasing data. [16:04:20] * halfak will send email [16:04:25] and CC Ironholds [16:07:07] cool! [16:07:36] Did we decide to go with [16:07:44] or [16:07:53] Either way is pretty easy. [16:08:01] One obscures the *when* [16:10:29] I think because we couldn't think of a decent attack, given the lack of reader datasets out there [16:10:51] One of the good elements of only having a 30 day window, I guess - even if there were datasets out there it's almost impossible to have multiple datasets from the same period ;p [16:11:01] god bless capacity issues! [16:11:05] :) [16:27:09] We have a maxmind db, right? [16:27:16] Ironholds, ^ [16:29:00] we do! [16:29:15] /usr/share/GeoIP/various_files [16:29:23] on stat3? [16:29:39] yep! [16:29:48] http://pygeoip.readthedocs.org/en/v0.3.1/getting-started.html#region-lookup suggests region lookup for the US indeed resolves to state level. W00t! [16:30:05] Yup. Saw that too. :) [16:30:14] It looks like an awesome library [16:30:56] it is! [16:31:03] although I had to patch a couple of their error handlers a while back [16:31:07] the devs are super-nice and super-responsive [16:31:52] they just had an error handler that looked like "if not-A, complain that we need A. If not-B, complain that we need B" [16:31:58] this worked really well until I threw in C. [16:32:36] If not-C, complain that we need C [16:32:43] :P [16:33:58] sure ;p [16:34:17] what I actually did was added a check for "is this in [range of things we support at all]?" that was called before the not-A or not-B checks. [16:35:51] heh [16:37:07] I keep meaning to build something that talks to the underlying C, though. Faster. [16:45:54] * halfak is unsure whether or not this is a pun. [16:46:13] user the underlying C to detect not-C ness [16:46:17] *use [16:48:07] hah [16:48:11] it's not ;p [19:53:47] awright! [19:54:01] I can take a URL and filter it down to subdomains.host.TLD 8.5m times in 7 seconds [20:31:35] Hey Nemo_bis: https://meta.wikimedia.org/wiki/Research_talk:Anonymous_mobile_editing_in_Italian_Wikipedia/Work_log/2014-11-15