[04:23:42] hey ori-l; possibly a more in person question, but what I was wondering about is how to hook into what your using -- fundraising is going to move a portion of CentralNotice to varnish -- and we need to log from this new plugin [04:26:32] well, using our setup is contingent on getting an OK from analytics and ops, so let me answer in two parts: what you'd need to do to log client-side state using varnish, and what you'd need to do to use our data pipeline. [04:27:43] ok -- terry made it sound like we could pipe the output from varnish pretty much anywheres -- so in theory we wouldn't be touching the analytics backend -- just using your frontend in varnish [04:28:10] and ops has given tentative approval for this module -- we still have to prove it works [04:28:12] ah, i see. yeah. so there's a url pattern declared in varnish's config, /event.gif [04:28:38] requests hitting that URL (followed by an arbitrary query string payload) get logged [04:29:06] and varnish responds with an HTTP 204 ("No Content") -- an empty successful response [04:29:41] now, when i say that varnish logs the request, i don't mean that varnish does any kind of special processing. varnish logs *all* requests, and this is simply not an exception [04:30:28] the only thing that is unique about it is that varnish can generate an empty response and move on without looking it up in the cache or trying to resolve it against one of the back-ends [04:30:48] hurm; do you happen to know where in the request process things get logged? [04:31:05] we're looking at a URL mangling scheme, and ideally I'd love to log after the mangle [04:31:07] varnish's approach to logging is unusual and awesome [04:31:15] oh, before the mangle [04:32:29] there may be a log entry for the processed URI, let me check [04:33:54] yes: there's RxURL, TxURL, and ObjURL [04:34:36] ah hah! can you point me to where you found that? [04:34:56] https://www.varnish-cache.org/docs/2.1/reference/varnishlog.html#tags [04:35:06] the documentation is sparse on this point [04:35:20] TxURL is the URL requested from the back end [04:35:25] i don't remember what ObjURL is -- i had to look at the source. sec. [04:36:37] ok, so source in varnish is the best place to start [04:41:36] yeah [04:42:03] well, ok: RxURL is the raw requested URL, TxURL is the URL requested from the back-end, if any, ObjURL is the URL looked up in the cache [04:43:01] if some of your requests will be served from the cache, you want ObjURL, because there won't be a TxURL for those [04:43:30] ^ mwalker [04:44:00] yep yep; we will have things being served from cache -- so I'll start there; looks like there's a method to retrieve custom headers from the obj as well which is something else that'll come in hugely handy [04:44:27] -- we want to serve metadata from the backend request and have that logged :) [04:44:31] because... we're crazy :D [04:44:55] that's not crazy at all [04:46:29] because this is very sparsely documented, what you probably want to do is produce a minimal example that takes a URL, transforms it, looks it up in the cache, transforms it again, and requests it from a back-end, and then watch varnishlog and confirm [04:47:03] absolutely [04:48:00] as to the test setup, did you deploy the WMF version of varnish, or one out of another repo? -- specifically if you deployed from WMF; anything special about it? [04:50:22] log2udp won't batch input. each log line is emitted as a udp packet, which is massively inefficient. as a result, ops decided to patch varnishncsa to batch and emit UDP directly [04:51:01] arhar -- hence the requirement for a varnish config change [04:51:26] that's not the reason for the VCL change, no [04:51:49] we wanted to be able to log arbitrary client-side events, which don't always correspond to discrete requests [04:51:58] so we needed an endpoint we could push arbitrary data to [04:52:49] the clicktracking extension exposed such an endpoint via the mediawiki api, but that meant every request bypassed the cache and hit the mediawikis directly, which could take down the cluster (don't ask me how i know that.) [04:53:20] that's why we configured a custom URL that could be handled within varnish and produces empty responses [04:53:49] makes sense [04:54:17] but I've not yet found in the operations repo the varnish config so I'm still hazy on what the VCL is doing [04:54:24] * mwalker continues looking [04:54:28] i think katie mentioned that you guys needed to track things like how often people expand a collapsed banner [04:54:41] it's something we're going to be looking into [04:54:43] a special endpoint is useful for things like that (unless expanded the banner already results in an ajax request) [04:54:48] *expanding [04:55:01] nope; we load the whole thing at once [04:55:22] we're still scheming about how to abuse what you already have for that though [04:55:22] i documented all this here, btw: http://wikitech.wikimedia.org/view/Event_logging [04:55:42] ah; fantastic [04:55:49] the zeromq stuff got axed, sadly, but the rest is relevant [04:58:53] ok; I get it now; you have VCL in place to serve the 'image'; and then you collect logs via varnishcsa (with filters) which gets directly placed onto the network [04:59:08] yeah [04:59:29] is the UDP multicast, or is it directed somehow? [05:00:37] directed; multicast support was patched into udplog but it isn't being used because it was problematic. i've pressed for specifics but nobody can recall details. the eqiad bits direct the udp at vanadium, whereas the esams bits send it to oxygen, which has been configured to route it to vanadium [05:00:55] the reason for the oxygen kludge is that vanadium has no public interface and isn't otherwise reachable from esams [05:01:25] kk; do you happen to know the bandwidth of that pipe? [05:01:56] i think leslie gave me some useful details, sec [05:02:02] let me dig thru my logs [05:03:05] while i do that: if you're considering an alternative transport, you might as well know why zmq pub/sub got axed: it requires that vanadium connect to the bits hosts, and the bits hosts aren't running ip tables and have no private interface, so any port you open on them is open to the world [05:04:17] aaah; interesting [05:09:17] i can't find the precise specs from leslie, but she said: [05:09:26] "We have a lot of uplink capacity on the bits caches, I am unconcerned about our eqiad network - it is very robust and very easily upgradeable." [05:10:02] the figure cited in that thread was "<100 million/day" [05:10:59] kk; but we have 400million page views a day; do you downsample somehow at the varnish end? or is that something fundraising is going to have to implement? [05:11:51] the firehose udp request log is sampled at 1/1000, but it's important to realize what the bottleneck is [05:12:07] the network link isn't saturated; it's that emery / oxygen / locke can't write it to disk fast enough [05:12:48] that's the firehouse udp request log from the squids, which log all requests via udp to those three machines. the bits machines don't send their logs atm. [05:12:48] aha; I get it; that makes perfect sense, I was under the impression that varnishes logging was independent of the method we already use to get squid logs [05:13:35] it depends what you mean by "independent": it's on a different port and directed at a different machine [05:13:45] but the transport is the same, yeah [05:14:59] also, because it's new, it doesn't have any clients other than us at the moment; the same isn't true with the squid logs, which are used to generate all manner of stats [05:15:19] true; I'm just trying to avoid killing your box [05:15:50] so; if I can setup a transport stream that goes to emery/oxygen/locke just like squid then it fits into the pre-existing workflow [05:16:26] but; that should just be a new instance of varnishcsa running on the varnish boxes just pointed somewhere else? [05:16:40] if it goes to emery/oxygen/locke it'll get sampled. asher also explicitly wanted to avoid "mixing the streams" (heh) [05:17:01] we're totally fine with it getting sampled -- no way in heck we can deal with processing 400million hits [05:17:24] but; I can work something out with asher then if he has qualms about mixing the methods [05:17:54] what do you want to log? [05:18:09] (if you don't mind me asking) [05:18:16] what this module I'm writing does is serve out banners -- so we're logging banner impressions [05:18:45] a banner in this system will effectively get served to everyone for every non special page they view [05:19:29] and specifically the data we're logging is (country, language, project, logged in, banner served, campaign served) [05:19:52] well, varnishncsa would be sieving through the firehose varnish log segment for the ObjURL entries matching some banner regexp, right? so you don't need to filter requests on the receiver, just log them; and you don't need to log *each* request, just increment a counter [05:20:08] and fsync periodically [05:20:21] so you could probably handle 400 million hits, it's not crazy [05:20:38] or increment multiple counters [05:21:26] that's lots of counters :) 250,000 or thereabouts [05:21:46] wait, 2,500 [05:22:29] that's languages + countries + projects + campaigns + banners + 2 (auth) ? [05:22:53] ~1000 language/project * 250 countries * 2 logged in * n campaigns * m banners [05:23:19] where n and m are < 10 in the usage we've seen [05:23:33] you possibly don't need a separate counter for every possibly permutation if you can track them separately [05:23:41] every possible [05:23:46] but i'm just being lurid because i like these types of problems. i'm sure you understand the requirements :) [05:24:46] it's useful for us to have the raw counts -- we like to poke and prod it as much as possible later on -- it all goes into an aggregated (by count / 5 minute) sql table [05:25:09] so we do count it; but right now it's offline [05:25:42] give me a sec to try something out [05:52:45] [~] $ time python counters.py 1000000 [05:52:45] real 2m8.973s [05:52:45] user 0m55.207s [05:52:47] sys 0m20.177s [05:53:27] where counters.py is this: http://dpaste.de/y291X/ [05:53:43] * mwalker looks [05:55:40] having a primary keyspace and using hashes is a hack i got from here: http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value-pairs [05:55:56] so... it keeps up -- 400M impressions comes to 4.6k / sec; yours comes out at 5.1k/sec [05:56:30] that's on my laptop, not vanadium [05:56:51] but traffic isn't flat [05:57:05] true [05:57:27] anyway, the connection between half-assed benchmarks like these and the performance characteristics of production systems is always tenuous, not saying it's trivial, just a fun experiment [05:58:34] if incoming data has good distribution and you are looking at it in aggregate sampling solves your problem [05:59:18] now i'll mind my own business and stop poking in other people's stacks, it's a bad habit but a well-intentioned one :) [05:59:46] yep; no it's good stuff and appreciated: you've been here and done this; -- this is new to us [06:01:03] you might like this: http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011 [06:01:21] esp. slides 32-41 [06:01:36] off to walk the dog, bbl [10:07:26] * dschoon also thinks the rainbird presentation was great. [11:45:21] good morning whoever's here. Goodnight to dschoon if he's actually here. It's 4am, you're not allowed to be up unless you're 80. [12:51:32] hey drdee [12:53:42] hey [12:54:53] average_drifter ^^ [12:57:40] drdee: ok.. this review is going to not be very easy [12:57:53] keep your changes small :) [12:58:01] drdee: I know I know, but there are a lot of changes... due to recovering from Erik [12:58:13] 1 asana task -> 1 commit [12:58:13] from Erik's change [12:58:28] still don't understand what caused the merge conflict [12:58:47] many causes, let me list some: [12:58:50] 1) paths changed [12:58:55] unless erik used a different version than the one in subversion [12:58:57] 2) unix/m$ line endings [12:59:10] 2)? [12:59:10] git should take care of that [12:59:30] 3) some kind of thing happened that caused the diff tools to report entire contents of files had change although they were visibly just small changes.. [12:59:39] k [12:59:47] do you have the review ready? [12:59:53] drdee: may I push in "development" branch ? [12:59:53] sure [13:00:02] drdee: git review or push ? [13:00:15] git review [13:00:24] ok [13:08:38] drdee: https://gerrit.wikimedia.org/r/28506 [13:09:01] drdee: ok please don't panic, I know it doesn't look very good, but I'm here and we'll sort it out [13:10:07] I don't know what happened to the diff tool.. it totally didn't pick up changes [13:10:17] I mean regular diff.. : [13:10:19] :( [13:16:55] so basically the ipad_extract_data parses iPad user agents and it's supposed to replace parts of the code in SquidCountArchiveProcessLogRecord.pm [13:27:04] * drdee reading now [13:27:16] and i rarely panic ;) [13:28:45] yeah basically there is no point in reviewing this [13:29:26] ok, I should redo it, just picking parts of my commits [13:29:30] so the diffs don't get messed up [13:30:00] SquidCountryScan.pl has still a merge confllict [13:30:11] leave it as-is [13:30:24] but fix SquidCountryScan.pl [13:30:44] and then just push, hopefully the new changes will be normal [13:31:25] i like it that you breakup the browser detection in separate packages [13:31:42] do you need to commit file.tmp ? [13:31:48] no, I can delete that one [13:31:53] I'm not even sure if it was mine [13:32:00] don't think it was mine [13:32:07] and maybe don't commit full_run_spetrea either [13:32:36] ok [13:32:50] is folder 't' for tests? [13:33:35] morning ottomta [13:35:31] drdee: t/ is for tests yes. it is the canonical name for test directories in Perl [13:35:44] morning [13:35:53] oh okay didn't know that [13:36:24] so average_drifter, fix those minor things and also ask erik to look at it. i will add him as reviewer [13:38:57] and also don't commit SquidReportArchiveConfig-spetrea.pm and SquidCountArchiveConfig-spetrea.pm [13:39:10] basically never commit local dev files [13:39:13] yes [13:59:34] ok so it seems that if you get line-endings ^M (which are the ones from from windows) that confuses diff a lot and it makes it think files are completely different(which accounts for all the red) although they're not [14:01:27] mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm [14:02:12] when i help erik configuring git there were some options related to this [14:02:22] wondering if they were set incorrectly [14:02:36] yes, usually this is done through a pre-commit hook.. something like that [14:02:51] I've seen it done in a few places before [14:03:05] i don't think you need a commit hook [14:03:11] somewhere in git config on erik's machine is sufficient [14:03:50] git config --global core.autocrlf input ? [14:04:05] ( took that from here https://help.github.com/articles/dealing-with-line-endings ) [14:04:14] reading that as well :D [14:04:31] http://stackoverflow.com/questions/170961/whats-the-best-crlf-handling-strategy-with-git [14:23:37] average_drifter, can you also fix the first line of your commit summary? [14:23:42] it's not very descriptive ATM [14:28:05] yes [14:30:00] average_drifter: check http://stackoverflow.com/questions/170961/whats-the-best-crlf-handling-strategy-with-git [14:30:19] maybe we should add a .gitattributes to the repo [15:32:06] new review [15:33:47] :) [15:33:54] so what about using .gitattributes [15:34:04] to fix the line issue because this is ridiculous [15:37:41] yes, coming right up [16:42:03] morning [16:45:21] morning [16:45:22] i pushed everything but zoom [16:45:40] (I'm moving that to the d3.behavior way) [16:46:07] but in the meantime, you wanna show me what we have to do to get this into /graphs/:graphid ? [16:46:18] dschoon ^ [16:46:23] howdy [16:46:32] well, I'll be at the dentist after the standup [16:46:32] sure [16:46:36] no standup today [16:46:39] we have the platform meeting at 10 [16:47:15] oh, well I'll be at the dentist from 10:45 your time to like 12:00 [16:47:15] so maybe after your lunch [16:47:44] wait is the platform meeting important? I can cancel the dentist [16:50:52] no no, it's fine. [16:51:10] we can relay anything you miss. [16:51:15] it's mostly a sync-up [17:04:45] did I miss the hangout link? [17:05:34] no, we all are in the platform meeting [17:05:41] no scrummy today [17:05:41] i seeeee [17:05:48] cool [18:36:51] getting coffee brb [19:25:02] back [20:21:37] brb lunch [20:39:18] dschoon I'm going for an early dinner but I'm looking at the code now, I'll ping when I'm back [20:43:18] or not looking at the code 'cause github is dead :) [20:51:09] drdee whatcha doin? [20:51:11] QueryResult.jar? [20:51:23] sqoop :) [20:51:28] cool! [20:51:41] btw, i puppetized mysql, which changed the datadir [20:51:45] not sure if you used it for anythign [20:51:51] i'm puppetizing oozie now [20:52:00] cooliu [20:55:25] the oozie stuff has a lot of extra steps...... [20:55:38] Download the ExtJS version 2.2 library from http://extjs.com/deploy/ext-2.2.zip and place it a convenient location [20:55:53] Copy or symlink the Oracle JDBC driver JAR in the /var/lib/oozie/ directory [20:55:53] You must manually download the Oracle JDBC driver JAR file [20:56:09] ahhh wait that is just for oracle [20:56:10] phew [20:56:21] i put all those things in the asana task [20:56:21] ah no, i have to do it for mysql too [20:56:23] yes you have to [20:56:27] libmysql-java? [20:56:29] is that that? [20:56:35] https://app.asana.com/0/828917834272/2189400798699 [20:57:11] milimetric: wait, how are you doing that? [20:57:14] looking at it [20:57:15] it *just now* pushed [20:57:29] ah, wait. i didn't read far enough :) [20:57:38] yes libmysql-java is also required but that's just a deb [20:57:46] the ext2.js is required to make the web console work [20:58:11] anyway, milimetric, it's up there now. [20:58:16] lmk when you'd like to chat [20:58:24] i think i'm going to migrate upstairs. [21:03:25] lol drdee [21:03:35] what? [21:03:36] i love how you added a section "Big Unknown Question" [21:03:40] which contains: [21:03:46] :D [21:03:46] 1) [21:03:46] 2) [21:03:46] 3) [21:03:54] there are hidden [21:04:00] which makes it, literally, a question that is unknown :P [21:04:15] brb heading upstairs [21:11:05] back [21:23:15] ottomata, you got the hadoop consumer working, right? [21:24:29] where did the data end up? [21:24:40] i do not think it's http://analytics1001.wikimedia.org:8888/filebrowser/view/user/otto/kafka/request_logs2/part-1349974547743_0144-m-00001?file_filter=any [21:24:40] it works, yeah, but I didnt' levae it running [21:24:40] i only tested it [21:24:43] yes [21:24:49] just curious to see what the output looked like [21:24:49] that is from a 10 line per broker input [21:24:54] so 10 lines per file [21:25:23] oh! so that *is* it [21:25:32] you just sent in arbitrary logs, not the results of the dummy prototype [21:25:33] yeah, appended the kafka seq number it looks like [21:25:38] interesting. [21:25:39] ohhh yeah byte offset [21:25:40] okay, cool. [21:25:42] in the topic [21:25:43] is prepended [21:25:45] to the line [21:25:45] interesting. [21:25:46] that's it [21:25:52] always? [21:25:57] well, at least in the thing I tested yesterday [21:25:59] does it need that or something? [21:26:06] weird [21:26:06] iunno [21:26:07] maybe the consumer is adding it? [21:26:22] probably, i don't think that shows up on the cli consumer [21:26:25] k [21:27:11] yeah, so I passed —limit 10 to my java hadoop consumer [21:27:21] that spawned up a mapper for each broker [21:27:35] and each mapper consumed 10 lines [21:27:56] how did it find out about the brokers? zk? [21:28:25] yup [21:28:40] awesome. [22:11:20] uhhnngggg oooooziiieeeeee [22:11:21] ok gotta go byyyeeeee [22:12:20] kk [22:12:22] taa [22:13:39] laterz