[02:48:27] Analytics-Kanban, Analytics-Wikimetrics: Utf-8 names on json reports appear as unicode code points: "\u0623\u0645\u064a\u0646" - https://phabricator.wikimedia.org/T93023#1127355 (kevinator) Ah, I see the bug now. I had to download the file and open it in a text editor. My JSON viewer plugin in chrome dis... [14:02:03] ottomata: standup [14:02:08] sry :) [14:04:26] (CR) Milimetric: [C: 2] Add config to run funnel_failure_rates_by_type [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/197318 (https://phabricator.wikimedia.org/T89251) (owner: Mforns) [14:04:33] (CR) Milimetric: [V: 2] Add config to run funnel_failure_rates_by_type [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/197318 (https://phabricator.wikimedia.org/T89251) (owner: Mforns) [14:10:08] nuria: https://github.com/wikimedia/operations-debs-logster/blob/master/logster/parsers/SampleLogster.py [14:10:13] SampleLogster is a bad name, i guess [14:11:43] ottomata: got that, command runs fine like: /usr/bin/logster -o statsd --statsd-host=labmon1001.eqiad.wmnet:8125 --metric-prefix='wikimetrics' LineCountLogster /var/log/apache2/access.wikimetrics.log [14:11:52] ottomata: on wikimetrics staging [14:12:01] ottomata: but where are metrics going? [14:12:06] aye, just wondering if you'd want to report based on status code too, e.g. you probably only want 200s? [14:12:28] nuria: if you run that with -o stdout too, do you see the metrics? [14:13:19] ottomata: yes , [14:13:32] https://www.irccloud.com/pastebin/Wvr0STPM [14:14:20] ok, i guess check to see that the packets are going out then, or run witih the --debug flag and see what happens [14:14:22] you could do [14:14:29] sudo tcpdump port 8125 [14:29:49] Analytics-Dashiki, Analytics-Kanban, Patch-For-Review: Pageviews not loading in Vital Signs - https://phabricator.wikimedia.org/T90742#1128630 (mforns) To update private repo: sudo GIT_SSH=/var/lib/git/ssh git pull --rebase [14:31:45] Analytics, MediaWiki-extensions-ConfirmEdit-(CAPTCHA-extension): Provide a log of actions which trigger the CAPTCHA - https://phabricator.wikimedia.org/T43522#1128634 (He7d3r) [14:45:02] ottomata: ok, see stuff (with delay) on tcpdump... now, how do isee those metrics on graphite? [14:45:03] https://graphite.wmflabs.org/ [14:45:31] ottomata: do you know? cause they do not *seem* to appear [14:50:02] Analytics-Cluster, Analytics-Kanban, Performance: Cluster report that looks at x-Analytics header and extracts the date to calculate uniques. - https://phabricator.wikimedia.org/T92977#1128699 (kevinator) Is this task the same as {T88814}? [14:50:33] nuria: i dunno how it works in labs [14:50:48] but, afaik, statsd collects and aggregates stats over a minute and sends things to graphite [14:50:55] so, i guess, check if they are in statsd? [14:51:05] or, you could try sending them to graphite directly with the -o graphite flag [14:51:07] instead of statsd? [14:54:23] Analytics-Cluster, Analytics-Kanban, Performance: Cluster report that looks at x-Analytics header and extracts the date to calculate uniques. - https://phabricator.wikimedia.org/T92977#1128715 (Nuria) Let's see: Task T888814 has two parts: Part #1 VCL changes, code & deploy (https://phabricator.wikime... [14:55:47] ottomata: ok let me ask yuvi cause I assumed statsd is sending every one metric to graphite [14:56:05] Yuvi|reallyFood: lemme know when you rae back ... [14:57:35] i don't have access to the statsd or graphite instances in labs :/ [14:57:38] not sure how to get to them [14:57:44] so i'm not sure how to debug further [14:59:03] Analytics-Cluster, Analytics-Kanban, Performance: Cluster report that looks at x-Analytics header and extracts the date to calculate uniques. - https://phabricator.wikimedia.org/T92977#1128737 (kevinator) This will change a little bit once the UA map is in the refined tables. For now, you can use the... [15:03:10] ottomata: no worries , i will ask YuviPanda [15:05:05] (PS6) Milimetric: Add Sunburst Visualizer [analytics/dashiki] - https://gerrit.wikimedia.org/r/197234 [15:05:08] (PS2) Milimetric: [Review but DO NOT MERGE] Begin funnel layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/196489 [15:05:09] (PS1) Milimetric: Add rickshaw timeseries graph [analytics/dashiki] - https://gerrit.wikimedia.org/r/197590 [15:07:09] qchris: are we sure hdfs is the proper user to use for guard? [15:10:57] ottomata: It's the only user that we know that exists at that point. [15:11:04] Which one to use instead? [15:11:07] milimetric, the merge you did, did not work, because the gerrit changeset had a "false" non-merged dependency [15:11:21] checking [15:11:23] milimetric, I'm going to remove it and I ping you [15:11:44] I first had the stats user, but that does not ... oh ... now with the added dependency on statistics ... it does exist. [15:11:49] Would you prefer the stats user? [15:11:55] (PS3) Milimetric: Query and visualization for failure vs user analysis [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/195436 (https://phabricator.wikimedia.org/T91123) (owner: Mforns) [15:12:08] mforns: no don't [15:12:10] i'll just rebase [15:12:14] But the "refinery::" role doing things as "stats" ... that looks wrong too. [15:12:20] but mforns is the stacked bar chart thing all set? [15:12:21] milimetric, ok [15:12:28] milimetric, no [15:12:39] it is still blocked by the wrong data [15:12:40] right? [15:12:55] refinery doing things as hdfs also seems wrong [15:12:56] mforns: that's fine, let's merge it. Ultimately all the scripts are bad, because all the data is bad [15:12:57] milimetric, so, that's why I said that I would remove the dependency, [15:13:04] milimetric, ok [15:13:09] k, i'll do that [15:13:11] hm [15:13:11] but [15:13:12] hm [15:13:20] so, role::analytics::refinery [15:13:23] milimetric, thx [15:13:24] does have an explicit dependency on client [15:13:25] Class['role::analytics::hadoop::client'] -> Class['role::analytics::refinery'] [15:13:32] hadoop client [15:13:42] (CR) Milimetric: [C: 2 V: 2] "Merging even though we know the script is wrong. At this point, due to the problems we found with the data, all the scripts need re-worki" [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/195436 (https://phabricator.wikimedia.org/T91123) (owner: Mforns) [15:13:48] meh whatever, hdfs is fine [15:13:50] (PS3) Milimetric: Add config to run funnel_failure_rates_by_type [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/197318 (https://phabricator.wikimedia.org/T89251) (owner: Mforns) [15:13:55] (CR) Milimetric: [V: 2] Add config to run funnel_failure_rates_by_type [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/197318 (https://phabricator.wikimedia.org/T89251) (owner: Mforns) [15:14:14] I was hesitant with 'hdfs' too, so I guess we both do not like it :-/ [15:14:20] But other things run as hdfs too. [15:14:37] well, the other things that run as hdfs do stuff with hadoop [15:14:43] like, camus, dropping partitions, etc. [15:14:46] milimetric, in an hour I'll have a look at stat1003 generated data, to see if reportupdater worked as expected [15:14:49] so, they work with files in hdfs [15:15:05] mforns: thanks [15:15:15] refinery-soruce and refinery guard don't really need hdfs at all [15:15:27] True. [15:15:28] i don't think you even need that require role::analytics::refinery, do you? [15:15:39] It's for the existence of the hdfs user. [15:15:44] right. [15:15:53] If we switch the user, we can drop it. [15:16:04] hm, [15:16:10] Argh. Right I forgot about the comment about the dependency. Thanks. [15:16:21] well, qchris [15:16:24] if we switch to stats user [15:16:31] than there is a real dependency [15:16:35] then* [15:16:55] which, i think i'm ok with [15:17:00] working path and user? sure. [15:17:02] what do you think? [15:17:10] then you can run it as $:;statistics::user::username [15:17:41] Ok. Then I'll translate the whole thing to the stats user. [15:17:44] ok [15:17:44] milimetric, oh, I forgot [15:17:59] so, ja, if you do that, you don't need a comment about that statistics module dependency :) [15:18:03] milimetric, there's also this change, which should have been a dependency: [15:18:09] https://gerrit.wikimedia.org/r/#/c/197319/1 [15:18:24] ottomata: Is it ok if I'll keep the classes in refinery.pp, as it's more refinery than statistics? [15:18:44] yes [15:18:46] k [15:18:49] Thanks. [15:19:34] (CR) Milimetric: [C: 2] Make row assignable [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/197319 (https://phabricator.wikimedia.org/T89251) (owner: Mforns) [15:19:44] thanks milimetric :] [15:36:49] (PS2) Milimetric: Add rickshaw timeseries graph [analytics/dashiki] - https://gerrit.wikimedia.org/r/197590 [15:38:08] Analytics-EventLogging, Analytics-Kanban, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1128881 (Nuria) We are going to enable our event-ingesting pipeline to use varnishkafka rtaher thna varnishncsa.... [15:40:35] YuviPanda: hola, have time for labs question about graphite? [15:40:54] nuria: I am back :) [15:40:55] and yes [15:41:18] YuviPanda: ok, so iam trying to send metrics to labs graphite via logster and statsd, like: [15:42:00] YuviPanda: /usr/bin/logster --debug -o statsd --statsd-host=labmon1001.eqiad.wmnet:8125 --metric-prefix='wikimetrics' LineCountLogster /var/log/apache2/access.wikimetrics.log [15:42:29] YuviPanda: If i run that command with stdout i see the execution and metric: [15:42:46] https://www.irccloud.com/pastebin/vNHFSvUs [15:43:02] but i do not see anything on graphite on labs... [15:43:29] YuviPanda: That is: https://graphite.wmflabs.org/... am i missing something about how to send metrics there? [15:44:02] nuria: so graphite.wmflabs.org seems down atm, coren is investigating. txstatsd should still be running, however. [15:45:01] YuviPanda: so (when garphite is up), should i sent metrics there using statsd? [15:45:12] nuria: you are sending it to the correct place, yeah. [15:45:21] nuria: but I can’t help debug until graphite is back up... [15:45:32] nuria: what’s the ‘-o stdout’ mean? [15:46:04] YuviPanda: that is just debugging so it prints to screen [15:46:19] nuria: ah, is it printing to the screen exactly what it is sending to statsd? [15:46:21] -o statsd is the "real" command [15:46:27] YuviPanda: I think so [15:46:36] hmm, that doesn’t look like proper statsd format [15:46:38] * YuviPanda looks for the dosc [15:47:04] YuviPanda: plus execution times i think (last number) [15:47:39] nuria: right, so https://github.com/etsy/statsd/blob/master/docs/metric_types.md is the spec. I assume you would want ‘gauges’. and the output doesn’t look like valid statsd format at all [15:47:51] YuviPanda: since logster is used in prod i assume formats are right, even if printout looks odd [15:48:14] nuria: right. so I guess the way to debug this is to look at tcpdump and see what exactly is being sent... [15:48:26] and make sure it matches what txstatsd expects [15:48:37] nuria: are you sure it’s being used in prod? afaik ottomata said it was written but never used... [15:49:15] mmm.. ottomata : does varnishkafka use logster in prod? cc YuviPanda [15:49:43] yes [15:49:44] it does [15:50:09] ottomata: sending data to statsd? cc YuviPanda [15:50:16] hmm, ok. [15:50:18] YuviPanda: logster sends from all varnishkafka instances to a local txstatsd on each cache node, and those forward to central statsds [15:50:27] ottomata: these are txstatsd as well? [15:50:32] afaik [15:50:34] yes [15:50:51] nuria: hmm, so I’m not sure, but I can help debug once graphite is back up :) [15:51:13] YuviPanda: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/cache.pp#L417 [15:51:25] YuviPanda: ok, thank you. [15:51:40] Ironholds, hi [15:51:42] and alos [15:51:43] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/cache.pp#L543 [15:52:06] nuria: cool :) [16:01:54] hey mforns [16:02:02] hey Ironholds :] [16:02:22] I'm starting: https://phabricator.wikimedia.org/T86535 [16:02:29] and have 2 questions [16:04:00] Ironholds, when you say "If there is only one event in a session, it should not be reported." I understand one-pv sessions should not count in calculating the means, min, max, quantiles, right? [16:04:53] I'm in a meeting [16:04:59] let's talk about this when I'm back [16:05:03] *? [16:05:03] oh, ok, np [16:05:18] sure Ironholds, ping me when you have time [16:07:39] qchris: i am running guard! [16:07:49] wohoo \o/ [16:15:09] mforns, okay, I'm around. WHat's up? [16:15:15] hey Ironholds [16:15:27] so I'm looking at: https://phabricator.wikimedia.org/T86535 [16:15:47] when you say "If there is only one event in a session, it should not be reported." I understand one-pv sessions should not count in calculating the means, min, max, quantiles, right? [16:15:57] yep [16:16:03] ok, right [16:16:18] but they should count as a session in session counts, right? [16:16:54] and also count in the computation of events per session? [16:19:18] err [16:19:22] * Ironholds goes to check [16:20:42] mforns, yes and yes [16:20:46] so, what I did, as a heuristic? [16:21:04] well, not heuristic. identifier. [16:21:25] I had the session length calculation output -1 in the case of sessions with only one event [16:21:40] which means it's trivial to filter them out but you still retain the data - meaning you can calculate, e.g., bounce rate, trivially. [16:21:42] Ironholds, aha [16:22:09] ok, makes sense [16:22:14] mforns, I've actually implemented the entire set in highly speedy C++, so if you'd find it useful I'm happy to point you to the source code [16:22:27] that would be great :] [16:22:31] of "this is how we separate streams of timestamps into sessions, session length calculation uses N value" [16:22:32] cool! [16:23:05] https://github.com/Ironholds/reconstructr/tree/master/src you want session_metrics and sessionise, I think [16:23:22] ok, thanks! [16:23:27] (I apologise for it being a product of its API and thus using lists and vectors for damn near everything ;p) [16:24:42] np at all [16:25:08] the other question was about quantiles. which ones do we need? 0.25, 0.5 and 0.75? [16:28:33] Analytics-EventLogging, Analytics-Kanban, operations, Patch-For-Review: Eventlogging JS client should warn users when serialized event is more than "N" chars long and not sent the event [8 pts] - https://phabricator.wikimedia.org/T91918#1129163 (mforns) Open>Resolved [16:28:34] Analytics-EventLogging, Analytics-Kanban, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1129164 (mforns) [16:29:48] (CR) Nuria: "mmm... isn't Utilities class missing from patch?" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/197296 (owner: OliverKeyes) [16:31:47] mforns, I'd check with Deskana|Away ; Howie wanted 0.1:1.0, plus 0.99 [16:32:34] nuria, bahahahahaha [16:32:41] ...I am an idiot who forgot what -a did [16:32:44] Ironholds, what do you mean with 0.1:1.0 ? [16:32:48] one moment while I submit a valid patch to you [16:32:56] mforns, 0.1, 0.2, 0.3, 0.4,0.5, 0.6... ;p [16:32:58] Ironholds: that a*ahem* has NEVER happen to me ... [16:33:14] well of course not! You are a professional! I am a professional! We are all professionals! [16:33:15] Ironholds, ok [16:33:25] who said impostor syndrome? I HEARD YOU, VOICE IN MY HEAD [16:33:27] Ironholds: no, it's more like every SINGLE time man. [16:33:38] hehehe [16:33:46] nuria, thank you for again demonstrating why I love working with y'all [16:33:47] Ironholds: it's one of those mistakes that you are like .. ah no, did i do that again? [16:33:57] I can think of WMF employees whose response would've been "ugh, you forgot to do X, god." [16:34:08] "I forget to do X all the time" makes all the difference when you're a noob :) [16:35:57] Analytics-EventLogging: Adapt eventlogging intake to use Kafka - https://phabricator.wikimedia.org/T93096#1129191 (Ottomata) NEW a:Ottomata [16:38:26] (PS2) OliverKeyes: De-static-everything [analytics/refinery/source] - https://gerrit.wikimedia.org/r/197296 [16:53:59] Ironholds: ok, looked at patch. Then: if we want to not have static classes but those are such that we only want 1 instance of the object, they should be "application scoped" singletons, meaning that they are instantiated once for the life of the app, which i guess in this case is the java that runs your hive query. [16:54:17] https://www.irccloud.com/pastebin/iwP1LIoG [16:54:46] aha! [16:54:52] so, throw that into the UDF defs? [16:54:56] nuria, ^ [16:55:19] Ironholds: no, into the classes that are singletons themselves [16:55:31] Ironholds: let me give you more precise docs [16:55:59] nuria: i'm going to work on some other stuff for a bit after SoS, would you look over this deployment plan? [16:56:00] http://etherpad.wikimedia.org/p/Analytics-Nuria [16:56:06] Ironholds: http://www.javaworld.com/article/2074979/java-concurrency/double-checked-locking--clever--but-broken.html?page=2 [16:56:07] i will do that to betalabs first [16:56:32] ottomata: will do, after 'second breakfast' [16:57:04] nuria, awesome; thanks :) [16:57:28] Ironholds: some useful info also here: https://en.wikipedia.org/wiki/Singleton_pattern [16:57:39] nuria, so just throw that in as the constructor, at which point when you try to instantiate a second time, the constructor will reference the initial instantiation? [16:57:45] if I'm understanding correctly [16:58:11] Ironholds: and this book is like the best reference evah: http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601 [16:58:32] perfect! [16:58:41] Ironholds: right, constructor is private now [16:59:02] Ironholds: do some tryouts outside the hive environment and you shall see it perhaps more clearly [16:59:12] nuria, gotcha :). It makes sense! [17:01:18] Ironholds: ok, caller classes would do: Singeton.getInstance() [17:01:30] Ironholds: let me know if this doesn't make sense [17:03:23] nuria, it mostly does! I'll read the docs more thoroughly and put together a dummy example just to be sure :) [17:10:12] nuria, okay, so it looks like what I want for full optimisation is (1) the singleton approach as you're suggesting [17:10:34] but (2) to lazily rather than eagerly create it (so we avoid unnecessary clutter, since the entire point of this patch is avoiding unnecessary clutter) [17:11:22] does that sound right? [17:16:14] nuria, is it not possible to do the getInstance() inside of the Resource (in your paste) instead of in MySingleton [17:16:30] and have Resource hav a protected (private?) static member [17:16:39] that gets initialized when getInstance is called? [17:16:47] that way you avoid more classes? [17:17:20] ottomata, it looks like that's the case from my googling [17:17:30] but I'm going for the simplest approach, which is test it and see if it explodes :D [17:17:57] lazy instantiated singletons in the existing classes: test coming right up [17:18:07] class Resource { [17:18:07] protected static Resource resourceInstance; [17:18:07] public getInstance() { [17:18:07] resourceInstance = new Resource(); [17:18:07] return resourceInstance; [17:18:07] } [17:18:08] } [17:18:08] something like tha? [17:18:11] idunno [17:18:24] would that avoid the public static getInstance()* [17:19:15] ottomata, yeah, although I'm going for: [17:19:32] well, that but resourceInstance = null [17:19:42] and then if(resourceInstance = null){ [17:20:00] resourceInstance = new Resource(); [17:20:04] return resourceInstance; [17:20:14] or, something like that [17:20:21] aye [17:20:24] righto [17:20:25] the point is to create lazily rather than eagerly so you don't create it whether or not you need it [17:20:38] * Ironholds is skim-reading all the things and still disconcerted I can have this conversation [17:20:53] okay, implementation test comin' up [17:21:07] Ironholds: "This is just a convenience method that also makes sure that arguments are not null.", i cannot explain this well, but this is something scala is really good at [17:21:13] starting with Pageviews because that's a one-public-method class [17:21:20] it auto guards against NPEs with some fancy type wrappers [17:21:24] ottomata, nice! [17:21:29] Scala is on my list, don't worry! [17:21:35] heheh, still on mine too :) [17:21:39] When I am proficient in Java and C++ and Python I will dig into scala :D [17:21:52] well, "proficient" in C++ == "can make most things work with this one version of gcc" [17:22:01] i read this part of the scala book [17:22:05] and didn't buy it [17:22:13] it seemed like it wasn't really gaining anything, why not just check for nulls? [17:22:22] but, then I used it in that spark streaming thing, and I got it. [17:22:39] i still don't fully understand, but it was really nice [17:22:51] yay! [17:24:04] http://alvinalexander.com/scala/using-scala-option-some-none-idiom-function-java-null [17:30:36] ottomata, am I allowed to stick a very subtle reference to The Lonely Island's "Jack Sparrow" in the documentation if I can make it relevant. [17:32:55] ottomata: yes, there should not be any additional classes [17:33:49] wheee it works! [17:34:02] nuria, I think I got it to work, lazily instantiating so it doesn't create it from the get-go [17:34:28] I'm going to finish making it work for the pageviews class specifically, then would you like me to throw it up to the gerrit patch so you can check I haven't fubared it before I do it for every other class? [17:34:42] also, can I just say: singletons are REALLY COOL. [17:34:47] I need to find out how to do this in C++ [17:35:09] ...oh god I just said I liked Java. End times. [17:35:26] ottomata, Ironholds : singletons are far from ideal and they have their own sort of troubles, but at least: 1) they allow for mocking when testing 2) they can implement an interface. So for an application that will have several classes w/o any deps management I think is a preferable alternative to all being static and thus not OO. [17:35:37] yup [17:35:45] and also preferrable to "a new instance with every row" [17:36:13] Ironholds: ya, ya, hive udfs give us this method that is executed once per udf [17:36:15] Ironholds: singleton objects (aka companion objects) are first level citizens in scala :p [17:36:18] http://tutorials.jenkov.com/scala/singleton-and-companion-objects.html [17:36:32] Ironholds: that is where our instantiations should go always [17:36:45] *thumbs up* [17:37:00] ottomata: looking at deployment plan [17:37:01] okay, I've got it working for one class. You want to check now or should I implement everywhere and --amend then? [17:39:31] Ironholds, quick question about APP metrics job, why do we need to have a sample of uuids? [17:39:48] mforns, as opposed to grabbing all events from all UUIDs? [17:39:52] yes [17:40:04] we don't, it was just easier [17:40:24] remember the requirements come from a point where the processing and metric generation happened outside hadoop [17:40:29] which means no MR and no distributed computing [17:40:33] aha [17:40:34] so we took a sample and said: that'll have to do. [17:41:12] ok, so now we really do not need these uuid sample, right? we'll run the job over all webrequest logs [17:41:16] if you can do it efficiently with all UUIDs (which I don't doubt! After all, you already have to open all the files, and once the first-stage reducers have sorted the requests by {uuid, timestamp} it's just a streaming problem to tokenise them) do it [17:41:39] my only useful thing there would be: tokenise first, calculate metrics later. Which sounds very duh, but. [17:41:56] life is easier when you can provide a list of vectors, each vector representing a session (or, the java equivalent to vectors) [17:42:05] as opposed to, for each metric, having to sessionise [17:42:26] Ironholds, aha [17:42:37] ok, makes sense [17:42:42] *thumbs up* [17:42:52] thanx! [17:42:53] mforns, oh, and you've worked out the genius of mapping for this, right? [17:43:16] what? :] [17:43:21] maps produce {key, value} tuples. All we need as input data for session reconstruction is the UUID, which is a key, and the timestamp, which is the value [17:43:23] *jazz hands* [17:43:47] I'm super excited to see what you come out with because this system is...like, it would be hard to generate a system more optimised for session reconstruction, than hadoop is :) [17:43:56] so best of luck and let me know if you have further problems [17:44:01] ...questions, even [17:44:07] sorry, I'm doing 30 things at once; p [17:44:36] Ironholds, yes I have some experience with MR, but all your comments will help a lot and are welcome! [17:44:51] oh, totally. You know more about it than I do! [17:45:03] I'm just doing my "I love this thing! This thing is so awesome. Let's squee at how perfect it is" thing ;p [17:45:04] ok, I'll ping you no doubt if I have more questions [17:45:28] mforns: any desire to try to do this in spark instead of hive? you can use python. :p [17:45:29] no, for sure you know more on this specific case than I do! [17:45:37] it *might* be more efficient...it might be less... [17:45:42] actually, i take that back. [17:45:53] hive and parquet work great together right now, stick with it :p [17:46:00] hehe, ok [17:46:31] ottomata: ok, looked at plan, sounds good and i can help testing on vanadium wheever [17:47:11] mforns: but if you sample the job will be faster right? [17:47:21] nuria, yes [17:47:35] oh, hive? :/ [17:47:40] hive is going to be horrifyingly inefficient [17:47:58] unless you want to build out an entire class of UDFs just for this, I guess [17:48:05] mforns: Then there is a good reason as refined tables already have hive partitioning that allows for random sampling [17:48:15] Analytics-Cluster: Make spark work well with webrequest Parquet data - https://phabricator.wikimedia.org/T93105#1129465 (Ottomata) NEW a:Ottomata [17:48:18] Ironholds: ah, so map & reduce code you mean [17:48:26] Ironholds: ya, that might be the case [17:48:35] nuria, yeah, a straight MR job [17:48:42] oh iunno mforns, Ironholds, dunno much about this task [17:48:44] do wahtever you need :) [17:48:52] because (1) the data format (uuid, timestamp) is perfectly optimised for MR [17:48:53] oh if you do aMR job, i think you wil have the same problems that hive will. [17:49:05] Ironholds: thos come from x analytics header, right? [17:49:17] and (2) if we sort before plugging it into the sessioniser we can take advantage of streaming [17:49:20] Ironholds: i think it might come to that, we can probably try hive first and see how things look [17:49:23] sorry, same problems that spark will * [17:49:26] which means that we get to avoid clogging the rest of the system with...all the things. [17:49:40] ottomata, yeah, I think so? I think technically you have to check both that and URL at the moment, because legacy app versions. [17:49:49] why is count + group by bad? isn't that all you are doing? [17:50:01] nope [17:50:11] select uuid, timestamp, ORDER BY uuid, timestamp DESC [17:50:17] then convert timestamp into seconds [17:50:29] then stream each uuid's timestamps into a tokeniser [17:50:32] (coudl do that in the select) [17:50:38] tokenisserrrrrr [17:50:49] ? [17:50:53] well, "sessioniser" makes Aaron cry [17:50:58] ottomata, okay, I've got 20 timestamps from userX [17:51:18] I want to work out how many sessions they had, and also how long these sessions were, and also how many pages were in each session, right? [17:51:33] all of these metrics are dependent on reconstructing sessions from stream_of_timestamps [17:52:07] so instead of including that logic in each metric's calculator, you have a dedicated sessioniser that accepts a pile of timestamps and produces a list with each entry consisting of one "session" and the timestamps within that session [17:52:43] then you plug the list into how_many_sessions. list.length() done. How many pages? map(list,length). How long these sessions were? well, more complicated, but for the sake of this example, last_timestamp - first_timestamp [17:53:48] i see ok [17:53:49] so if Hive is the method we're looking at; how well does hive handle throwing multiple values in? Because we'll need to sort, and then divide up by UUID value, and then sessionise, and then calculate metrics for each of those users' sessions, and then calculate (mean, median, quantiles, whatever) for each metric [17:54:07] mforns: you should use spark :p [17:54:09] haha [17:54:12] like, Hive would work well for data retrieval (it used to be slow but from what nuria is saying it's probably a lot more reliable now, because the partitioning is optimised for this) [17:54:22] but it's not a great model for the processing, as I understand it [17:54:24] aye [17:54:28] ottomata, aa [17:54:29] aha [17:54:38] i jsut made this ticket: [17:54:38] https://phabricator.wikimedia.org/T93105 [17:54:43] ottomata, wait, did I just get a thing right? Truly, today is a weird day ;p [17:54:55] Ironholds: i am not 100% sure, but it does sound complicated to do with hive [17:55:12] but, either way [17:55:15] if it is MR or if it is Spark [17:55:23] either'd work! [17:55:28] i think there will have to be some work around making Parquet work with a column projection of some kind [17:55:35] hive does this for us [17:55:35] MR is easiest just because, well, if we're gonna have to maintain a thing, maintaining it in Java would seem to make sense. [17:55:40] in my mind [17:55:48] disagree there :/ [17:55:58] really? hrm. What would you suggest? [17:56:06] like, what's the rationale? [17:56:09] i would suggest the best tool for the job that we also want to support [17:56:31] java would be fine. but spark code is much easier to iterate on, and to debug and to read. and potential is faster (but mabye not!) [17:56:50] easier to iterate on than MR code* [17:56:59] *nods* [17:57:00] makes sense! [17:57:06] Java MR is very verbose and has a lot of parts. implementing mappers, reducers, etc. [17:57:13] i'm not opposed to that at all [17:57:16] if mforns wants to do that [17:57:18] no objection here. [17:57:19] yeah, that's a good point. I reviewed Bob West's streaming code and just that made my eyes bug [17:57:36] but you raise an excellent point of "mforns gets to decide", so on that note I'm gonna get back to singleton generation :D [17:57:41]