[00:14:57] (PS8) Nuria: Add UAParserUDF from kraken [analytics/refinery/source] - https://gerrit.wikimedia.org/r/166142 (owner: Ottomata) [00:15:40] (CR) Nuria: Add UAParserUDF from kraken (4 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/166142 (owner: Ottomata) [05:47:40] Analytics / Quarry: Quarry vulnerable to XSS exploit - https://bugzilla.wikimedia.org/72414 (PiRSquared17) NEW p:Unprio s:critic a:None Example: http://quarry.wmflabs.org/query/808 [06:00:39] (CR) Springle: [WIP] Add schema for edit fact table (2 comments) [analytics/data-warehouse] - https://gerrit.wikimedia.org/r/167839 (owner: QChris) [08:25:46] (PS2) QChris: [webstatscollector] Add Makefile [analytics/metrics] - https://gerrit.wikimedia.org/r/99077 [08:27:15] (CR) QChris: [V: 2] [webstatscollector] Add Makefile [analytics/metrics] - https://gerrit.wikimedia.org/r/99077 (owner: QChris) [11:57:13] This seems cool: https://github.com/fastos/fastsocket [11:57:48] Do we have any problems with throughput it could solve? [12:17:17] Not sure if we have problems it solves ... bit nice nonetheless :-) [12:33:43] (CR) QChris: Add UAParserUDF from kraken (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/166142 (owner: Ottomata) [13:16:08] (PS1) Yuvipanda: Fix XSS vulnerability [analytics/quarry/web] - https://gerrit.wikimedia.org/r/168278 [13:21:53] Analytics / Quarry: Quarry vulnerable to XSS exploit - https://bugzilla.wikimedia.org/72414#c1 (Yuvi Panda) NEW>RESO/FIX Fixed now. Thanks for reporting! [13:36:19] qchris: hiyayaaa [13:36:31] we good to move on with next ack=2 merge? [13:36:35] Heya Sir ottomata [13:37:07] Is it ok to wait until after scrum? [13:37:19] I want to check out the depooling of amssq42. [13:37:46] Not sure how that affects us ... maybe it's spare us 1 on 2 restarts ... not sure yet. [13:38:26] sure [14:12:40] ottomata: The ACK thing is good to go from my point of view: https://gerrit.wikimedia.org/r/#/c/167552/ [14:24:41] qchris, merged! [14:24:49] Awesome. [14:25:25] Too bad tomorrow is Friday and the final one has to wait longer. [14:26:07] Analytics / EventLogging: EventLogging needs process nanny alarm on hafnium - https://bugzilla.wikimedia.org/67309 (nuria) PATC>RESO/FIX [14:29:29] (PS11) Mforns: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [14:29:37] (CR) jenkins-bot: [V: -1] Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [14:37:15] :) [14:37:18] moving locations, back in aibt [14:53:25] Analytics / Refinery: Raw webrequest text partition for 2014-10-22T15/1H not marked successful - https://bugzilla.wikimedia.org/72427 (christian) NEW p:Unprio s:normal a:None The text webrequest partition [1] for 2014-10-22T15/1H has not been marked successful. What happened? [1] _______... [14:53:38] Analytics / Refinery: Raw webrequest partitions that were not marked successful due to configuration updates - https://bugzilla.wikimedia.org/72300 (christian) [14:53:41] Analytics / Refinery: Raw webrequest text partition for 2014-10-22T15/1H not marked successful - https://bugzilla.wikimedia.org/72427#c1 (christian) NEW>RESO/FIX Commit e1c35ceb080d00d590e120dc7745dac34428de53 got merged, which updated the varnishkafka configuration for the text caches. This caus... [14:54:10] Analytics / Refinery: Raw webrequest bits partition for 2014-10-22T23/1H not marked successful - https://bugzilla.wikimedia.org/72428 (christian) NEW p:Unprio s:normal a:None The bits webrequest partition [1] for 2014-10-22T23/1H has not been marked successful. What happened? [1] _______... [14:55:39] Analytics / Refinery: Raw webrequest bits partition for 2014-10-22T23/1H not marked successful - https://bugzilla.wikimedia.org/72428#c1 (christian) NEW>RESO/FIX Only amssq42 is affected. There was no real loss, but the sequence numbers got reset. Since amssq42 got depooled [1] for trusty testing... [14:55:40] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [14:58:00] !log Marked raw text webrequest partitions for 2014-10-22T15/1H ok (See {{bug|72427}}) [14:58:11] !log Marked raw text webrequest partitions for 2014-10-22T23/1H ok (See {{bug|72428}}) [16:09:04] ottomata, qchris: how do we prevent puppet from running on wikimetrics staging? it's running and restarting the queue every time [16:09:53] nuria__: puppet agent --disable REASON_FOR_DISABLE [16:10:11] Where REASON_FOR_DISABLE is the reason why you want it disabled [16:10:22] puppet agent --disable "Human readable reason for disable" :) [16:10:45] Hahahaha. [16:40:22] qchris: my next task is to help get the pageviews definition implemented and generating a per-project daily file [16:40:24] milimetric, nuria__: I'm going to rebase to the new master now [16:40:47] mforns: k, i lied about reviewing your code actually because I forgot to eat again [16:40:50] i'll review after lunch [16:41:06] per-project daily file? [16:41:14] That's a file with three readings? [16:41:18] sorry [16:41:29] so the end goal is pageviews in dashiki [16:41:37] for that we need a wikimetrics-format file somewhere [16:41:54] and wikimetrics-format means json files that have daily counts per project [16:42:09] like enwiki.json has day 1: X, day 2: Y, day 3: Z [16:42:29] Oh. The thing we started to discuss yesterday. [16:42:36] ??? [16:42:41] milimetric: the REAL pageviews definiton!? [16:42:45] yes [16:42:46] :) [16:42:49] you have that?! [16:42:51] it exists!? [16:42:52] no [16:42:54] :) [16:42:55] oh. [16:43:01] i'm tasked with helping make it exist [16:43:08] ok! [16:43:16] yep, i'm all "yay" [16:43:25] btw, I would love that to be an implementation agnostic specification of some kind...eventually [16:43:26] You're tasked to help making it exist ... https://meta.wikimedia.org/wiki/Research:Page_view [16:43:30] because i've had the pageviews bug assigned to me since freaking April [16:43:44] qchris: I have read that page [16:43:56] I know there will be various comments, so feel free to be "brutal" [16:43:57] and I know that's where we have to start, so how can I help? [16:44:11] mforns: brutal is my second middle name. Dan Florin Brutal Andreescu [16:44:24] We first need to the use-cases straightened out. [16:44:34] Because there is sooooo much cruft in there. [16:44:39] https://meta.wikimedia.org/wiki/Research_talk:Page_view :) [16:44:42] And so many implicit assumptions. [16:45:29] yep, i read that and though the formulations of the questions in that section sound good, it didn't seem very actionable and "develop"able [16:45:38] so qchris did you have an ideal "use case format" [16:45:41] ? [16:46:07] milimetric: ok! [16:46:24] No. Just making sure that we're on the same page about the problem we're gonna solve. [16:46:50] But! I am not the one to drive this. So I guess it's better to ask Ironholds. [16:47:25] * Ironholds rises from r'lyeh [16:47:26] well, so for the purpose of vital signs, we need mobile, desktop, and total pageviews by project [16:47:34] by day [16:47:40] (PS12) Mforns: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [16:47:50] (CR) jenkins-bot: [V: -1] Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [16:47:50] so could we start with such a dataset as the first step Ironholds? [16:47:53] where mobile groups zero/apps/web? [16:48:17] Ironholds: that's a great question for kevinator and tnegrin that I've been trying to get clarified as well [16:48:18] milimetric, that is fairly doable. The only unstable thing, I think, for that, would be getting a handle on what we're doing with spiders and other non-human traffic. [16:48:49] sounds like a horror movie: spiders and non-human traffic [16:49:07] qchris, I would agree there is a lot of cruft; we should trim it down :). I've taken a pass over some of your questions and will take a pass over the rest when I get out of this meeting. [16:49:29] I am struggling to work out if we want to list all the use cases, or just the main ones. Or even list elements/degrees of granularity and have precisely 1 use case justifying each. [16:49:31] Ironholds: ok, want to have a three-way then? [16:49:40] milimetric: read that again. [16:49:48] Ironholds: All use-cases! :-D [16:49:50] oh, intended [16:49:54] *snorts* [16:50:20] I'm totally down to have a joint meeting; I prefer email threads/talk pages but google hangouts are certainly faster. [16:50:23] ok, so after your meeting i'll be full of food too so that'll work - see you both then [16:50:35] Ironholds / qchris: i think let's do IRC for this [16:50:35] this aligns nicely with tnegrin telling me we could steal your expertise at making use cases sensible :D [16:50:39] okay! [16:50:39] that way we have some log [16:50:43] yay! [16:50:54] but i think we need a bit of real-timeyness for my purposes [16:51:07] What ... meeting ... when? [16:51:54] I do not like that we're again switching medium. :-((((((((((((((((((((((((((((((((((((((((((((( [16:52:40] milimetric, mforns ; these are couple benchmarks of backfilling of pages created on staging now: http://www.mediawiki.org/wiki/Analytics/Editor_Engagement_Vital_Signs/Backfilling#Pages_created [16:53:32] that's good [16:54:51] qchris: just to catch me up on what's going on and for you guys to get my perspective. The current medium isn't oriented towards my goals at all [16:55:34] Neither is it to mine or anyone else's ;-) [16:55:41] and I'm happy to just talk to Oliver, and I promise to track anything useful on wiki [16:56:06] Then ... let's do it on the talk page right away. [16:56:18] milimetric, mforns: will have breakfast #2 and review marcel's code [16:56:28] Google Hangouts are not viewable by the public. [16:56:38] qchris: I don't want a hangout [16:56:44] i suggested IRC [16:56:50] my bad. sorry. [16:56:51] talk page isn't real-time [16:57:54] nuria: ok! [16:58:59] (PS13) Mforns: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [17:06:56] Analytics / Wikimetrics: Can not delete tagged cohorts - https://bugzilla.wikimedia.org/72434 (Marcel Ruiz Forns) NEW p:Unprio s:normal a:None When a cohort is tagged and I try to delete it, the following flash message appears: "Error! Wikimetrics is experiencing problems. Visit the Suppor... [17:11:47] qchris: just learned of this (it is included in cdh 5.2???): http://kitesdk.org/docs/current/guide/Kite-Data-Module-Overview/ [17:11:59] just interesting to be aware of. not suggesting we use it [17:13:05] * qchris reads [17:15:01] Nice. [17:23:16] (PS1) Yurik: clone script, overwrite error, mem error [analytics/zero-sms] - https://gerrit.wikimedia.org/r/168333 [17:24:13] (CR) Yurik: [C: 2 V: 2] clone script, overwrite error, mem error [analytics/zero-sms] - https://gerrit.wikimedia.org/r/168333 (owner: Yurik) [17:27:31] mforns, milimetric : going to coffee shop, will be back in a bit [17:32:29] kevinator: for the purpose of vital signs, does a "mobile pageview" mean "mobile site" or "mobile + zero + app"? [17:32:51] I would say mobile only [17:33:05] k, thx [17:33:34] right now, zero and app are so small they can be lumped as ‘other’ and people can investigate them in a cube [17:34:58] hokay [17:35:07] so: I'm back. For an hour (I have to go to lunch after that. bah!) [17:36:08] milimetric, qchris, lets talk what we do with use cases! [17:36:14] k [17:36:20] k [17:36:27] milimetric, I have heard tnegrin waxing lyrical about your ability to make things understandable by Engineers and !Engineers. I nominate you to lead ;p [17:36:36] lol [17:36:45] ok, i'll give my perspective on the definition page [17:36:59] to me, I see the same breakdown you start with [17:37:18] but I think "tags" are dimensions and use cases can be implemented by new tags beyond the ones you have [17:37:38] so - step 1 - filter out everything that's *not* a pageview, same as you [17:37:45] * Ironholds nods [17:37:52] aha [17:37:55] then, figure out what use cases we care about right away [17:38:07] so you'd implement the filtering, pipe the result into wmf_raw.pageviews or something [17:38:14] and then individual solutions for different use cases? [17:38:16] then, step 2 - tag the resulting set to make the first version of the pageview definition [17:38:21] yep [17:38:26] I think that makes a lot of sense [17:38:40] it's also presumably what we'd have to do anyway even if we just had high-level metric + webstatscollector replacement. [17:38:47] since one requires a high level of granularity, one doesn't. [17:39:06] right, but we can from the start acknowledge that new dimensions (types of tags) are coming soon [17:39:08] "soon" [17:39:14] yup [17:39:33] so, the thing then is what is "the use case we care about right away" [17:39:34] I agree that tagging as a set of dimensions makes sense, too; it is possible to have a request that is both zero and an app request, for example. [17:39:38] * Ironholds nods [17:39:58] if that's the approach we're taking, I would make one change to the general filtering [17:40:08] which is, I'd include geolocation and ua-parsing at that stage too. [17:40:50] okay. So, in that case, I would suggest step 1: implement generalised filters. [17:41:17] Step 2, the high-priority use cases for me, at least, are (1) mobile web/app/zero/desktop breakdown per hour, ignoring country or spiders [17:41:43] and (2) maintenance scripts. "grab me a random selection of unique user agents that aren't identified as spiders OR as a recognised browser so we can train the spider/automation detector" [17:42:15] ok, interesting, mine's a little different too [17:42:17] "grab me a random selection of {ip,user_agent,count(*)} tuples so we can investigate automated detection of unregistered spammers/DoS attempts" [17:42:24] oh? [17:42:29] and, qchris, any thoughts so far? [17:42:32] it would be - pageviews to "desktop" or "mobile" per day per project [17:42:40] where apparently, "mobile" does not include zero or app [17:42:56] "does not include" == "other", or "does not include" == "DELETE WHERE"? [17:42:59] Ironholds: I am reading along ... but it's too early for me to talk implementation. We need use cases :-/ [17:43:07] no deleting at all [17:43:12] * Ironholds nods [17:43:15] we're not talking impl though [17:43:21] qchris, so you'd advocate for: we need the use cases for the generalised filtering? [17:43:25] we just want to figure out what use cases we should do first [17:43:26] filters and stuff ... that's implementation. [17:43:28] even if tagging is a different step? [17:44:03] use-cases will shape all of that. [17:44:14] Agreed. See Nemo surfacing the need to include 404s, for example. [17:45:11] I guess, the only two use cases we have listed which would make a difference to the filtering are things that pertain to spider-handling and what we do with multimedia files. [17:45:27] (If you want to play devil's advocate and can think of mutations to the generalised filtering that would result from use case changes, speak!) [17:45:50] but Ironholds are the use cases you listed tied to things we have to get done and make public this quarter? [17:46:07] kevin is putting the use cases together -- I pinged him to join this thread [17:46:45] he is? [17:46:50] bah ;p [17:46:58] well, we're all putting the use cases together :) [17:47:03] milimetric, not to my knowledge. Honestly, though, I'm not sure /what/ immediate use cases we have. [17:47:13] My deliverable is the definition. Circular, I know ;p [17:47:21] the one i mentioned is immediate, i'll get fired if we don't get it done in the next month or so [17:47:52] no firing dan! [17:48:06] right - the definition so far is really great imo [17:48:07] milimetric: Last time we talked through the VS pageviews with tn egrin, the outcome was to move forward with the current webstatscollector pageview definition [17:48:10] did that changes? [17:48:23] apparently, yes [17:48:37] (did you put a space in his nick on purpose qchris?) :) [17:48:39] I don't think it changing is a problem. I mean, it being a high priority use case is useful to focus the definition [17:48:40] * qchris is pzzled. [17:48:57] at the moment a lot of the priorities around different use cases are, in my mind at least, fairly fuzzy. [17:48:59] milimetric: yes. on purpose. I suspected a different answer. [17:49:08] I know that we want at some undefined point in some undefined order ALL OF THESE THINGS. [17:49:31] so i was saying - definition looks great but it's more of a definition of the space of available dimensions than of what we should do first and how to move from that initial definition to implementing the next few use cases [17:49:47] right, exactly [17:50:01] and the current wiki page does a great job of not backing down from that [17:50:06] heh [17:50:47] Okay, so then: the first use case to solve for is a per-hour(?) breakdown of pageviews, by site, by desktop/mobile web/{mobile app/zero}? [17:51:16] that would require, after generalised filters: remove 404s, apply mobile web/mobile app/zero filters, tag remainder as desktop, group by. [17:51:28] if we want to leave the spiders question aside for this one. [17:52:46] well - things like "remove 404s" to me, so we don't make this single-purpose, just mean, have a dimension that has "response type" grouped in "found" and "not found" and potentially more later [17:53:25] mforns: did you submitted the puppet changes too for CR? [17:53:42] nuria: yes [17:53:57] do you want me to put you as a reviewer? [17:54:09] milimetric, I can't parse that [17:54:14] one sec :) [17:54:23] kevinator: do you have any guidance on the use cases for the pageview definition? [17:54:33] ok, Ironholds, so what I mean is: [17:54:48] the "definition" of the pageview could be split up into: [17:55:02] one definition for what to filter out - this would be stuff that nobody would ever want [17:55:24] yup [17:55:28] that's the generalised filter [17:55:33] then N definitions, one per use case - this would be how do we use the set of available dimensions to answer questions q1,q2,...,qn? [17:55:46] and so we wouldn't throw away 404s [17:55:50] mforns: yes please [17:56:12] we would just abstract 404, 4XX into a value for a dimension [17:56:30] that same dimension would have like 500s grouped somewhere, 403s somewhere else, 30X somewhere else, etc. [17:56:42] milimetric, yes! [17:56:46] nuria: should I add andrew, too? [17:56:48] although we only want 200s and 404s [17:56:59] I was saying; for your specific use case [17:57:05] well, we need whatever granularity makes sense for us to answer our q1 -> qn [17:57:14] nuria: btw, you are there [17:57:14] yes, for me, yes [17:57:15] suppose we apply generalised filters, the results go into wmf_raw.webrequests [17:57:20] mforns: yes, please, otomata reviews all our puppet stuff [17:57:20] bah. wmf_raw.pageviews [17:57:33] we need like {x | x <- response codes and kevinator cares about x} [17:57:35] to answer your question, we'd need to take wmf_raw.pageviews, excluded 404s, apply... [17:57:43] nuria: sorry for having forgotten that... [17:57:59] so yeah, we'd still keep 404s (we need those for other questions) I was just sort of mentally going through the delta between the generalised filters' output and your question. [17:58:55] We're getting knee-deep into implementation with this generalized filters stuff ... but is there a single point of information that no one would care about? [17:59:06] If so, let's drop it from the logs ... no generalized filter. [18:00:15] I can see value in each of the columns that we have. [18:00:32] I guess others would see value too. [18:00:46] So let's step back from implementation details as which filter to apply when. [18:01:05] There are use-cases to build and flesh out. [18:02:29] I can see value in all of the columns, from an operational or research POV. [18:02:42] but there are values within those columns (and columns) that I cannot see being useful for any definition of 'pageviews'. [18:02:45] I agree with the use cases point [18:03:05] we need to agree on what we are trying to do -- I've asked Kevin to work on this [18:03:12] good! [18:03:28] does this also mean he has the responsibility of provoking these conversations? :P [18:03:35] yes [18:03:38] grand [18:04:09] in that case, if qchris wants to wait on use cases, I say we spike this until such use cases are available. We can dig into the spiders/automata question and handle general commentary in the meantime. [18:04:18] and I can work on the unique clients/session analysis work [18:04:52] fair enough - and I guess we can use the current webstatscollector data for Vital Signs [18:05:03] I haven't been following this discussion closely so I can't say what we should be working on [18:05:22] well, it sounds at the least like we're blocked by a use case definition from kevinator [18:05:34] tnegrin: Some days ago, it was said that Vital Signs could use the current webstatscollector data to start with. [18:05:38] does that still hold true? [18:05:39] that was our agreement -- it should be on the list, but we should use the WSC data [18:05:54] Awesome. [18:06:03] my hope was that we could use that data to provide the project level data we use in VS [18:06:39] possibly by post processing the hourly files but that's an implementation detail [18:06:48] Totally. [18:06:48] yep, that's how we'd have to do it [18:07:11] So milimetric ... I guess that unblokcs you pageview-definition-wise. Does it? [18:07:13] make it happen :) [18:07:13] so that's my next question to qchris: if he sees anything stopping me from working on re-shaping those hourly files into wikimetrics json output [18:07:37] well, pageview-definition-wise no, because i have the weight of the world on my shoulders with the damn pageview bug :) [18:07:57] Just RESOLVE/WONTFIX :-) [18:08:04] noooooooo [18:08:07] ;) [18:08:09] We're going to Phabricator ... let's not migrate that bug. [18:08:16] :) [18:08:19] heh -- you've done this before [18:08:35] no, i love that bug [18:08:35] one day -- we're building that database and closing that bug [18:08:48] i'll go down fighting for that one [18:09:01] So milimetric about the massaging of webstatscollector data. [18:09:05] and it implies all the dimensions that we can only get out of the real definition [18:09:11] yes - sorry - real work [18:09:17] The thing is that the current webstatscollector Hive implementation will die. [18:09:25] uh oh [18:09:31] So whatever we build, needs to switch at some point. [18:09:39] Hive is harder to maintain than other things. [18:09:40] when / why will it die? [18:09:44] okay, so what I'm hearing is; wait on kevinator for pageviews but not for the VS implementation? [18:09:45] I’m back in front of my screen [18:09:48] Cool. [18:09:50] yes Ironholds [18:09:51] It will die when we have a good new pageview definition. [18:09:57] I've gotta go out for lunch. *waves* back in a bit! [18:10:05] qchris: oh! that's fine [18:10:09] * qchris waves at Ironholds [18:10:14] bon apetit good sir [18:10:39] 1 sec [18:10:49] I have a lot of reading to catch up on :-) [18:11:19] kevinator: good reading, but tl;dr; is you gotta drive the use cases for the pageview definition [18:11:30] with Ironholds [18:11:38] and we're here to help and brain bounce [18:12:07] and i was suggesting that the set of all use cases lend themselves well to dimensional DW type modeling [18:12:12] exactly [18:12:12] (data warehouse) [18:12:49] there are a lot of details around 404s and other responses that we haven't considered currently [18:13:31] right, but really all we care about is - how do we want to essentially "digitize" the high resolution data into values of dimensions we care about [18:13:46] agree [18:14:00] I think the dimensional model is a great way to think about the use cases [18:14:13] it's been driving kevin and my discussions [18:14:37] So Oliver already documented some use cases: https://meta.wikimedia.org/wiki/Research:Page_view#Primary_use_cases [18:15:33] I can add a some [18:16:21] kevinator: I think those are more general than we'd like use cases to be for our purpose [18:16:32] I agree -- they are not stories [18:16:51] I'd suggest something like "what are the hot articles in each project?" [18:17:04] kevinator: "Daily / Monthly by [18:17:05] - Total [18:17:06] - Project [18:17:08] - Country [18:17:09] - Type (Desktop / Mobile / Spider / App) [18:17:10] - Device (UA Parser) [18:17:11] - Page" [18:17:13] I think this was the format we discussed [18:17:13] or "what is the image view count per category?" [18:17:39] milimetric: is a matrix with priorities a good model? [18:17:41] tnegrin: did you cut and past that from one of my docs? I thought i hadn’t shared that [18:17:59] oops [18:18:08] it's in the spreadsheet we were using in the tasking meeting [18:18:30] i think prioritized use cases are great, but most importantly i think they have to be concrete [18:18:39] agree [18:18:55] and they don't have to get into details about how we would determine them from the data, just the question is enough in most cases I think [18:19:03] like, "Type" above might be a couple of dimensions [18:19:15] how so? [18:19:26] well, maybe zero requests can come from apps [18:19:30] I don't really know how to think about this [18:19:32] I think the primary use case is for Vital signs and to give us a sense if the use of a project is growing or shrinking. [18:19:51] so top priorities are total and then by project [18:19:54] yeah -- but these use cases are for the final PV def -- we're just using what we have for VS [18:20:17] we already have enough in the page views file the qchris and ottomata made from hive [18:20:27] so can we meet those [18:20:39] but we need solid use cases to implement the final version [18:21:00] right, i think it's ok if the top use case for the final version is *also* vital signs though [18:21:08] fair [18:21:12] and also ok that the requirements for it are the same [18:21:17] The second use case is in a BI tool where we can cut up by country, target site, device [18:21:26] right [18:21:38] or a queryable database :) [18:21:50] a DW [18:22:36] but this isn't really a story right -- a story would deal more with the metrics/dimensions [18:23:02] "as a user, I want to see a breakdown of page views by type of request", et al [18:23:07] well - the use case is - "vital signs cares about daily pageview counts by project" [18:23:17] and "vital signs cares to see a breakdown of mobile and desktop pageviews" [18:23:21] yeah [18:23:33] with a little more definition about what "mobile" and "desktop" mean - as in, not API, not Zero, etc. [18:25:36] The next story would be something like: [18:26:22] as a product manager, I can look at pageviews by browser on mobile [18:27:14] for Zero, i can look at pageviews for the countries with Zero carriers [18:27:18] that works -- with some more detail about what that means [18:27:42] I think all the dimensions are specified here. [18:27:48] in the details I mean [18:28:10] Looking at pageviews per page is another story [18:28:11] where? [18:28:32] I need to write these stories down in a wiki [18:28:39] ok [18:31:05] kevinator: I suggest working with Ironholds to move the current use cases to some more "use case inspirations" type section and then hammering out brief and specific use cases there [18:31:05] that way there's a one-stop [18:31:53] Ok… I’m starting the doc now and will talk to Ironholds when he’s back from lunch [18:33:08] *one-stop shop for pageview-definition groking i mean, sorry got dc-ed [18:33:12] cool [18:45:14] (CR) Nuria: "I think there are grants missing for wikimetrics user on centralauth as the app returns" (7 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [18:46:39] (CR) Nuria: "Also two tests were failing for me:" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [18:50:05] ok nuria__ & qchris & milimetric have a tie-break situation [18:50:27] goal: get daily pageviews by project in Vital Signs [18:50:36] with a mobile / desktop / total breakdown [18:50:51] milimetric qchris : aham [18:51:21] approach 1: wikimetrics gets a new PageViews metric that looks at the current projectcounts hourly files and computes the kind of json output that dashiki already understands [18:51:55] approach 2: Hive + Oozie job creates TSVs with the desired data, aggregated and mapped how dashiki expects it, but Dashiki implements a new formatter that takes in TSV data [18:52:19] mforns ^ [18:52:29] oi! [18:52:46] mforns: Do you have an opinion on ^? [18:53:52] mforns / nuria__ : take your time, no rush, qchris and I have been bikeshedding for a while already [18:54:01] ok [18:54:19] approach 1 assumes other files than the current ones, right? as you would need daily aggregation per project regardless of file format (so as to 1 make 1 request to show pageviews for eswiki for the year at a daily granularity) [18:54:23] milimetric: dashiki currently gets all its datasets from wikimetrics, right? [18:54:31] qchris: yes [18:54:33] yep [18:54:38] Ok. [18:54:47] qchris: but not from the db, from teh static mount [18:54:49] nuria__: no, approach 1 would aggregate those files by itself [18:54:49] *the [18:54:53] in wikimetrics [18:55:14] milimetric: then approach 1 is not feasible, showing 30 days of data implies 30 requests [18:55:31] milimetric: i see [18:55:56] milimetric, qchris: not so fond of turning wikimetrics in a swiss army knife of file agreggation [18:56:31] but wasn't the vision at some point that wikimetrics is a general queryable data source? [18:56:40] nuria__: that would be throw-away logic, yes, because we'd move to the DW to get pageview information within a few months anyway [18:57:00] but, on the other hand, the pageview metric itself would remain and might be useful in general [18:57:05] (implementation aside) [18:57:05] where do these project counts hourly files come from? [18:57:20] http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/ [18:57:23] They fall out of Hive. [18:57:24] if you scroll all the way down [18:57:36] milimetric: that newer files exist doesn't preclude from current ones being there too [18:58:04] yes, this is true [18:58:09]