[00:14:57] (PS8) Nuria: Add UAParserUDF from kraken [analytics/refinery/source] - https://gerrit.wikimedia.org/r/166142 (owner: Ottomata) [00:15:40] (CR) Nuria: Add UAParserUDF from kraken (4 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/166142 (owner: Ottomata) [05:47:40] Analytics / Quarry: Quarry vulnerable to XSS exploit - https://bugzilla.wikimedia.org/72414 (PiRSquared17) NEW p:Unprio s:critic a:None Example: http://quarry.wmflabs.org/query/808 [06:00:39] (CR) Springle: [WIP] Add schema for edit fact table (2 comments) [analytics/data-warehouse] - https://gerrit.wikimedia.org/r/167839 (owner: QChris) [08:25:46] (PS2) QChris: [webstatscollector] Add Makefile [analytics/metrics] - https://gerrit.wikimedia.org/r/99077 [08:27:15] (CR) QChris: [V: 2] [webstatscollector] Add Makefile [analytics/metrics] - https://gerrit.wikimedia.org/r/99077 (owner: QChris) [11:57:13] This seems cool: https://github.com/fastos/fastsocket [11:57:48] Do we have any problems with throughput it could solve? [12:17:17] Not sure if we have problems it solves ... bit nice nonetheless :-) [12:33:43] (CR) QChris: Add UAParserUDF from kraken (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/166142 (owner: Ottomata) [13:16:08] (PS1) Yuvipanda: Fix XSS vulnerability [analytics/quarry/web] - https://gerrit.wikimedia.org/r/168278 [13:21:53] Analytics / Quarry: Quarry vulnerable to XSS exploit - https://bugzilla.wikimedia.org/72414#c1 (Yuvi Panda) NEW>RESO/FIX Fixed now. Thanks for reporting! [13:36:19] qchris: hiyayaaa [13:36:31] we good to move on with next ack=2 merge? [13:36:35] Heya Sir ottomata [13:37:07] Is it ok to wait until after scrum? [13:37:19] I want to check out the depooling of amssq42. [13:37:46] Not sure how that affects us ... maybe it's spare us 1 on 2 restarts ... not sure yet. [13:38:26] sure [14:12:40] ottomata: The ACK thing is good to go from my point of view: https://gerrit.wikimedia.org/r/#/c/167552/ [14:24:41] qchris, merged! [14:24:49] Awesome. [14:25:25] Too bad tomorrow is Friday and the final one has to wait longer. [14:26:07] Analytics / EventLogging: EventLogging needs process nanny alarm on hafnium - https://bugzilla.wikimedia.org/67309 (nuria) PATC>RESO/FIX [14:29:29] (PS11) Mforns: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [14:29:37] (CR) jenkins-bot: [V: -1] Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [14:37:15] :) [14:37:18] moving locations, back in aibt [14:53:25] Analytics / Refinery: Raw webrequest text partition for 2014-10-22T15/1H not marked successful - https://bugzilla.wikimedia.org/72427 (christian) NEW p:Unprio s:normal a:None The text webrequest partition [1] for 2014-10-22T15/1H has not been marked successful. What happened? [1] _______... [14:53:38] Analytics / Refinery: Raw webrequest partitions that were not marked successful due to configuration updates - https://bugzilla.wikimedia.org/72300 (christian) [14:53:41] Analytics / Refinery: Raw webrequest text partition for 2014-10-22T15/1H not marked successful - https://bugzilla.wikimedia.org/72427#c1 (christian) NEW>RESO/FIX Commit e1c35ceb080d00d590e120dc7745dac34428de53 got merged, which updated the varnishkafka configuration for the text caches. This caus... [14:54:10] Analytics / Refinery: Raw webrequest bits partition for 2014-10-22T23/1H not marked successful - https://bugzilla.wikimedia.org/72428 (christian) NEW p:Unprio s:normal a:None The bits webrequest partition [1] for 2014-10-22T23/1H has not been marked successful. What happened? [1] _______... [14:55:39] Analytics / Refinery: Raw webrequest bits partition for 2014-10-22T23/1H not marked successful - https://bugzilla.wikimedia.org/72428#c1 (christian) NEW>RESO/FIX Only amssq42 is affected. There was no real loss, but the sequence numbers got reset. Since amssq42 got depooled [1] for trusty testing... [14:55:40] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [14:58:00] !log Marked raw text webrequest partitions for 2014-10-22T15/1H ok (See {{bug|72427}}) [14:58:11] !log Marked raw text webrequest partitions for 2014-10-22T23/1H ok (See {{bug|72428}}) [16:09:04] ottomata, qchris: how do we prevent puppet from running on wikimetrics staging? it's running and restarting the queue every time [16:09:53] nuria__: puppet agent --disable REASON_FOR_DISABLE [16:10:11] Where REASON_FOR_DISABLE is the reason why you want it disabled [16:10:22] puppet agent --disable "Human readable reason for disable" :) [16:10:45] Hahahaha. [16:40:22] qchris: my next task is to help get the pageviews definition implemented and generating a per-project daily file [16:40:24] milimetric, nuria__: I'm going to rebase to the new master now [16:40:47] mforns: k, i lied about reviewing your code actually because I forgot to eat again [16:40:50] i'll review after lunch [16:41:06] per-project daily file? [16:41:14] That's a file with three readings? [16:41:18] sorry [16:41:29] so the end goal is pageviews in dashiki [16:41:37] for that we need a wikimetrics-format file somewhere [16:41:54] and wikimetrics-format means json files that have daily counts per project [16:42:09] like enwiki.json has day 1: X, day 2: Y, day 3: Z [16:42:29] Oh. The thing we started to discuss yesterday. [16:42:36] ??? [16:42:41] milimetric: the REAL pageviews definiton!? [16:42:45] yes [16:42:46] :) [16:42:49] you have that?! [16:42:51] it exists!? [16:42:52] no [16:42:54] :) [16:42:55] oh. [16:43:01] i'm tasked with helping make it exist [16:43:08] ok! [16:43:16] yep, i'm all "yay" [16:43:25] btw, I would love that to be an implementation agnostic specification of some kind...eventually [16:43:26] You're tasked to help making it exist ... https://meta.wikimedia.org/wiki/Research:Page_view [16:43:30] because i've had the pageviews bug assigned to me since freaking April [16:43:44] qchris: I have read that page [16:43:56] I know there will be various comments, so feel free to be "brutal" [16:43:57] and I know that's where we have to start, so how can I help? [16:44:11] mforns: brutal is my second middle name. Dan Florin Brutal Andreescu [16:44:24] We first need to the use-cases straightened out. [16:44:34] Because there is sooooo much cruft in there. [16:44:39] https://meta.wikimedia.org/wiki/Research_talk:Page_view :) [16:44:42] And so many implicit assumptions. [16:45:29] yep, i read that and though the formulations of the questions in that section sound good, it didn't seem very actionable and "develop"able [16:45:38] so qchris did you have an ideal "use case format" [16:45:41] ? [16:46:07] milimetric: ok! [16:46:24] No. Just making sure that we're on the same page about the problem we're gonna solve. [16:46:50] But! I am not the one to drive this. So I guess it's better to ask Ironholds. [16:47:25] * Ironholds rises from r'lyeh [16:47:26] well, so for the purpose of vital signs, we need mobile, desktop, and total pageviews by project [16:47:34] by day [16:47:40] (PS12) Mforns: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [16:47:50] (CR) jenkins-bot: [V: -1] Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [16:47:50] so could we start with such a dataset as the first step Ironholds? [16:47:53] where mobile groups zero/apps/web? [16:48:17] Ironholds: that's a great question for kevinator and tnegrin that I've been trying to get clarified as well [16:48:18] milimetric, that is fairly doable. The only unstable thing, I think, for that, would be getting a handle on what we're doing with spiders and other non-human traffic. [16:48:49] sounds like a horror movie: spiders and non-human traffic [16:49:07] qchris, I would agree there is a lot of cruft; we should trim it down :). I've taken a pass over some of your questions and will take a pass over the rest when I get out of this meeting. [16:49:29] I am struggling to work out if we want to list all the use cases, or just the main ones. Or even list elements/degrees of granularity and have precisely 1 use case justifying each. [16:49:31] Ironholds: ok, want to have a three-way then? [16:49:40] milimetric: read that again. [16:49:48] Ironholds: All use-cases! :-D [16:49:50] oh, intended [16:49:54] *snorts* [16:50:20] I'm totally down to have a joint meeting; I prefer email threads/talk pages but google hangouts are certainly faster. [16:50:23] ok, so after your meeting i'll be full of food too so that'll work - see you both then [16:50:35] Ironholds / qchris: i think let's do IRC for this [16:50:35] this aligns nicely with tnegrin telling me we could steal your expertise at making use cases sensible :D [16:50:39] okay! [16:50:39] that way we have some log [16:50:43] yay! [16:50:54] but i think we need a bit of real-timeyness for my purposes [16:51:07] What ... meeting ... when? [16:51:54] I do not like that we're again switching medium. :-((((((((((((((((((((((((((((((((((((((((((((( [16:52:40] milimetric, mforns ; these are couple benchmarks of backfilling of pages created on staging now: http://www.mediawiki.org/wiki/Analytics/Editor_Engagement_Vital_Signs/Backfilling#Pages_created [16:53:32] that's good [16:54:51] qchris: just to catch me up on what's going on and for you guys to get my perspective. The current medium isn't oriented towards my goals at all [16:55:34] Neither is it to mine or anyone else's ;-) [16:55:41] and I'm happy to just talk to Oliver, and I promise to track anything useful on wiki [16:56:06] Then ... let's do it on the talk page right away. [16:56:18] milimetric, mforns: will have breakfast #2 and review marcel's code [16:56:28] Google Hangouts are not viewable by the public. [16:56:38] qchris: I don't want a hangout [16:56:44] i suggested IRC [16:56:50] my bad. sorry. [16:56:51] talk page isn't real-time [16:57:54] nuria: ok! [16:58:59] (PS13) Mforns: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [17:06:56] Analytics / Wikimetrics: Can not delete tagged cohorts - https://bugzilla.wikimedia.org/72434 (Marcel Ruiz Forns) NEW p:Unprio s:normal a:None When a cohort is tagged and I try to delete it, the following flash message appears: "Error! Wikimetrics is experiencing problems. Visit the Suppor... [17:11:47] qchris: just learned of this (it is included in cdh 5.2???): http://kitesdk.org/docs/current/guide/Kite-Data-Module-Overview/ [17:11:59] just interesting to be aware of. not suggesting we use it [17:13:05] * qchris reads [17:15:01] Nice. [17:23:16] (PS1) Yurik: clone script, overwrite error, mem error [analytics/zero-sms] - https://gerrit.wikimedia.org/r/168333 [17:24:13] (CR) Yurik: [C: 2 V: 2] clone script, overwrite error, mem error [analytics/zero-sms] - https://gerrit.wikimedia.org/r/168333 (owner: Yurik) [17:27:31] mforns, milimetric : going to coffee shop, will be back in a bit [17:32:29] kevinator: for the purpose of vital signs, does a "mobile pageview" mean "mobile site" or "mobile + zero + app"? [17:32:51] I would say mobile only [17:33:05] k, thx [17:33:34] right now, zero and app are so small they can be lumped as ‘other’ and people can investigate them in a cube [17:34:58] hokay [17:35:07] so: I'm back. For an hour (I have to go to lunch after that. bah!) [17:36:08] milimetric, qchris, lets talk what we do with use cases! [17:36:14] k [17:36:20] k [17:36:27] milimetric, I have heard tnegrin waxing lyrical about your ability to make things understandable by Engineers and !Engineers. I nominate you to lead ;p [17:36:36] lol [17:36:45] ok, i'll give my perspective on the definition page [17:36:59] to me, I see the same breakdown you start with [17:37:18] but I think "tags" are dimensions and use cases can be implemented by new tags beyond the ones you have [17:37:38] so - step 1 - filter out everything that's *not* a pageview, same as you [17:37:45] * Ironholds nods [17:37:52] aha [17:37:55] then, figure out what use cases we care about right away [17:38:07] so you'd implement the filtering, pipe the result into wmf_raw.pageviews or something [17:38:14] and then individual solutions for different use cases? [17:38:16] then, step 2 - tag the resulting set to make the first version of the pageview definition [17:38:21] yep [17:38:26] I think that makes a lot of sense [17:38:40] it's also presumably what we'd have to do anyway even if we just had high-level metric + webstatscollector replacement. [17:38:47] since one requires a high level of granularity, one doesn't. [17:39:06] right, but we can from the start acknowledge that new dimensions (types of tags) are coming soon [17:39:08] "soon" [17:39:14] yup [17:39:33] so, the thing then is what is "the use case we care about right away" [17:39:34] I agree that tagging as a set of dimensions makes sense, too; it is possible to have a request that is both zero and an app request, for example. [17:39:38] * Ironholds nods [17:39:58] if that's the approach we're taking, I would make one change to the general filtering [17:40:08] which is, I'd include geolocation and ua-parsing at that stage too. [17:40:50] okay. So, in that case, I would suggest step 1: implement generalised filters. [17:41:17] Step 2, the high-priority use cases for me, at least, are (1) mobile web/app/zero/desktop breakdown per hour, ignoring country or spiders [17:41:43] and (2) maintenance scripts. "grab me a random selection of unique user agents that aren't identified as spiders OR as a recognised browser so we can train the spider/automation detector" [17:42:15] ok, interesting, mine's a little different too [17:42:17] "grab me a random selection of {ip,user_agent,count(*)} tuples so we can investigate automated detection of unregistered spammers/DoS attempts" [17:42:24] oh? [17:42:29] and, qchris, any thoughts so far? [17:42:32] it would be - pageviews to "desktop" or "mobile" per day per project [17:42:40] where apparently, "mobile" does not include zero or app [17:42:56] "does not include" == "other", or "does not include" == "DELETE WHERE"? [17:42:59] Ironholds: I am reading along ... but it's too early for me to talk implementation. We need use cases :-/ [17:43:07] no deleting at all [17:43:12] * Ironholds nods [17:43:15] we're not talking impl though [17:43:21] qchris, so you'd advocate for: we need the use cases for the generalised filtering? [17:43:25] we just want to figure out what use cases we should do first [17:43:26] filters and stuff ... that's implementation. [17:43:28] even if tagging is a different step? [17:44:03] use-cases will shape all of that. [17:44:14] Agreed. See Nemo surfacing the need to include 404s, for example. [17:45:11] I guess, the only two use cases we have listed which would make a difference to the filtering are things that pertain to spider-handling and what we do with multimedia files. [17:45:27] (If you want to play devil's advocate and can think of mutations to the generalised filtering that would result from use case changes, speak!) [17:45:50] but Ironholds are the use cases you listed tied to things we have to get done and make public this quarter? [17:46:07] kevin is putting the use cases together -- I pinged him to join this thread [17:46:45] he is? [17:46:50] bah ;p [17:46:58] well, we're all putting the use cases together :) [17:47:03] milimetric, not to my knowledge. Honestly, though, I'm not sure /what/ immediate use cases we have. [17:47:13] My deliverable is the definition. Circular, I know ;p [17:47:21] the one i mentioned is immediate, i'll get fired if we don't get it done in the next month or so [17:47:52] no firing dan! [17:48:06] right - the definition so far is really great imo [17:48:07] milimetric: Last time we talked through the VS pageviews with tn egrin, the outcome was to move forward with the current webstatscollector pageview definition [17:48:10] did that changes? [17:48:23] apparently, yes [17:48:37] (did you put a space in his nick on purpose qchris?) :) [17:48:39] I don't think it changing is a problem. I mean, it being a high priority use case is useful to focus the definition [17:48:40] * qchris is pzzled. [17:48:57] at the moment a lot of the priorities around different use cases are, in my mind at least, fairly fuzzy. [17:48:59] milimetric: yes. on purpose. I suspected a different answer. [17:49:08] I know that we want at some undefined point in some undefined order ALL OF THESE THINGS. [17:49:31] so i was saying - definition looks great but it's more of a definition of the space of available dimensions than of what we should do first and how to move from that initial definition to implementing the next few use cases [17:49:47] right, exactly [17:50:01] and the current wiki page does a great job of not backing down from that [17:50:06] heh [17:50:47] Okay, so then: the first use case to solve for is a per-hour(?) breakdown of pageviews, by site, by desktop/mobile web/{mobile app/zero}? [17:51:16] that would require, after generalised filters: remove 404s, apply mobile web/mobile app/zero filters, tag remainder as desktop, group by. [17:51:28] if we want to leave the spiders question aside for this one. [17:52:46] well - things like "remove 404s" to me, so we don't make this single-purpose, just mean, have a dimension that has "response type" grouped in "found" and "not found" and potentially more later [17:53:25] mforns: did you submitted the puppet changes too for CR? [17:53:42] nuria: yes [17:53:57] do you want me to put you as a reviewer? [17:54:09] milimetric, I can't parse that [17:54:14] one sec :) [17:54:23] kevinator: do you have any guidance on the use cases for the pageview definition? [17:54:33] ok, Ironholds, so what I mean is: [17:54:48] the "definition" of the pageview could be split up into: [17:55:02] one definition for what to filter out - this would be stuff that nobody would ever want [17:55:24] yup [17:55:28] that's the generalised filter [17:55:33] then N definitions, one per use case - this would be how do we use the set of available dimensions to answer questions q1,q2,...,qn? [17:55:46] and so we wouldn't throw away 404s [17:55:50] mforns: yes please [17:56:12] we would just abstract 404, 4XX into a value for a dimension [17:56:30] that same dimension would have like 500s grouped somewhere, 403s somewhere else, 30X somewhere else, etc. [17:56:42] milimetric, yes! [17:56:46] nuria: should I add andrew, too? [17:56:48] although we only want 200s and 404s [17:56:59] I was saying; for your specific use case [17:57:05] well, we need whatever granularity makes sense for us to answer our q1 -> qn [17:57:14] nuria: btw, you are there [17:57:14] yes, for me, yes [17:57:15] suppose we apply generalised filters, the results go into wmf_raw.webrequests [17:57:20] mforns: yes, please, otomata reviews all our puppet stuff [17:57:20] bah. wmf_raw.pageviews [17:57:33] we need like {x | x <- response codes and kevinator cares about x} [17:57:35] to answer your question, we'd need to take wmf_raw.pageviews, excluded 404s, apply... [17:57:43] nuria: sorry for having forgotten that... [17:57:59] so yeah, we'd still keep 404s (we need those for other questions) I was just sort of mentally going through the delta between the generalised filters' output and your question. [17:58:55] We're getting knee-deep into implementation with this generalized filters stuff ... but is there a single point of information that no one would care about? [17:59:06] If so, let's drop it from the logs ... no generalized filter. [18:00:15] I can see value in each of the columns that we have. [18:00:32] I guess others would see value too. [18:00:46] So let's step back from implementation details as which filter to apply when. [18:01:05] There are use-cases to build and flesh out. [18:02:29] I can see value in all of the columns, from an operational or research POV. [18:02:42] but there are values within those columns (and columns) that I cannot see being useful for any definition of 'pageviews'. [18:02:45] I agree with the use cases point [18:03:05] we need to agree on what we are trying to do -- I've asked Kevin to work on this [18:03:12] good! [18:03:28] does this also mean he has the responsibility of provoking these conversations? :P [18:03:35] yes [18:03:38] grand [18:04:09] in that case, if qchris wants to wait on use cases, I say we spike this until such use cases are available. We can dig into the spiders/automata question and handle general commentary in the meantime. [18:04:18] and I can work on the unique clients/session analysis work [18:04:52] fair enough - and I guess we can use the current webstatscollector data for Vital Signs [18:05:03] I haven't been following this discussion closely so I can't say what we should be working on [18:05:22] well, it sounds at the least like we're blocked by a use case definition from kevinator [18:05:34] tnegrin: Some days ago, it was said that Vital Signs could use the current webstatscollector data to start with. [18:05:38] does that still hold true? [18:05:39] that was our agreement -- it should be on the list, but we should use the WSC data [18:05:54] Awesome. [18:06:03] my hope was that we could use that data to provide the project level data we use in VS [18:06:39] possibly by post processing the hourly files but that's an implementation detail [18:06:48] Totally. [18:06:48] yep, that's how we'd have to do it [18:07:11] So milimetric ... I guess that unblokcs you pageview-definition-wise. Does it? [18:07:13] make it happen :) [18:07:13] so that's my next question to qchris: if he sees anything stopping me from working on re-shaping those hourly files into wikimetrics json output [18:07:37] well, pageview-definition-wise no, because i have the weight of the world on my shoulders with the damn pageview bug :) [18:07:57] Just RESOLVE/WONTFIX :-) [18:08:04] noooooooo [18:08:07] ;) [18:08:09] We're going to Phabricator ... let's not migrate that bug. [18:08:16] :) [18:08:19] heh -- you've done this before [18:08:35] no, i love that bug [18:08:35] one day -- we're building that database and closing that bug [18:08:48] i'll go down fighting for that one [18:09:01] So milimetric about the massaging of webstatscollector data. [18:09:05] and it implies all the dimensions that we can only get out of the real definition [18:09:11] yes - sorry - real work [18:09:17] The thing is that the current webstatscollector Hive implementation will die. [18:09:25] uh oh [18:09:31] So whatever we build, needs to switch at some point. [18:09:39] Hive is harder to maintain than other things. [18:09:40] when / why will it die? [18:09:44] okay, so what I'm hearing is; wait on kevinator for pageviews but not for the VS implementation? [18:09:45] I’m back in front of my screen [18:09:48] Cool. [18:09:50] yes Ironholds [18:09:51] It will die when we have a good new pageview definition. [18:09:57] I've gotta go out for lunch. *waves* back in a bit! [18:10:05] qchris: oh! that's fine [18:10:09] * qchris waves at Ironholds [18:10:14] bon apetit good sir [18:10:39] 1 sec [18:10:49] I have a lot of reading to catch up on :-) [18:11:19] kevinator: good reading, but tl;dr; is you gotta drive the use cases for the pageview definition [18:11:30] with Ironholds [18:11:38] and we're here to help and brain bounce [18:12:07] and i was suggesting that the set of all use cases lend themselves well to dimensional DW type modeling [18:12:12] exactly [18:12:12] (data warehouse) [18:12:49] there are a lot of details around 404s and other responses that we haven't considered currently [18:13:31] right, but really all we care about is - how do we want to essentially "digitize" the high resolution data into values of dimensions we care about [18:13:46] agree [18:14:00] I think the dimensional model is a great way to think about the use cases [18:14:13] it's been driving kevin and my discussions [18:14:37] So Oliver already documented some use cases: https://meta.wikimedia.org/wiki/Research:Page_view#Primary_use_cases [18:15:33] I can add a some [18:16:21] kevinator: I think those are more general than we'd like use cases to be for our purpose [18:16:32] I agree -- they are not stories [18:16:51] I'd suggest something like "what are the hot articles in each project?" [18:17:04] kevinator: "Daily / Monthly by [18:17:05] - Total [18:17:06] - Project [18:17:08] - Country [18:17:09] - Type (Desktop / Mobile / Spider / App) [18:17:10] - Device (UA Parser) [18:17:11] - Page" [18:17:13] I think this was the format we discussed [18:17:13] or "what is the image view count per category?" [18:17:39] milimetric: is a matrix with priorities a good model? [18:17:41] tnegrin: did you cut and past that from one of my docs? I thought i hadn’t shared that [18:17:59] oops [18:18:08] it's in the spreadsheet we were using in the tasking meeting [18:18:30] i think prioritized use cases are great, but most importantly i think they have to be concrete [18:18:39] agree [18:18:55] and they don't have to get into details about how we would determine them from the data, just the question is enough in most cases I think [18:19:03] like, "Type" above might be a couple of dimensions [18:19:15] how so? [18:19:26] well, maybe zero requests can come from apps [18:19:30] I don't really know how to think about this [18:19:32] I think the primary use case is for Vital signs and to give us a sense if the use of a project is growing or shrinking. [18:19:51] so top priorities are total and then by project [18:19:54] yeah -- but these use cases are for the final PV def -- we're just using what we have for VS [18:20:17] we already have enough in the page views file the qchris and ottomata made from hive [18:20:27] so can we meet those [18:20:39] but we need solid use cases to implement the final version [18:21:00] right, i think it's ok if the top use case for the final version is *also* vital signs though [18:21:08] fair [18:21:12] and also ok that the requirements for it are the same [18:21:17] The second use case is in a BI tool where we can cut up by country, target site, device [18:21:26] right [18:21:38] or a queryable database :) [18:21:50] a DW [18:22:36] but this isn't really a story right -- a story would deal more with the metrics/dimensions [18:23:02] "as a user, I want to see a breakdown of page views by type of request", et al [18:23:07] well - the use case is - "vital signs cares about daily pageview counts by project" [18:23:17] and "vital signs cares to see a breakdown of mobile and desktop pageviews" [18:23:21] yeah [18:23:33] with a little more definition about what "mobile" and "desktop" mean - as in, not API, not Zero, etc. [18:25:36] The next story would be something like: [18:26:22] as a product manager, I can look at pageviews by browser on mobile [18:27:14] for Zero, i can look at pageviews for the countries with Zero carriers [18:27:18] that works -- with some more detail about what that means [18:27:42] I think all the dimensions are specified here. [18:27:48] in the details I mean [18:28:10] Looking at pageviews per page is another story [18:28:11] where? [18:28:32] I need to write these stories down in a wiki [18:28:39] ok [18:31:05] kevinator: I suggest working with Ironholds to move the current use cases to some more "use case inspirations" type section and then hammering out brief and specific use cases there [18:31:05] that way there's a one-stop [18:31:53] Ok… I’m starting the doc now and will talk to Ironholds when he’s back from lunch [18:33:08] *one-stop shop for pageview-definition groking i mean, sorry got dc-ed [18:33:12] cool [18:45:14] (CR) Nuria: "I think there are grants missing for wikimetrics user on centralauth as the app returns" (7 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [18:46:39] (CR) Nuria: "Also two tests were failing for me:" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [18:50:05] ok nuria__ & qchris & milimetric have a tie-break situation [18:50:27] goal: get daily pageviews by project in Vital Signs [18:50:36] with a mobile / desktop / total breakdown [18:50:51] milimetric qchris : aham [18:51:21] approach 1: wikimetrics gets a new PageViews metric that looks at the current projectcounts hourly files and computes the kind of json output that dashiki already understands [18:51:55] approach 2: Hive + Oozie job creates TSVs with the desired data, aggregated and mapped how dashiki expects it, but Dashiki implements a new formatter that takes in TSV data [18:52:19] mforns ^ [18:52:29] oi! [18:52:46] mforns: Do you have an opinion on ^? [18:53:52] mforns / nuria__ : take your time, no rush, qchris and I have been bikeshedding for a while already [18:54:01] ok [18:54:19] approach 1 assumes other files than the current ones, right? as you would need daily aggregation per project regardless of file format (so as to 1 make 1 request to show pageviews for eswiki for the year at a daily granularity) [18:54:23] milimetric: dashiki currently gets all its datasets from wikimetrics, right? [18:54:31] qchris: yes [18:54:33] yep [18:54:38] Ok. [18:54:47] qchris: but not from the db, from teh static mount [18:54:49] nuria__: no, approach 1 would aggregate those files by itself [18:54:49] *the [18:54:53] in wikimetrics [18:55:14] milimetric: then approach 1 is not feasible, showing 30 days of data implies 30 requests [18:55:31] milimetric: i see [18:55:56] milimetric, qchris: not so fond of turning wikimetrics in a swiss army knife of file agreggation [18:56:31] but wasn't the vision at some point that wikimetrics is a general queryable data source? [18:56:40] nuria__: that would be throw-away logic, yes, because we'd move to the DW to get pageview information within a few months anyway [18:57:00] but, on the other hand, the pageview metric itself would remain and might be useful in general [18:57:05] (implementation aside) [18:57:05] where do these project counts hourly files come from? [18:57:20] http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/ [18:57:23] They fall out of Hive. [18:57:24] if you scroll all the way down [18:57:36] milimetric: that newer files exist doesn't preclude from current ones being there too [18:58:04] yes, this is true [18:58:09] qchris: no, i do not see wikimetrics as a general queryable data source, that is looking more [18:58:16] qchris: like the DW [18:58:20] detecting when the dataset is "ready" would re-implement the trickiest part of Oozie in wikimetrics [18:58:32] qchris: the feeling is that wikimetrics will write its metrics to the DW [18:58:34] I think qchris meant "querying" not "queryable" [18:58:37] so I believe approach 1 has one middle step (wikimetrics), right? [18:58:58] maybe appr2 is simpler [18:59:22] i vote for appr2 as well, but that puts more work on qchris [18:59:37] so qchris: we support and help you in any way we can for that extra work [18:59:40] milimetric, qchris: we can all chime into that work [18:59:54] milimetric: that should not be the deciding factor [19:00:05] no, i know [19:00:10] but it sounds like 3 for appr.2 [19:00:24] To me the issue with approach 2 is that there is no clear place where to put the files. [19:00:27] And also [19:00:48] Hive does not know about the projectcounts output files. [19:00:48] qchris: can we put them on http://datasets.wikimedia.org/ [19:01:02] qchris: why not http://dumps.wikimedia.org/other/pagecounts-all-projects/2014/2014-10/ [19:01:22] cause one thing is "sites" another is "projects" [19:01:27] milimetric: like a fully fresh dataset that we have to maintain. That means more daily work on my shoulders. [19:01:44] nuria__: Because that directory has all the idiosyncrasies of webstatscollector. [19:02:09] yep, but that dataset is coming soon with the "real" pageviews anyway, and this one will be thrown away. So I don't think it's extra maintenance, just sooner than we would normally get it [19:02:29] to nuria's point though - how would we ever do "make sure one day of data is fully available" in wikimetrics? [19:02:44] whereas Oozie has your awesome OK flag [19:02:45] The cluster has daily hiccups that I need to document. 1 more dataset ... 1 more place where I need to update them. [19:02:56] qchris, milimetric: and the reality is that we need teh dataset [19:03:17] No done flag for the projectcounts files. [19:03:22] qchris: I feel that pain, and we should really help you with that [19:03:26] sure it is more work but it is because we have more functionality [19:03:41] but if we have N datasets now, the equation is N + 1 + 1 - 1 = N + 1 [19:03:55] qchris: now, it is valid point to say: we need to make the cluster more solid [19:04:05] qchris: in order to support a second dataset [19:04:21] agreed - or at the very least get you help in adding hiccup documentaion [19:04:27] oozie can not generate files in dashiki json format? [19:04:47] mforns: we even have to glue the tsvs together by hand. [19:04:52] qchris: i can get behind a better cluster before adding more work to it [19:04:52] oozie would just be the scheduler / workflow tool, but yes, we could schedule scripts that would do that [19:05:02] nuria__: +1 [19:05:12] ok [19:05:22] Besides ... Oozie/Hive feels like a massive overkill for a few KB files. [19:05:43] qchris: for pageview data per project? [19:05:48] but qchris, my point from above: how does wikimetrics know "day X is ready"? [19:06:06] milimetric: If all 24 hourly files exist ... its ready. [19:06:12] qchris, milimetric: what other dataset is there that management cares more about (besides edits) [19:06:56] qchris, milimetric: I do not think wikimetrics should be in the business of concatenating files [19:07:15] agree [19:07:18] Don't call it concatenating files ... it's generating a report. [19:07:26] That's what wikimetrics is used for. [19:07:27] https://github.com/wikimedia/analytics-wikimetrics/blob/master/wikimetrics/api/file_manager.py#L159 [19:07:28] :) [19:07:33] qchris, milimetric: i could see how if dashiki had a backend that backend would do it but not wikimetrics, which is cohort centric [19:07:45] not a "general purpose tool" [19:07:46] yeah - the cohort centric part bothered me too [19:07:55] Ok. Approcah 2 it is. [19:07:59] that's why i initially suggested approach 1 but moved to qchris's approach [19:08:12] :-( [19:08:17] xD [19:08:32] there is approach #3, and it's adding a backend to dashiki [19:08:34] qchris: another bad thing - having "pageviews" in a tool meant to track user activity might be a very bad thing [19:08:39] which would be SAD [19:08:51] like running up behind someone to give them back something that fell and trying to explain you're not trying to murder them [19:09:05] nuria__: :) [19:09:56] qchris, milimetric: the file_manager manipulates data that comes from a cohort+metric run, not generic data at all, so really not a fit [19:10:28] i know :) I was just being a smartass 'cause that function concatenates files [19:10:33] cohort = enwiki project (not enwiki editors). metric = pageviews. Perfect fit. [19:10:51] qchris: now i get your point that the hicups make it more work taht it seems [19:11:00] But let's move forward with Approach 2. [19:11:00] qchris: but we'd have to exclude "PageViews" from static cohort reports [19:11:08] which would be hard from a UX point of view [19:11:14] like - hey - how come this metric disappeared? [19:11:20] qchris: cohort=editors (user ids , not readers) [19:11:38] oh! [19:11:40] oh! [19:11:45] wikimetrics wouldn't work [19:11:46] duh [19:11:47] my bad [19:12:04] nuria__: wikimetrics has per project cohorts already. [19:12:05] because metrics always return individual results [19:12:20] the aggregation happens above the metric, in the AggregateReportNode [19:12:25] qchris: for readers, with user_ids [19:12:35] it wouldn't work, it's a moot point [19:12:37] I'm so sorry [19:12:39] :( [19:12:47] milimetric: why wouldn't it? [19:13:00] we'd have to either massively hack it to return aggregate results or completely rewrite aggregation [19:13:12] We're beating a dead horse. Let me say it a third time ... let's move forward with approach 2. [19:13:14] because the shape of a metric output is { user_id : metric } [19:13:18] ah ya, sure [19:13:36] qchris: okeissssss [19:13:36] no i know qchris, i'm just saying - mea culpa, and for future reference - this is useful to keep in mind [19:13:46] qchris: will not say one more word [19:14:24] qchris: i lied, one more word: please add me to those CRs if you do it so I know how does that work [19:14:26] but yea, we should at some point revisit the { user_id : metric } approach [19:14:46] qchris: feel free to assign tha work to me [19:14:52] it's definitely dev work [19:15:08] yes, add me too, just to follow, please [19:15:22] qchris: agreed, that might be a more even split of the work as you would only need to do initial support [19:15:27] guys you can add yourselves to CRs automatically - in the refinery project settings in gerrit [19:15:40] ok [19:16:02] ok let's leave poor christian alone - it's like really late there [19:16:20] Hahaha. I'll be around for a bit ;-) [19:16:28] Gute Nacht qchris [19:16:39] Gute Nacht nuria__ :-) [19:16:46] bye! [19:35:27] ottomata: Got some time to bikeshed around refinery / pagecounts_all_sites? [19:38:18] yes [19:38:24] Awesome! [19:38:28] So we have the pagecounts files partitioned by hour in Hive. [19:38:31] For each wiki, we need a file holding daily aggregates for that wiki per site. [19:38:38] Hive is pretty much out of the game since (~800 wikis) we're talking ~800 output files daily here ... so ~800 queries. [19:38:43] But! The going over the 24 hourly projectcounts (a few KB each) file each day with, say a Python script, [19:38:46] and updating the ~800 files with that data would be pretty straight forward. [19:38:49] It need not even use the cluster but could feed straight from stat1002's /mnt/hdfs and push to [19:38:53] $SOME_WELL_KNOWN_PLACE or a git repo. [19:38:56] Eventually, those ~800 files need to be accessible from some URL. [19:38:59] The files are fully public. [19:39:04] Is such a script something that would pass your Code-Review, or does that feel wrong from the start? [19:39:52] qchris, aggreate pagecount or projectcount? [19:40:17] aggregate projectcounts is sufficient. [19:40:29] k you first said pagecount so i was confused [19:40:31] But we could re-aggregate pagecounts too,. [19:40:46] projectcounts is just an aggregate of pagecounts. [19:40:56] yes, projectcount files are much smaller [19:41:04] Right. [19:41:47] so [19:42:04] you need one file per wiki, with daily pageviews to that wiki as rows [19:42:06] in that file? [19:42:15] yes. [19:42:28] Or one big json blob in Vital-Signs format. [19:42:35] (But that would be less accessible to the community) [19:42:39] ha, aye, k [19:42:54] um, yeah i think that's fine, especially if you are just parsing the last 24 projectcount files each time [19:42:58] and you really just want to append to the file, right? [19:43:02] Right. [19:43:40] Ok. Thanks. [19:43:57] ja sure, whatever, sounds good to me [19:43:59] Is stats user on stat1002 a good place to run it? [19:44:04] python script, read from file, append to file [19:44:08] yes [19:44:18] Perfect. [19:44:20] stat user crontab on stat1002 sounds right to me :) [19:44:30] Okey-Dokey. [19:44:48] Thanks. [19:44:54] milimetric: nuria__ mforns : i started writing down the use-cases / stories for pageviews. They are not perfect yet… [19:45:11] https://www.mediawiki.org/wiki/Analytics/Pageviews/Stories [19:45:19] qcrisPretendsZZZ: ^^ [19:47:29] Thanks kevinator_at_lun. [19:47:38] kevinator: ok [19:47:54] qcrisPretendsZZZ: 1. lol @ nickname, 2. wait, you're doing this in python on stat2? [19:47:55] why is it truncating my nick? [19:48:11] that's dev world qchris, I can do that! [19:48:26] i'll just add it to my personal cron and send it out to datasets. [19:49:47] milimetric: But I guess we want thaht puppetized and everything, wouldn't we? [19:50:02] qcrisPretendsZZZ: YES [19:50:17] qcrisPretendsZZZ: in case it was not clear . YES TO PUPPET [19:50:33] * milimetric walks slowly away from puppet work [19:50:37] :-D [19:50:42] let's please please not have scripts running under our own users that break when someone is on vacation [19:50:59] I mean ... if you prefer to have it run under your own user ... fine by me. [19:51:03] nono [19:51:06] puppet is right [19:51:18] the main meaning was - this feels a lot like we dumped dev work on christian [19:51:30] if it doesn't even involve oozie, we should do it [19:51:34] :-D Meh. [19:51:34] own user crons over my dead body [19:51:49] Meh rejected [19:51:54] Hahaha. [19:52:03] are you working on it already? [19:52:33] No. I've just thought about the problem. [19:52:43] And figured I should run it by otto mata. [19:53:05] I've got three other things to finish before I'd start. [19:53:21] ok qcrisPretendsZZZ let's talk tomorrow at standup [19:53:31] i'm going to do the prep work in dashiki to handle multiple file servers [19:53:36] Ok. [19:53:46] milimetric: yeah, do it! [19:53:53] i or qchris can deal with puppetization [19:53:57] k [19:54:05] you want it to live in refinery? [19:54:08] you just get a script going for it...uhhhhh, where is the right place to commit it? [19:54:08] I so hope that "or" short-circuits :-) [19:54:09] i guess so? [19:54:24] It's not bound to refinery. [19:54:30] What about a separate repo? [19:54:37] They're cheap. [19:54:37] mehhh [19:54:40] shitty-python-hacks? [19:54:44] or what about dashiki? [19:54:46] not sure which [19:54:48] probably not dashiki [19:54:51] i think refinery is fine [19:54:52] definitely not dashiki [19:55:35] k, i'll just start it tomorrow and we can figure out where in refinery to put it on the CR [19:55:42] But I guess/hope we'd have a data-repo somewhere ... so we can see changes over time, not loose data etc. [19:59:11] milimetric: qchris_away: [19:59:12] cat /mnt/hdfs/wmf/data/archive/webstats/2014/2014-10projectcounts-20141023-* | awk '{arr[$1]+=$3} END {for (i in arr) {print i,arr[i]}}' | sort [19:59:16] done. [19:59:17] :) [19:59:27] sorry [19:59:27] cat /mnt/hdfs/wmf/data/archive/webstats/2014/2014-10/projectcounts-20141023-* | awk '{arr[$1]+=$3} END {for (i in arr) {print i,arr[i]}}' | sort [19:59:33] qchris_away: yes please repo. [19:59:51] well... we have to map from en.m => enwiki (mobile submetric) and en.d => enwiktionary [19:59:53] for cron at least [19:59:55] so it's not quite that easy [19:59:58] haha, ok, fancier stuff needed [20:00:08] nuria__: qchris is suggesting storing the output in a repo [20:00:09] which...meh [20:00:19] just stick it on a public webserver somewhere [20:00:20] ottomata: and also we'd have to make sure all the files are there before computing a day [20:00:22] or copy it back into hdfs [20:00:23] good enough for me [20:00:30] and check that we computed all the days and if not compute the old days we haven't [20:00:45] ottomata, qchris: sorry , iam cheering for crons/setup to be in a repo not output files [20:01:02] [ $(ls /mnt/hdfs/wmf/data/archive/webstats/2014/2014-10/projectcounts-20141023-* | wc -l) -eq 24 ] && ... [20:01:02] :) [20:01:03] ottomata: output files need to be served somewhere public [20:01:43] ja, put 'em back into the hdfs archive/ for redundancy? and then have them rsync to dumps.w [20:01:45] just like the other ones [20:01:55] projectcounts-all-sites-daily [20:01:56] whatever [20:04:39] okay, who knows things about the NavigationTiming_10076863 schema? [20:04:40] halfak? [20:05:05] Nope. don't know much. [20:05:13] I could help you reverse engineer it though if you want [20:05:24] I bet Ori knows about it. [20:07:33] Ironholds: what about it? [20:07:43] I understand the browser part of it if that's what you're curious about [20:08:13] I would really like to know if it contains a unique ID for the client as well as the event [20:08:20] if it doesn't I'm gonna be a sad trombone :( [20:08:36] not just make a sad trombone noise. BE a sad trombone. All I will be capable of is making sad trombone noises. [20:12:19] Ironholds: I don't know how to say this [20:12:48] NOOO [20:12:53] like, what's the most pleasant trombone material that human skin can be morphed into... [20:12:53] * Ironholds sad trombones :( [20:12:58] * Ironholds thinks [20:12:58] i heard of plastic ones? [20:12:59] brass? [20:13:01] plastic is nice... [20:13:07] you really shouldn't make a serious instrument out of anything but brass [20:13:08] https://meta.wikimedia.org/wiki/Schema:NavigationTiming [20:13:13] WHY [20:13:19] there's nothing I see on there that has any unique thingy per person [20:13:26] i mean, you have useragents.... [20:13:29] no? [20:13:49] useragents useless? [20:13:52] totally [20:14:04] not for tracking over >1 day and even 1 day requires IP to be anywhere useful. Aww. [20:14:11] I wanted to build up a desktop/mobile web session benchmark! [20:14:16] I made my code all OO and fancy to enable it! [20:14:28] i see [20:14:51] but yeah, none of our EL schemas have per-user tracking at all - probably for the same reason we don't have Uniques solved yet [20:15:25] bollocks! NT used to. [20:15:32] and the volume worked fine [20:16:04] wait, I tell a lie, that was ModuleStorage [20:16:32] same principle! [20:20:47] :P Was going to say ModuleStorage [20:21:26] oh, cool, didn't know [20:21:33] which last triggered on 20141018162911 [20:21:39] I think that's probably a legacy trigger. Womp womp :( [20:21:46] hmhm... [20:22:01] (PS1) Milimetric: Fix test and empty selection cases [analytics/dashiki] - https://gerrit.wikimedia.org/r/168414 [20:22:09] what do I have to bribe people to put together an interim LUCIDs solution for mobile web/desktop? [20:22:51] and by that I don't mean "explain to me how there's FAR TOO MUCH DATA for the system to take, before we've discussed how much data there would be". I mean: if we continue not having a desktop/mobile web solution, Product will continue doing silly things with our apps data. [20:24:14] Ironholds: you mean like real unique people tracking right, not halfak's date clever thing? [20:24:31] I mean like reissuing the ModuleStorage tokens every month [20:24:37] like, as simple as that. [20:24:48] take pre-existing schema and code that does /exactly what we want/, apply to tiny sample of clients. [20:25:06] every month, turf out the previous tiny sample, do.call(rinse_wash_repeat()) [20:25:08] Ironholds, can we not use session tokens for this? [20:25:24] bribes don't apply then, because it's not a technical problem [20:25:25] Those are stripped before they get to the requestlogs, I thought? [20:25:37] unless they're saved somewhere else? [20:26:10] Ironholds, they totally are stripped, but they don't need to be. [20:27:14] yeah, but I'm talking a quick-and-dirty solution [20:27:22] (also I don't know how long the expiry is set to on those tokens(?)) [20:35:12] ironhols: modulestorage is only defined for ie8 and up [20:35:31] ironholds: mostly desktop and newer mobile [20:35:53] Ironholds: so not really a feature deployed to the majority of mobile users [20:36:10] nuria__, indeed, I wasn't proposing literally using the same terms, but the general principle of "we could have a schema that does this, and we know that because we have a schema that did it" [20:36:45] Ironholds: w/o client side storage you are only left with cookies [20:36:57] indeedy! [20:37:29] Ironholds: Modulestorage was testing localstorage thus it made sense there 100%, it was a desktop oriented feature [20:37:51] yeah, I wasn't disputing "we'd need to use cookies" [20:37:55] I was saying "yep, we'd need to use cookies" [20:39:54] Ironholds: as milimetric said problem is not technical [20:40:41] well, there's clearly a technical element [20:40:52] when I proposed it last time I got back "the amount of data would make things fall over" [20:40:55] BTW, Ironholds we baby sit the nav timing so we own taht schema more or less [20:41:23] if our communities merged into a polite gentleman with a white hat, and this dude came up to us and was like - "no prob bro, do whatever", then we could have it done in a week [20:41:28] Ironhods: I think I should be able to answer your questions [20:42:09] Analytics / Tech community metrics: List of Phabricator users - https://bugzilla.wikimedia.org/35508#c19 (Ben B) (In reply to Antoine "hashar" Musso (WMF) from comment #7) > From Sumanah, tip about how to get a list of user is at : > https://groups.google.com/group/repo-discuss/msg/c426b6a83400b58e >... [20:42:26] milimetric, but again, this wasn't the initial objection. The initial objection was technical. [20:43:37] objection from whom? [20:44:46] Ironholds: ^ [20:44:55] At the time, I think nuria__ [20:44:57] I'll check my emails [20:45:22] Ironghols: on user EL yes, but there are many other technical solutions [20:45:27] Ironholds: sorry [20:45:47] Ironholds: *on using EL, sure, plenty of technical objections [20:45:58] Ironholds: but there are other stech olutions [20:46:03] but none that wouldn't require us to build out a load of stuff [20:46:15] thus opening the "our time is better invested in a long-term solution" argument [20:46:34] and that's the point at which we return to the status quo, which is "apps just went ahead and did it and now they want tons of contextless data" [20:46:47] Ironholds: your time ? as dev time to enhance EL to support sessions will be ample [20:47:12] Ironholds: as it was never designed for that purpose [20:47:14] but you just told me EL was not an appropriate solution, and so we should go for another tech solution [20:47:22] wait [20:47:29] for sampled analysis - EL si fine [20:47:30] *is [20:47:31] which is the same argument made last time this came up, and was followed with "our time is better invested in a long-term solution" [20:47:34] milimetric, I agree! [20:47:39] like very sampled [20:47:42] and very limited [20:47:50] but still not fine privacy wise [20:47:51] Ironholds, milimetric : for sample analysis of "events" [20:48:03] yeah, like NavigationTiming for example [20:48:09] if we want to track one person's experience with the site [20:48:20] okay, what percentage of clients got ModuleStorage tokens? [20:48:22] we'd have a much better understanding of how our changes are affecting performance at an individual level [20:48:24] let's use that amount. [20:48:40] The system has shown it can tolerate that number of uniques logging every page request. [20:48:47] it did that when we did...precisely that thing ;p [20:48:47] like, obviously, uniquely tracking users added on to any of our current schemas has value [20:49:00] but the point is - we are on uncertain grounds as to whether we can do that [20:49:12] personally, I think we should have an opt-out of event logging before we do such things [20:49:27] Ironholds: that was "sampled", for desktops [20:49:47] Ironholds: by no means for all users (was never the intention) or mobile or apps [20:50:00] Ironholds: makes sense? [20:50:42] Ironholds: also nav timing is not linked to page data [20:51:29] what? yes it is. [20:51:47] yeah, it's even got rev_id in some cases [20:51:51] and pageID [20:52:00] nuria__, yes, I don't want it for a ll users. At no point did I want it for all users. [20:52:08] ah sorry on teh capsule, you are very right [20:52:08] but isn't this a dead horse? [20:52:21] it's got rev_id on the schema itself [20:52:24] the precise line in the document in question, which I know you read because you commented on this precise line, was "Issues a unique token to browsers that do not have a unique token already set, making the decision on a probabilistic basis - say, a 0.002% chance that non-identified browsers will have a token provided;" [20:52:26] Yes, we already talked about this at length. [20:52:47] yes, but evidently that discussion was informed by the impression that I wanted to do it to everyone. [20:53:15] that is not the case. I can't imagine there is no delta in technical/privacy concerns between "do it for everyone" and "do it for a subset of users whose event count the system has already demonstrated it can tolerate" [20:53:31] anyway, I have my 1:1 with Dario. I'll try and get around it with fingerprinting for the time being [20:53:48] (which I actually find more privacy-problematic because it necessitates us keeping that data around. ew. But.) [20:53:54] hey Ironholds - did you get my message on the 1:1? [20:54:03] DarTar, nope [20:54:09] I was hoping we could cancel it / reschedule it [20:54:22] unless there’s something urgent you want to talk about [20:54:40] I’ve been in interviews and stuff the whole day and I need to get shit done [20:55:05] well, I'd like to talk about wtf is going on with any of the consumption metrics, but that may be a tnegrin conversation [20:55:06] (I thought I added a note to the calendar invite early this morning) [20:55:09] in the meantime, update: [20:55:41] I've been trying to kick people into contributing to the consumption metrics docs. Plus points to EZ, Q-Chris and Halfak, grr to ellery ;p [20:55:42] and - I’m finishing a pre-interview chat right now [20:55:51] other than that, app sessions are sucking in all my time [20:55:53] write away, will respond in a moment [20:55:57] Ironholds: privacy wise the sample rate is not relevant in my opinion. Though I agree keeping around user agents is very bad too. Our goals include scrubbing that [20:56:04] and have renewed my deep hatred of object-orientation [20:56:09] Ironholds, I'm getting to it. Lots of side-tracking today. [20:56:20] halfak, yeah, I was saying go you! You're one of the more responsive peeps [20:56:41] milimetric, all I want is a schema of is_mobile, timestamp_of_page_load and uniqueID [20:56:56] Ironholds: let's talk later today [20:56:58] Everything else is gravy [20:57:00] I'll send you an invite [20:57:04] tnegrin, we have our 1:1 Monday, we're probably good. [20:57:08] but if you have the time, sure. [20:57:12] I have time [20:57:26] cool [20:59:21] Ironholds: this is just my opinion, and it doesn't really count, but I (shamefully) think I make a good point. All I'd like is opt-out. It's not like we don't want to implement it, we just don't because we can't get out of our own way to prioritize it. [20:59:43] and failure to prioritize shouldn't impact privacy. If we're evil, fine, but let's not accidentally impact privacy [21:00:36] milimetric, I totally agree! I would also like an opt-out. [21:00:54] so, technically, I think that's the only blocker [21:00:59] I'm just currently caught between product and a hard place [21:00:59] as to how we do it - there's no problem [21:01:09] i know, i realize you feel that pain more than us [21:01:39] Ironholds: I think that is something to escalate to toby [21:01:55] he's listening :) [21:02:46] Ironhols, milimetric : right, in the short term, until we prioritize opt-out there is little we can do [21:06:08] makes sense [21:06:30] sorry for getting grumpy at y'all; like you say, nuria, it's a problem to push with the people who can make these decisions for certain, not the people who feed into them. [21:06:45] ottomata, nuria__: Around the repos for the pageview counting ... actually I was hoping to use both. A repo for the code. and another repo for the data. Just like the setup that erosen used for the projects he ran. That proofed to be pretty nifty in many occasions. [21:08:42] But I am fine with the one doing the coding to decide. And it seems [21:08:48] code will go into refinery. [21:10:00] qchris: but you wanted to store the data into git? [21:10:21] milimetric: Yes. That's pretty useful. [21:10:29] But If you manage persistence otherwise ... fine by me. [21:10:54] (Like you can clone locally easily and inspect changes over time... [21:11:12] ... automatically see if things go wrong in diffs ... and many other things.) [21:11:44] :) ideally i'd love to store all these datasets on wiki [21:11:56] That's fine too. [21:11:58] which is why i asked to be included in some upcoming discussions with wikidata [21:12:07] but for now, i think normal flat static files are fine [21:12:11] this is throw-away anyway [21:12:18] Sure. [21:12:36] I am not trying to get in the way. Just responding to the ping I received while being away. [21:34:26] (PS1) Milimetric: Add Separated Values converter [analytics/dashiki] - https://gerrit.wikimedia.org/r/168488 [21:34:42] nuria__: I added a simple CSV / TSV parser ^ [21:34:57] it's just WIP [21:35:26] i have to still add some type of factory that would read the new settings and figure out what converter to use, then swap that out in wikimetrics-visualizer [21:36:28] but I also fixed up a failing test and added a simple "empty selection" case here: https://gerrit.wikimedia.org/r/#/c/168414/1 [21:37:47] mforns: I also added you to this, so you can keep an eye on dashiki if you like [21:44:07] ok [21:44:43] thanks! [22:22:57] ok milimetric, will look later on today, i want to add some tests to UA UDF [23:42:55] Hey, is there a standard set of buckets for users by number of total edits (to date)? [23:43:11] If not I'm happy to invent my own but would prefer to follow best practice if it exists. :-) [23:47:43] James_F: you mean to classify editors as active/not? [23:48:11] James_F: if so we call an active editor one with 5 edits in a 30 day period [23:48:55] nuria__: For edit events; I'm currently thinking 0|0>x>=10|10>x>=100|100>x>=1000|1000>x>=10000|x>10000 [23:49:14] nuria__: Yes, but for lifetime saved edits, rather than recent activity. [23:50:18] James_F: you can see halfak's work on this regard and he might have a classification for editors that go beyond 5 edits a month [23:50:38] nuria__: I looked on meta and didn't find anything but I might have missed it. [23:50:43] * halfak reads scrollback. [23:51:35] You're right James. We don't have standard classes for edit counts. [23:51:41] James_F: But such a work exists, we normally classify it by activity on a timeperiod: http://meta.wikimedia.org/wiki/Research:Refining_the_definition_of_monthly_active_editors [23:51:42] Edit counts are a weird metric. What are you trying to measure? [23:51:56] halfak: Editing experience. [23:52:06] Sounds like the right metrics. [23:52:49] halfak: Well, it'll miss cross-wiki editors, but for the vast majority it feels roughly right. Orders of magnitude seems sane? [23:52:50] I have one that will get you slightly better measurement accuracy (probably) at the cost of not being able to do it all in SQL: https://meta.wikimedia.org/wiki/Research:Activity_session [23:53:09] Thanks! [23:53:30] James_F: If you're planning to just take a look, I think orders of magnitude are reasonable. [23:53:37] * James_F nods. [23:53:50] James_F: also have in mind that counting edits might include counting "reverted edits" so you have to look in more than 1 place to do it [23:53:59] If you find something interesting, you can always do a sensitivity analysis to see if the cutoffs were causing some artifact. [23:54:00] The advantage of raw edit count is that we have it trivially available on the client. [23:54:11] https://en.wikipedia.org/wiki/Sensitivity_analysis [23:54:18] I'm just trying to avoid joining user to event logging. :-) [23:54:41] In a production system or for an analysis? [23:54:44] Yeah, I doubt it'll show significantly divergent behaviour from expectations, but then, that's why they call them expectations. :-) [23:54:49] Production. [23:54:52] James_F: as long as you are aware that you might be counting productive + non productive edits [23:55:23] So long as you bucket 1 and 10 differently, you should be OK. [23:55:52] * James_F nods. [23:55:56] James_F: but for total number of edits you should not need EL [23:56:12] Editors who don't usually save productive edits don't usually make it to 10 edits. [23:56:26] nuria__: No, this is for the editing data workflow EL, I'm just adding this in for subsequent analysis in case it's a useful cut. [23:56:35] * James_F nods. [23:58:00] Out of curiosity, what feature is going to make use of this? [23:58:44] Editors. WT and VE in mobile/desktop/app. [23:58:57] And LQT and Flow and whatever else tools come along. [23:59:01] James_F: WT is ? [23:59:07] wikitext? [23:59:08] nuria__: The wikitext editor. [23:59:10] Yeah. [23:59:19] Gotcha. That, I think I can do better with. So, we're talking about deployments based on editing experience brackets? [23:59:31] No. [23:59:41] This isn't anything to do with A/B testing or deployments.