[00:11:25] Analytics-EventLogging, MediaWiki-extensions-RelatedArticles, MobileFrontend: Upstream Schema.js from MobileFrontend to EventLogging - https://phabricator.wikimedia.org/T117140#1767524 (bmansurov) NEW [00:21:19] Analytics-EventLogging, MediaWiki-extensions-RelatedArticles, MobileFrontend: Upstream Schema.js from MobileFrontend to EventLogging - https://phabricator.wikimedia.org/T117140#1767592 (bmansurov) [00:22:09] Analytics-Kanban: {loon} - https://phabricator.wikimedia.org/T117141#1767595 (kevinator) NEW [00:22:52] Analytics-Kanban: {loon} Refactor Data Dumps - https://phabricator.wikimedia.org/T117141#1767603 (kevinator) [08:03:47] (CR) Matthias Mullie: Measure the user responsiveness to notifications over time (10 comments) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [08:03:58] (PS4) Matthias Mullie: Measure the user responsiveness to notifications over time [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) [09:29:07] Analytics-Kanban: Druid testing on labs to asses whether is a suitable Cassandra replacement. {slug} - https://phabricator.wikimedia.org/T116409#1768238 (JAllemandou) [09:29:26] Analytics-Kanban: Improve record size on cassandra storage for pageview API data {slug} - https://phabricator.wikimedia.org/T116209#1768239 (JAllemandou) [09:29:50] Analytics-Kanban: Add page_id to pageview_hourly when present in webrequest x_analytics header - https://phabricator.wikimedia.org/T116023#1768247 (JAllemandou) [09:30:56] Analytics-Backlog, Analytics-Kanban: Load Wikimedia JSON data into Altiscale "Research Cluster" HIVE - https://phabricator.wikimedia.org/T114489#1768251 (JAllemandou) a:JAllemandou [09:31:21] Analytics-Backlog, Analytics-Kanban: JsonRevisionsSortedPerPage failed on enwiki-20150901-pages-meta-history - https://phabricator.wikimedia.org/T114359#1768254 (JAllemandou) [09:31:48] Analytics-Kanban: JsonRevisionsSortedPerPage failed on enwiki-20150901-pages-meta-history - https://phabricator.wikimedia.org/T114359#1768255 (JAllemandou) a:JAllemandou [09:32:06] Analytics-Kanban: Load Wikimedia JSON data into Altiscale "Research Cluster" HIVE - https://phabricator.wikimedia.org/T114489#1697404 (JAllemandou) [09:32:26] Analytics-Kanban: Add page_id to pageview_hourly when present in webrequest x_analytics header - https://phabricator.wikimedia.org/T116023#1768265 (JAllemandou) a:JAllemandou [13:00:57] o/ joal [13:01:05] Hey halfak :) [13:01:12] batcave? [13:01:23] OMW :) [13:32:07] Analytics-Tech-community-metrics, Outreachy-Round-11: Outreachy Proposal for Improving MediaWikiAnalysis - https://phabricator.wikimedia.org/T116733#1768647 (01tonythomas) We are approaching the Outreachy'11 application deadline, and if you want to have your proposal considered to be part of this round, d... [13:59:30] ottomata: did some setting changes in our hadoop/hive config, both beeline and hue throw this error - Error while processing statement: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Java heap space [13:59:43] ha, aye. [13:59:45] hm [13:59:45] when I try to run a query [13:59:57] that's probalby why i should do this [13:59:57] https://phabricator.wikimedia.org/T110090 [14:00:00] unless those are just client erros [14:00:15] ah [14:00:16] what query are you running? can you paste query and all output? [14:00:21] i'm not sure [14:00:26] i can run from hive cli [14:00:31] just not beeline and hue [14:00:48] https://www.irccloud.com/pastebin/Dm72i7Om/ [14:01:09] ottomata: ^ [14:03:41] madhuvishy: and error? [14:03:43] output? [14:03:47] do you get that in shell? [14:04:03] https://www.irccloud.com/pastebin/YguOdmz9/ [14:04:09] this in hue [14:04:11] and beeline [14:04:15] in hive I get output [14:04:29] https://www.irccloud.com/pastebin/yV3mlKFr/ [14:11:56] holaaa [14:16:38] Analytics, Analytics-Kanban: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1768820 (Nuria) [14:22:16] hoaallloaaaa [14:23:28] madhuvishy: when you use beeline, do you specify a username? [14:26:34] (CR) Mforns: "Awesome! thanks Matthias." (3 comments) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [14:27:26] ottomata: yeah [14:27:30] -n madhuvishy [14:27:33] aye [14:38:54] nuria: do we have a backlog grooming meeting in 20 minutes? [14:39:21] madhuvishy: yes backlog grooming/tasking/pointing the reactive tasks of these week [14:39:44] okay [14:56:30] madhuvishy: try your query now [14:57:51] ottomata: cool it's launching [14:58:25] and the heap size seems to be set to 1024 too [14:58:37] which is great cos i wasn't able to set it from beeline [14:58:54] i think its a server problem [14:59:25] which server? [14:59:53] hive-server [15:00:04] want to come to your meeting, will be there in just a few [15:03:06] halfak: got the thing to work ! [15:03:14] halfak: in a meeting now, will show after [15:04:00] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Browser Report. Small bugfixes {lama} [5 pts] - https://phabricator.wikimedia.org/T116931#1768910 (Milimetric) [15:04:32] joal, \o/ [15:05:22] Analytics-Kanban: Improve record size on cassandra storage for pageview API data {slug} - https://phabricator.wikimedia.org/T116209#1768930 (Nuria) Flatening table, instead of 16 rows there is one row that includes a dictionary: https://github.com/wikimedia/restbase/pull/381 Testing and update of tests [15:06:19] Analytics-Kanban: Improve record size on cassandra storage for pageview API data (RESTBase changes) {slug} - https://phabricator.wikimedia.org/T116209#1768939 (Nuria) [15:07:57] Analytics-Kanban: Improve record size on cassandra storage for pageview API data (RESTBase changes) {slug} [8 pts] - https://phabricator.wikimedia.org/T116209#1768941 (Milimetric) [15:08:35] Analytics-Kanban: Document Cassandra SLAS and storage requirements for daily and hourly data {slug} - https://phabricator.wikimedia.org/T116407#1768943 (Milimetric) a:Milimetric>Nuria [15:09:08] Analytics-Kanban: Document Cassandra SLAS and storage requirements for daily and hourly data {slug} [5 pts] - https://phabricator.wikimedia.org/T116407#1768953 (Milimetric) [15:11:07] Analytics-Kanban: Druid testing on labs to asses whether is a suitable Cassandra replacement. {slug} [8 pts] - https://phabricator.wikimedia.org/T116409#1768960 (Milimetric) [15:15:13] Analytics-EventLogging, MobileFrontend, Technical-Debt: MobileFrontend's schema code should be upstreamed to EventLogging - https://phabricator.wikimedia.org/T109398#1768971 (Jdlrobson) [15:15:15] Analytics-EventLogging, MediaWiki-extensions-RelatedArticles, MobileFrontend: Upstream Schema.js from MobileFrontend to EventLogging - https://phabricator.wikimedia.org/T117140#1768972 (Jdlrobson) [15:15:29] Analytics-EventLogging, MediaWiki-extensions-RelatedArticles, MobileFrontend: Upstream Schema.js from MobileFrontend to EventLogging - https://phabricator.wikimedia.org/T117140#1767524 (Jdlrobson) [15:15:59] Analytics-EventLogging, MediaWiki-extensions-RelatedArticles, MobileFrontend: Upstream Schema.js from MobileFrontend to EventLogging - https://phabricator.wikimedia.org/T117140#1767524 (Jdlrobson) [15:16:31] Analytics-Kanban, Patch-For-Review: Add lag option to reportupdater {frog} [8 pts] - https://phabricator.wikimedia.org/T117091#1768981 (Milimetric) [15:16:43] Analytics-EventLogging, MediaWiki-extensions-RelatedArticles, MobileFrontend: Upstream Schema.js from MobileFrontend to EventLogging - https://phabricator.wikimedia.org/T117140#1767524 (Jdlrobson) [15:17:33] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1768986 (Milimetric) Oh yeah... But they're not fully redacted anyway... hm... [15:19:22] Analytics-Kanban, Echo, Editing-Analysis, Flow, and 2 others: Analytics support for this task {frog} - https://phabricator.wikimedia.org/T117220#1768995 (Milimetric) NEW a:matthiasmullie [15:20:06] Analytics-Kanban: Analytics support for this task {frog} - https://phabricator.wikimedia.org/T117220#1768995 (Milimetric) a:matthiasmullie>Milimetric [15:20:17] Analytics-Kanban: Load Wikimedia JSON data into Altiscale "Research Cluster" HIVE [5] - https://phabricator.wikimedia.org/T114489#1769006 (Nuria) [15:20:48] Analytics-Kanban: Analytics support for echo dashboard task {frog} - https://phabricator.wikimedia.org/T117220#1768995 (Milimetric) [15:20:56] Analytics-Kanban: Load Wikimedia JSON data into Altiscale "Research Cluster" HIVE [5 pts] - https://phabricator.wikimedia.org/T114489#1697404 (Nuria) [15:24:11] Analytics-Kanban: Analytics support for echo dashboard task {frog} [8 pts] - https://phabricator.wikimedia.org/T117220#1769017 (Milimetric) [15:33:35] Analytics-Backlog, Research consulting, Research-and-Data: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#1769033 (DarTar) NEW a:ezachte [15:34:29] Analytics, Analytics-Kanban: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. [5 pts] - https://phabricator.wikimedia.org/T116609#1769042 (Nuria) [15:37:06] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1769061 (DarTar) Is there any sensitivity we need to be aware of when publishing reports for small countries from the unsampled l... [15:49:21] Analytics-Backlog: Reformat pageview API responses to allow for status reports and messages - https://phabricator.wikimedia.org/T117017#1769108 (Nuria) Action item: Document error codes appropiately. Where? Mediawiki or wikitech . Probably mediawiki [15:49:35] Analytics-Kanban: Reformat pageview API responses to allow for status reports and messages - https://phabricator.wikimedia.org/T117017#1769109 (Nuria) [15:53:27] Analytics-Backlog: Are the per-article or top-100 lists meant to be working in the pageviews API yet? - https://phabricator.wikimedia.org/T117018#1769113 (Milimetric) We'll try to make the 404s easier to understand, but this is just because there's sparse data in this top endpoint. It basically needs to wait... [15:53:47] Analytics-Kanban: Pageview API showcase App {slug} - https://phabricator.wikimedia.org/T117224#1769114 (Nuria) NEW [15:54:36] Analytics-Kanban: Pageview API Press release {slug} - https://phabricator.wikimedia.org/T117225#1769125 (Nuria) NEW [15:54:48] Analytics-Backlog: Fix the pageview API "top" spec and 404 reporting {slug} - https://phabricator.wikimedia.org/T117018#1769132 (Milimetric) a:Milimetric [15:55:37] Analytics-Kanban: Pageview API documentation for end users {slug} - https://phabricator.wikimedia.org/T117226#1769137 (Nuria) NEW [15:56:49] Analytics-Kanban: Reformat pageview API responses to allow for status reports and messages {slug} - https://phabricator.wikimedia.org/T117017#1769144 (Nuria) [15:59:24] (PS1) Christopher Johnson (WMDE): makes base_uri a relative path [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/250028 (https://phabricator.wikimedia.org/T116150) [16:00:21] Analytics-Backlog: Fix the pageview API "top" spec and 404 reporting {slug} - https://phabricator.wikimedia.org/T117018#1769172 (Ironholds) Gotcha! (Could you give a workable per-article example too? Missing data is fine but I wanna have this API client working for launch ;)) [16:00:33] (CR) Christopher Johnson (WMDE): [C: 2 V: 2] makes base_uri a relative path [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/250028 (https://phabricator.wikimedia.org/T116150) (owner: Christopher Johnson (WMDE)) [16:10:15] Analytics-EventLogging, Editing-Department, Improving access, QuickSurveys, and 5 others: QuickSurveys: Schema changes - https://phabricator.wikimedia.org/T114164#1769212 (jhobs) a:jhobs Patch above updates to new revision. Does not include anything around the discussion of logging IP addresses. [16:11:08] Analytics-Kanban: Pageview API showcase App {slug} - https://phabricator.wikimedia.org/T117224#1769220 (Nuria) Compare pageviews for several pages.Example: US election candidates? [16:27:00] (CR) Matthias Mullie: "Yeah, sorry, that was sloppy!" (3 comments) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [16:27:26] (PS5) Matthias Mullie: Measure the user responsiveness to notifications over time [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) [16:28:49] (CR) Milimetric: [C: 2 V: 2] Measure the user responsiveness to notifications over time [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [16:29:23] (CR) Milimetric: "(the puppet that runs this still has to be deployed, but it's submitted, so data should show up soon. The dashboard was updated and has 3" [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [16:30:46] Analytics: Track overall traffic, without any filtering, broken down into major categories, for internal use. - https://phabricator.wikimedia.org/T117236#1769279 (ezachte) NEW a:Nuria [16:38:49] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1769299 (ezachte) +1 hm on redacted numbers. It seems the end result has fallen between the cracks. For webstatscollector 3.0 d... [16:41:26] Analytics-Kanban: Pageview API showcase App {slug} - https://phabricator.wikimedia.org/T117224#1769309 (Nuria) Let's make a pointed example and deploy it to labs. Example can be on github and small code wise. [16:46:23] mforns: lemme know if you wanna brain bounce about the prototype [16:46:34] milimetric, sure! [16:46:57] I wanted to write down my personal goals before my one on one now, but afterwards, yes please! [16:47:38] anytime [16:48:07] milimetric, where's the pageview API repo? I've got some free time today and would like to spend it helping with that (even if it's just deprecating the all-years option) :) [16:48:16] (assuming it would be helpful for me to be involved/not cause more work than it solves [16:48:38] Ironholds: hm... yeah, sure [16:49:08] so the logic lives in RESTBase: https://github.com/wikimedia/restbase/blob/master/mods/pageviews.js#L156 [16:49:12] Analytics-EventLogging, MediaWiki-extensions-RelatedArticles, MobileFrontend: Upstream Schema.js from MobileFrontend to EventLogging - https://phabricator.wikimedia.org/T117140#1769353 (bmansurov) a:bmansurov [16:49:23] Ironholds: that's the line that talks about the controversial 404 decision [16:49:28] *nod* [16:49:29] madhuvishy: want to talk about nocookie? [16:49:41] madhuvishy: I have time but i do not want to keep you late [16:49:50] joal, kevinator, take more time if you need for your 1x1, you started late, I can wait, np [16:49:55] I'll porbably start with all-years because restructuring the response to allow for avoiding 404s/202s is more complex and finnicky and I'm new to the project [16:50:00] Ironholds: the all-years option is defined here: https://github.com/wikimedia/restbase/blob/master/mods/pageviews.js#L231 in code [16:50:05] and one sec, one more place [16:50:46] Ironholds: https://github.com/wikimedia/restbase/blob/master/specs/analytics/v1/pageviews.yaml#L202 [16:50:51] joal: let me know when you have time to look at teh work i have done on teh pageview spike, i must be missing something [16:50:52] (that's the documentation about it) [16:51:01] cool! Thanks! [16:51:08] those are the only places it exists [16:51:20] no thank *you* [16:51:25] "throwIfNeeded" christ, JS. [16:51:30] so Ironholds: npm install when you get it [16:51:35] *thumbs up* [16:51:35] oh I was proud of that name [16:51:53] milimetric, oh, it's great! It's just such a weird concept to me. Code that ~~helps you~~?! :P [16:51:53] and "npm test" runs the tests which use sqlite and you don't need cassandra [16:51:57] awesome [16:52:01] if you want to run the project itself I think you need cassandra [16:52:07] gwicke: yt? [16:52:15] but you should make sure the tests pass anyway (the date validation changes slightly without the all-years) [16:52:40] nuria: yup [16:52:40] Ironholds: yeah, JS only punches you in the face when you're not looking [16:52:47] hah. A good way of putting it. [16:52:49] not like C which just punches you in the face when you say hello [16:52:58] Analytics, Analytics-Backlog: Track overall traffic, without any filtering, broken down into major categories, for internal use. - https://phabricator.wikimedia.org/T117236#1769358 (Nuria) [16:53:16] is ((rp.year === 'all-years') ? '2015' : rp.year) basically an ifelse(condition, do_if_true, do_if_false)? [16:53:36] milimetric, wow, you've got as far as saying hello? Advanced C coder right here. [16:53:41] He has to DO STUFF before it complains :D [16:53:47] milimetric: sqlite is fine for small installs.. it just doesn't scale and replicate as well [16:53:49] gwicke: take a look at our capacity testing for the storage of the pageview API: https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI/DataStore [16:54:19] Ironholds: yes, ternary operator [16:54:24] cool! [16:54:43] === means "really equal" [16:54:52] jk, it's "type and value are both the same" [16:55:02] as opposed to == which is just value [16:56:48] ah, so like PHP [16:57:21] nuria: thanks [16:57:29] has it, also like PHP, got !== as the inverse to === because they never got around to incrementing the negative operators? [16:57:45] if I remember correctly (which I may not) [16:58:30] nuria: you might be interested in https://labs.spotify.com/2014/12/18/date-tiered-compaction/, as your data is time series as well [16:59:17] see https://phabricator.wikimedia.org/T117115 [17:02:44] we are also moving to multiple instances per box, which avoids the 800G limit [17:03:45] our staging cluster is already multi-instance [17:06:45] gwicke: lol, yes. sqlite has scaling issues for sure compared to cassandra :) [17:07:31] Ironholds: yes, !== and != are the corresponding negations. Not sure what incrementing negative operators means but i'm scared. And I don't wanna say it's the same as PHP, that sounds scary [17:08:22] milimetric, that is, !== should logically be the opposite of ==, not === [17:08:51] a-team: I completely forgot, we have to work on the wikimetrics tasks too. I'll create those next week maybe. [17:08:56] by incrementing: = is used for assignment since, what, B or C? But nobody bothered to make 'not equals' (!=) literally mean 'not equals' (!==) [17:09:09] nuria: ready when you want :) [17:09:12] in JS, you basically always want === and !== [17:09:20] yes [17:09:25] halfak: I have updated the ehterpad we were working on with correct code [17:09:56] joal: batcave? [17:10:17] I think jshint will fail the tests when == or != is used [17:10:52] nuria: OMW [17:11:44] gwicke: noted, still, i worry about loading , doesn't seem that cassandra can deal with the job of loading daily hourly resolution but that needs more research [17:11:54] nuria: false ! [17:12:09] nuria: It handled it but not at the pace of two at the same time :) [17:12:12] joal: hourly resolution [17:12:15] SORRY! [17:13:04] nuria: try switching the compaction method [17:13:51] it's likely that it'll make a significant difference for your use case [17:14:59] gwicke: for the moment I wait to follow cassandra pace on compaction :) [17:15:24] Analytics-Backlog: Inspect Pageview API queriess (after launch ) {slug} - https://phabricator.wikimedia.org/T117242#1769462 (Nuria) NEW [17:16:49] joal: yup, leveled compaction (which is what we used so far) is creating a lot of IO by trying hard to mix old & new data [17:18:29] the query to switch is in https://phabricator.wikimedia.org/T117115 [17:19:28] that should probably be made permanent for AQS (as in, add it to the code to do it) [17:19:28] thanks a lot gwicke, I'll check that [17:19:51] Analytics-Kanban: Understand the Perl code for this report {lama} - https://phabricator.wikimedia.org/T117243#1769490 (Milimetric) NEW [17:20:19] Analytics-Kanban: Understand the Perl code for this report {lama} - https://phabricator.wikimedia.org/T117244#1769498 (Milimetric) NEW [17:20:37] Analytics-Backlog: Understand the Perl code for this report {lama} - https://phabricator.wikimedia.org/T117245#1769504 (Milimetric) NEW [17:20:46] mobrovac: yeah, I have WIP code for modification characteristic hints in the schema spec [17:20:52] cool! [17:21:22] Analytics-Backlog: Understand the Perl code for this report {lama} - https://phabricator.wikimedia.org/T117246#1769510 (Milimetric) NEW [17:21:48] Analytics-Kanban: Understand the Perl code for this report {lama} - https://phabricator.wikimedia.org/T117247#1769518 (Milimetric) NEW [17:21:58] Analytics-Kanban: Understand the Perl code for this report {lama} - https://phabricator.wikimedia.org/T117246#1769510 (Milimetric) [17:22:07] Analytics-Kanban: Understand the Perl code for this report {lama} - https://phabricator.wikimedia.org/T117245#1769504 (Milimetric) [17:28:46] milimetric, so good news, I patched it and it builds [17:28:49] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1769563 (Milimetric) >>! In T114379#1769061, @DarTar wrote: > Is there any sensitivity we need to be aware of when publishing rep... [17:28:52] Analytics-Kanban: {loon} Refactor Data Dumps - https://phabricator.wikimedia.org/T117141#1769564 (kevinator) [17:28:54] bad news, pre-existing unit tests complain [17:29:46] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1769569 (Milimetric) >>! In T114379#1769299, @ezachte wrote: > +1 hm on redacted numbers. > > It seems the end result has falle... [17:31:36] Analytics-Backlog: Fix the pageview API "top" spec and 404 reporting {slug} - https://phabricator.wikimedia.org/T117018#1769583 (Milimetric) The per article one should be much easier to guess, except RESTBase is very finnicky about trailing slashes, that's usually what's happening if you see {items:[]} So do... [17:35:31] Analytics-Backlog: Fix the pageview API "top" spec and 404 reporting {slug} - https://phabricator.wikimedia.org/T117018#1769597 (Ironholds) Cool! On the actual bug, https://github.com/wikimedia/restbase/pull/391 hopefully fixes the all-years element. [17:39:44] gwicke: Will try date-tier compaction next monday (trying would be too hazardous for my weekend serenity :) [17:41:31] joal: +2 guard your weekend [17:41:40] halfak: have you seen my last message? [17:41:45] yup milimetric :) [17:41:53] Ironholds: it's probably because of the comment I left on the PR [17:42:03] yeah, we are waiting with the next switch until Monday as well [17:42:05] stick rp.year + on what you pass to validateTimestamp [17:42:41] Also thx for the tip gwicke, it might indeed help a lot [17:43:09] milimetric, aha [17:43:13] I'll set a 'thoroughly thought about' value for base_time_second :) [17:43:20] gwicke: --^ [17:44:10] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1769642 (ezachte) I also vote for doing away with .mw, it's redundant, and confusing indeed. For percentage bots I changed 'OK-i... [17:44:16] joal: you are welcome! [17:44:42] milimetric, good catch! [17:46:49] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1769678 (ezachte) >>! In T114379#1769563, @Milimetric wrote: >>>! In T114379#1769061, @DarTar wrote: >> Is there any sensitivity... [17:48:07] milimetric, hey do you want to brainstorm? [17:49:01] to the batcave! [17:49:05] :] [17:55:07] halfak: I'd need a few minutes of your time when you're available :) [17:56:42] joal, sorry. Meeting meeting meeting. I've got to travel to a coffeeshop. Will be back in on ~ an hour [17:56:58] hm, I'll be gone [17:57:04] It will wait monday :) [17:57:11] Have a good weekend halfak :) [18:58:36] have a nice weekend a-team :] [18:59:14] bye! [18:59:24] good weekend marcel [18:59:57] ciao [19:08:07] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1770131 (Milimetric) The 10x larger numbers in webrequest vs. pageview_hourly are probably due to is_pageview being false for 90%... [19:57:24] Analytics, Analytics-Kanban: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. [5 pts] - https://phabricator.wikimedia.org/T116609#1770277 (Nuria) Looking at project counts for wikinews for the 10th of July I can see huge volume of "supposed" pagaeviews fo... [20:01:37] Krinkle: BTW, we updated browser report to format % better plus include up to 0.1% of requests, otherwsise long tail was too long. please see: /wmf/data/archive/browser/general/mobile_web-2015-10-18 [20:02:19] Krinkle: we will let it bake a bit and after we will announce it to engineering, did you file a ticket about visualization? [20:39:10] Analytics, Analytics-Kanban: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. [5 pts] - https://phabricator.wikimedia.org/T116609#1770335 (ezachte) Thanks Nuria, so we're zooming in on what happened. I'm still wondering though, how can we have 56 Special:... [20:43:53] Analytics-EventLogging, MediaWiki-extensions-RelatedArticles, MobileFrontend, Patch-For-Review, Reading Web Sprint 59 - Amsterdam and the hamsters: Upstream Schema.js from MobileFrontend to EventLogging - https://phabricator.wikimedia.org/T117140#1770337 (Jdlrobson) [20:51:22] Analytics-EventLogging, Editing-Department, Improving access, QuickSurveys, and 5 others: QuickSurveys: Schema changes - https://phabricator.wikimedia.org/T114164#1770351 (Jdlrobson) Release needed so I can test... [21:15:26] ottomata: do you happen to know how one can raise the heap size limit when using beeline on hive? [21:15:50] like https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Out_of_Memory_Errors_on_Client for the old hive client [21:16:25] (i discussed that earlier here with others from the team, and googled a bit, but wasnt able to find out) [21:18:29] kevinator: Whom can I bug to look at some hive related things I've been working on? -- https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_request_analytics [21:19:20] Analytics-EventLogging, Editing-Department, Improving access, QuickSurveys, and 5 others: QuickSurveys: Schema changes - https://phabricator.wikimedia.org/T114164#1770373 (Jdlrobson) @phuedx @bmansurov - https://gerrit.wikimedia.org/r/250142 [21:19:41] bd808: nuria is a good person to ask. [21:21:00] The team just finished doing a browser report... there might be some similarities [21:21:57] cool. I poked her on the associated phab task [21:35:31] Analytics-Backlog: Allow metrics to roll up results by user across projects - https://phabricator.wikimedia.org/T117287#1770440 (Milimetric) NEW [21:41:48] Analytics-Backlog: Create a set of celery tasks that can handle the global metric API input - https://phabricator.wikimedia.org/T117288#1770461 (Milimetric) NEW [21:42:45] Analytics-Backlog: Create a set of celery tasks that can handle the global metric API input - https://phabricator.wikimedia.org/T117288#1770461 (Milimetric) [21:45:09] Analytics-Backlog: Build a public form that can hit the new API - https://phabricator.wikimedia.org/T117289#1770482 (Milimetric) NEW [21:46:22] Analytics-Backlog: Create a tool that can read the elements of a wiki template and call the API - https://phabricator.wikimedia.org/T117290#1770494 (Milimetric) NEW [21:46:41] Analytics-Backlog: Implement a simple public API to calculate global metrics {kudu} - https://phabricator.wikimedia.org/T117285#1770501 (Milimetric) [21:46:54] Analytics-Backlog: Create a tool that can read the elements of a wiki template and call the API {kudu} - https://phabricator.wikimedia.org/T117290#1770494 (Milimetric) [21:47:04] Analytics-Backlog: Build a public form that can hit the new API {kudu} - https://phabricator.wikimedia.org/T117289#1770482 (Milimetric) [21:47:10] Analytics-Backlog: Create a set of celery tasks that can handle the global metric API input {kudu} - https://phabricator.wikimedia.org/T117288#1770509 (Milimetric) [21:47:17] Analytics-Backlog: Allow metrics to roll up results by user across projects {kudu} - https://phabricator.wikimedia.org/T117287#1770510 (Milimetric) [21:48:18] Analytics, Analytics-Cluster, Fundraising-Backlog, Unplanned-Sprint-Work, and 2 others: Impression log parsers should get sample rate from filenames - https://phabricator.wikimedia.org/T116800#1770514 (K4-713) [22:05:38] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1770554 (ezachte) >>! In T114379#1770131, @Milimetric wrote: > The 10x larger numbers in webrequest vs. pageview_hourly are proba... [22:09:56] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1770570 (Milimetric) Ok. I think backfilling should be just a config change and a bunch of CPU / IO for the cluster. But I'll h... [22:10:12] kevinator: that task needs your attention ^ regarding wikipedia zero data [22:10:49] I know this morning you said most likely it had to go, but maybe we can just double check with them, I feel like that assumption could have potentially changed and it would not only save us some work but also give the community more interesting data [22:21:45] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1770604 (Ottomata) Today's [[ https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-10-30-18.17.html | EventBus RFC ]] [[ http://bots.wmflab... [22:23:13] Analytics-Tech-community-metrics, DevRel-October-2015: Correct affiliation for code review contributors of the past 30 days - https://phabricator.wikimedia.org/T112527#1770608 (Aklapper) Second iteration of identity merges and affiliation corrections pushed in 7ba9d95f0039e6b6583d09aa0002ff444a4212a1 Onc... [22:24:37] bd808: still there? [22:24:47] o/ [22:25:22] bd808: you had some questions? [22:25:31] nuria: https://phabricator.wikimedia.org/T116065 is the thing I'd like to get some feedback on [22:25:54] A rough plan is at https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_request_analytics [22:26:59] bd808: ok, give me a sec to read .. [22:27:15] *nod* [22:31:01] bd808: i see, all data you want minus request parameters for a post is already on hive [22:31:11] bd808: are you aware of that? [22:31:41] the post request params is the part we really care about [22:32:23] that tells us what the api.php request is really doing (think article title in a normal request) [22:32:26] bd808: but, for example, for your user agent reports [22:32:47] bd808: you wouldn't need that , correct? [22:33:27] yeah, UA is just about hits by distinct UAs [22:33:39] bd808: you can set up those reports to run now with current data and work on parallel on the structure to gather data we do not have, makes sense? [22:33:57] bd808: so i would 1) get familiar with oozie by writing the Ua reports 1st [22:34:48] 2) work on parallel to setup the api->monolog kafka pipeline [22:35:12] great idea. [22:35:26] one thing we do not want to do is duplicate work, and, for example [22:35:42] ua processing for all your ua agents is already happening [22:35:51] we do not want to re-compute that [22:35:52] we want raw UAs [22:36:01] bd808: ditto gor ip and geo-location [22:36:29] for api requests the UA is *supposed* to be unique by api consumer [22:37:09] but I agree we don't want duplicate data when it can be avoided [22:37:23] I haven't actually poked around in the existing hive data yet [22:37:41] I will plan on spending some time with that next week [22:38:27] bd808: ok, take a look at : [22:39:08] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest [22:39:20] the "raw" ua is on wmf.webrequest [22:39:39] table holds processed UA (in the case of a custom one is of no help) and raw one [22:39:52] so your reports can use that data already [22:40:05] excellent [22:40:40] bd808: also referer is there which in your case might be of help [22:41:19] I would expect referrer to be empty most of the time actually, but it would be worth checking [22:41:45] bd808: cause oozie is not teh most friendly platform and the more you can use of what is already done the better (computation wise) and dev wise (you will move faster) [22:43:54] bd808: there are oozie docs on wikitech [22:44:33] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie [22:45:22] the data in webrequest looks like it would give me what I need to populate action_ua_hourly, assuming that I can derive what I'm calling ip_class (internal, external, labs) from webrequest.ip [22:45:32] is that a raw ip address? [22:45:51] or is it encrypted/hashed somehow? [22:46:27] oh, client_ip is what I;d want [22:46:56] bd808: right, you want to avoid ip processing as it is also being done as part of our ingestion pipeline. [22:47:00] * bd808 remembers reading this now [22:47:00] bd808: see : https://github.com/wikimedia/analytics-refinery/tree/master/oozie/browser/general [22:47:22] for our most recent browser report (still baking, we deployed it today) [22:47:47] bd808: you will likely not need your intermediate tables for your report [22:48:33] bd808: but take a look so you can see how things work [22:48:54] I will poke around more for sure [22:49:45] it seems like we would still need the rollup tables for month over month and year over year reporting but maybe I don't get how the tools work [22:50:14] bd808: i see, if you want to have something besides a file, yes, you are correct [22:51:08] ok, that makes more sense [22:52:20] I'm designing a small datamart here with some pretty limited dimensions. Time will tell if Quim can really do anything actionable with the data. [22:54:39] bd808: sounds good , i liked your line of "We might want to know" needs to go out the window and be replaced with "these are the exact reports we need". [22:54:47] :) [22:55:25] people always want all the data! but it cost time, money and privacy to hoard it [22:55:31] bd808: so i think you got the right idea, we want to avoid to have stuff nobody cares about specially if it has privacy implications [22:56:46] bd808: once we have your browser reports we will take a second look to see where we stand , given that user agent and IP are present on the webrequest pipeline i would not publish those from kafka, that way you do not have private data [22:57:05] or, at least, sone private data [22:58:40] Analytics, Analytics-Kanban: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. [5 pts] - https://phabricator.wikimedia.org/T116609#1770683 (Nuria) > I'm still wondering though, how can we have 56 Special:HideBanners requests for every real page request? Ve... [22:58:54] *some private data [22:59:07] it would be great to get it down to just the get/post details and the general internal/external/labs classification [22:59:50] that won't leak much at all especially if we only put selected parameters into the front of the pipeline [23:00:37] I don't want to put in page_id, rev_id, title info for example. The envisioned reports only care about the type of operation not the thing acted upon [23:02:10] I'll try using the existing data to make a UA report next week and then I should have some more informed questions to ask [23:02:19] thanks for your time nuria [23:02:38] milimetric: looking into it. I'll go up to talk to W0 [23:02:58] bd808: ok, i would work on schema+ publishing to kafka on parallel with user agent reports. This project has three parts: 1) ua reports (actionable now) 2) kafka publishing from api and persistance into hadoop 3)translation to hive data and reports [23:03:50] I've been waiting on hearing that the pipeline is working for Discovery before pushing too hard into #2 [23:04:20] Last time I checked they still hadn't turned on their Monolog->kafka pipeline [23:04:43] bd808: good, that si the way to go, do not worry oozie will keep you busy next week, testing is not as fast as we would like. if you ever programmed on java 10 years ago you wil feel right at home [23:04:48] *will [23:05:08] lol. I was a java coder from '99 to '06 so :) [23:09:29] bd808: same here so you are going to understand what i mean then [23:21:41] milimetric: are you still there? [23:37:46] Analytics-Wikistats: Monthly page view stats for wikibooks, wikinews, wikiquote, wikisource, wikiversity for July 2015 are extremely anomalous - https://phabricator.wikimedia.org/T116531#1770774 (Nemo_bis) [23:37:47] Analytics, Analytics-Kanban: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. [5 pts] - https://phabricator.wikimedia.org/T116609#1770775 (Nemo_bis) [23:46:03] Analytics, Analytics-Kanban: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. [5 pts] - https://phabricator.wikimedia.org/T116609#1770791 (Nemo_bis) https://wikimediafoundation.org/wiki/Template:Hide_banners gets millions of views and generates a dozen re... [23:59:36] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1770802 (kevinator) @milimetric, I spoke to @DFoy and there aren't any issues writing Wikipedia Zero pageviews numbers per projec...