[03:44:06] Analytics-Kanban, Reading-Admin: Visualization of Browser data to substitute current reports on wikistats - https://phabricator.wikimedia.org/T118329#1894186 (Milimetric) a:Milimetric [07:25:29] Analytics-Backlog, Wikimedia-Developer-Summit-2016: Developer summit session: Pageview API overview - https://phabricator.wikimedia.org/T112956#1894283 (Legoktm) >>! In T112956#1891198, @Milimetric wrote: >> Honestly, if you're looking for a "broader audience", the developer summit is exactly the wrong... [11:20:54] Analytics-Tech-community-metrics: VisualEditor listed among slowest code review repos by mistake? - https://phabricator.wikimedia.org/T122046#1894514 (Qgil) NEW [11:22:51] Analytics-Tech-community-metrics: VisualEditor listed among slowest code review repos by mistake? - https://phabricator.wikimedia.org/T122046#1894522 (Qgil) Actually, EventLogging appears as 6th but https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/EventLogging,n,z doesn't look like... [11:24:21] (CR) Qgil: "What is the status of this patch. WIP, abandoned, or really seeking code review?" [analytics/data-warehouse] - https://gerrit.wikimedia.org/r/189532 (https://phabricator.wikimedia.org/T88583) (owner: Mforns) [11:39:37] Analytics-Tech-community-metrics: gerrit_review_queue can have incorrect data about patchsets "waiting for review" - https://phabricator.wikimedia.org/T121495#1894537 (Aklapper) This seems to be more wide-spread than I thought hence I'm adding this to our Jan/Feb-2016 radar. [11:39:55] Analytics-Tech-community-metrics, DevRel-January-2016: gerrit_review_queue can have incorrect data about patchsets "waiting for review" - https://phabricator.wikimedia.org/T121495#1894539 (Aklapper) [11:41:55] Analytics-Tech-community-metrics: VisualEditor listed among slowest code review repos by mistake? - https://phabricator.wikimedia.org/T122046#1894543 (Aklapper) Likely a duplicate of T121495. [11:45:57] (CR) Mforns: "This is abandoned." [analytics/data-warehouse] - https://gerrit.wikimedia.org/r/189532 (https://phabricator.wikimedia.org/T88583) (owner: Mforns) [11:46:22] (Abandoned) Mforns: Adapt loading and automatic verification scripts [analytics/data-warehouse] - https://gerrit.wikimedia.org/r/189532 (https://phabricator.wikimedia.org/T88583) (owner: Mforns) [11:52:41] take some rest joal [11:52:44] see you tomorrow [16:44:23] (PS1) Milimetric: Update data through December [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/260389 [16:44:39] (CR) Milimetric: [C: 2 V: 2] Update data through December [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/260389 (owner: Milimetric) [16:54:17] Analytics-Backlog: update comScore description on report card - https://phabricator.wikimedia.org/T122059#1895086 (kevinator) NEW [16:58:41] a-team I won't be at standup... but you can ping me to join if you need me [16:59:20] How quickly does pageview data find its way to Hive? [16:59:35] ok kevinator [17:00:40] nuria: hi! The IPv6 patch just went out to production, and Krinkle did add unit tests :) [17:00:54] Hey guys, really sorry but I won't make it for the standup today :( [17:01:03] np at all elukey [17:01:05] :] [17:02:08] Analytics-Backlog: update comScore description on report card - https://phabricator.wikimedia.org/T122059#1895134 (kevinator) this note was just added by @milimetric on http://reportcard.wmflabs.org/ > As of August, 2015, we no longer update these numbers. [17:02:14] ...though maybe the tests could have been more complete (not sure) [17:03:12] Analytics-EventLogging, Analytics-Kanban: Update Eventlogging jrm tests so they include userAgent into capsule {oryx} [3 pts] - https://phabricator.wikimedia.org/T118770#1895137 (Nuria) Open>Resolved [17:03:30] Analytics-Cluster, Analytics-Kanban, Easy: Update client IP in webrequest table to use IP [5 pts] {hawk} - https://phabricator.wikimedia.org/T116772#1895138 (Nuria) Open>Resolved [17:03:39] Analytics-Kanban, operations: Move misc/udp2log.pp to a module, and role/logging.pp somewhere better - https://phabricator.wikimedia.org/T122058#1895141 (Ottomata) [17:04:17] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves {oryx} [3 pts] - https://phabricator.wikimedia.org/T121120#1895147 (Nuria) [17:04:19] Analytics-Kanban, Patch-For-Review: Fix EL mysql consumer's deque push/pop {oryx} [3 pts] - https://phabricator.wikimedia.org/T120209#1895146 (Nuria) Open>Resolved [17:12:06] mforns: milimetric: nuria: kevinator: sorry for the bother, do you know how fast webrequest and pageview data fills up on Hive? [17:12:22] AndyRussG: on standup sorry [17:12:29] Ah k sorry! [17:40:35] Analytics-Kanban, operations, Patch-For-Review: Move misc/udp2log.pp to a module - https://phabricator.wikimedia.org/T122058#1895283 (Ottomata) [17:40:41] Analytics-Kanban, operations, Patch-For-Review: Move misc/udp2log.pp to a module [3 pts] - https://phabricator.wikimedia.org/T122058#1895064 (Ottomata) [17:48:54] Analytics, MobileFrontend, Reading-Web-Sprint-63-E____________: Make MobileWebUIClickTracking schema usable (too big) - https://phabricator.wikimedia.org/T108723#1895305 (Jdlrobson) [17:50:13] Analytics, MobileFrontend, Reading-Web-Sprint-63-E____________: Make MobileWebUIClickTracking schema usable (too big) - https://phabricator.wikimedia.org/T108723#1528729 (Jdlrobson) [17:51:00] Analytics, MobileFrontend, Reading-Web-Sprint-63-E____________: Make MobileWebUIClickTracking schema usable (too big) - https://phabricator.wikimedia.org/T108723#1895307 (JKatzWMF) @jdlrobson I added 10% sampling to this [17:52:56] Analytics-Kanban: Remove queue of batches from EL code - https://phabricator.wikimedia.org/T121151#1895308 (Nuria) [18:01:04] a-tea m, ops meeting over, should I join tasking? [18:01:08] a-team* [18:01:31] no worries ottomata we're just looking at mforns's anonymization code [18:06:02] k [18:17:37] AndyRussG: webrequest is updated every hour [18:18:15] nuria: ah cool! so... for example when would 16-17 hrs UTC be there? (apprently not there yet) Sorry to be pressuring the bits [18:18:28] AndyRussG: when cluster has availability to compute it [18:18:39] hmmm [18:18:54] Is there a dash or jobs viewer for that? [18:19:14] AndyRussG: you can see what cluster is doing: [18:19:26] AndyRussG: yarn.wikimedia.org [18:20:12] AndyRussG: even better: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0113816-150922143436497-oozie-oozi-C/?bundle_job_id=0113812-150922143436497-oozie-oozi-B [18:20:30] that's for text [18:32:26] ottomata: hmmm 500 server error... [18:32:36] server error?! [18:32:38] (for that URL ^) [18:33:04] hmm not working for me now too [18:39:44] (PS1) Mforns: [WIP] Sanitize pageview_hourly table [analytics/refinery/source] - https://gerrit.wikimedia.org/r/260408 (https://phabricator.wikimedia.org/T118838) [18:40:01] nuria, milimetric, madhuvishy ^ [18:40:13] mforns: yessir [18:40:17] sweet, I'll wait for you to try those things and see more info after that [18:40:24] ok milimetric [18:40:32] oh well.... just hoping to see when I'd get today 16hrs UTC [18:41:18] (CR) Milimetric: [C: 2 V: 2] "Looks great, nice job." (2 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) (owner: Madhuvishy) [18:41:24] nuria, ottomata, I think what AndyRussG is asking for is good rule of thumb to use: pageviews occurring now will be queryable in pageview_hourly in X hours. [18:43:03] (CR) jenkins-bot: [V: -1] [WIP] Sanitize pageview_hourly table [analytics/refinery/source] - https://gerrit.wikimedia.org/r/260408 (https://phabricator.wikimedia.org/T118838) (owner: Mforns) [18:46:11] kevinator: I am not sure (and ottomata might know best) but i think it varies as cluster might have less or more work but pageviews need to be processed into the webrequest and pageview hourly table, i think is safe to say that pageviews from 24hrs ago should be there now but I am not sure if we can be more specific in general, using pageviews as monitoring [18:46:11] tool for real time changes is not teh best idea [18:48:45] yes, we should never claim we're going to get closer and closer to anything real-time. However, I have had a few curious inquiries about how long it takes for data to migrate through the system. [18:52:39] pageviews or webrequest? [19:03:15] ottomata: both [19:13:18] AndyRussG: Both webrequest and pageview_hourly have 16utc now [19:13:46] mforns_gym: One little thing to note - https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html [19:13:53] madhuvishy: woohoo! thanks! [19:16:56] AndyRussG: if you want to look for yourself for any other future hours [19:16:59] https://www.irccloud.com/pastebin/lcL3NLq3/ [19:18:00] madhuvishy: cool! Didn't know Hive could get hours in the future toooo :) [19:18:17] AndyRussG: this is the scheduler, oozie [19:18:52] it'll show you the status of the recent finished jobs, running ones and the upcoming jobs it's waiting for data for. [19:19:43] cool! nice :) [19:19:50] the coordinator ids will change if we relaunch the coordinators for some reason(which is not very often), but should be fairly stable for the next couple days atleast while you are figuring stuff out. We can always fetch you the latest ids if you want to keep track. (Ideally hue should help avoid this but it's flaky so) [19:47:51] Analytics-Tech-community-metrics, DevRel-December-2015: Improve Key performance indicator: code contributors new / gone - https://phabricator.wikimedia.org/T63563#1895635 (Aklapper) Related code is in `mediawiki-dashboard/browser/js/mediawiki.js` and `GrimoireLib/vizgrimoire/analysis/contributors_new_gone... [20:14:27] * milimetric lunch [20:21:17] Analytics-Tech-community-metrics, DevRel-December-2015: Improve Key performance indicator: code contributors new / gone - https://phabricator.wikimedia.org/T63563#1895837 (Aklapper) >>! In T63563#678030, @Qgil wrote: > Number of contributors stopping contributing or decreasing continuously in the past 3 m... [21:40:37] (CR) Nuria: [C: 2 V: 2] Update avro schemas to use event-schema repo as submodule [analytics/refinery/source] - https://gerrit.wikimedia.org/r/260030 (owner: EBernhardson) [22:18:36] mforns: don't know if you got my previous ping, but I was saying we should change the groupByKey operation [22:18:53] https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html [22:21:31] madhuvishy, thanks for that, it's really interesing [22:21:54] madhuvishy, I'll try it [22:22:59] there's a difference though: in the example on that page: a->1,a->1,b->1,b->1 are reduced to a->2,b->2 [22:24:17] in our case: a->1,a->1,b->1,b->1 would be reduced to: a->(1,1),b->(1,1) [22:24:49] because the reduced value is actually the collection of all rows [22:25:00] mforns: yeah! we can may be combine the keyBy and groupByKey and just have one reduceByKey operation [22:25:06] ah just seeing your recent pings [22:25:29] mforns: that's fine, our reduction operation won't be plus [22:25:37] aha [22:26:23] but you're right in that we won't need to distribute the key for each row [22:26:35] just the key for each bucket [22:27:00] yeah [22:27:11] i think the shuffle is probably causing us problems [22:27:29] which makes sense because you said when you increased the workers it became worse [22:28:00] it was probably doing even more data transfer on the network [22:28:11] madhuvishy, I think you're right, specially because caching/checkpointing is not working, so for the 2nd iteration, we are executing groupByKey over a groupByKey [22:28:23] yeah [22:29:01] cool, I'll try that :] [22:30:37] madhuvishy, but I think I'll use combineByKey, which is the same, except the types of input and output can be different [22:30:44] yeah alright [22:30:50] reduceByKey wouldn't work I think [22:33:31] * milimetric loves scala / spark [22:33:33] nice find Madhu [22:34:13] mforns: milimetric this is a nice set of slides they shared at the training the other day - http://training.databricks.com/visualapi.pdf [22:34:26] visualizes all the core rdd operations [22:34:36] ooh [23:12:07] mforns: did it help? [23:12:16] madhuvishy, executing right now [23:12:19] cool! [23:16:30] mforns: did it finish or die? [23:16:46] madhuvishy, hehehe, it didn't end, it seems an infinite loop [23:16:56] aah [23:17:06] stopped it [23:17:11] ya i was wondering too - which count was getting triggered so many times [23:17:32] mmm [23:20:08] madhuvishy, run it again with a limit of 10 iterations [23:21:12] madhuvishy: what i do not get is what groupBy key is useful for then [23:21:34] nuria: I think they are actively discouraging people to not use it [23:21:49] for huge amounts of data [23:21:54] madhuvishy: right, seems like it should be deprecated [23:22:12] aha [23:22:17] well it does what it's supposed to do - i think there's a plan to make it more efficient [23:24:02] mforns: why does getDimensionStats do a collect? [23:24:26] is it running over all the rows? [23:24:49] * mforns looks [23:29:28] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1896565 (Nuria) @jcrespo: Let us know what table you would like to tackle on tokudb 1st. [23:29:56] madhuvishy, the collect is applied to the query results, the selectExpre query is the one that groups by field, so the collect is applied to 19 elements only [23:30:02] it is not the rdd of rows [23:30:04] mforns: ah okay [23:30:10] it is an rdd of the query results [23:30:24] right, makes sense [23:30:34] did your current job succeed? [23:30:39] yes [23:30:56] but it was sampled to 0.00001 [23:31:09] I saw another flaw [23:31:48] hmmm? [23:32:23] madhuvishy, for some very rare buckets it may take many iterations to get to the desired K [23:32:35] ah [23:32:47] and we are repeating the process for all buckets, even if they are alredy > K [23:32:52] *already [23:33:03] oh [23:33:12] we should repeat the process for the buckets that are < K [23:33:13] right [23:33:14] only [23:33:22] can you do a filter? [23:33:33] otherwise the long tail, is going to be reeeeeally long :] [23:33:37] yeah! [23:34:15] we need to manage to split the rdd no? into ok buckets and not-ok buckets [23:34:39] hmmm [23:34:43] * madhuvishy looks [23:37:24] mforns: what should truly be recursive? i was wondering if anonymizeBucket can be recursive instead of anonymizeDataset [23:37:35] may be i am missing something though [23:37:54] madhuvishy, mmmm I don't think so [23:38:11] anonymizeDataset needs to be either iterative or recursive [23:38:24] it is the one that checks the fixed point condition [23:38:56] ah yes i think i understood it better now [23:39:01] we need to work the whole dataset before deciding if we do another step [23:39:15] at least the part of the dataset that we know is below K [23:39:23] right [23:40:07] so yeah, what you say should work - we'll have one more param to the recursive function, that'll behave as an accumulator for the already sanitized buckets? [23:40:32] that's what I imagine, does it make sense? [23:41:14] madhuvishy, maybe we don't need another parameter [23:41:30] we just split into 2 rdds, the ok and the not-yet-ok [23:41:37] mforns: it makes sense in my head yes [23:41:45] and call recursively passing only the not-yet-ok [23:42:00] and then merge the results with the ok [23:42:03] and return [23:42:17] mforns: oh you don't need the ok rdd? [23:42:20] then ya [23:42:50] I don't think so, no? [23:42:52] you can just filter the rdd even. before flatmap, filter all buckets with K>20 [23:42:59] aha [23:43:23] well I do need the ok-rdd [23:43:35] but I don't need to pass it to the recursive call [23:43:46] oh how would you keep track of it then? [23:43:55] mmm [23:44:15] also, i assumed it would grow in every recursion [23:45:25] mmm, batcave? [23:46:02] mforns: i'm heading out for an errand :( [23:46:08] oh, np [23:46:20] difficult to talk recursive in IRC :] [23:46:55] mforns: he he yes [23:47:21] but I think the initial RDD id going to be split into 2 RDDs, the first one is going to be final, the second is going to be passed to the recursive call [23:47:55] then, the recursive call is going to split this second part into ok and not-ok, and pass the not-ok to its rec-call [23:47:57] and so on, [23:48:16] every recursive call, stores a part of the rdd, but there is no duplication I think [23:49:02] when returning through the recursive stack, RDDs are going to be merged [23:49:06] if that is possible... [23:55:12] Ya I'm not able to picture how the OK rdds will get merged