[11:14:33] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1220384 (Aklapper) [12:05:22] Analytics-Tech-community-metrics, Wikimedia-Git-or-Gerrit, ECT-April-2015: Active Gerrit users on a monthly basis - https://phabricator.wikimedia.org/T86152#1220423 (Aklapper) > A basic data point that we are missing at http://korma.wmflabs.org/browser/scr.html: > How many users perform any kind of ac... [12:06:45] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1220424 (Aklapper) [12:08:15] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1167361 (Aklapper) Edited the task description. Need help to make that list way shorter. I've marked items that don't convince me w... [12:09:09] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1220429 (Aklapper) [12:26:17] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1220440 (Qgil) What about mediawiki.org editors? * New editors (editors reaching their 10th edit on a given month). * Active editor... [12:26:21] Analytics-Tech-community-metrics, ECT-April-2015: Maniphest backend for Metrics Grimoire - https://phabricator.wikimedia.org/T96238#1220441 (Aklapper) @Dicortazar: Could you elaborate a bit on the complexity of this task given the MetricsGrimoire architecture that I am not familiar with? I'm interested t... [12:48:26] Analytics-Tech-community-metrics, Wikimedia-Git-or-Gerrit, ECT-April-2015: Active Gerrit users on a monthly basis - https://phabricator.wikimedia.org/T86152#1220453 (Dicortazar) @Aklapper, that panel is work in progress of this task. We're still missing two points in the panel to close this task: * R... [12:57:52] Analytics-Tech-community-metrics, ECT-April-2015: Maniphest backend for Metrics Grimoire - https://phabricator.wikimedia.org/T96238#1220472 (Dicortazar) @Aklapper, there's an introduction to the Metrics Grimoire toolset in the Puppetization of the tools [1]. We may provide extra info about the architectur... [13:43:44] Ironholds: Good morning sir [13:44:05] Ironholds: Would you give some time today for a few questions about pageviews implementation agian ? [14:03:22] ottomata: Helloooo ! [14:03:37] ottomata: you there ? [14:03:55] yello [14:03:57] yuppers [14:04:05] have 5 mins for me ? [14:04:11] getting through emails, talking to mark [14:04:12] suuuuure [14:04:16] k [14:04:44] Wondering of we shouldn't separate tables for mobile/text webrequests and bits/uploads [14:05:03] Cause bits/uploads are the ones taking long time now [14:05:07] to refine [14:06:06] hm, [14:06:08] why would that help? [14:06:39] Just not to have to refine them with as many fields [14:06:59] ? [14:07:02] For instance, is_pageview is not to be calculated on those pages [14:07:39] like, they'd have a different schema? [14:07:40] And I wonder if geocoded data etc would really be needed for those partitions [14:07:41] Yeah, there would be different tables with different schemas [14:07:49] ah, hm, but, somehow I doubt that that is the bottleneck though, right? [14:07:57] To prevent too much refinery [14:08:14] there'd still have to be refinement, and to do that the json data needs to be read from disk [14:08:25] once it is in memory, i don't htikn it would really slow things down that much to add extra fields [14:08:26] Currently, bits and uploads are the one taking longer in the refinery [14:08:34] that's because they are big! [14:08:55] hm ... [14:08:55] what I would prefer to do, is investigate the possibility of making camus write parquet data [14:08:59] rather than the raw json [14:09:01] hmM [14:09:01] or [14:09:03] woudl that realy help? [14:09:17] maybe not actually, the refinement phase would still have to read all columns from disk [14:09:26] Well, if you output refined data outside of camus, you win some step [14:09:51] hm [14:10:17] mouah, you're right we still need to refine those partitions [14:10:22] nevermind [14:10:23] :) [14:11:31] yah, but i'm sure there refinement optimizations we could do (not hive?) [14:11:58] I think so as well [14:27:49] Analytics-Tech-community-metrics, ECT-April-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1220642 (Ironholds) Probably - @halfak ? [14:39:35] holasas [14:39:38] mooornin [14:40:50] hello [14:55:01] ottomata, joaL; are you in teh middle of something or you have time for 1 question [14:55:35] sorry, ottomata & joal: are you in the middle of something or you have time for 1 question [14:55:43] got time, gimme 5 mins [14:57:51] (PS2) Milimetric: [WIP] Build process work, refactor in progress [analytics/dashiki] - https://gerrit.wikimedia.org/r/204951 [15:03:29] hey nuria I have time now [15:04:39] jaja hi nuria :) [15:04:55] joal, ottomata ok, then for the output of the mobile job, i just realized [15:05:11] that insert values () is not supported on hive 0.13, teh one we run [15:05:33] then ... joal , ottomata how do you think the output should be? [15:05:42] Insert values is not yet supported in hive [15:05:47] joal, ottomata : it is easy to create a table: [15:06:01] joal: right, support starts on version 14 [15:06:09] You can normally insert without overwriting if schema doesn't change [15:06:26] joal: not records though, right [15:06:28] ? [15:06:40] hmmm [15:06:48] joal: also that doesn't work cause you want overwrite if you re-run teh job [15:06:53] syntax is definitely not insert value [15:07:20] Both modes exist: append or overwriten [15:08:05] joal: but overwrite is per table, not per record, correct? [15:08:17] or per partition, right [15:08:45] joal: but for an output table for this job do we want partitions? [15:08:50] joal: not really right? [15:09:16] nuria: could you register the RDD as a temp talbe and then select from it to insret into your final table? [15:09:19] joal: job spits out 10 values per date (a month or day) [15:09:32] registerRDDAsTable [15:09:42] ottomata: still no override [15:09:51] ottomata: right? [15:09:55] What I'd do is the same trick I did for mobile uniques [15:10:06] ottomata: so a re-run of job would not override old values [15:10:18] overwrite you mean? [15:10:24] ottomata: current table is like this (but can be changed to anything) [15:10:30] why not? [15:10:32] Create a temporary table appending old values and newly computed ones, replace old table by new one [15:10:42] https://www.irccloud.com/pastebin/zeYh3XPf [15:10:49] INSERT INTO mobile_apps_uniques_table SELECT FROM temp_rdd_table WHERE .... [15:10:53] INSERT OVERWRITE [15:10:54] * [15:10:56] i mean [15:11:12] ottomata: but insert overwrite overwrites a partition, not a record [15:11:14] hmm, nuria [15:11:28] i don't think you need to create the output table in the spark job, do you? [15:11:32] that we will probably just do in Hive? [15:11:39] ottomata: agreed [15:12:03] creation of a temporary table would be necessary though [15:12:18] aye, or, nuria, maybe that hacky file output thing I did is ok? [15:12:29] ottomata: ah ok, i thought that was your preference , the table that is, [15:12:30] if we use an external hive table, you can just overwrite the partition's directory? [15:12:37] i mean, i don' know what is best [15:12:45] i've never done this before :) [15:12:55] the hacky file output thin felt very hacky, but maybe it is ok? [15:12:56] ottomata: puf me neither, but it *seems* [15:13:28] ottomata: that spark job doesn't interact with hive /oozie at all, it is just a stand alone thing [15:13:28] ottomata, nuria : hive reads file [15:13:43] ottomata: so it'd be better if it cleans up and manages itself but again .... [15:13:43] the temp table might work well, especially if it just lets you run a hiveish query in spark that selects from an RDD and inserts into an external hive table [15:13:55] nso if you change file in a given partition folder (as long as schema works), everything works fine [15:13:59] ottomata: but the output of this program is not and RDDD [15:14:07] ja i know [15:14:58] hiveContext.registerRDDAsTable(resultRDD, "tempResultRDDTable") [15:15:04] joal, ottomata : ok let's talk about the details after standup then? [15:15:27] hiveContext.sql("INSERT OVERWRITE final_output_table SELECT FROM tempResultRDDTable ...") [15:15:28] ? [15:15:41] nuria you can tell spark to make a rdd out of your values, particularly: you can create a rdd out of the table you have, add a line to this rdd, and save with overwrtiting to the right table [15:16:25] nuria: no idea if ^^ actually works though [15:16:46] joal: but that seems really cumbersome , right? why would we want an RDD of matrix with 20 elements? [15:17:20] joal: not saying that is maybe not the only way [15:17:36] it is just an idea indeedd :0 [15:17:40] :) [15:19:15] hive table --> RDD <-- computed values [15:19:39] |-> save overwriting table [15:20:49] joal: ok, let's talk aboout this after stand up so i can wrap this up today [15:20:58] nuria, that is what happnes with the hacky file write too, no? [15:20:59] k [15:21:05] .coalecse [15:21:13] very small RDD with one parittion [15:21:32] ottomata: yes, but it sure seems a lack off on spark [15:21:39] spark's side [15:21:41] yeah [15:21:50] computing GREAT, reporting TERRIBLE [15:22:13] nuria, maybe just [15:22:15] saveAsTextFile() [15:22:16] they forgot that part of the framework [15:22:16] ? [15:22:20] rdd.saveasTextFile [15:22:20] ? [15:22:33] ottomata: ya, the way you had it pretty much [15:23:46] ottomata: sounds fine too [15:48:49] ottomata: wait, come back and let's talk about spark! [15:49:08] ottomata: holaaaaa [15:49:46] haha ok [16:08:07] (PS1) Joal: [WIP] Add spark utility to parse wikidumps [analytics/refinery/source] - https://gerrit.wikimedia.org/r/205277 [16:08:31] nuria, ottomata : You have some code in here --^ [16:09:01] Both maven variable (ottomata), scala test, and a hadoop file format to read XML [16:09:06] Going to ERL [16:09:10] EL sorry [16:11:11] ottomata: for testing we want this one [16:11:17] https://www.irccloud.com/pastebin/KuJ5DXgO [16:12:26] milimetric, mforns : one thing we need to do is to reboot the wikimetrics instances too [16:12:35] nuria, ok [16:13:07] * milimetric runs to lunch [16:13:13] but i'll be back in 30 [16:14:47] Analytics, Analytics-Kanban, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1220905 (Ironholds) Note that when the task says "help with" it means "do all of". [16:17:38] Analytics, Analytics-Kanban, WMF-Product-Strategy: Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April - https://phabricator.wikimedia.org/T96169#1220924 (Ironholds) (Running, btw ;p) [16:46:18] nuria, joal and I have a doubt on el backfilling [16:46:26] mforns: yesssir [16:46:43] nuria, if we insert an event that is already there, is it duplicated? [16:47:09] mforns: there is a parameter on jrm.py that controls that [16:47:34] nuria, replace [16:47:35] ok [16:47:50] mforns: let me add 1 thing to wiki [16:51:47] mforns: to backfill prior (since i did it with events 1 by 1 as the dropage was happening at all times) i had to do a bunch of changes, one of them was changing from batch inserts to "insert sequential" : https://gerrit.wikimedia.org/r/#/c/190139/7/server/eventlogging/jrm.py [16:53:03] mforns: and since event is inserted 1 by 1 you can choose to ignore duplicates, if you insert a batch with INSERT IGNORE and there 1 duplicate in the batch I think the batch will not get inserted [16:57:10] Analytics-Engineering, Analytics-EventLogging, Analytics-Kanban: Tungsten deprovision, substitute host in Eventlogging setup - https://phabricator.wikimedia.org/T93920#1221064 (fgiunchedi) Open>Invalid a:fgiunchedi ah, I think this was resolved on irc, the host running those processes is hafni... [17:14:31] ottomata: I can't log onto deployment-eventlogging02.eqiad.wmflabs :( [17:16:01] ottomata: Is there anything you could do for me about that ? [17:18:07] joal: wait I think i know [17:18:48] joal: waht is your user on wikitech? [17:18:52] *what [17:19:01] Joal [17:20:29] joal: labs is a mistery to me but i think i need to add you here: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep [17:21:03] seems so ! [17:21:18] nuria: Woa, a prod in a lab :) [17:21:24] So cool ! [17:22:13] joal: so you can ssh [17:22:21] joal: correct? [17:22:32] joal: the sudo part I am not sure [17:22:43] still nossh :( [17:22:53] joal: do try now [17:23:04] nope [17:23:20] login/logout wikitecch ? [17:23:46] is it "joal" or "Joal" ? [17:24:15] Written Joal on the web, joal as user [17:27:37] try now, please [17:27:44] ^joal [17:28:09] still no chance :( [17:29:12] So it seems the username is Joal (as per the login) [17:29:55] joal: ok, let me try one more thing, otherwise otto is teh man [17:29:56] nuria: meeting with Kevin [17:30:05] ehhhh? [17:30:07] We'll see that after [17:30:47] joal: I'm in the bat-cave [17:30:58] Thx a lot nuria :) [17:31:14] joining [17:31:29] Alternate batcave for me kevinator ... [17:32:09] hmm, i'll try rejoining [17:36:50] joal: ok, plis try again [17:37:07] works ! [17:37:11] Thx a lot nuria [17:37:25] again :) [17:38:33] joal: man, wikitech is the most confusing thing ...documented what i did : https://wikitech.wikimedia.org/wiki/EventLogging/Testing/BetaLabs#Give_people_access [18:07:54] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics, Patch-For-Review: Utf-8 names on json reports appear as unicode code points: "\u0623\u0645\u064a\u0646" - https://phabricator.wikimedia.org/T93023#1221405 (Capt_Swing) @Nuria saw that you reopened this task. Quick question: does it sti... [18:12:59] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics, Patch-For-Review: Utf-8 names on json reports appear as unicode code points: "\u0623\u0645\u064a\u0646" - https://phabricator.wikimedia.org/T93023#1221418 (Nuria) @Cap_swing: T74747 is deployed. There is a bug on the report when user... [18:16:06] Analytics-Cluster, Analytics-Kanban: Expand people's ability to use Hive/Cluster {hawk} - https://phabricator.wikimedia.org/T94903#1221430 (Nuria) This has two parts: - better resource allocation (someone running a crazy query cannot take cluster down). Done on week 15th april - impala for easy querying,... [18:19:07] Analytics-Cluster, Analytics-Kanban: Expand people's ability to use Hive/Cluster {hawk} - https://phabricator.wikimedia.org/T94903#1221435 (Ottomata) I don't know if we want to commit to Impala at this time. I'm probably going to install it so we can try it out, but I'm not certain that we will be able t... [18:22:59] Analytics-Cluster, Analytics-Kanban: Expand people's ability to use Hive/Cluster {hawk} - https://phabricator.wikimedia.org/T94903#1221478 (Nuria) @ottomata: fair enough, just scoping it for usage should be plenty sufficient at this time. [19:17:14] ottomata: yt? [19:36:02] ja, but in meeting... [19:36:05] k [19:41:01] nuria, done, waht'sup? [19:44:52] ottomata: for saving files... [19:44:53] ottomata: do we want snappy? [19:45:00] for these small text files, don't care [19:45:10] ottomata: k [19:45:12] probably not, because we want them more easily readable and copyable [19:45:23] gz is fine, but they are small enough that uncompressed is cool [19:45:25] right? [19:45:30] ok [19:45:49] i've seen that they are saved are as snappy as default [19:48:26] but i will shoot for plain files [19:51:35] ja that is the default final output commpession [19:59:22] ottomata: so, idally, we wan to save the files as text files then? [19:59:37] ottomata: but PLAIN text files (w/o snappy compression) [20:00:38] yeah, thikn so [20:00:42] right? how big are these? [20:07:28] halfak: Hello ! [20:07:34] halfak: you there ? [20:07:44] Hi joal [20:08:12] You won't believe it halfak, got that xml-json extraction in spark running in altiscale ! [20:08:32] Took me a week though :( [20:08:46] but now it seems to work well [20:09:00] took 7.5mins to convert simplewiki [20:09:03] joal, both frustrating and awesome! [20:09:08] indeed [20:09:12] 7.5mins is pretty darn good. [20:09:20] Computer Scieeeeeeeence :) [20:09:25] Want to try enwiki? [20:09:40] Please, let shoot at the big one :) [20:09:55] Regretfully, I haven't made the transfer yet. [20:09:59] I'd also love to try your reference count/search thing on simple [20:10:03] And I need to do the s3 bucket thing. [20:10:22] Yeah I can imagine web for terabytes is not the most efficient :) [20:10:37] Let me get enwiki started. [20:13:41] halfak: by the way, I double checked the number of llines from your export and mine, and it matches [20:13:56] It's not a full garanty, but it's a first win :) [20:14:08] :) That's awesome for my code too :D [20:14:15] huhuhu [20:17:47] Ironholds: you there ? [20:18:09] joal, what's up? [20:18:13] Heya [20:18:23] Still and again ... some question on pv ... [20:18:33] I promise, one day I'll ping you for a beer :) [20:18:55] If you have 5 mins, I'd go to batcave, easier [20:19:01] Ironholds: --^ [20:19:41] sure [20:19:42] hang on [20:30:48] kevinator: Heya ! [20:30:54] kevinator: got a minute ? [20:36:37] halfak: new run with snappy compression output is slower: 12mins [20:37:13] What's the difference in file size? [20:37:19] joal, ^ [20:38:17] ht24.4 G /user/joal/simplewiki_json [20:38:17] 7.6 G /user/joal/simplewiki_json_snappy [20:38:31] halfak: --^ [20:38:44] Decent reduction :) [20:39:15] mforns: Hey [20:39:22] joal, hello [20:39:23] How is the thing going on eventlogging ? [20:39:31] Seems like the compression is worth the time. [20:39:37] xfer to S3 started. [20:39:44] I'll check on it again before I call it a day. [20:39:52] joal, well I managed to create the 2 missing files [20:40:06] joal, I had created the timestamps wrongly [20:40:22] joal, so I fixed them and re executed [20:40:51] ok [20:41:00] backfilling works in testing ? [20:43:27] joal, I'm in that [20:43:34] Nice :) [20:43:36] still not executed [20:43:39] ok [20:44:05] I am going to bed [20:44:11] joal, ok [20:44:13] I'll catch up on that tomorrow morning [20:44:22] Thx mforns for all the good work :) [20:44:37] joal, thank you! nite! [20:50:55] (PS3) Milimetric: [WIP] Build process work, refactor in progress [analytics/dashiki] - https://gerrit.wikimedia.org/r/204951 [20:55:30] exit [20:59:58] o/ joal|night [21:19:23] Analytics-Cluster, operations: Set up ops kafkatee instance as part of udp2log transition - https://phabricator.wikimedia.org/T96616#1222404 (Ottomata) NEW a:Ottomata [22:01:52] Analytics-Cluster, Analytics-Kanban: Install Impala on cluster {wren} - https://phabricator.wikimedia.org/T96329#1222638 (Ottomata) a:Ottomata [23:19:52] (PS4) Milimetric: [WIP] Build process work, refactor in progress [analytics/dashiki] - https://gerrit.wikimedia.org/r/204951 [23:53:15] (PS5) Milimetric: [WIP] Build process work, refactor in progress [analytics/dashiki] - https://gerrit.wikimedia.org/r/204951