[11:17:46] Analytics / Refinery: Duplicates/missing logs from esams bits for 2014-09-28T{18,19,20}:xx:xx - https://bugzilla.wikimedia.org/71435#c1 (christian) (In reply to christian from comment #0) > Looking at the Ganglia graphs, it seems we'll see the same issue also > for today (2014-09-29). Yes, we did. The... [11:37:18] Analytics / Refinery: Raw webrequest partitions for 2014-08-29T20:xx:xx not marked successful - https://bugzilla.wikimedia.org/71463 (christian) NEW p:Unprio s:normal a:None For the hour 2014-08-29T20:xx:xx, none [1] of the the four sources' bucket was marked successful. What happened? [1... [11:37:19] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [11:37:47] Analytics / Refinery: Raw webrequest partitions for 2014-08-29T20:xx:xx not marked successful - https://bugzilla.wikimedia.org/71463#c1 (christian) NEW>RESO/FIX The issue covered each and every cache. For each cache, at some point between 20:30:00 and 20:57:00, partition numbers reset, and the mi... [11:45:36] !log Marking webrequest partitions for 2014-09-29T20:xx:xx good {{bug|71463}} [14:14:26] Ironholds: ^ [14:14:29] like that? [14:14:36] yep. ta! [14:26:18] ottomata: Since there'll be a demo at todays showcase ... [14:26:20] do you want to demo the hdfs fuse mount on stat1002? [14:28:55] sure! [14:29:33] Awesome! [14:29:38] kevinator: ^ [14:34:10] got it, thanks! [15:34:04] ottomata: Thanks for bringing up the "flashy presentation" thing! \o/ [15:34:30] +1 [15:34:30] haa [15:34:41] it was an interesting discussion [15:35:25] ja, qchris! [15:35:31] been thinking about more sqoop [15:35:39] hive is going to have support for what we need in a version or two [15:35:50] and some coolers stuff too [15:36:26] https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-BasicDesign [15:36:49] will be able to update and delete [15:36:50] also [15:36:54] this is cool too [15:36:54] https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest [15:37:09] might replace supersede camus. [15:37:12] oopos [15:37:18] Mhmm .-) [15:37:33] it might do what camus does now, but streaming like, and as part of hive [15:37:34] I like camus. [15:38:14] acid on hive ... whoa! [15:50:52] hahaha, just figured out the dashiki bug [15:51:11] on mouse down, the blur event gets queued for 100 miliseconds later fires [15:51:35] if it takes you more than 100 milliseconds to lift your finger off the mouse, the click event never has a chance to fire [15:51:42] because the blur event handler removes the click target [16:03:27] this sounds like the kind of bug that makes me never want to do ui programming [16:03:53] milimetric, I'm gonna save a log of this [16:03:59] and when people go, why do you only write research code? [16:04:09] print a copy off for them [16:05:54] qchris, sadly, no release date set on hive 0.14 though. [16:12:30] i wonder if we should just wait...i dunno. we'd also have to wait for cloudera to include it I guess [16:15:55] Ironholds / ottomata: I love that we have different aesthetics. I find that kind of bug awesome, though I usually stay away from that kind of situation by not having time-based logic [16:16:06] milimetric, oh no, it's awesome [16:16:22] in the same way that R can handle [POSIX timestamp] < [string] [16:16:26] you end up sort of marvelling [16:16:30] ..but it's still a bug. [16:43:28] milimetric: I prepared the deck for the showcase - https://docs.google.com/a/wikimedia.org/presentation/d/1TeIGxuvsnV_dLYQl_Oaa61lnx2aXzdLpGyoqWLmxEPU/edit#slide=id.p [16:43:57] Scrumbugs is not up to date tho [16:44:18] bug 70887 refuses to be added to the sprint :-( [16:44:43] and 3 of them need to be marked as fixed [16:47:30] Analytics / Wikimetrics: Update 'existing' Edits Metric to include deleted pages - https://bugzilla.wikimedia.org/71008 (Dan Andreescu) NEW>RESO/FIX [16:47:44] Analytics / Wikimetrics: Update 'existing' Pages Created to include delete pages - https://bugzilla.wikimedia.org/71009 (Dan Andreescu) NEW>RESO/FIX [16:48:16] Analytics / Visualization: Story: EEVSUser loads static site in accordance to Pau's design - https://bugzilla.wikimedia.org/67806 (Dan Andreescu) NEW>RESO/FIX [16:48:31] Analytics / Wikimetrics: Admin script is really slow now that there are lots of reports - https://bugzilla.wikimedia.org/70775 (Dan Andreescu) PATC>RESO/FIX [16:48:45] Analytics / Wikimetrics: Story: EEVSUser loads dashboard from URI that specifies state / EEVSUser copies URI that recreates current dashboard state - https://bugzilla.wikimedia.org/70887 (Dan Andreescu) [16:48:50] kevinator: I just updated all the bugs that needed it [16:49:10] thanks! [16:49:11] kevinator: 70887 didn't sync because each component has to be added to scrumbugz, i'm fixing that now [16:49:20] (that one's in the new component - Dashiki) [16:50:09] ahh [16:50:45] Analytics / Dashiki: Story: EEVSUser loads dashboard from URI that specifies state / EEVSUser copies URI that recreates current dashboard state - https://bugzilla.wikimedia.org/70887 (Dan Andreescu) [17:24:40] ottomata, how would I get a tsv from stat2 into hive? [17:30:23] hdfs dfs -put ... /path/to/data [17:30:23] create external table [17:30:23] ... [17:30:23] location '/path/to/data'' [17:30:24] something like that [17:42:45] Analytics / Wikimetrics: Story: AnalyticsEng has productionized wikimetrics and limn servers - https://bugzilla.wikimedia.org/71455#c1 (Kevin Leduc) Collaborative tasking will happen on: http://etherpad.wikimedia.org/p/analytics-71455 [17:43:30] Analytics / Wikimetrics: Story: AnalyticsEng has productionized wikimetrics and limn servers - https://bugzilla.wikimedia.org/71455 (Kevin Leduc) [17:47:04] hmn. cool! [17:47:07] thanks :) [17:49:22] that might jsut work, but there also might be some other steps to set [17:50:05] Ironholds: [17:50:06] https://github.com/wikimedia/kraken/blob/master/hive/tables/webrequest_all_sampled_1000.hive [17:51:44] ottomata, ta [17:52:05] I'm thinking of writing a "load into hive" script for my utilities library, later on. [18:08:14] oh, bloody hell [18:08:16] * Ironholds headdesks [18:08:19] ottomata, wanna hear something cool? [18:08:40] yes! [18:09:17] the format for URLs from upload, is such that hive thinks everything from the folder name onwards is the file, in parse_url [18:10:15] because it has filename.jpg as a folder name, and then filename-pixelsize.jpg as a file name! [18:10:18] * Ironholds jazz hands [18:11:09] that's how parse_url parses them? [18:11:40] try running https://gist.github.com/Ironholds/caf326d56288fb725ae1 [18:11:45] if you can solve this, I will buy you a cookie [18:21:47] kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 3.97331795603e-103 [18:22:00] e-103 ?:) [18:25:56] yes [18:26:12] sorry mutante, it is known [18:26:24] usually it happens about once a week, and then I think abou tit and tweak something, and then trigger an eelction [18:26:31] this is the bug [18:26:31] https://bugzilla.wikimedia.org/show_bug.cgi?id=69667 [18:26:59] ok, then i'm just going to link the ticket. thx [18:27:30] hmm Ironholds, it seems to happen to all URLs? [18:27:43] not just upload ones [18:28:02] ottomata, grabbing the folder name? damn. hmn. [18:29:14] I...should run out and do some errands. But, would you be interested in working on this when I get back? peer...poking, I guess. [18:31:26] Ironholds: hm, it doesn't work even if I just do this: [18:31:31] parse_url('http://bla.org/this/is/a/file.jpg', 'FILE') [18:31:35] reutrns /this/is/a/file.jpg [18:31:42] um, maybe that is what it is suppsoed to return? [18:31:44] what the hell is the use of that?! silly sausages. [18:31:50] an absolute filename? Possibly. [18:32:05] testing PATH [18:32:11] I guess we could use a regex? Or split on '/'. [18:32:45] ha, PATH does the same thing [18:33:33] wtf [18:34:44] Ironholds: [18:34:50] yeah i guess regex or split [18:34:50] http://stackoverflow.com/questions/13832500/how-to-access-the-last-element-in-an-array [18:35:26] oof, that doulbe reverse is on the string [18:35:28] kinda silly [18:36:00] Analytics / Dashiki: Cannot add project to dashboard - https://bugzilla.wikimedia.org/71333 (Dan Andreescu) [18:36:14] * Ironholds winces [18:36:17] oh dear god. [18:36:19] ah well [18:36:37] we're going to end up with lower(reverse(split())[0]) [18:36:45] which is...too much nesting for one SELECT ;P [18:36:46] ah well. [18:37:34] (PS1) Milimetric: Fix nondeterministic project selector [analytics/dashiki] - https://gerrit.wikimedia.org/r/163892 (https://bugzilla.wikimedia.org/71333) [18:43:40] Ironholds: [18:43:41] SELECT regexp_extract(uri_path, '/([^/]+)$', 1) as uri_file [18:50:06] stat1001/1002 - E: Unable to locate package python3-matplotlib [18:51:25] HMmmmmm ah, crap, mutante, that's because they are not trusty [18:51:36] ok, i put that in there for convenince, more than anything else, i'll just remove the python3 version [19:12:40] ottomata, hmn. I'll check it out. Thanks :) [19:14:00] it works! you are a genius. [19:37:25] heading to cafe, back in a bit [19:38:49] Hello #wikimedia-analytics! Question for you all. Is there an application out there that can help me parse a .csv file? I have a huge one with tons of data, but I don't have the know-how to get the output I'm looking for. [19:52:50] cndiv: What kind of output do you want? [19:54:09] marktraceur: So the file is an output from our keycard server. It logs every time that anyone uses a key to open a door. I want to tell it a last name, and to receive the days that name opened a door. [19:54:27] marktraceur: I'm thinking this is a script of some kind, not an application, right? [19:54:34] Yeah [19:54:46] Or just a query. [19:54:53] marktraceur: what's the difference? [19:55:21] Well, you basically have a database. A query is when you ask the database for a result, and a script would probably also do the parsing [19:55:29] I see. [19:55:39] What are those scripts usually written in around here? Python? [19:55:45] But you could just import the data into a database and issue a query...I can't imagine there not being a CSV->MySQL import tool. [19:55:46] Ruby? [19:55:50] Probably python [19:56:07] CSV > MySQL import tool. Good idea. [19:56:24] First DDG result: http://www.tech-recipes.com/rx/2345/import_csv_file_directly_into_mysql/ [19:56:50] boom. OK, I'll look into that. [19:56:54] and see what's easier. [19:57:31] marktraceur: Thanks for your help. [19:57:52] Then you should be able to do like "select date, lastname from entries group by date;" and get a list. Add "where lastname = Holmquist" before the semicolon to see when I entered the building. [19:58:03] Er no. [19:58:12] group by date, lastname maybe. [20:00:18] marktraceur: What's the most popular way of doing something like this? The script or database route? [20:00:26] using a .csv as the source of data [20:00:30] Popular, or good? :P [20:00:35] hah [20:00:37] popular, for now. [20:00:47] Probably parsing the CSV on the fly with a script. [20:01:01] But honestly if it's that big of a dataset I'm not sure why it's keeping the data in a CSV anyway [20:01:19] * marktraceur hides the early versions of the orgchart tool, which used CSVs [20:01:23] marktraceur: It's an ancient keycard server. It didn't even spit out a .csv, it spat out a .txt with commas. [20:01:30] Well that's just silly. [20:01:39] yep [20:02:00] cndiv: Do you have daily exports available? If so you could easily run an import every day and not have to deal with the crappy output files. [20:02:06] "easily" [20:02:41] marktraceur: they are as often as you manually request them. This project doesn't have to be exact, just a general idea [20:02:53] Sure [20:03:05] cndiv: Mostly my question is, can you get data for a specific amount of time? [20:03:38] Basically I'm looking to type in a last name IE "Deubner" and to get out "Chip Deubner opened doors on days X, Y, Z between dates A and B. Not including weekends, that's X% of days." [20:03:49] marktraceur: no just one big .txt output [20:03:53] Wow, fancy. [20:03:58] Less fancy. [20:04:00] Uhm [20:04:22] marktraceur: That sounds more like a script, right? [20:05:55] I mean, yes, but the backend could/should still be a database IMO [20:06:38] marktraceur: for sustainability's sake, you mean [20:06:49] Mostly yes [20:07:04] And extensibility - what if you want a script later that gives you the entries for a time period? :) [20:07:35] (CR) Milimetric: [C: 2 V: 2] Removing gulp-clean from build [analytics/dashiki] - https://gerrit.wikimedia.org/r/163259 (owner: Nuria) [20:07:38] cndiv: Also, what's the card system called? Maybe the script you want exists... [20:07:46] (PS2) Milimetric: Fix nondeterministic project selector [analytics/dashiki] - https://gerrit.wikimedia.org/r/163892 (https://bugzilla.wikimedia.org/71333) [20:07:50] marktraceur: haha no idea [20:07:57] Brilliant [20:07:58] marktraceur: good idea though, I'll look that up [20:08:07] marktraceur: we didn't install it, it came with the building [20:08:19] and you know how modern this building is. [20:08:48] marktraceur: OK, I'll gather my thoughts a bit better and go from there. Thanks for your help - I have to run. [20:08:49] I often imagine Fred Flinstone pulling the elevator cables. [20:08:55] marktraceur: That's about right. [20:21:55] (PS2) Ottomata: Use tsv format when outputting webrequest faulty hosts files [analytics/refinery] - https://gerrit.wikimedia.org/r/150963 [20:26:54] milimetric: I want to write something up to start a discussion of how to deal with incremental imports of mediawiki data [20:27:10] should I send an email to analytics-internal, or to just a few folks...or use a wikitech (talk?) page? [20:27:26] analytics-internal, sure [20:27:40] hm k [20:27:50] people can always filter out [20:45:43] tnegrin: i didn't know hortonworks was yahoo! [20:45:54] oh yeah -- I know all those guys [20:46:09] that's why we use Cloudera ;) [20:46:17] haha [20:46:30] well, hortonworks is working on a hive feature we really want [20:46:31] :) [20:46:39] which one? [20:46:41] http://hortonworks.com/blog/adding-acid-to-apache-hive/ [20:46:46] updates :) [20:47:27] because if we need one thing, it's HDFS tripping [20:47:30] uh, yeah -- that'd be great [20:47:35] tripping? [20:48:16] "acid" [20:48:30] seriously though, this change looks like just what we need to make Hive useful for MW data [20:49:44] I'm writing up some thoughts now, summarizing some options [20:50:26] cool [20:50:44] and I'm gonna..hmn. [20:50:58] Pageviews is as far as it can go with just my input, han's script is running, can't do anything to push the turking... [20:51:13] sod it, I've put in my 8 hours. Anyone wants me I'll be...doing other things. [20:51:59] :) [20:54:18] ...which may be volunteer data analysis [20:54:21] y'all can't judge me. [20:55:54] tnegrin: i'm currently impressed with all the stuff i'm reading on hortonworks blog right now, basically everything we want they are working on :) [20:57:05] yeah -- I really like their approach, especially with faster queries on hive. the problem is that I don't trust their execution -- I had some friends who tried to use their stuff and it was a mess [20:57:17] If you want to merge it yourself, that would work [20:57:39] but I would prefer to wait until their work hits apache, then a cloudera distro [20:57:42] well, its not just their work [20:57:43] https://issues.apache.org/jira/browse/HIVE-5317 [20:57:47] true that [20:58:06] i might suggest that as an option in this email i'm writing (and will have to finish tomorrow), but its not something i'd be exciiited about :) [20:58:17] it's done? [20:58:26] well, the JIRA is done, ja [20:58:32] but it isn't in a stable release [20:58:45] of hive, let alone cloudera [21:00:53] so here's the deal -- we now need to build apps on top of the hadoop we have [21:01:19] we need to think about the next iteration of tech -- real time, faster hive, druid, etc [21:02:07] scalding [21:02:12] so THINK [21:02:25] tnegrin: the part i'm thinking about [21:02:30] is for sqoop imports of mediawiki [21:02:31] honestly, I'd rather have a geolocation UDF and a UA parsing UDF [21:02:35] but...*shrugs* [21:02:37] that's apps [21:02:38] Ironholds: write one! [21:02:46] I don't speak Java! [21:02:49] the devs will manage that [21:02:52] heheh [21:03:04] tnegrin: yeah, i'm thinking about how to import mediawiki data [21:03:13] and the incremental part without updates and deletes is very cumbersome [21:03:32] it involves moving and re writing data all the time [21:03:51] HIVE-5317 makes all that automatic [21:03:57] anyyyyway [21:04:00] it is polo time! [21:04:01] this is good too-- also a better way to handle the dumps [21:04:16] "so you wanna play like that???" [21:04:20] haha [21:04:21] be safe [21:04:25] been doing tons of reading today, will finish this email tomorrow [21:04:36] splendid [21:05:03] lattaas [21:39:57] looking for ezachte but probably should mail