[08:54:39] hey ottomata [08:54:55] way too early [08:57:38] hey qchris [08:57:41] good morning sir [08:57:46] :) [08:57:55] Good morning sir average [08:57:57] yesterday fell asleep on the couch [08:58:14] I've been trying again, with hive [08:58:18] np :-) Did you manage to get the hostname set? [08:59:27] I tried to set the hostname, got errors over and over, for about 3 hours, after which I decided to delete the VM [08:59:30] I found a vagrant VM with hadoop and I installed that, now at least hive works.. but I still get zero rows when selecting on the table. [09:00:05] :-) [09:00:24] qchris: question , if you tell the hive table where the data is on hdfs, should I expect the data to actually be in the table already ? [09:00:30] or do I still have to load it ? [09:00:49] with stuff like LOAD DATA LOCAL INPATH [..] [09:00:51] ? [09:01:59] can I show you ? [09:02:47] I'm in the batcave btw [09:03:11] It has been some weeks :-) but IIRC once the table is created and the files added to the table once [09:03:18] You do not need to load them each time. [09:03:24] Once is sufficient. [09:03:29] Booting google machine. [09:06:06] yes, but I didn't use LOAD yet, I thought for example if I have them in hdfs and the LOCATION in the CREATE query of the table told hive where to get them from, it would go and get them [09:06:11] but I might be wrong [09:14:36] !card 1227 [09:14:36] https://mingle.corp.wikimedia.org/projects/analytics/cards/1227 [10:43:38] (PS1) QChris: [DO NOT SUBMIT] kraken-hive stub [analytics/kraken] - https://gerrit.wikimedia.org/r/96738 [10:45:07] (PS2) QChris: [DO NOT SUBMIT] kraken-hive stub [analytics/kraken] - https://gerrit.wikimedia.org/r/96738 [10:45:20] average: ^ compiles the udf for me [10:51:33] average: Now I found the reason why you get pig errors ... [10:51:34] The original specification contained conflicting requirements [10:51:57] average: I gave you the wrong test data ... :-( [10:52:07] Grabbing good test data from backup ... [10:54:24] average: I uploaded the good testdata to stat1002 again [10:54:36] Let me know when you grabbed it, so I can remove it again [10:54:43] md5sum testdata_* [10:54:43] 2ffe78738528f33bf9a887fe1a8437d5 testdata_desktop.csv [10:54:44] a17a4be67e20a4c4f91111988a5d8f02 testdata_mobile.csv [10:54:44] db2a8d782424c1a5f096156964d10e74 testdata_session.csv [10:56:03] (PS3) QChris: [DO NOT SUBMIT] kraken-hive stub [analytics/kraken] - https://gerrit.wikimedia.org/r/96738 [12:36:34] qchris_away: thanks :) [12:36:37] I pulled now [15:26:00] :) funny stat [15:26:09] the first 10,000 users that ever registered on enwiki [15:26:16] made almost 10,000 edits in the last 30 days [15:26:18] :D [15:26:38] mmmm :) [15:26:45] i wonder how the distribution of that is like :) [15:26:48] also hi milimetric :) [15:26:55] and the pageviews project was canned? :( [15:27:00] no! [15:27:08] pageview project was never canned, just deprioritized [15:27:09] however! [15:27:16] I think I just heard it's getting re-prioritized [15:27:29] and I'm pretty excited about it, I can't wait to kick that project's ass [15:27:37] :D [15:27:47] milimetric: do we capture referrer data anywhere? [15:27:55] only in the logs as far as I know [15:28:03] so we won't be able to analyze that until we get the firehose turned on [15:28:11] however, we're very close to getting the mobile mini-firehose turned on [15:28:31] and qchris is working on defining pageviews (I'm sure referrers will play some role) [15:29:02] I thought this was already in hadoop? [15:29:20] I think many things too sometimes [15:29:52] nothing's in hadoop yet [15:30:08] but *very* close [15:30:14] milimetric: the definition of the hive table I'm working with has a referer field [15:30:41] milimetric: is that what YuviPanda is saying ? [15:32:10] 17:26 < milimetric> the first 10,000 users that ever registered on enwiki [15:32:14] 17:26 < milimetric> made almost 10,000 edits in the last 30 days [15:32:18] ^^ wow, that's interesting :) [15:32:29] :) [15:32:47] average: tasking [15:33:05] task task task task task [15:33:12] * average is joining [17:49:00] milimetric: yt? [19:41:25] DarTar: How's your schedule today? Do you have some spare time to talk about pageviews? [19:42:11] qchris: sure, I have a meeting at 3 but otherwise available any time [19:43:01] DarTar: Great! [19:43:08] DarTar: I am in the hangout right now. [19:43:22] Join me whenever it suits you :-) [21:39:00] (PS1) Milimetric: Speed up async validation [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/96889 [21:43:27] (PS2) Milimetric: Speed up async validation [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/96889 [21:44:51] (CR) Milimetric: [C: 2 V: 2] Speed up async validation [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/96889 (owner: Milimetric) [21:50:56] Anybody around willing to do a quick database lookup? (he said, having no idea if it's easy or hard)... I'd like to know how many articles contain at least one , so that either unnamed or named refs can be found) [21:53:48] hm... lexein I'm not familiar with that part of the db [21:54:13] Me neither! [21:54:16] and I think full text search like that would be hard to do. Let me know if you track down a way to do it though. That'd be interesting [21:54:27] I'll ask around too [21:54:50] I don't have a spare machine on which to try it. Other things to do... [21:55:08] Thanks for asking [21:55:45] Is this the wrong channel to ask? [21:56:19] this channel will probably have people with the ability to answer that kind of question [21:56:37] Thx [21:56:40] but the answer may be 'that looks hard' [21:56:41] :) [21:59:14] yeah, this is the right channel lexein. And frankly, I should be able to answer that. So I'll try to track it down for you. [22:00:13] (CR) Jdlrobson: [C: 1] Allow queries to be smarter about when they run [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/96315 (owner: Jdlrobson) [22:01:08] I can't imagine it being a ''fast'' search, given that it will basically be a linear search. Wouldn't think there'd be an index anywhere of uses, since it's a MediaWiki builtin, rather than in template space. [22:02:09] Ok, so *now* I have a reason to buy an SSD. I was holding out, waiting for an app that needed it. [22:03:45] :) [22:10:27] (CR) Milimetric: [C: 2 V: 2] Allow queries to be smarter about when they run [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/96315 (owner: Jdlrobson) [22:44:35] lexein, proton: i don't you can directly query a msyql db to do that, i think you would need to analyze the xml dump files [22:46:43] drdee: does this not help? http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html [22:47:30] I have no problem downloading the XML and searching locally. Gotta dig up that USB drive... heh. [22:49:25] in theory it does, but afaik the way the actual revision texts are stored in the db makes it not possible (for one they are compressed IIRC) to query them directly, that is more an artifact of how our db's are setup. There are however plans to change this :) [23:03:36] drdee - of course, you're right. Doh! [23:03:47] XML it is! [23:03:52] good luck! [23:04:14] Good luck indeed - I have to find my external drive ina 16x19' storage unit [23:04:55] Might as well just go buy one, I guess [23:06:03] All: You don't happen to know any good search utilities designed to search inside compressed archives, do you?