[09:01:46] (CR) QChris: "> Qchris, +1? :)" (4 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/134377 (owner: Ottomata) [10:27:08] (PS12) Nuria: Add cohort class hierarchy, refactor CohortService [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/134743 (owner: Milimetric) [11:40:15] (PS7) Nuria: Enable project-level cohorts via WikiCohort [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/135181 (owner: Milimetric) [12:15:15] anyone knows about analytics/wp-zero and analytics/geowiki repositories? I proposed a bunch of linting changes to both repositories [12:44:07] qchris: hiya! [12:44:14] Hi [12:44:29] thanks for the review! re 'percent_different': [12:44:39] i didn't call it 'count_..' because it isn't a count [12:45:04] all the other fields that start with counts are counts of some value, like the number of nulls, or the number of duplicates, etc. [12:45:31] but, i see what you are saying about 'different' in the name [12:45:39] I do not fully understand ... are you referring to the comment in line 64? [12:45:42] as 100.0 percent_different does sound bad :) [12:45:48] yes [12:45:49] Ok. [12:45:53] OH [12:45:57] COUNT(*) [12:45:59] i'm sorry [12:46:02] i didn't read very well [12:46:04] gotcha [12:46:04] can do [12:46:28] thought you were asking to prefix the name with count_ [12:46:33] Oh. [12:46:35] :-) [12:46:57] ok, ja, but hm, what's a better name then... [12:47:06] haha [12:47:09] percent_same :p [12:47:20] Hahaha. [12:47:24] we could make it 0 [12:47:29] that's how it was in the pig script [12:47:32] percent_loss [12:47:45] but 'loss' is weird, because negative loss == duplicates [12:47:59] Currently it's a bit misleading to have ..._different and then expect 100. [12:48:00] i guess percent_different is good if it is 0 [12:48:02] yeah [12:48:04] agree [12:48:21] percent_different and defaulting around 0 sounds good to me as well. [13:08:47] (PS6) Ottomata: Add hive script to help monitor webrequest loss and duplication [analytics/refinery] - https://gerrit.wikimedia.org/r/134377 [13:09:09] (CR) Ottomata: Add hive script to help monitor webrequest loss and duplication (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/134377 (owner: Ottomata) [13:28:31] (CR) QChris: [C: 2 V: 2] Add hive script to help monitor webrequest loss and duplication [analytics/refinery] - https://gerrit.wikimedia.org/r/134377 (owner: Ottomata) [13:29:11] :D [13:29:16] next up, ooziefication! [13:29:30] You mean 135128? [13:29:43] I am having a hard time testing it. [13:30:11] Is there some labs part one can toy with? (hadoop-master0 does not like hive too much) [13:32:16] ah, no, i mean [13:32:21] ooziefication for webrequets_stats [13:32:26] sequence_stats* [13:32:27] Ok :-) [13:32:43] qchris, yeah, i was testing in labs, but every time i make a labs hadoop cluster [13:32:46] i falls apart in a few weeks [13:32:50] Hahaha. [13:32:59] Ok. [13:33:03] disk space in /var/log is very small, have to prune it manually, other things go wrong too i think [13:33:47] oh, qchris, i wrote that mainly by testing with vagrant [13:34:01] 135128 [13:34:10] can/do you run vagrant [13:34:13] Oh ok. Then I'll try vagrant for that one too. [13:34:19] cool [13:34:21] Thanks! [13:35:26] (CR) Ottomata: Add code to auto-drop old hive partitions and remove partition directories (6 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/135128 (owner: Ottomata) [13:36:07] I didn't even press submit for the whitespace things :-) [13:40:23] :) [13:40:36] those whitespaces are much easier to catch in gerrit than in my editor [17:08:47] ottomata, something is wrong in the state of denmark [17:10:24] by which I mean: you know that 80,000 mapper query? [17:10:44] I just made a 91,000 mapper query. [17:10:51] I have absolutely no idea what in the hell is going on. [17:14:38] ha, its because you turned off strict mode [17:14:45] ANYTHING CAN HAPPEN [17:14:46] :p [17:19:00] ottomata, actually, no, not for this query [17:19:28] cat /home/ironholds/anna.hql on stat2 [17:19:37] it's a SELECT COUNT(*) with two WHEREs. [17:19:49] and it's only running on a month's worth of data. [17:20:58] I'm going to go through the error log when I have a big screen and see if I can spot anything useful in explaining why on god's green earth the cluster has become unusable. [17:30:17] ottomata, okay, amendment. [17:30:20] 97,000 mappers. [17:31:40] Ironholds: this is on may? [17:34:00] yup [17:34:49] so, crucial information is Ended Job = job_1387838787660_13919 with errors [17:34:49] Error during job, obtaining debugging information... [17:34:49] Job Tracking URL: http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_13919/ [17:35:04] I cannot see the analytics application report, because see previous issues with proxy server, but... [17:35:09] hey tnegrin, I set a new record! [17:35:16] A SELECT COUNT(*) that launched 97,000 mappers. [17:35:19] so I saw [17:35:21] and then consumed 60 days of CPU time. [17:35:23] and then died. [17:35:56] I do not think hive understands the new table properly [17:36:22] I agree. [17:36:29] I also suggest that the server-side heapsize may be a problem. [17:36:53] What I can sort of quasi-deduce is: it's having to partition the table far more than the table is actually partitioned, hitting the max number of mappers, and then dying. [17:37:20] Quite why Hadoop doesn't check whether number-of-reducers >= max-number-of-reducers at the beginning of the query beggars belief, but that's their horrible design decision to sweat over. [17:45:05] DarTar: http://www.codecademy.com/blog/143-eventhub-open-sourced-funnel-analysis-cohort-analysis-and-a-b-testing-tool [17:45:32] Ironholds: I"m trying to find a reason for your job's failure [17:45:34] haven't yet [17:45:49] but, there are 179360 files in may right now [17:45:52] ottomata, nothing in the job report? [17:45:55] hurm [17:45:59] i can't get to it though the web ui [17:46:04] i'm looking atht eh logs in the cli [17:46:08] oh god. Not this again :( [17:46:10] grepping for error|warning|fatal [17:46:17] its easier to look at logs in cli anyway [17:46:40] yeah, but the job pages vanishing makes me sad [17:46:52] they don't vanish, they get moved [17:47:11] and my little sshuttle tunnel isn't working very well, now that there are so many jobs in the history [17:47:29] huh [17:49:33] here is one of yours [17:49:33] http://analytics1010.eqiad.wmnet:19888/jobhistory/job/job_1387838787660_12774 [17:50:18] ori, interesting: these A/B test tools keep popping up everywhere, will take a look [17:52:22] StevenW: you should look at that too (see link above) [17:52:33] EventHub? [17:52:33] ori: scanning the description, it looks mostly relevant for very large scale experiments (1M+ users in the sample) [17:52:33] ottomata, cool. Can't see it, but cool ;). [17:52:36] Yeah I heard about it. [17:53:32] yeah i can barely see it too [17:57:00] ori: can this be integrated with EL? [17:57:36] tnegrin: i think so, they're mostly complementary in terms of the feature-set they provide [17:57:46] if nothing else there are some good ideas to pillage [17:58:01] we need better funnel visualization for sure [17:59:37] there are many tools out there to do A/B test analysis on arbitrary data, the main question is the selection of the test, not so much the usability of the UI or even the scalability (given that most tests we run are small scale) [17:59:54] and tnegrin: I believe funnel visualization is a separate problem [18:02:02] before jumping on any single library, I’d like to hear about use cases: we’ve never heard back from teams about the functionality they would like to see if we were to build a more automated A/B test infrastructure [19:41:02] Ironholds: what was the most recent count query that you were running? [19:41:05] i can't tell from the job name [19:42:01] ottomata, cat /home/ironholds/anna.hql (to avoid spamming the channel with hive) [19:44:48] hm, one way you might be able to reduce the number of mappers [19:44:53] is selecting smaller partitions, [19:45:01] for example, you probably don't need to look at the bits partition, do you? [19:47:06] Ironholds: ^ [19:50:38] Ironholds: another possibility: run smaller jobs in general [19:50:41] and then aggregate them later [19:50:49] can you do what you want to do in pieces? [19:50:59] insert results into a hive table of your own [19:51:03] and then query the aggregate? [19:51:12] like, run the query across a day of data at a time [19:51:58] ottomata, yes, but that'd be..heinously inelegant [19:52:00] doable, though [19:52:10] ottomata: are you sure the partitions are correct? the number of mappers seems extreme [19:52:12] Not mentioned on the bug yet, but it will be because general knowledge pool [19:52:20] I've relaunched the simple SELECT with bits excluded. [19:52:26] no -- it can't work this way [19:52:41] something's wrong here [19:52:44] well, he def should exclude bits, there's no reason to query that, right? [19:52:53] also, i think the number of mappers is not innapropriate for the data [19:52:54] indeed [19:52:58] but why should the # of mappers be so high? [19:53:01] I think it was more the "just divide-and-conquer" bit. [19:53:01] there are around 170K files in the month of may [19:53:11] and each needs 5-8 mappers? [19:53:15] wait, no, ignore me. [19:53:17] mmm….compared to how many in the old partitions? [19:53:17] * Ironholds slaps self. [19:53:26] the files are exactly the same [19:53:33] i really don't know what you mean by 'old parittions' [19:53:33] do you mean the old talbes? [19:53:35] yeah [19:53:47] ah, well, you couldn't query all of the files at the same time [19:53:47] oh, so the partitions are the same, just what partitions the table refers to has changed? [19:53:59] the table has all of the data in it [19:54:01] AND an extra partition [19:54:05] webrequest_source [19:54:12] which corresponds to the old tables [19:54:18] 'bits', 'mobile', 'text', 'upload' [19:54:26] *nods*. [19:54:27] it works just like a date partition [19:54:39] sure -- but Ironholds query worked fine for the old tables [19:54:45] and now it doesn't work [19:54:47] if you include which webrequest_source(s) you want, it will not look at the other source(s) data [19:54:57] it worked over webrequest_text and webrequest_mobile individually [19:54:57] that means that he was only running on one of the partitions at once [19:55:03] I think ottomata is saying..yeap. [19:55:16] as a long-term thing, though, this is probably not tremendously tenable. [19:55:20] I'm not understanding [19:55:25] (fyi,i'm not disagreeing that something isn't wrong, i think this should work too) [19:55:34] ok tnegrin [19:55:35] previously [19:55:40] there were 3 (intended 4) tables [19:55:43] and each one of those [19:55:53] pointed at a subset of webrequests [19:55:54] so [19:56:01] I mean, if excluding bits does it, excluding bits does it. [19:56:02] webrequest_text worked ONLY with data from text varnishes [19:56:11] Ironholds current query [19:56:19] does not specify which webrequest_source partition on the new table to work with [19:56:27] so, his query is running across ALL webrequests [19:56:37] but the question is, why the multiplication in MR jobs. [19:56:40] Like, it's not additive. [19:56:41] not from one particular varnish [19:56:49] I'm pretty sure I wasn't running 97k/4 map jobs last time. [19:56:52] how many were launched on webrequest_text? [19:57:00] yes -- this is what I don't understand -- that's a lot of jobs [19:57:03] I do not have that data :(. But, less. [19:57:04] bits is a lot of data [19:57:13] about half of what we ahve on disk now [19:57:14] yeah, but...okay, so one point here is, how are partitions decided? [19:57:15] so, by including bits [19:57:19] but that would mean the partitions are bigger [19:57:20] I assumed it would be year/month/day/hour [19:57:20] you are doubling the amount of data you were querying [19:57:35] yes [19:57:36] partitions are [19:57:44] webrequest_source/year/month/day/hour [19:57:48] so, bigger partitions, definitely, because we get a lot of bits requests. [19:58:06] but at most we should have n*4 total partitions, compared to running against one source on its own, right? [19:58:16] each unique year/month/day/hour permutation, for each unique source. [19:58:31] yes, but each source is of different data size [19:58:34] but yes, same number of partitions [19:58:34] BUT [19:58:41] 1 partition != one file [19:58:48] 1 partition could have many files in it [19:58:53] and mappers are...per-file, or per-[size of data], or..? [19:58:55] i think usually about 20, each of different size [19:59:04] they are per block [19:59:04] I understood it to be per-partition but that appears wrong. [19:59:04] i think our block size is 256M [19:59:04] so [19:59:06] AHHH. [19:59:10] Okay, now this makes a lot more sense. [19:59:15] also, per file [19:59:21] if the file size < 256M [19:59:22] I assumed there was a mapper/partition relationship, or mapper/file relationship. [19:59:24] which for mobile, is true [19:59:32] *nods*. [19:59:41] can we force that? [19:59:47] change camus? [20:00:00] well, or we could just exclude bits ;). [20:00:04] let's see if that works first. [20:00:08] ja, we could run separate camus jobs to run mobile less frequently :) [20:00:09] If it does, great. If it doesn't, we can has problem. [20:00:10] or [20:00:11] actually [20:00:14] camus has a feature built in [20:00:19] called the camus-sweeper [20:00:23] that will concat hdfs files into larger files [20:00:28] el futbol [20:00:29] i haven't tried it at all yet [20:00:44] hah [20:00:44] this is probably a good thing to try [20:00:50] yeah, its on my todo list [20:00:56] we could also limit the number of mappers hive can run [20:00:57] especially since mobile files are smaller [20:01:06] hmmmm, i don't think that would help...woudl it? [20:01:11] like what, fail your job before submitting? [20:01:26] actually, that'd help tremendously. [20:01:29] you can limit the number of mappers that run at once, but if it has to work with that much data, its going to ned mappers [20:01:37] at the moment it doesn't check if [number of mappers] > [max number of mappers] at the start. [20:01:51] it just keeps launching them until it hits [max number of mappers] and then chokes and dies. [20:02:08] wait, wha? [20:02:08] are you sure? [20:02:13] that's why it dies? [20:02:14] want me to send you the full log? [20:02:20] well, that's why it /claimed/ it was dying. [20:02:28] yeah, you got an exception or something? [20:02:29] I've got 2.5M of logs from the last query. [20:02:30] on stat1002? [20:02:31] i'm there now, can look [20:02:40] /home/ironholds/anna.tsv [20:02:41] that's from the hive stdout/stderr? [20:02:43] you don't wanna cat that thing [20:02:44] yep. [20:02:48] k [20:02:53] yeah so [20:02:56] oh bollocks [20:02:57] webrequest_text [20:02:57] right now is [20:02:59] 12.8 T /wmf/data/external/webrequest/webrequest_text/hourly/2014/05 [20:03:01] that's compressed [20:03:01] it's overwritten it because I relaunched the task [20:03:03] * Ironholds headesks. [20:03:20] so that' 50K mappers right there [20:03:31] yeek [20:03:50] what would be the practical effect of increasing block size? [20:03:52] and [20:03:56] each mapper has to: [20:03:58] grab a 256M block [20:04:00] uncompress [20:04:00] parse json [20:04:05] do hive query [20:04:13] 256M is already a large block size [20:04:17] darnit. [20:04:18] the default is 64 [20:04:21] What happened to big data! :P [20:04:26] Yahoo lied to me! [20:04:36] well, big data is supposed to be handled by lots of little mappers! [20:04:36] tnegrin, I blame you for this. Everything yahoo does is your fault. [20:04:36] that's true. [20:04:42] haha [20:04:52] I'm now imagining mappers as adorable gnomes [20:04:55] Ironholds: are you just trying to get an ad-hoc result? [20:04:58] aaanyway. Okay. [20:04:59] or is this for something regular? [20:05:03] ottomata, just ad-hoc. [20:05:07] hm [20:05:07] aye [20:05:14] So, it's rerunning now, which, downside, means that I have overwritten stderr. Oops. [20:05:24] but upside, either it'll succeed, in which case excluding bits was the answer [20:05:25] ok, well, we will see what happens :) [20:05:28] or it will fail in the old way. [20:05:35] or it will fail in an entirely new and fascinating way! [20:05:44] the possibilities are endless (and mostly painful) [20:06:22] yeah, the worst is that it takes so long to find out, eh? [20:06:52] i think getting the refined table with cluster sampling will make this a lot easier [20:06:59] yup. [20:07:02] especially when you want to get a rough answer faster [20:07:10] My analogy is: if I was asking the Mars Rovers for data, they'd respond faster. [20:07:16] haha [20:07:22] naw, you'd need 90K rovers [20:07:27] working together [20:07:44] totally! [20:07:53] but, it would come back and be all "I can't do that" in what, 8 minutes? [20:08:01] haha, yeah [20:08:08] i can't do that, hal (oliver?) [20:08:13] It's got better "I'm given her all she's got cap'n! I cannae do it!" rates is my point. [20:08:22] maybe we should write a paper on this [20:08:28] haha [20:08:32] turning decomissioned NASA projects into a distributed computing system. [20:08:41] haha, you should talk to the ICE people [20:08:47] we could call it AUnalysis. [20:09:23] heh, so! since you seemed to be curious before [20:09:37] i didn't actually move any data when I created the new webrequest table [20:09:41] all I did [20:09:43] was created a new table [20:09:47] and added partition maps to it [20:10:10] for f in all hourly directories; alter table webrequest add partition ... location 'f'; done [20:10:15] then deleted the old tables [20:10:25] *nods*. [20:10:27] the data in HDFS stayed in exactly the same place it was before (well, almost, I did move it one level down) [20:10:37] so it didn't change anything in hadoop's storage [20:10:40] just how hive sees the data. [20:10:43] right [20:11:02] so when you query with webrequest_source='mobile', you are querying the exact same data you did when you previously queried the webrequest_mobile table [20:11:17] effectively you exchanged data type as a 'hard' partition (table) for data type as a 'soft' partition (i.e. one that can be queried across) [20:11:19] gotcha. [20:11:22] yup [20:11:31] okay: let's see what the query does now :D [20:11:59] I am headdesking for not spitting the potential bits thing, though. [20:12:18] *spotting [20:12:51] ottomata, okay, something is definitely wrong; [20:13:04] jsut another FYI, bits is 12.0 T /wmf/data/external/webrequest/webrequest_bits/hourly/2014/05 [20:13:05] too [20:13:08] I excluded bits. Now it's asking for 100k mappers. [20:13:09] so also a lot of mappers [20:13:12] WHA??? [20:13:12] hm [20:13:23] Hadoop job information for Stage-1: number of mappers: 100477; number of reducers: 1 [20:13:36] hmm, kill it [20:13:37] let's try with [20:13:53] AND (webrequest_source = 'text' OR webrequest_source = 'mobile') [20:13:58] maybe the != is weird [20:14:16] (im' looking at explains...) [20:14:34] now to try and find the jobID! [20:15:10] should be close to the top of your output [20:15:27] so yeah, when I explain that query with != vs = OR = [20:15:30] so, != bits ahs [20:15:58] usr/lib/hadoop/bin/hadoop job -kill job_1387838787660_14130 [20:15:58] Error: JAVA_HOME is not set and could not be found. [20:16:02] expr: ((uri_path = '/wiki/Education') and (uri_host) IN ('outreach.m.wikimedia.org', 'outreach.wikimedia.org')) [20:16:03] and [20:16:04] I've decided technology just hates me today. [20:16:08] the OR = has [20:16:08] expr: (((month = 5.0) and (webrequest_source = 'text')) or (((webrequest_source = 'mobile') and (uri_path = '/wiki/Education')) and (uri_host) IN ('outreach.m.wikimedia.org', 'outreach.wikimedia.org'))) [20:16:15] Ironholds [20:16:24] yarn application -kill [20:16:26] it will look like [20:16:28] gotcha; ta. [20:16:31] application_NNNNNNN_NNNNN [20:17:49] okay, killed, modified to add the OR, restarted [20:18:11] I am going to briefly depart for a smoke while it ever-so-slowly loads up, and will report back with cheers or tears depending ;p [20:19:31] k [20:19:38] i'm heading out pretty soon, mayb e30 mins [20:31:03] ottomata, okie-dokes :) [20:31:08] how's it lookin? [20:31:11] how many mappers on that one? [20:31:18] 100669 [20:31:34] uhhh 100669 [20:31:34] I believe the appropriate phrase is "Jesus Wept" [20:32:03] OH [20:32:05] Ironholds: [20:32:12] AND uri_path = '/wiki/Education'; [20:32:12] you got a premature semicolon [20:33:04] ahh! [20:33:31] your applicaiton ID is [20:33:31] application_1387838787660_14139 [20:33:33] kill away! [20:33:33] :) [20:33:40] already did :) [20:33:54] reset, rerunning in 120 seconds [20:37:07] ANnnnnnd mappers? [20:37:11] oh not running yet.... [20:39:03] 55433 [20:39:04] ok! [20:39:12] that's fewer! [20:39:15] and makes sense [20:40:32] (12.1 T (text) + 1.9 T( mobile))*1024*1024 / 256M) == 57344 [20:41:00] neat :) [20:41:05] okay, let's see what happens to this query. [20:41:15] ja [20:41:30] when are we dragging you out to the office? I owe you a drink or seven, I think. [20:42:01] dunno! [20:42:07]