[06:45:38] (03CR) 10Nuria: [C: 04-1] "-1 until joseph can CR, post k-day" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) (owner: 10Nuria) [12:35:34] 10Analytics: Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243 (10Erik_Zachte) So I looked first into the cron processes that are still enabled on home/ezachte. There are two. One is running fine (compressing page view counts into daily/monthly zips for 3rd parties). The other... [19:13:18] all right I am finally at home :) [19:28:56] so IIUC for some reason some refine yarn mappers consume more memory and get kille [19:28:59] *killed [19:29:24] now I am wondering if raising a bit the limit (say mapred.child.java.opts) could help in making refine succeed [19:31:07] but not sure how to make a test [19:34:20] there are two hours with mapper memory limit issues, and one breaching the data loss threshold [19:38:34] joal: are you around by any chance? [19:42:13] (dinner brb) [20:16:12] just launched a coordinator to run webrequest_load with a modified oozie_launcher_memory, not sure if it will help but it is an easy test since it is a parameter [20:20:31] hey elukey I'm here, can I help? [20:20:47] reading emails [20:21:35] hey mforns! [20:22:23] so some webrequest_load hours are failing in refine, IIUC (to be verified) I think it is due to mappers being killed because breaching their memory max limits [20:23:07] there might be some data that causes more memory to be consumer, not sure [20:23:12] I see [20:23:17] there is also one hour failing for data loss [20:23:21] just to add more fun [20:23:30] and the "Unable to initialize FileSignerSecretProvider" is related to that? [20:24:03] mforns: not sure where it was found, but might be related or not [20:25:03] elukey, nuria mentioned it in the first email, for id: job_1573208467349_185270 [20:25:31] mforns: yes I've read it, but didn't see it in the logs that I checked [20:25:40] I mean the failed jobs [20:25:43] aha [20:27:20] but please consider what I am saying as "to be validated" [20:30:02] mforns: I am currently running a separate coordinator only for one failed our, to test a oozie parameter to increase memory for mappers, but probably will not work [20:30:29] I'm trying to find the nuria's log [20:30:38] precisely -Doozie_launcher_memory=8196 [20:30:48] yep yep I am keeping you updated about my ramblings [20:30:55] so you can tell me if I am crazy or not [20:31:28] yeah it doesn't work [20:31:47] uff [20:34:01] hmmm [20:34:21] I saw some errors in webrequest-load-wf-text-2019-12-14-13 [20:34:31] it seems to be failing to set the success flag [20:34:56] https://hue.wikimedia.org/oozie/list_oozie_workflow/0002917-191212123816836-oozie-oozi-W/?coordinator_job_id=0017451-191108122147035-oozie-oozi-C&bundle_job_id=0017450-191108122147035-oozie-oozi-B [20:35:25] If I grep for ERROR, they all belong to either: mark_add_partition_done [20:35:34] or mark_raw_dataset_done [20:39:11] mforns: in what logs are you grepping? [20:39:25] ah in the log panel? [20:39:31] yes, the url I pasted [20:40:12] mforns: but if you click on the job_etc.. link associated with the refine failure [20:40:15] then you go https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_187412 [20:41:00] yes, then click on the "hamburger" button [20:41:22] in the stderr panel, I can see [20:41:23] ERROR : Ended Job = job_1573208467349_187413 with errors [20:41:26] wait no... [20:42:24] elukey, what I did was click on the Log tab within the workflow page [20:43:13] mforns: if you go in https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_187412, then "Logs" in the sx panel [20:43:24] then stderr you'll find the above log [20:43:37] that should be the one related to the hive query, that fails [20:43:53] namely https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_187413 [20:45:20] that is http://yarn.wikimedia.org/proxy/application_1573208467349_187413 [20:45:42] there are some failed/killed tasks reported at the bottom [20:46:08] failed are https://yarn.wikimedia.org/jobhistory/attempts/job_1573208467349_187413/m/FAILED [20:46:17] Container [pid=32073,containerID=container_e02_1573208467349_187413_01_000165] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 3.6 GB of 4.2 GB virtual memory used. Killing container. [20:46:40] aha [20:48:03] why this happen is not clear to me, I guess due to some data in the hours causing more memory used? [20:52:52] the volume of data in the raw webrequest seems ok [20:53:13] the failed hours seem to have acceptable size of data, compared to other dates [20:54:06] could be some specific value that triggers this [20:55:24] so we have parameters for oozie.launcher.mapreduce.map.memory.mb [20:55:38] but his is to increase the heap sizes of the Application Master [20:55:57] say for example if the Hive query was big and the client side was going OOM [20:56:33] meanwhile we should set the mappers, temporarily, to mapreduce.map.memory.mb=8196 [20:56:50] but IIUC this would need new parameters in the refine workflow [21:01:10] we could also check with Spark the failed hour, and see if something is different, but could take a lot [21:01:31] tomorrow seems not a good day for kerberos sigh [21:02:24] mforns: it is very late, we can re-check tomorrow morning with Joseph.. Maybe there is a quick workaround, worst case we don't enable kerberos [21:02:40] :///////// [21:02:54] yeah I know, this is really bad luck [21:06:07] va bene [21:06:29] grazie Marcel [21:07:08] non ha problemo, grazie a te [21:07:22] xD [21:07:24] :D [21:07:25] <3 [21:07:39] see ya tomorrow [21:08:33] o/ [21:10:11] elukey: super thanks for checking this out , I do not think this should hold kerberos, seems entirely unrelated, will do some selects today to see if i can figure out what is different on those two hours [21:12:17] nuria: hola! Yes we can proceed anyway but there is already a bit of backlog for those hours, and having an outstanding issue on top of potential other ones when kerberos will be on is a bit scary :( [21:13:14] if oozie wasn't so cumbersome to configure we could test adding more memory to mappers, and see if it is a one off (it happened with camus too in the past) [21:23:56] nuria: going to stop checking now, I'll restart with Joseph tomorrow morning and we'll make the call about kerberos ok? [21:24:08] thanks a lot for helping/checking :) [21:24:26] * elukey afk!