[06:45:38] <wikibugs>	 (03CR) 10Nuria: [C: 04-1] "-1 until joseph can CR, post k-day" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/552943 (https://phabricator.wikimedia.org/T238360) (owner: 10Nuria)
[12:35:34] <wikibugs>	 10Analytics: Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243 (10Erik_Zachte) So I looked first into the cron processes that are still enabled on home/ezachte. There are two.   One is running fine (compressing page view counts into daily/monthly zips for 3rd parties).  The other...
[19:13:18] <elukey>	 all right I am finally at home :)
[19:28:56] <elukey>	 so IIUC for some reason some refine yarn mappers consume more memory and get kille
[19:28:59] <elukey>	 *killed
[19:29:24] <elukey>	 now I am wondering if raising a bit the limit (say mapred.child.java.opts) could help in making refine succeed
[19:31:07] <elukey>	 but not sure how to make a test
[19:34:20] <elukey>	 there are two hours with mapper memory limit issues, and one breaching the data loss threshold
[19:38:34] <elukey>	 joal: are you around by any chance?
[19:42:13] <elukey>	 (dinner brb)
[20:16:12] <elukey>	 just launched a coordinator to run webrequest_load with a modified oozie_launcher_memory, not sure if it will help but it is an easy test since it is a parameter
[20:20:31] <mforns>	 hey elukey I'm here, can I help?
[20:20:47] <mforns>	 reading emails
[20:21:35] <elukey>	 hey mforns!
[20:22:23] <elukey>	 so some webrequest_load hours are failing in refine, IIUC (to be verified) I think it is due to mappers being killed because breaching their memory max limits
[20:23:07] <elukey>	 there might be some data that causes more memory to be consumer, not sure
[20:23:12] <mforns>	 I see
[20:23:17] <elukey>	 there is also one hour failing for data loss
[20:23:21] <elukey>	 just to add more fun
[20:23:30] <mforns>	 and the "Unable to initialize FileSignerSecretProvider" is related to that?
[20:24:03] <elukey>	 mforns: not sure where it was found, but might be related or not
[20:25:03] <mforns>	 elukey, nuria mentioned it in the first email, for id: job_1573208467349_185270
[20:25:31] <elukey>	 mforns: yes I've read it, but didn't see it in the logs that I checked
[20:25:40] <elukey>	 I mean the failed jobs 
[20:25:43] <mforns>	 aha
[20:27:20] <elukey>	 but please consider what I am saying as "to be validated"
[20:30:02] <elukey>	 mforns: I am currently running a separate coordinator only for one failed our, to test a oozie parameter to increase memory for mappers, but probably will not work
[20:30:29] <mforns>	 I'm trying to find the nuria's log
[20:30:38] <elukey>	 precisely -Doozie_launcher_memory=8196
[20:30:48] <elukey>	 yep yep I am keeping you updated about my ramblings
[20:30:55] <elukey>	 so you can tell me if I am crazy or not
[20:31:28] <elukey>	 yeah it doesn't work
[20:31:47] <elukey>	 uff
[20:34:01] <mforns>	 hmmm
[20:34:21] <mforns>	 I saw some errors in webrequest-load-wf-text-2019-12-14-13
[20:34:31] <mforns>	 it seems to be failing to set the success flag
[20:34:56] <mforns>	 https://hue.wikimedia.org/oozie/list_oozie_workflow/0002917-191212123816836-oozie-oozi-W/?coordinator_job_id=0017451-191108122147035-oozie-oozi-C&bundle_job_id=0017450-191108122147035-oozie-oozi-B
[20:35:25] <mforns>	 If I grep for ERROR, they all belong to either: mark_add_partition_done
[20:35:34] <mforns>	 or mark_raw_dataset_done
[20:39:11] <elukey>	 mforns: in what logs are you grepping?
[20:39:25] <elukey>	 ah in the log panel?
[20:39:31] <mforns>	 yes, the url I pasted
[20:40:12] <elukey>	 mforns: but if you click on the job_etc.. link associated with the refine failure
[20:40:15] <elukey>	 then you go https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_187412
[20:41:00] <mforns>	 yes, then click on the "hamburger" button
[20:41:22] <elukey>	 in the stderr panel, I can see
[20:41:23] <elukey>	 ERROR : Ended Job = job_1573208467349_187413 with errors
[20:41:26] <mforns>	 wait no...
[20:42:24] <mforns>	 elukey, what I did was click on the Log tab within the workflow page
[20:43:13] <elukey>	 mforns: if you go in https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_187412, then "Logs" in the sx panel
[20:43:24] <elukey>	 then stderr you'll find the above log
[20:43:37] <elukey>	 that should be the one related to the hive query, that fails
[20:43:53] <elukey>	 namely https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_187413
[20:45:20] <elukey>	 that is http://yarn.wikimedia.org/proxy/application_1573208467349_187413
[20:45:42] <elukey>	 there are some failed/killed tasks reported at the bottom
[20:46:08] <elukey>	 failed are https://yarn.wikimedia.org/jobhistory/attempts/job_1573208467349_187413/m/FAILED
[20:46:17] <elukey>	 Container [pid=32073,containerID=container_e02_1573208467349_187413_01_000165] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 3.6 GB of 4.2 GB virtual memory used. Killing container.
[20:46:40] <mforns>	 aha
[20:48:03] <elukey>	 why this happen is not clear to me, I guess due to some data in the hours causing more memory used?
[20:52:52] <mforns>	 the volume of data in the raw webrequest seems ok
[20:53:13] <mforns>	 the failed hours seem to have acceptable size of data, compared to other dates
[20:54:06] <elukey>	 could be some specific value that triggers this
[20:55:24] <elukey>	 so we have parameters for oozie.launcher.mapreduce.map.memory.mb
[20:55:38] <elukey>	 but his is to increase the heap sizes of the Application Master
[20:55:57] <elukey>	 say for example if the Hive query was big and the client side was going OOM
[20:56:33] <elukey>	 meanwhile we should set the mappers, temporarily, to mapreduce.map.memory.mb=8196
[20:56:50] <elukey>	 but IIUC this would need new parameters in the refine workflow
[21:01:10] <elukey>	 we could also check with Spark the failed hour, and see if something is different, but could take a lot
[21:01:31] <elukey>	 tomorrow seems not a good day for kerberos sigh
[21:02:24] <elukey>	 mforns: it is very late, we can re-check tomorrow morning with Joseph.. Maybe there is a quick workaround, worst case we don't enable kerberos
[21:02:40] <mforns>	 ://///////
[21:02:54] <elukey>	 yeah I know, this is really bad luck
[21:06:07] <mforns>	 va bene
[21:06:29] <elukey>	 grazie Marcel
[21:07:08] <mforns>	 non ha problemo, grazie a te
[21:07:22] <mforns>	 xD
[21:07:24] <elukey>	 :D
[21:07:25] <elukey>	 <3
[21:07:39] <mforns>	 see ya tomorrow
[21:08:33] <elukey>	 o/
[21:10:11] <nuria>	 elukey: super thanks for checking this out , I do not think this should hold kerberos, seems entirely unrelated, will do some selects today to see if i can figure out what  is different on those two hours
[21:12:17] <elukey>	 nuria: hola! Yes we can proceed anyway but there is already a bit of backlog for those hours, and having an outstanding issue on top of potential other ones when kerberos will be on is a bit scary :(
[21:13:14] <elukey>	 if oozie wasn't so cumbersome to configure we could test adding more memory to mappers, and see if it is a one off (it happened with camus too in the past)
[21:23:56] <elukey>	 nuria: going to stop checking now, I'll restart with Joseph tomorrow morning and we'll make the call about kerberos ok?
[21:24:08] <elukey>	 thanks a lot for helping/checking :)
[21:24:26] * elukey afk!