[07:48:58] inflatador: most of the time it's via spark from a notebook, sometimes spark3-sql and for inspecting the logs I use the "yarn" client, I use the yarn webui to list running jobs and access the spark UI [10:21:52] lunch [13:02:57] o/ [13:17:36] dcausse thanks, I was troubleshooting an HDFS permissions issue in Slack but it looks like Balthazar got it [13:37:43] \o [13:58:16] .o/ [14:04:30] o/ [14:54:33] school run [15:21:52] I suspect spark does not know much about the mem used by the python process? not seeing anything very useful from the spark ui [15:22:08] hmm, i think it has something now about limiting python...hmm [15:22:36] and can't ask yarn to give me the actual mem usage of a container :/ [15:22:49] in theory spark.executor.pyspark.memory is supposed to limit the usage of the pyspark side [15:23:06] oh looking, thanks! [15:24:03] it doesn't really say how it limits it, but maybe it helps. Long time ago i ended up pulling some code out of the yarn resource manager and loading it in the spark executor side so i could call the same functions it uses to actually log resource usage...but doubt i have that around anymore [15:25:21] seems what I need "If not set, Spark will not limit Python's memory use and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes" [15:56:31] sigh... MemoryError [15:56:41] :( [15:57:54] trying with spark.executorEnv.PYSPARK_EXECUTOR_MEMORY_MB someone reported that this option does not seem to be passed properly in yarn (https://stackoverflow.com/questions/77876840/pyspark-spark-executor-pyspark-memory-introduced-errors) [16:14:25] ok seems to work with PYSPARK_EXECUTOR_MEMORY_MB, just hoping that it'll help python to keep mem lower and not throw a mem error... [16:15:00] yea sometimes it just shifts to a more explicit error [16:15:05] no clue how gc is working in python... [16:15:44] python is mostly reference counted iiuc, the gc mostly handles cycles [16:16:15] but i dunno what pyspark looks like internally...maybe it's all cycles :P [16:17:01] yes... [16:17:32] if the python gc is a bit lazy like the java one I'm hoping that an explicit limit might help [16:17:51] it should at least encourage it to run, no clue how often python usually runs the gc [16:18:31] I wonder what a yarn container looks like tho, does the client app actually sees what mem it has or does it sees the host physical mem? [16:21:23] dunno if it's right, but gemini claims setting pyspark memory limit sets RLIMIT_AS or RLIMIT_RSS at the kernel level which causes malloc to fail and triggers a MemoryError (which somehow does not trigger an automatic gc) [16:22:22] sigh... [16:22:52] will fail earlier then but might not help with my problem :( [16:25:52] might explain why it failed earlier just calling mmap on the model weigths... [16:27:28] Looks like the readahead script might be causing issues for the k8s workers. I've got a patch up to make it run every 30m instead of 5m. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1267898 . It's a pain, but we should be able to fix the script next week [16:27:44] i could be misremembering, but i vaguely emember having trouble with yarn before considering mmap'd things part of memory used [16:29:00] yes mmap files is always a bit difficult to control, iirc flink had issues with rocksdb in that regard too [16:29:40] well will let it run like that for the week-end but a bit pessimistic that it'll succeed :/ [16:29:56] heading out, have a nice week-end