[01:21:07] PROBLEM - Disk space on Hadoop worker on an-worker1098 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [01:32:27] PROBLEM - Disk space on Hadoop worker on an-worker1100 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [03:05:21] PROBLEM - Disk space on Hadoop worker on an-worker1096 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [04:16:23] PROBLEM - Disk space on Hadoop worker on an-worker1110 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:32:31] PROBLEM - Disk space on Hadoop worker on an-worker1105 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:48:31] RECOVERY - Disk space on Hadoop worker on an-worker1100 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:02:09] RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:05:22] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Aklapper) 05Open→03Stalled @Sbodington: The previous "Approved" comment here initially looked like drive-by vandalism to me. It was made by a [self-cre... [07:23:49] RECOVERY - Disk space on Hadoop worker on an-worker1096 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:28:37] PROBLEM - Disk space on Hadoop worker on an-worker1105 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:43:27] PROBLEM - Disk space on Hadoop worker on an-worker1105 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:05:53] !log remove big stderrlog fine in /var/lib/hadoop/data/e/yarn/logs/application_1605880843685_1450 on an-worker1105 [08:05:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:10:19] !log remove big stderrlog fine in /var/lib/hadoop/data/d/yarn/logs/application_1605880843685_1450 on an-worker1110 [08:10:22] Hi elukey [08:10:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:10:35] Can I help? [08:11:31] joal: bonjour bonjour [08:11:51] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10elukey) [08:12:19] joal: nothing on fire, some workers have a couple of disks filled up due to --^, I am fixing [08:12:48] it is also due to hdfs being used a lot, we store more than 2PB now [08:13:33] hm [08:13:59] still elukey - it's always the same job causing problem [08:14:48] yes I opened a task so we can help, going to update the email thread aswell [08:15:26] We can help, but the job should be stopped from now on - I'll request that [08:17:47] joal: I am not sure, it is true that the job spams a lot, but I see also as a limitation of our infra, disks are almost maxed out [08:17:57] ok [08:18:05] RECOVERY - Disk space on Hadoop worker on an-worker1110 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:19:22] elukey: the fact that other jobs don't fail or generate that much problem is only a signal IMO [08:20:15] elukey: I'm gonna make an fsimage analysis for size on monday [08:21:21] yep :) [08:21:36] I am wondering if there is a way to set a max file size for yarn logs [08:28:09] RECOVERY - Disk space on Hadoop worker on an-worker1105 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:38:35] I am wondering if the /etc/spark2/conf/log4j.conf is picked up by containers [08:38:44] because we set log4j.rootCategory=INFO, console [08:40:24] I am reading in a lot of places that adding somthing like spark.driver.extraJavaOption=/etc/spark2/conf/log4j.conf might help [08:43:30] in theory it should be picked up [08:48:35] anyway, let's see on monday [08:48:37] o/ [12:20:00] (03PS1) 10GoranSMilovanovic: 20201121_batch [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/642645 [12:20:21] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] 20201121_batch [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/642645 (owner: 10GoranSMilovanovic) [12:25:14] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @elukey If it could help, here's the [[ https://github.com/wikimedia/analytics-wmde-WDCM/blob/master/_wdcmModules/w... [13:03:34] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @elukey Additional info: - I have stopped any attempts to run the job - its Pyspark code is referenced in T268376#... [13:06:35] 10Analytics: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10JAllemandou) [16:53:16] 10Analytics, 10Analytics-Features: Feature request: Keeping track of time spent in phases of edits for users - https://phabricator.wikimedia.org/T268385 (10Jukeboksi) p:05Triage→03Low [18:23:07] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:33:45] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:58:05] 10Analytics, 10Analytics-Wikistats, 10I18n: WikiReportsLocalizations.pm still fetches language names from SVN - https://phabricator.wikimedia.org/T64570 (10Aklapper) #Analytics: How to get an answer / a decision? Is this still needed / wanted, or should this be declined in favor of Wikistats 2.0? Patch merge...