[01:21:07] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1098 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[01:32:27] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1100 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[03:05:21] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1096 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[04:16:23] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1110 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[05:32:31] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1105 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[06:48:31] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1100 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:02:09] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:05:22] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Aklapper) 05Open→03Stalled @Sbodington: The previous "Approved" comment here initially looked like drive-by vandalism to me. It was made by a [self-cre...
[07:23:49] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1096 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:28:37] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1105 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:43:27] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1105 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[08:05:53] <elukey>	 !log remove big stderrlog fine in /var/lib/hadoop/data/e/yarn/logs/application_1605880843685_1450 on an-worker1105
[08:05:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:10:19] <elukey>	 !log remove big stderrlog fine in /var/lib/hadoop/data/d/yarn/logs/application_1605880843685_1450 on an-worker1110
[08:10:22] <joal>	 Hi elukey 
[08:10:22] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:10:35] <joal>	 Can I help?
[08:11:31] <elukey>	 joal: bonjour bonjour
[08:11:51] <wikibugs>	 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10elukey)
[08:12:19] <elukey>	 joal: nothing on fire, some workers have a couple of disks filled up due to --^, I am fixing
[08:12:48] <elukey>	 it is also due to hdfs being used a lot, we store more than 2PB now
[08:13:33] <joal>	 hm
[08:13:59] <joal>	 still elukey - it's always the same job causing problem
[08:14:48] <elukey>	 yes I opened a task so we can help, going to update the email thread aswell
[08:15:26] <joal>	 We can help, but the job should be stopped from now on - I'll request that
[08:17:47] <elukey>	 joal: I am not sure, it is true that the job spams a lot, but I see also as a limitation of our infra, disks are almost maxed out 
[08:17:57] <joal>	 ok
[08:18:05] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1110 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[08:19:22] <joal>	 elukey: the fact that other jobs don't fail or generate that much problem is only a signal IMO
[08:20:15] <joal>	 elukey: I'm gonna make an fsimage analysis for size on monday
[08:21:21] <elukey>	 yep :)
[08:21:36] <elukey>	 I am wondering if there is a way to set a max file size for yarn logs
[08:28:09] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1105 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[08:38:35] <elukey>	 I am wondering if the /etc/spark2/conf/log4j.conf is picked up by containers
[08:38:44] <elukey>	 because we set log4j.rootCategory=INFO, console
[08:40:24] <elukey>	 I am reading in a lot of places that adding somthing like spark.driver.extraJavaOption=/etc/spark2/conf/log4j.conf might help
[08:43:30] <elukey>	 in theory it should be picked up
[08:48:35] <elukey>	 anyway, let's see on monday
[08:48:37] <elukey>	 o/
[12:20:00] <wikibugs>	 (03PS1) 10GoranSMilovanovic: 20201121_batch [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/642645
[12:20:21] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] 20201121_batch [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/642645 (owner: 10GoranSMilovanovic)
[12:25:14] <wikibugs>	 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @elukey If it could help, here's the [[ https://github.com/wikimedia/analytics-wmde-WDCM/blob/master/_wdcmModules/w...
[13:03:34] <wikibugs>	 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @elukey Additional info:  - I have stopped any attempts to run the job - its Pyspark code is referenced in T268376#...
[13:06:35] <wikibugs>	 10Analytics: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10JAllemandou)
[16:53:16] <wikibugs>	 10Analytics, 10Analytics-Features: Feature request: Keeping track of time spent in phases of edits for users - https://phabricator.wikimedia.org/T268385 (10Jukeboksi) p:05Triage→03Low
[18:23:07] <icinga-wm>	 PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:33:45] <icinga-wm>	 RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:58:05] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10I18n: WikiReportsLocalizations.pm still fetches language names from SVN - https://phabricator.wikimedia.org/T64570 (10Aklapper) #Analytics: How to get an answer / a decision? Is this still needed / wanted, or should this be declined in favor of Wikistats 2.0? Patch merge...