[01:57:01] displays [08:35:23] https://yarn.wikimedia.org/jobhistory/job/job_1605880843685_9832 - 111 hours! [08:35:55] and I see a ton of [08:35:56] SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.64.21.104:50526:null (DIGEST-MD5: IO error acquiring password) [08:38:57] and the error msg is in hue [08:38:58] Application application_1605880843685_18051 finished with failed status [08:39:13] SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] [08:39:16] ERROR StatusLogger Reconfiguration failed: No configuration found for '5e8c92f4' at 'null' in 'null' [08:39:25] 18:59:29.491 [SIGTERM handler] ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend - RECEIVED SIGNAL TERM in stdout [08:40:24] ah ok 18051 is the appmaster [08:42:56] I am inclined to re-run and see, weird [09:55:38] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:06:22] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:50:59] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:01:42] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers