[02:02:12] valhallasw`cloud: More jobs disappearing from the grid :/ [02:30:08] 6Labs: Labs proxy 502 should indicate where to ask for support - https://phabricator.wikimedia.org/T109078#1542986 (10scfc) Okay. I understood this task to be (just) your item #1. [03:19:00] aaaa [09:46:23] a930913: any details? [14:42:49] valhallasw`cloud: BracketBot and DefconBot had continuous jobs. [14:43:52] a930913: and which tool is that? [14:44:17] Their respective names. [14:44:38] oh, right. I tried BracketBot, but it's case sensitive [14:45:05] Really? :o Didn't know myself. [14:46:27] and there was nothing in run.err/run.out? [14:53:51] 6Labs, 10Tool-Labs: Jobs Disappearing from SGE - https://phabricator.wikimedia.org/T99027#1543261 (10A930913) 5Resolved>3Open Happened again, 13th Aug. Both DefconBot and BracketBot lost continuous jobs. > 16:00 BracketBot: Loading script. > 16:00 BracketBot: Unloading script. > 16:01... [14:55:23] a930913: I don't get what that irc log shows [14:56:49] valhallasw`cloud: The time the continuous job was killed. [14:56:55] And how it happened. [14:57:03] Because it restarted a few times. [14:57:14] But then started and disappeared. [14:57:18] a930913: maybe it shows that to *you* but definitely not to *me*. I have no clue how to interpret what it says there. [14:57:25] I have no clue where what wm-bot says comes from [14:57:49] valhallasw`cloud: https://phabricator.wikimedia.org/T99027#1283992 [14:58:00] That comment explains the relevance. [14:58:23] ah. [15:01:03] valhallasw`cloud: Something happens that causes the job to die without running the finally block that would send unloading. It then starts up and exits, running the finally block, twice. [15:01:19] valhallasw`cloud: It starts up for a final time, and dies without running the finally block. [15:01:46] a930913: huh? the irc log shows load-unload-load-unload-load [15:02:13] valhallasw`cloud: It was running before, and was not running afterwards. [15:02:37] So what isn't in the irc log is the initial death, and the post death. [15:02:46] unless there are two processes running [15:03:33] Hmm? [15:06:12] 6Labs, 10Tool-Labs: Jobs Disappearing from SGE - https://phabricator.wikimedia.org/T99027#1543266 (10valhallasw) ``` hostname tools-exec-1214.eqiad.wmflabs group tools.bracketbot owner tools.bracketbot project NONE department defaultdepartment jobname run jobnumber 5543 taskid... [15:06:23] a930913: can you qmod -rj the bot and see what that does? [15:06:27] qmod -rj [15:06:35] that should reschedule the task [15:08:22] 6Labs, 10Tool-Labs: Jobs Disappearing from SGE - https://phabricator.wikimedia.org/T99027#1543267 (10valhallasw) Although exit_status 143 corresponds to SIGTERM which would suggest it was qdel'ed. I've asked @a930913 to try to `qmod -rj` the task themselves, so that we can see if that is the cause of the issue. [15:09:07] valhallasw`cloud: I have restarted both tasks. [15:09:35] The currently running ones are not the problem ones. [15:09:53] ? [15:10:37] valhallasw`cloud: Wait, you want me to use the ID of the old jobs? [15:10:43] no, of the new jobs [15:10:55] I'm trying to see if rescheduling causes the issues you're seeing [15:11:35] No, I've done it a number of times. [15:11:49] Can do it again if you want. [15:12:08] I'd be interested to see what qacct shows in any case [15:12:29] basically, there are three options I can think of [15:12:34] 1) it's caused by qmod -rj [15:12:42] 2) it's caused by an accidental qdel [15:12:53] 3) it's caused because the server was restarted even though the job wasn't correctly rescheduled [15:13:47] 16:12 BracketBot: Loading script. │ [15:13:49] 16:12 BracketBot: Unloading script. │ [15:13:51] 16:12 BracketBot: Loading script. [15:14:21] Interestingly it dies without saying unloading, and restarts twice. [15:14:46] I think you're seeing the rescheduled job start before the old one is unloaded [15:15:01] but it doesn't die in this case [15:23:37] a930913: as for defconbot, I only see tasks being run every 30 minutes? [15:24:19] oh, there's also the job 'defconbot' [15:24:20] I see [15:24:39] Yeah, the failover runs every 30 minutes. [15:27:56] 6Labs, 10Tool-Labs: Jobs Disappearing from SGE - https://phabricator.wikimedia.org/T99027#1543296 (10valhallasw) qmod -rj triggers ``` 16:12 BracketBot: Loading script. 16:12 BracketBot: Unloading script. 16:12 BracketBot: Loading script. ``` but the bot is correctly rescheduled... [15:28:22] a930913: I'm going to guess option 3) is what happened; specifically, the jobs being restarted right on the same host, then being killed at the reboot [15:30:22] Seems the most likely explanation. [22:52:46] Coren: Sorry around the wild grep ealier... was trying to edit CBNG and couldn't find the bit I was looking for, found it on my local machine and forgot to stop the grap... derp... apologies :)