[00:12:09] JJMC89: https://gitlab.wikimedia.org/toolforge-repos/ldap/-/commit/db71990c3b83c1437b6bdbb1869ae311ca5e8fae it's deploying right now [00:13:15] thanks [00:19:50] https://ldap.toolforge.org/user/jjmc89 :) [00:20:30] 🙂 [03:01:17] bd808, taavi: wondering if either of you have any ideas of how I can run tests from GitLab CI that query LDAP, since the runners are now in digital ocean [03:05:00] legoktm: you can add a tag to choose a runner inside Cloud VPS. https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/blob/main/.gitlab-ci.yml?ref_type=heads#L80 [03:06:50] :o I forgot about that, let me give it a shot [08:53:33] hey! any idea why a toolforge job would just freeze for hours/days? Is there a way to define a timeout for a job, stopping it when it hits that? [08:55:11] related job: https://k8s-status.toolforge.org/namespaces/tool-urbanecmbot/pods/afd-announcer-28815980-p9g27/ [09:06:25] urbanecm: mmmm I don't remember if we support healthchecks for cronjobs [09:07:05] we probably should, it's not the most convenient thing to have to notice it fails and restart it manually [09:10:35] urbanecm: if you open a phab ticket requesting the feature, I'll make sure it gets attention from the team. I think the change is somewhat simple [09:21:05] arturo: sure, sounds good. filled T377420, let me know if you want me to add anything else [09:21:06] T377420: Introduce health checks for Toolforge Jobs Framework - https://phabricator.wikimedia.org/T377420 [09:22:18] urbanecm: thanks. Health checks are already supported, but for continuous jobs, see here https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Configuring_health-checks_for_jobs [09:23:32] arturo: aha, misunderstood it then. hmm, will that work (for continuous jobs) if the pod itself gets in trouble though? [09:24:12] urbanecm: yes, for continuous jobs, it works as expected. You would need to have some logic in your code, though [09:24:36] but it is true that cronjobs don't have this, and supporting it may be good [09:26:43] Gotcha, thanks for the clarification [09:27:27] arturo: fwiw, this job is supposed to be quick (take less than the 10s, actually). Dunno if that matters tho [09:29:08] I guess another option for implementing it, instead of a cronjob, is daemon. So the control loop is... well, under your control, and you can easily check if the thing is happening [09:29:34] because how healtcheck works in kubernetes, it may not be granular enough to work on a 10s job [09:29:51] that's why your own control loop may be better [09:30:45] i see... i guess i could convert it to a continuous job, and sleep most of the time, rather than have a "once in five minutes" cronjob. [09:30:57] yeah [09:31:20] `while True: whatever ; sleep 120 ;` etc [09:31:47] then, having a healthcheck to verify the continuous job is looping [09:32:24] yep yep [09:32:29] will consider that [13:27:54] !log melos@tools-bastion-13 tools.stewardbots Restarted StewardBot/SULWatcher because of a connection loss [13:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [21:33:45] hi all. I've noticed that my job is stuck (been running for 1d 12h). two questions: 1) how can I check what it's doing there and 2) how can I prevent it from happening - e.g. get it autokilled after a certain period of time. tool name: chie-bot, job name: job-move-to-user-page [21:45:48] Leloiandudu: for the latter question, T377420 was just filed earlier today. [21:45:48] T377420: Introduce health checks for Toolforge Jobs Framework cronjobs - https://phabricator.wikimedia.org/T377420 [21:46:51] For the former I was going to say `toolforge jobs logs job-move-to-user-page`, but that tells me you turned on file logging, so check job-move-to-user-page.out and job-move-to-user-page.err [21:49:41] Leloiandudu: One of the things I tend to do with Toolforge jobs is code them so that unexpected errors will kill the process entirely. It looks a bit like your job may have seen a network error and got stuck as a result. "Unhandled exception. System.Net.WebException: Resource temporarily unavailable (ru.wikipedia.org:443)" [21:51:36] bd808: thanks! there's nothing in the file logs. my jobs always exit on any unexpected errors (and notify me), so there's something else going on here... it almost feels like the job got stuck before my code even started executing. should I restart it or is it useful to leave it in this state for any investigation? [21:55:36] Leloiandudu: I don't see anything exciting about the Pod on the Kubernetes side of things. I'd say kill it this time. [21:56:32] thanks! will do [21:56:57] `kubectl events |grep job-move-to-user-page` says "4m54s (x453 over 37h) Normal JobAlreadyActive CronJob/job-move-to-user-page Not starting job because prior execution is running and concurrency policy is Forbid" which really jsut says that it has been stuck for 37 hours [21:57:37] anything from the previous run? [21:58:33] it looks like it ran to success in 37 seconds as soon as you cleared out the stuck version [21:59:15] yeah. same happened to another one that got stuck at the same time (job-dykc-only-new) - I restarted it about an hour ago [21:59:37] that's why I was thinking something got stuck that was out of my control [22:01:53] I see similar "System.Net.Http.HttpRequestException: Resource temporarily unavailable (ru.wikipedia.org:443)" failures in the job-dykc-only-new log. Unfortunately those dumps to stderr don't have timestamps that would let us know when the network blipped [22:03:12] that was from the last time (file modification date is 5 Sept) [22:03:40] * bd808 goes to see what happened to wikibugs [22:06:02] !log tools.wikibugs Bot dropped offline when it lost connection to libera.net but ZNC noticed and recovered without intervention [22:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL