[14:04:09] brouberol: ok to merge yours? [14:05:31] taavi yes please [14:05:36] thanks, doing [14:05:42] tx [22:47:07] hey, can someone help me figure out the status of `mw-script.eqiad.vlmj41vw` job? `kubectl get jobs` says 0/1 completion, `kubectl get pods` claims it OOM'ed for some reason. [22:51:40] looking! yep so far I agree with your findings [22:53:05] urbanecm: do you know offhand if the job you're running is unusually RAM-intensive? [22:53:38] rzl: it shouldn't, but it is a newly written one (i started it for very first time in prod) and maybe i did something stupid in the code [22:54:12] unfortunately the telemetry is pretty sparse because once it gets oomkilled, well, we don't get any more telemetry :) https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=000000017&var-namespace=mw-script&var-pod=mw-script.eqiad.vlmj41vw-vgcbs&var-container=$__all [22:54:39] * urbanecm finds the outputs self-contradictory (either it is not completed => running, and thus not OOM'ed, or it OOM'ed, and then it should be completed) [22:54:49] not completed in the way i wanted it to, but... [22:55:15] ah no, "Completed" means it completed successfully [22:55:34] if it fails, it's incomplete forever :) [22:55:42] aha! [22:55:59] the difference is important because it goes to kubernetes' restart behavior -- if configured to, it will restart a Failed job but not a Completed one [22:56:15] (presently mw-script jobs are never restarted so that difference doesn't matter to you) [22:56:37] makes sense [22:56:48] okay, so then i need to figure out why it takes more memory than i expect it to [22:57:06] rzl: afaics, those are the resources i have. is that correct? https://www.irccloud.com/pastebin/CM2Daff4/ [22:57:47] yep that looks right to me -- I just ran `kubectl describe pod mw-script.eqiad.vlmj41vw-vgcbs` to get the current state and that matches [22:58:02] okay, sounds good. thanks for the quick help! [22:58:30] so if you investigate and find you really do need more than 1.2 gigs, we can configure it, but if that's a surprise, better to debug that first [22:58:49] i don't expect the job to need more than that. i'll look into that. [22:59:01] rad, good luck! [22:59:06] is it possible to get profiling data from jobs (even completed ones)? [22:59:21] i can run the job on smaller set of data and profiling could help me understand what is going on [23:00:23] hm, I don't know enough about PHP profiling to give you a solid answer, but I suspect it's doable with a little wiggling [23:01:29] two broad approaches are to log whatever data you need to stdout on some container (in which case you can retrieve it after completion with `kubectl logs`) or to get the pod to stay running so that you can talk to it -- in which case maybe you launch a `shell.php` job and invoke your real work from inside it [23:05:56] hmm, https://wikitech.wikimedia.org/wiki/WikimediaDebug#Plaintext_CLI_profile seems to have some guidance, but just running things with --profiler=text doesn't seem to do anything [23:08:17] it _does_ work in beta though [23:09:19] yeah I'm not sure -- the wrapper script should pass --profiler=text through to mwscript just fine, but my guess is whatever profiling stuff it needs isn't actually available in the cli image [23:10:11] seems so [23:10:55] ah, yeah that would need the debug image =/ [23:11:14] which alas, does not have the CLI scipts [23:11:20] heh [23:11:52] as long as i can run things there, i should be just able to invoke MWScript.php manually though? [23:12:52] it _might_ be a challenge to do that, given the way we've structured the entrypoint on the mediawiki container [23:14:17] wait, hmmm ... urbanecm: do you happen to know why profiler you need? I just realized that xhprof _is_ in the cli image :) [23:16:03] yeah, i'm just reading https://github.com/wikimedia/operations-mediawiki-config/blob/master/src/Profiler.php#L65, which does have some logic for setting that up [23:18:54] indeed, we recently switched it over from tideways [23:23:19] urbanecm: K.rinkle would be a good person to poke for help in figuring out profiling. He's the wizard at most of that kind of stuff in prod. [23:23:30] ^ this :) [23:23:47] swfrench-wmf: it seems the missing bit is auto_prepend_file not being to set to `/srv/mediawiki/wmf-config/PhpAutoPrepend.php`. no idea if that's intentional [23:24:11] but if i require that file, profiling config does generate [23:27:56] got it, interesting ... to be honest, I have no idea if it is intentional, and might defer to K.rinkle in this case as well [23:46:31] okay, thanks all. filled T404804 and summarized the discussion here. let's see what K.rinkle will say there. [23:46:32] T404804: GrowthExperiments:graduateEligibleMentees.php maintenance scripts OoMs for eswiki - https://phabricator.wikimedia.org/T404804