[05:03:07] cdanis: if you are around, could I get a check on https://gerrit.wikimedia.org/r/c/operations/puppet/+/548954/ ? [05:03:22] cdanis: it looks like go.dog is out for the rest of the week [13:18:06] <_joe_> mutante: that script is definitely executed by cron [13:18:20] <_joe_> but it's not in the crontab where you'd expect it to be [13:20:34] <_joe_> Nov 7 07:56:01 mw2225 CRON[16180]: (root) CMD (/usr/local/bin/hhvm-needs-restart [13:20:42] <_joe_> so it gets indeed executed by cron [13:21:16] <_joe_> but it's not in the root crontab where it would be, per puppet [13:21:26] <_joe_> so it's executing even though it should not [13:22:49] <_joe_> but crontab -l is just reading the crontab from disk [13:23:29] <_joe_> ok, lunch time [13:23:47] <_joe_> this is an interesting riddle but probably it's saner to just restart cron [13:23:58] * _joe_ lunch [15:39:56] jbond42: watching puppet CI downvote your changes because you had the audacity to use a version of python released in the last ten years is depressing me [15:41:04] cdanis: is that the `puppet-merge: Repository` change? [15:41:10] galaxy brain take: we should do a clean increment and break CI for python 2 [15:41:19] jbond42: yes :) [15:42:13] ahh it hadn't cliekc its caus it was testing with python2, i guess thats what prompted the other comment on https://gerrit.wikimedia.org/r/510613, thanks :) [15:42:23] yes [15:43:10] at the time you sent 510613 my reaction was basically "I don't know enough about what's in the Puppet repo, and I certainly don't know enough about how rake is written" [15:43:17] fyi, boron is WARNING about disk space - https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=boron&service=Disk+space [15:43:30] but we're so far away from the perfect being the enemy of the good here [15:43:43] the silly status quo is also the enemy of the slightly better [15:44:09] rlazarus: in the puppet repo that might be doable [15:47:01] cdanis: i agree, feel free to rip 510613 apart or do something new and add me. i created that as a conversation pice more then anything elses, its definetly not pretty or perfect but as long as there areno bugs i thinks its probably better then what we have [15:59:58] created this task for boron: https://phabricator.wikimedia.org/T237649 dunno who is fluent in Reprepro that I can CC [16:04:06] What is going on?! only 3 alerts on https://icinga.wikimedia.org/alerts [16:04:58] XioNoX: ill take a look at boron tomorrow [16:42:47] is there a way in phabricator to make a workboard a default view or at least linked in the left panel menu? [16:50:57] project names when you click can be configured to load project details page or workboard by default [16:51:05] then you can just link the project in your left navbar [16:51:09] thats what i do for #procurement [16:52:28] i dont quite recall how you set default project behavior though its been awhile [16:52:31] looking now [16:52:53] or just link directly via custom link to workboard i suppose [16:53:11] its project it but https://phabricator.wikimedia.org/project/board/# rather than https://phabricator.wikimedia.org/project/profile/# [17:07:34] cool thanks [17:07:45] there are like five UIs in phab it gets a bit confusing. [19:19:25] tried to use the cumin server in deloyment-prep to find stuff out about cloud instances and was confused why stuff like "facter os.release" does not work. turns out getting "nested" facts just doesn't work in facter 2 but does in facter 3 and prod is newer. [20:02:20] apergos, I'm here [20:02:31] o/ [20:02:39] hey hey [20:02:42] hi [20:02:47] discussion for 30 minutes: do we have enough common needs in job scheduler that we might agree on a common solution. liw gehel ebernhardson marxarelli ottomata [20:02:48] hoya [20:03:34] my understanding from the discussions so far is that we probably have enough conflicting constraints that the effort to bridge all those is not worth it [20:03:37] just fyi i might have to duck out in 30 min for an appointment [20:03:50] yeah [20:03:54] from what I understand in T237361 [20:03:55] T237361: Discuss common needs in a job manager/scheduler - https://phabricator.wikimedia.org/T237361 [20:04:02] (limited understanding, feel free to contradict me!) [20:04:09] mgiht we go around and state the 3 biggest needs? e.g. for me: extensibility via api accessible by python, distributed over multiple hosts, multiple queues with priority [20:04:11] Argo is probably the best option for the CI use case [20:04:17] because it is built with CI in mind [20:04:19] and yes 30 minutes hard cap [20:04:23] right, there's a lot more to a functioning ci system that job scheduling [20:04:33] than* job scheduling [20:04:33] Airflow or something else might work, but we'd have to do a lot of custom stuff to get it to work for CI [20:04:43] so ci top three needs for their sw are; [20:04:48] and airflow would necessitate a lot of home grown integration parts [20:05:26] Since releng is farther ahead on this than we are, and the other options we are considering (like airflow) aren't the best choices for them; i think they should probably just move foward with argo [20:05:33] overall i would try to think of airflow not as a system that runs your code, but as a system that submits your job to some compute (hadoop/yarn, k8s, whatever). I don't think we want to embed that much smrats inside the workflows [20:05:42] and, when analytics & sre are ready to work on our side of things [20:05:45] er, we have a lot of requirements :) https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/CI_Futures_WG/Requirements#Hard_requirements [20:05:50] we should do a simliar evaluation that releng did [20:05:54] and include argo in our options [20:06:04] maybe argo will be great for our use cases...but maybe not! [20:06:10] if we can all use argo that sounds awesome [20:06:25] marxarelli, and the newci doc has more - url in ticket [20:06:27] i suspect that Argo will be great for CI, and Airflow will be great for the other stuff we need. [20:06:54] do we know that argo won't be great for our (non-ci) needs? [20:07:05] should we look at it, the rest of us? [20:07:10] i think we should look at it [20:07:14] the only reason i've discounted argo is because i need to submit jobs to an existing non-k8s cluster [20:07:34] is argo exclusively for use with k8 then? [20:07:37] yes [20:08:24] is there an expectation that the non-k8 cluster will remain so for some time? (a few years lets say) [20:08:25] yes, to work it defines a number of k8s custom resources, most notably the Workflow resources which is fulfilled by spawning a number of underlying pods for each workflow task [20:08:35] very much "k-native" [20:08:56] there's a concise rundown in https://docs.google.com/document/d/1b6sqmfdcH4XL8wayL5OOaJX9xOyDq-osVEEkzzIWWY4/edit#heading=h.raithztgi2cu [20:09:06] (our proposal to SRE) [20:09:17] that non-k8s cluster is hadoop and I don't think we have plans to move away from it [20:10:09] does argo not permit pods to access data in hadoop then? that's just off the table? [20:10:24] because I'm lookig at hadoop on my end also as a logical place to have an object store [20:10:38] k8s cluster could certainly talk to hdfs [20:10:59] hm, are we conflating 2 things? [20:11:09] there are the workflow tasks, which in argo always run in k8s [20:11:13] airflow can also have tasks run in k8s [20:11:23] would depend on the k8s cluster network egress i think. it's worth differentiating between argo and a particular argo setup/cluster in this case [20:11:26] or it could run in a celery cluster [20:11:44] but in either case, the tasks could run where they are assigned (in k8s pod, or celery worker), and then do whatever [20:11:51] in analytics case; that whatever is submit jobs to hadoop [20:12:02] i suppose argo could do the same? [20:12:19] argo is k8native, so you would have to spin up a pod that submits to the hadoop cluster [20:12:23] ya [20:12:35] but in theory, sure [20:12:36] if we did an airflow setup, we mgith want to run our airflow cluster in k8s too [20:12:44] when you say 'submits'... are you expecting 'submits to yarn' or something similar? [20:12:47] yes [20:12:55] or not, but likely yes [20:13:13] so two job management mechanisms at the same time [20:13:27] airflow and argo aren't just job management systems [20:13:34] they are workflow systems [20:13:36] the isolation there is basically that if you want to run something in airflow context, airflow has to have your execution environment, dependencies, etc. Ideally we want whatever workflow system to submit jobs to a compute cluster that deals with execution environments on a per-workflow basis [20:13:39] DAGs of tasks [20:13:50] currently we have oozie + hadoop [20:13:51] i don't think we'd want them (airflow and argo) targeting the same k8s cluster for workloads [20:14:12] oozie manages workflows, some workflows submit jobs to hadoop [20:14:20] it just happens that oozie tasks run in hadoop too [20:14:26] but there is no real reason they have to [20:14:38] other than oozie was built to use hadoop [20:14:41] instead of e.g. k8s [20:14:47] part of this is about getting off of oozie as I understand it [20:14:57] ya, but i'm saying [20:15:07] airflow would replace oozie [20:15:11] both submit jobs to hadoop [20:15:11] right [20:15:20] independent of where the scheduler runs [20:15:51] marxarelli: ya they could be different k8s clusters [20:16:03] i think we are talking about this to limit the # of techs we use to do the same things [20:16:11] can have different deployments for differnt purposes [20:16:18] ebernhardson: can you elaborate? [20:16:26] ottomata: on which? [20:16:36] your last comment [20:16:39] yes; if we have shared expertise in a common platform in use, that's better than oozie for some, argo for others, airflow for yet others [20:16:40] still groking sorry [20:16:59] 'in airflow context' means on an airflow worker? [20:17:07] ottomata: sure. i'm just being protective of the cluster we're asking for. we need plenty of resources to run software testing/integration/image building/deployment pipeline tasks :) [20:17:11] ottomata: i mean that i don't think airflow + celery is a good solution. If we start writing a bunch of workflows directly in python that run in the airflow context, then airflow has to have whatever dependencies you use (numpy, pandas, sqlalchemy, whatever) [20:17:25] haha apergos, oozie is better for nobody :p [20:17:27] ottomata: any workflow system should be limited to submitting jobs to a compute cluster from its own context, not running a bunch of stuff directly [20:17:53] ah yes ok [20:17:54] agree [20:18:03] for hadoop that'll be hadoop. [20:18:33] but submitting tasks to k8s pods could be useful too [20:18:42] yup [20:18:48] and doing both in a workflow DAG would be great too [20:18:50] like [20:19:13] hadoop job -> k8s pod read output do some work -> kafka [20:19:15] who konws [20:19:16] anything really [20:19:35] for dumps we'll be querying databases or running mw maintenance scripts but some outputs might go to hadoop in the longer term and be recombined from there into larger output files; the code to run those already exists in python, though not in a 'wrappable by airflow' format [20:20:05] it would be nice to let airflow handle some of the related dependencies [20:20:13] does argo do dependencies also? [20:20:32] apergos: by dependencies to you mean package deps or task dependencies? [20:20:35] task [20:20:42] ah, ya my understanding is it does [20:20:50] job b needs output from job a, etc [20:20:51] argo is also DAG workflows [20:20:56] it implements DAG execution [20:21:12] 9 minutes to go [20:21:27] and can pass simple input/output parameters around as well as artifacts if you've configured an artifact store [20:21:51] what sort of api bindings does it have; what do jobs look like? [20:22:47] argo Workflow resources are k8s CRDs (custom resource definitions) and are fulfilled by a resident controller installed across the k8s cluster [20:23:17] so, basically a pointer to a container and some parameters in yaml? [20:23:19] some example resources can be found here https://github.com/argoproj/argo/tree/master/examples [20:23:30] so a param could be a wiki name, for example [20:23:50] pretty much. container specs as well as parameter definitions [20:24:19] you'd like want common tasks to be implemented as discrete container images [20:24:29] *likely* want [20:24:31] so for dumps case, that could be something like a generic "mediawiki prod container" where it invokes mwscript to run the dump? [20:24:45] that would function, yeah [20:24:46] :) [20:24:53] anything really [20:25:06] five minutes left [20:25:10] gehel: you have been pretty quiet, thoughts/questions from your side? [20:25:25] just reading, [20:25:38] hm yeah i think the 'mediawiki prod container for job or mwmaint' is part of https://phabricator.wikimedia.org/T218812 [20:25:40] too [20:25:41] I'm just the guy saying that people smarter than me should talk together [20:27:06] apergos: do you have active plans to work on this soon for dumps? [20:27:09] what do folks think about doing a comparison of argo and airflow for our various needs (except CI, where the CI piece is simply missing from Airflow) and see what it looks like? [20:27:16] compare notes after [20:27:21] just want to offer, i've dived into argo quite a bit for anyone that wants to talk beyond the meeting [20:27:36] ottomata: I have zero resources and yet it has to happen [20:27:39] haha [20:27:41] ok [20:27:43] even submitted a handlful of PRs during our evaluation [20:27:43] same over here. [20:27:57] i'd expect analytics eng to work on replacing oozie...Q4 at the earliest [20:27:59] probably next FY [20:28:01] I think that on our (search) side we have a good enough solution for our use case atm [20:28:12] we'll wait for analytics to provide us with a better one :) [20:28:23] gehel: oozie/yarn currently, or...? [20:28:44] apergos: airflow/yarn [20:28:48] nope, airflow deployed in a not entirely reusable way [20:28:50] ah ha [20:29:01] there are puppet patches working through :P [20:29:04] so you'll be a source of good info on that option too [20:29:08] 1 minute [20:29:19] marxarelli: what is RelEng timeline for argo? [20:29:28] well, ebernhardson will be a good source of info :) [20:29:40] yea, certainly by time analytics will have time to look, it seems we should have some experience operating airflow and argo to inform things [20:29:45] ottomata, asap [20:29:45] (the royal "you" :-)) [20:29:47] ottomata: tbd but our OKRs vary in optimism :) [20:29:54] haha ok so imminent [20:30:00] probably operating by Q4? [20:30:03] hopefully yeah [20:30:07] yes! [20:30:13] er, yes. [20:30:15] :) [20:30:24] ok, then apergos i'm cool with continuing to use T237361 as a place to discuss and gather use cases and thoughts [20:30:24] we're at time for anyone who needs to drop off [20:30:24] T237361: Discuss common needs in a job manager/scheduler - https://phabricator.wikimedia.org/T237361 [20:30:46] I'll try to summarize our conversation there tomorrow my time if someone doesn't get to it first [20:30:48] sounds like we have part of a plan already :) [20:30:49] but we might have more practical info by Q4, which is the earliest i think any of us would have time to work on this [20:31:17] past my bedtime, so I'm dropping, but I leave you marxarelli [20:31:18] apergos: thanks for organizing/running the meeting! [20:31:25] i have to jet as well [20:31:29] RelEng should move forward with Argo, and analytics and SRE will sync up about job workflow scheduler plans and arch when we get around to actually implementing something :) [20:31:30] thanks all [20:31:32] same for me, time to stop! [20:31:36] thanks for showing up and contributing! [20:31:41] \o [20:31:42] apergos: thanks for making this conversation happen! [20:31:43] except I'm not leaving marxarelli, oops [20:31:57] :) [20:31:59] yeah, thanks. [20:32:00] heh [20:32:01] and bye [20:32:05] l8r! [20:32:13] ok laters all, good talk. | (• ◡•)| [20:32:21] have fun! [20:32:43] going to have *dinner* (at 10:30 at night, ah well, my own dang fault :-P) [20:44:17] is it possible to delete a gerrit comment? [20:51:05] not after pressing send [20:51:27] you can discard a draft, but once it's sent it's carved in stone [20:52:56] ok, thx! no big deal [20:55:37] the pedantic answer is something like "only by rewriting git history on the gerrit server" which is functionally the same as "no" :) [20:59:23] it's sort of like trying to delete a photo from "the internet" :) [21:26:13] hey, I know the internet is just one small box with a light on it, all you do is remove it from there and you're done [21:55:40] is it expected that https://puppet-compiler.wmflabs.org/ is gone ? [21:57:39] mutante: I think jbond42 was touching those machines earlier but I don't know the details [21:58:00] rlazarus: ok, thanks [22:14:59] mutante: Some machines were apparently rebuilt [22:15:10] Which meant change of IP addresses [22:15:23] So I'm guessing the glue hasn't been updated to point at the right places [22:15:38] Reedy: the webproxy disappeared because it was associated with the instance name. jbond is looking on -cloud [22:16:00] so yes, basically that. thanks [22:16:14] heh [22:16:18] lag and so many channels :P [22:16:31] yea, too many channels