[13:51:42] <bblack>	 hasharLunch: ping - lmk when you have time to talk about authdns linting, I saw your https://gerrit.wikimedia.org/r/#/c/341602/ :) I think my question is: is there a sane way to get from that commit, to a place where the rspec-based thing becomes a lint job that deploys config files like a fake production node (to some staging path) and then runs a local command against the outputs to validate th
[13:51:48] <bblack>	 eir sanity?
[13:52:41] <bblack>	 or is even rspec really limited to testing the source files as opposed to deploying staged copies of templated output files?
[13:53:46] <bblack>	 hasharLunch: (or alternatively, as you mentioned before, perhaps such a test is best done as some kind of automated use of the puppet-compiler?)
[15:19:06] <hashar>	 bblack: sorry with the production explosion I havent quite processed the above ^ :(
[15:19:23] <hashar>	 the lame patch https://gerrit.wikimedia.org/r/#/c/341602/  basically compile a catalog all in memory using puppet
[15:19:36] <hashar>	 and then lets one does assertion as to how that catalog look. Eg some file has some content
[15:19:57] <hashar>	 so in theory we could use the same mecanism to generate the dns config files
[15:20:16] <hashar>	 boot a local DNS server out of those config files and run whatever test we want against
[15:20:43] <hashar>	 but I am not quite sure how to invoke puppet to generate the files withouth having to rely on rspec-puppet
[15:21:12] <hashar>	 (or we use rspec / ruby to write the tests since it has all the context already)
[15:22:59] <bblack>	 hashar: for context - we don't actually have to boot a DNS server to run tests.  We can just install the server software on all CI nodes (like we do today with authdns::lint), and the server software includes a linting-command
[15:23:57] <bblack>	 hashar: so basically, on a production node the authdns::config class would end up deploying a bunch of templated output files to /srv/authdns/staging-puppet/ .  What we want on lint is the ability to generate those same output files *somewhere* (doesn't matter where), and then have lint invoke a check-command that reads them and return exit code 0 if success
[15:24:39] <bblack>	 hashar: (the distinction is, we don't have to do a proper full production puppetization of an actual running server - just get the files outtputted)
[15:25:39] <bblack>	 also, proper generation of those output files for linting means using the applicable production hieradata, too.
[15:25:55] <bblack>	 (not mock/test simpler hieradata)
[15:26:27] <bblack>	 if an rspec-based thing can do that sort of thing, then we can go down that road, but I guess if not what we're really looking for is something like automating a puppet-compiler run
[15:27:14] <hashar>	 yeah it can 
[15:27:18] <hashar>	 pretty sure about it
[15:27:54] <hashar>	 rspec-puppet has some support for hiera, so we could point it to ./modules/puppetmaster/files/production.hiera.yaml
[15:28:10] <bblack>	 I guess, we can do a mock site.pp with one node named "testauthdns.wikimedia.org" which is defined as being in site:eqiad, production, and having "role authdns::server" or whatever
[15:28:14] <hashar>	 do some sed/string to fix the paths it contains and point to the build area of the jobo
[15:28:17] <bblack>	 and then it will pick up the right hiera
[15:28:48] <hashar>	 yup
[15:28:51] <bblack>	 have we done things like this before?
[15:28:56] <bblack>	 (some example I can follow?)
[15:29:10] <hashar>	 once rspec-puppet is configured to have hiera point to a hiera.yaml  and modules dir set,   the rspec definition is pretty much:
[15:29:21] <hashar>	 let(:node) { 'testauthdns.wikimedia.org' }
[15:29:36] <hashar>	 let(:pre_condition { 'role authdns::server' }
[15:29:49] <hashar>	 it {  # here you are in the catalog for that node with that role
[15:30:11] <hashar>	 I think I did some similar patch in a previous attempt last summer. Let me find it
[15:31:05] <hashar>	 https://gerrit.wikimedia.org/r/#/c/308889/3/tests/spec/spec_helper.rb  !!
[15:31:25] <hashar>	 it is a draft really
[15:31:48] <hashar>	 the idea is to eventually run rspec from the root of the repo, and have some kind of integration tests for hosts that matters
[15:34:56] <bblack>	 ultimately, we have the same check as this CI job post-merge too (as in, after puppet-merge, when we run authdns-update on a live production authdns, it does the same lint check locally before actual update of the running config)
[15:35:24] <bblack>	 so that's the fallback if we can't get this working initially I think.  It just means when someone merges a change with a non-obvious typo sort of issue, they have to merge it then see the failure then go revert it.
[15:36:46] <bblack>	 or if we had a beta-cluster authdns, we could cherrypick it and try it there first
[15:37:13] <hashar>	 will keep that in mind
[15:37:20] <hashar>	 that sounds like a good evening hacking project
[15:37:59] <hashar>	 I will see what I can do this week and craft a change based on rspec-puppet that generates the config files
[15:38:42] <bblack>	 I'd do it myself (I may yet try) but I have little idea what I'm doing with that rspec-puppet stuff, and it doesn't seem easy to experiment on it either
[15:38:52] <hashar>	 then I guess it is all having rspec to invoke authdns-lint 
[15:39:08] <hashar>	 yeah it is like learning vim :/
[15:40:03] <hashar>	 https://gerrit.wikimedia.org/r/#/c/308889/ has the bits to tweak the hiera.yaml production file to point to the local path
[15:40:11] <hashar>	 and configure rspec to use that hiera config
[15:40:36] <hashar>	 but the spec fails eventually. One needs to locally clone labs/private and there is later some dns error
[15:40:47] <bblack>	 at which point rspec-puppet can basically do whatever puppet would do to the mock/test site.pp machine, but in some chroot dir?
[15:41:07] <hashar>	 it doesnt apply the catalog
[15:41:13] <bblack>	 ok
[15:41:20] <hashar>	 it is just the catalog compilation then providing helpers to assert the catalog
[15:41:24] <hashar>	 which is really a huge dict/hash/list
[15:41:51] <hashar>	 so the catalog compilation will have something like { files => { '/etc/dns.conf': :content => 'some payload' } }
[15:41:57] <bblack>	 does it have built-in stuff to output the catalog's generated files (file or template stanzas) that were destined for some fs subpath?
[15:42:12] <hashar>	 and rspec puppet provides the boilerplate to do  it should.contain_file.with_content(/some payload/)
[15:42:23] <bblack>	 like output_catalog_files('/srv/authdns/where-they-go-in-prod-catalog/', '/staging_dir/')
[15:43:21] <bblack>	 (the point of "where-they-should-go" being to not bother generating every possible output file from base classes, just the ones we care about for this job)
[15:43:26] <hashar>	 it { p catalogue }
[15:43:32] <hashar>	 that display the whole array
[15:43:52] <hashar>	 I have a few draft notes on wikitech
[15:43:55] <hashar>	 the most useful is https://wikitech.wikimedia.org/wiki/Puppet_coding/testing#ruby_debugger
[15:44:11] <hashar>	 one can use the ruby debugger to stepin the rspec context
[15:44:24] <bblack>	 yeah I get the file contents are all in that dataset somewhere.  But I guess we have to write our own code to dig through it and write disk files in a staging dir for catalog file contents with original paths under /foo/ ?
[15:44:25] <hashar>	 and one then have a fancy colored command line to play with the catalogue variable
[15:44:32] <hashar>	 yeah correct
[15:44:47] <hashar>	 have to figure out how to write the files down to some chroot
[15:45:33] <hashar>	 which is probably "all about" iterating the list of files and replace ^/  by the job build area (eg on Jenkins could be $WORKSPACE/build 
[15:46:01] <hashar>	 there is another framework that is perfectly suited for the kind of test you want https://github.com/puppetlabs/beaker
[15:46:16] <hashar>	 which as I understand it boot a container, apply the manifests and let you then run acceptance tests against it
[15:46:25] <hashar>	 I havent played with it yet :(
[15:49:36] <hashar>	 bblack: I am having a team meeting now and got some other calls this evening
[15:49:49] <hashar>	 but I can demo you rspec-puppet basics tomorrow morning 
[15:50:24] <hashar>	 it is not so complicated, but the doc lacks in a few area and the stack is quite intimidating ( bundle, rake, rspec, rspec-puppet etc)
[15:52:50] <bblack>	 hashar: yeah the container-based approaches look appealing/simpler
[15:53:22] <bblack>	 "deploy role::foo to a new container with basic-debian-jessie as a starting point.  run command X in the container after deploy"
[15:53:27] <hashar>	 bblack: and as I understand it one ends up with a fully provisioned node and can run DNS queries against it :]
[15:53:55] <hashar>	 which might be a good lead for the varnish VCL tests as well
[17:04:49] <paravoid>	 bblack: so what's the kind of information that we need from puppet in order to lint?
[17:06:07] <volans>	 the compiled templates AFAIK
[17:07:33] <bblack>	 yes, compiled config files
[17:08:10] <volans>	 so basically a erb template compiler with hiera lookups :)
[17:08:18] <bblack>	 like... puppet
[17:08:22] <volans>	 eheheh :D
[17:08:41] <bblack>	 as I've said before, there's like 10 different ways we can ugly-hack things and make things work short term
[17:08:49] <bblack>	 the reason I resist those is because that's how we create messes
[17:08:55] <bblack>	 when do we go back and fix it?
[17:09:37] <paravoid>	 well yes sure
[17:09:37] <paravoid>	 but
[17:09:47] <paravoid>	 having the capability of generating a mock config is not by itself a bad idea
[17:10:03] <bblack>	 a mock config at which level do you mean?
[17:10:05] <mutante>	 like we could change the datacenter naming scheme to not depend on the vendor to unblock puppet changes for Asia dc by just calling it "asia" /me hides
[17:10:28] <bblack>	 mutante: agreed, but also kinda pointless, as the puppet change is trivial/mechanical
[17:10:30] <paravoid>	 a config that would allow our DNS config to be linted in the short-term
[17:10:53] <paravoid>	 I don't see how it's less tech debt to just disable DNS linting in general
[17:11:13] <bblack>	 yes, I think that's a reasonable view of those two things in isolation :)
[17:12:03] <bblack>	 we already have a mock test config in the currently-deployed setup, too
[17:12:12] <paravoid>	 config-geo-test you mean?
[17:12:15] <bblack>	 yeah
[17:12:34] <bblack>	 we could just add static lines to that config, which enumerate mocked discovery resources, or whatever
[17:12:42] <paravoid>	 yeah, that
[17:12:58] <paravoid>	 I don't think it's even worth re-templating with jinja
[17:13:20] <paravoid>	 just copy/pasting the discovery resources should be fine for now I think
[17:13:27] <paravoid>	 it's not like we have a thousand of them
[17:13:45] <bblack>	 yes
[17:14:00] <volans>	 and for now the discovery ones should be static right? (opt-in)
[17:14:12] <bblack>	 the reason this rankles me is we're just digging a deeper hole.  When do we do the project that makes this sane again and moves it to puppet and gives it proper linting there?
[17:14:29] <bblack>	 what quarterly goal for no functional gain will that be part of?
[17:14:56] <volans>	 next quarter DC-switchover goal?
[17:15:06] <bblack>	 it's no more necessary then than it is now
[17:15:12] <bblack>	 (functionally-necessary)
[17:15:13] <paravoid>	 it doesn't have to be part of a quarterly goal to happen :)
[17:15:48] <paravoid>	 and I know we're not great at tackling random tech debt outside of goals, but that's not a reason to rush a pretty serious change in
[17:15:57] <volans>	 but right now is a blocker to ensure that all the other pieces works together with this, as soon as we have it in prod better it is, we can improve the way CI works for it while already deployed I guess
[17:16:04] <paravoid>	 or to make the debt bigger/more noticeable to increase pressure to fix it :)
[17:16:11] <volans>	 lol
[17:16:15] <paravoid>	 (which is what disabling DNS linting does, basically :)
[17:16:34] <bblack>	 nevermind disabling linting, that's a strawman.  mocking is a fine solution with less trouble.
[17:16:42] <paravoid>	 ok :)
[17:17:06] <bblack>	 so there's a few scopes I view this problem at:
[17:17:42] <paravoid>	 and to be clear, I'm not suggesting this with any nefarious purpose (e.g. hating the change and putting it off for a later day which will never come)
[17:18:27] <paravoid>	 but time is running out and I don't feel great about rushing large architectural changes through, especially when not all the pieces aren't there yet AIUI (linting)
[17:18:36] <bblack>	 1) at the innermost scope, even today what we're doing with authdns kinda sucks.  We've re-implemented a narrow slice of puppet because of various raisins (fear of puppet failure, speed of deployment, speed/utility of CI), at the cost of, well, having separate implementations and scripts for templating, a different language, less power, and no single source of truth (hieradata and other puppet da
[17:18:42] <bblack>	 ta not being available for DNS)
[17:19:19] <paravoid>	 well and to be fair because all of this predates hiera and much of this predates puppet too :)
[17:19:26] <bblack>	 doesn't really matter, though
[17:20:01] <bblack>	 O dpm
[17:20:04] <bblack>	 grr keyboard
[17:20:17] <bblack>	 it's not really material how we got there, that's just how it is today
[17:20:30] <paravoid>	 sure, just saying, the reasons weren't as deliberate as you making them sound :)
[17:20:41] <paravoid>	 which is a good argument for re-thinking this
[17:21:02] <bblack>	 now we have a project that desires (for solid technical reasons, I think), to get deeper integration between authdns and our sources of truth and templating, and it's exposing all of the fault lines here.
[17:22:22] <bblack>	 ok yes, you can say my (1) statement above sounds like revisionist history.  it kinda does.  but I'm just looking at it from the blank slate of now, now how or why we got there.  It's just poor language to state it like it was a decision to be there in the present...
[17:22:37] <paravoid>	 yes, a project that we have commited to deliver in 11 days from now :)
[17:22:57] <paravoid>	 my approach to this would probably be different if today was Jan 20th, rather than Mar 20th
[17:23:14] <bblack>	 rewinding on the current project a bit, though, this is a rabbit-hole problem
[17:23:20] <paravoid>	 and this is just a small piece of the whole project
[17:25:24] <bblack>	 I think the decisions that lead us to this solution (about doing it via authdns, and the main authdns at that) are sound.  The unexpected rabbit-hole is that our infrastructure for authdns kinda sucks from a current POV, so there's no way to actually get there that doesn't have bad tradeoffs or debt or timeline-killers.
[17:26:07] <paravoid>	 I don't think it sucks fwiw
[17:26:27] <paravoid>	 but it seems kinda pointless to argue about that
[17:26:36] <bblack>	 I don't mean that to be judgemental, but it's just the present reality, for the reasons stated above.
[17:26:57] <bblack>	 it's a half-assed separate re-implementation of the tools we use for everything else, and disconnected from the single source of truth.
[17:27:15] <bblack>	 we may have gotten there for good reasons at the time, but it doesn't change the present reality
[17:27:41] <paravoid>	 I'd argue that the present reality is that we don't have a single source of truth :)
[17:28:21] <bblack>	 ok, we have a way of operating for literally everything else we do, which has a few kinds of truth in standard places with standard tools, right?
[17:28:27] <bblack>	 hieradata, etcd, manifests, etc
[17:28:54] <paravoid>	 not really?
[17:29:01] <paravoid>	 mediawiki-config is not in puppet either
[17:29:33] <paravoid>	 and I don't see how hieradata and etcd are a "single" source of truth
[17:29:34] <bblack>	 ok, let me scope "we" to operations-level work then?
[17:29:40] <paravoid>	 as they are two sources of truth :)
[17:29:52] <paravoid>	 even with that scope, that's still not accurate
[17:30:05] <paravoid>	 jaime's work involves depooling/repooling databases and that happens in mw-config
[17:30:05] <bblack>	 hieradata and etcd are different kinds of truth (configuration data vs volatile kv-state)
[17:30:45] <paravoid>	 the reality is that because of legacy we have traditionally had multiple places where stuff end up
[17:30:58] <paravoid>	 as we evolve, when it makes sense, we merge sources of data together
[17:31:21] <paravoid>	 (and other times we do or should the opposite, e.g. the growing pains of maintaining service config in puppet)
[17:31:43] <paravoid>	 we merged apache-config to puppet, we're talking about pushing parts of mw-config to etcd etc.
[17:32:04] <paravoid>	 and k8s will create yet another data source most likely, btw
[17:32:04] <bblack>	 do we have post-puppet ideas for service config?
[17:32:45] <bblack>	 my view of the world is we're trying to get to where there's one hieradata-like thing and one etcd-like thing we can use as standard solutions to these kinds of problems, and that hieradata and etcd are our current-best solutions
[17:32:51] <paravoid>	 in some cases they are in the respective deployment repos
[17:33:04] <paravoid>	 I think the longer-term idea is k8s but that's kinda handwavy at this point
[17:33:31] <bblack>	 (ok I thought you meant service config more in the sense of automating the crazy huge commits we make to puppet repo to deploy a new service, all over LVS and related classes, etc)
[17:33:56] <bblack>	 the thing we were scripting somewhat?
[17:34:03] <paravoid>	 yeah there's that too
[17:34:43] <bblack>	 so all of the above seems like you're trying to un-define things
[17:35:32] <paravoid>	 I'm only saying that DNS-in-puppet isn't something that we should do purely for standardization reasons (or, conversely, that the fact that this isn't the case now means that it sucks)
[17:35:41] <paravoid>	 if it makes sense for other reasons, that's fine, and let's do it
[17:35:53] <paravoid>	 but *by itself*, I don't see it as a problem personally
[17:36:11] <bblack>	 I disagree :)
[17:37:12] <bblack>	 there's a lot of handwaving above.  puppet (and hiera) is our standard for deploying configuration.  for storing config data, for templating.  it is a source of truth (e.g. the hieradata for LVS-related things reused in places to define services).
[17:37:31] <paravoid>	 and if we had chosen a different way to solve discovery (without DNS, w/ SkyDNS/CoreDNS, with a different set of gdnsd servers etc.) I don't think we'd be talking about it at all :)
[17:38:19] <paravoid>	 it's our standard for deploying /some/ configuration, yes
[17:38:23] <bblack>	 so we're going to change our solution to get over deficiencies in our ability to manage it?
[17:38:31] <paravoid>	 that's not what I said :)
[17:39:41] <bblack>	 the fact remains: modules/authdns == a collection of shellscripts and python scripts which re-creates a limited slice of the functionality of puppet+erb+hieradata, using a disconnected repo (no access to our core config hieradata), a separate linting and deployment system, etc.
[17:40:25] <bblack>	 the natural needs of this project expose that as a functional problem, but there's certainly other improvements that just haven't happened because of that disconnect but weren't as important.
[17:41:18] <paravoid>	 sure, ok
[17:41:43] <bblack>	 stepping out a layer of scope:
[17:42:27] <paravoid>	 an hour ago you said that a) this is a complex change that's not easy to test and b) that even if we went with the DNS-in-puppet work today, we'd still be without DNS linting, as there's a bunch of (yet unclear) work that needs to happen
[17:42:27] <paravoid>	 correct?
[17:42:59] <bblack>	 we have lots of tech debt problems (as an org, as a team).  it's hard to get tech debt as-a-project (as in: this quarter we're going to allocate N FTEs to just refactor to solve tech debt), and keep doing it routinely (do we allocate every Q3 to just tech debt maybe?)
[17:43:45] <bblack>	 one source of the tech debt problems in the long run is that when we run into obvious needs for refactoring that we could fix along the way, we don't want to expend the effort as part of the project that found the problem, basically.
[17:44:00] <paravoid>	 can we not solve every organizational or engineering problem now though? :)
[17:44:01] <bblack>	 I'm arguing that the right way to operate and keep debt low is to try to push changes like these through when you see them
[17:44:13] <_joe_>	 bblack: re post-puppet service config: we're already there. puppet just defines the ops-controlled variables as the other services urls and so on, as config variables that are integrated in the actual service config. k8s will do the same. I'll be very happy to talk about this tomorrow, now I'm off because I'm too tired
[17:44:14] <bblack>	 not to ignore the problem and hope we resolve the debt later
[17:44:40] <paravoid>	 what is it exactly that you recommend that we do now?
[17:46:42] <bblack>	 see if we can merge the change.  rely on post-merge linting in the interim, fix the linting asap (which is driven by the pain of revert from post-merge lint failings).
[17:46:52] <paravoid>	 how is post-merge linting not tech debt?
[17:47:08] <bblack>	 it's definitely tech debt
[17:47:35] <bblack>	 but it's tech debt as missing work on right solution
[17:47:54] <paravoid>	 so the proposal is to go for an architecturally improved but technically incomplete work (= not a 100% replacement)
[17:48:57] <paravoid>	 and leave more painful tech debt behind, in the hope that it'll annoy us enough to fix it soon? :)
[17:49:06] <bblack>	 not a 100% replacement for current workflow, maybe I would say? We're losing the ability to have a linter validate a DNS commit before it's merged.
[17:49:38] <bblack>	 we have tons of cases where no amount of linting or puppet-compiler will catch an issue until after it's deployed and broken, it's not like it's a major regression in our overall standards here.
[17:50:00] <bblack>	 we still have a linter in the workflow, but it happens after merge, but before any live effect on the real world.  when it fails, you have to revert.
[17:50:15] <paravoid>	 we have tons of cases where we replicate config into multiple repositories too
[17:50:33] <bblack>	 yes, we do :)
[17:51:15] <paravoid>	 I don't think rushing through a change in that it's a) complex b) difficult to test c) an incomplete replacement
[17:51:21] <bblack>	 or we can go stab mock data in config-test-geo for all the services we expect to deploy in the near future and move on.
[17:51:40] <paravoid>	 beats duplicating a few lines in a text file in an existing repository
[17:51:54] <bblack>	 rushing is not an artifact of my desired goal, it's an artifact of artificial pre-decided deadlines + variable/unknown rabbitholes in the work to get there.
[17:52:01] <paravoid>	 and have it all be working so that _joe_ can test an entire quarter worth of his effort this week
[17:52:33] <bblack>	 this goes back to the whole argument that for software, having pre-decided features + release dates is always bad.  for our ops-level software it's the same, but it's the model we operate under.
[17:52:33] <paravoid>	 yeah we can't be going down every rabbithole or we'll never deliver anything ever, I think we can agree to that :)
[17:53:11] <bblack>	 you deliver what's actually ready when the standard delivery times arrive.  just like calender-time and feature-independent software releases that most agree is the right model there.
[17:53:39] <paravoid>	 that's not the reality we're currently in, let's not get too philosophical
[17:54:05] <paravoid>	 if you're proposing to drop the goal and the DC switchover on Apr 19th just in order to fix this perfectly, then I'm definitely going to have to disagree with that
[17:54:09] <bblack>	 well the reality we're currently in is that we're going to deploy mock config likes in config-geo-test
[17:54:17] <bblack>	 s/likes/lines/
[17:54:31] <paravoid>	 that sounds like the least worse outcome to me so far, yes
[17:55:05] <bblack>	 I'm just trying to make it clear why I think that's wrong.  I think the wrongs run too deep to fix them all in the next 11 days, but that's neither here nor there.
[17:55:21] <bblack>	 it's still important to expose all N layers of pain-point here and how they map back to how we're operating on a grander scale
[17:55:53] <paravoid>	 we have discovery, _joe_ gets unblocked, we keep pre-merge linting in place, and it all can happen soon with very little risk
[17:56:07] <paravoid>	 that's a pretty good outcome _for this week_
[17:57:03] <paravoid>	 just think of it as lock avoidance :)
[17:57:33] <bblack>	 I think part of the disconnect is that I'm optimistic about my own ability to make the other solution work faster than you think, and less-riskily.
[17:57:43] <paravoid>	 otherwise it's all one big ball of "joe depends on bblack who depends on hashar who depends on labs adding capacity" and whatnot
[17:57:48] <bblack>	 but it is yet another rabbit hole
[17:58:10] <bblack>	 and yes, my ability to influence outcomes drops off sharply as we move into the CI part of the problem :)
[17:58:21] <paravoid>	 yup
[17:58:36] <paravoid>	 and fwiw I think you're being overly optimistic too, but I may just as well be wrong :)
[17:59:11] <paravoid>	 I definitely foresaw "discovery using the main authdns cluster" being more complex than what you initially stated though, I think :)
[17:59:24] <bblack>	 I'd be willing to bet a fair amount I could merge that POC commit today with puppet disabled, do a few followup minor fixes and validate that it all works correctly and be done later today.
[17:59:39] <bblack>	 sans linting before merges, now linting after merges
[18:00:37] <volans>	 bblack: what forbid us to have that and then continuing to work towards the right solution while we have a dns discovery connected to etcd that can be used and tested by the other services?
[18:00:45] <paravoid>	 how many of these rabbitholes are you currently in? :)
[18:00:58] <paravoid>	 I think you could say the same for your etcd/varnish patchset, but that's still not merged in either
[18:01:12] <paravoid>	 (unless something changed recently)
[18:01:50] <paravoid>	 only raising it because it's similar in nature
[18:02:07] <paravoid>	 (also switches to more standard config mgmt tools, also useful to the DC switchover)
[18:02:43] <paravoid>	 ok, I really need to go now, sorry :(
[18:03:20] <bblack>	 volans: nothing, but the pressure will be off to take the risks to push it aggressively, and there's a lot else to be done.
[18:03:47] <bblack>	 but we all perceive risks differently here.  worst case it's a revert, best case we move the architecture forward.
[18:08:29] <volans>	 but once we have somithing that works (also if statical for example) than we have in fact more time, and we can commit to fix it properly by the DC switchover date for example
[18:08:37] <volans>	 s/somithing/something/
[18:10:31] <volans>	 my only worry is that if we wait for the proper solution here and for any reason we get it too late, we'll not have the time to test/discover other rabbit holes/fix all the other moving parts
[18:12:43] <volans>	 the etcd go templates to set the up/down, the validation to ensure only one entry for -rw is up (not sure if was done), the services that uses the dns discovery (config, caching), the automation tooling that changes them
[18:15:13] <bblack>	 the etcd go templates and validate of 1x rw is up was already tested
[18:15:30] <volans>	 great (I was not sure)
[18:15:38] <bblack>	 I think one thing that has been pushed off a bit and is async, though, is trying to update our powerdns to do edns-client-subnet
[18:16:03] <bblack>	 (use jessie-backports one, update config a bit, test that it works for private addrs)
[18:16:34] <bblack>	 it's a niche concern, but it's still a real one
[18:16:42] <volans>	 sure
[18:17:25] <bblack>	 the fallout of not fixing the edns-client-subnet problem is that for active/active services, dns-discovery might sometimes give the wrong answer when both are up (e.g. restbase is up in both, but dns disc for it in eqiad returns the codfw hostname)
[18:18:00] <bblack>	 and quantifying "sometimes" is hard - it's however often a DNS query to the local recdns gets dropped/lost and glibc resolver falls back to asking the other DC's recdns for an answer.
[18:18:27] <volans>	 which are the active/active services as of now?
[18:19:19] <bblack>	 well none until further commits, but...
[18:19:46] <volans>	 I mean active/active capable/ready :)
[18:20:42] <bblack>	 I wouldn't know definitively tbh.  I thought _joe_ had a pending commit with all the services in it, but I don't see it now
[18:20:55] <bblack>	 e.g. https://gerrit.wikimedia.org/r/#/c/340992/2/hieradata/common/discovery.yaml
[18:21:10] <bblack>	 ^ I thought there was another commit pending that added several services there with the active_active flag set correctly
[18:21:37] <bblack>	 oh duh next one in the chain:
[18:21:38] <bblack>	 https://gerrit.wikimedia.org/r/#/c/340994/2/hieradata/common/discovery.yaml
[18:21:44] <volans>	 ok
[18:21:54] <bblack>	 basically "everything but mediawiki", I think
[18:41:27] <bblack>	 volans: also, failoid service?
[18:42:12] <volans>	 what's the question? apart that we have to do it :)
[18:43:06] <bblack>	 yeah that
[18:46:03] <volans>	 eheh, if you need some help for anything I can try to find some time this week between going forward with the automation tooling and it's testing
[18:47:53] <bblack>	 I wasn't planning to implement it, no
[18:48:17] <bblack>	 the nginx config is easy, but I don't even know how we provision some kind of virtual host for it in each dc
[18:49:08] <volans>	 me neither but I can ask alex and try to get them, 1 per DC or 2 per DC?
[18:49:33] <bblack>	 1 per DC I think, it's a stateless virtual host that emulates failure, surely it doesn't need real redundancy
[18:50:29] <volans>	 agree, then a basic jessie with nginx, nothing special right? 
[18:51:21] <bblack>	 yes, just a very basic nginx server which returns 503 over port 80
[18:51:34] <volans>	 80 only?
[18:51:41] <volans>	 any virtualhost I guess
[18:51:43] <bblack>	 although: rabbithole alert: ideally it should probably return 503 over 443 and have a cert that matches all internal service hostnames too :)
[18:52:05] <volans>	 yeah I was thinking about that
[18:52:34] <bblack>	 really the main point is to return a dysfunctional IP address
[18:52:57] <bblack>	 returning 503 seems "nicer", but you could also make the argument that it should just an IP address with no listener at all on common service ports.
[18:53:54] <bblack>	 (and let the client fail with connection refused)
[18:53:58] <volans>	 yeah
[18:57:05] <_joe_>	 yeah the https issue is not small
[18:57:16] <_joe_>	 we should indeed have failoid over https too
[18:57:55] <_joe_>	 you could get away with "all the public names we serve + *.{eqiad,codfw,...}.wmnet"
[18:58:09] <_joe_>	 but that's really, really annoying
[18:58:41] <_joe_>	 so I'd vote for "inexistent IP" instead
[18:58:42] <volans>	 probably easier to just not listen then
[18:59:11] <bblack>	 I think it's maybe worth it to have it be a real IP somewhere
[18:59:13] <volans>	 inexistent IP should have timeouts while connecting 
[18:59:20] <bblack>	 right, immediate fail vs timeout
[18:59:33] <volans>	 I prefer real IP without listening port
[19:00:09] <volans>	 it could also be a network device but probably safer to put this on a ganeti hosts as we said earlier
[19:02:29] <bblack>	 hah, and mock testing bites me in reverse
[19:03:34] <bblack>	 (can't deploy the mocked other service hostnames since the puppet hieradata part isn't actually deployed yet! gerrit authdns-lint passes, then it fails on authdns-update on the real hosts :P)
[19:07:20] <volans>	 bblack: by the way, god*g was asking me if it will be possible to test the dns discovery in deployment prep
[19:07:33] <volans>	 for the DNS caching
[19:07:40] <volans>	 client side
[19:07:46] <bblack>	 depends what you mean by that, I think
[19:08:05] <bblack>	 well actually maybe it doesn't
[19:08:16] <bblack>	 can deployment-prep even lookup prod .wmnet hostnames?
[19:08:47] <bblack>	 we don't have any deployment-prep equivalent to actual authdns like prod
[19:09:36] <volans>	 I guess just having the DNS answer 2 different IPs for a record in time to verify that rewrite.py in swift works correctly without restart
[19:10:07] <bblack>	 ?
[19:10:25] <bblack>	 I'm lost on some of that statement's meaning
[19:10:59] <bblack>	 back on the other stuff:
[19:11:01] <bblack>	 bblack@cp1065:~$ host appservers-ro.discovery.wmnet
[19:11:01] <bblack>	 appservers-ro.discovery.wmnet has address 10.2.2.1
[19:11:01] <bblack>	 bblack@cp1065:~$ host appservers-rw.discovery.wmnet
[19:11:01] <bblack>	 appservers-rw.discovery.wmnet has address 10.2.2.1
[19:11:07] <bblack>	 bblack@cp2001:~$ host appservers-ro.discovery.wmnet
[19:11:07] <bblack>	 appservers-ro.discovery.wmnet has address 10.2.1.1
[19:11:07] <bblack>	 bblack@cp2001:~$ host appservers-rw.discovery.wmnet
[19:11:07] <bblack>	 appservers-rw.discovery.wmnet has address 10.2.2.1
[19:11:48] <volans>	 last year switchover procedure required to merge a change in hiera for swift to change the rewrite_thumb_server
[19:11:51] <volans>	 nice!
[19:12:35] <bblack>	 the other hostnames matching _joe_'s commits are in: https://gerrit.wikimedia.org/r/#/c/343704/
[19:12:56] <bblack>	 but I had to revert it.  re-deploy it after the puppet side is deployed to the authdns servers.
[19:13:43] <volans>	 anyway I don't want to distract you right now ;)
[19:15:14] <bblack>	 volans: I think the short answer is no, there's nothing related to this in deployment-prep at all
[19:15:22] <bblack>	 (like so much of our infrastructure)
[19:17:26] <bblack>	 _joe_: should I just copy the bottom part of the file here https://gerrit.wikimedia.org/r/#/c/341000/1/hieradata/common/discovery.yaml and deploy that before all your stuff?
[19:17:50] <volans>	 ok, I figured, I'll find another way to have filippo test it, thanks
[19:17:53] <bblack>	 _joe_: the issue is it's chicken/egg - that hieradata has to exist before we can merge the hostname changes you want deployed before the rest of your commits
[19:22:05] * volans gotta go
[19:26:14] <bblack>	 _joe_: assuming the reasonable answer is yes, since currently only authdns consumes that hieradata AFAICS
[19:31:12] <bblack>	 of course, that had a - vs _ typo in it that I copied and pasted, and I gave up on the linter after waiting a few minutes :P
[19:33:43] <hashar>	 bblack: any leads as to how the dns confs are being generated?   Looks like  modules/authdns has a bunch of templates feed from hiera
[19:34:33] <bblack>	 hashar: it has to be refactored before the CI improvements are possible to model/experiment with anyways
[19:35:31] <bblack>	 hashar: (unmerged) https://gerrit.wikimedia.org/r/#/c/342887/17/modules/authdns/manifests/config.pp has authdns::config, which deploys all the config files to be checked, all underneath /srv/authdns/staging-puppet/ somewhere
[19:36:07] <bblack>	 if it's a container/chroot sort of thing, can leave it like it is.  otherwise we can paramaterize it so that the rspec invocation can change the destination root path
[19:37:52] <hashar>	 will try something with those :D
[19:53:34] <hashar>	 oh man
[19:53:43] <hashar>	 Could not find class ::passwords::puppet::database  :D
[19:56:28] <bblack>	 for authdns::config?
[19:56:48] <hashar>	 ah I am trying with a host
[19:56:52] <hashar>	 guess I should just use the role
[19:56:54] <hashar>	 ...
[19:57:13] <bblack>	 it depends what approach you're taking a I guess
[19:57:28] <bblack>	 even that commit doesn't have a correct role for this really
[19:57:39] <bblack>	 but authdns::config should be all you need
[19:58:03] <hashar>	 https://uproxx.files.wordpress.com/2013/05/i-have-no-idea-what-im-doing-03.jpg
[19:58:49] <bblack>	 role authdns::lint installs the gdnsd software package (which is what we use on CI hosts), and a few scripts as files
[19:59:12] <bblack>	 sorry, not "role authdns::lint", "class authdns::lint"
[19:59:26] <hashar>	 yeah we have it on the jessie hosts
[19:59:35] <hashar>	 so at least that part is solved
[19:59:53] <bblack>	 for the rest, you just want to apply "include authdns::config", which should be fairly self-contained
[19:59:56] <hashar>	 and authdns::config is the pending change https://gerrit.wikimedia.org/r/#/c/342887/17/modules/authdns/manifests/config.pp right ?
[20:00:10] <bblack>	 yeah, the whole pending change though, it does reference other classes from that change
[20:00:21] <hashar>	 oh my god
[20:00:42] <bblack>	 and other files, and templates!
[20:00:54] <bblack>	 it's also probably not valid yet
[20:00:57] <hashar>	  will first try to get the catalog to compile
[20:02:04] <bblack>	 I just realized an error in authdns::config stuff, but it won't stop compile, just makes invalid results I think
[20:02:17] <hashar>	 which really is mostly about copy pasting from the change I made after my summer vacation ( https://gerrit.wikimedia.org/r/#/c/308889/ ) as an exercise :]
[20:03:09] <hashar>	 grblblbmbmbmm
[20:03:15] <hashar>	 all our modules are so tightly coupled
[20:05:52] <bblack>	 PS18 has the fix for authdns::config - it needs two classparams
[20:06:14] <bblack>	 (it was inheriting them fine in the actual authdns server case, but to use it independently for CI would be different)
[20:06:42] <bblack>	         lvs_services       => hiera('lvs::configuration::lvs_services'),
[20:06:45] <bblack>	         discovery_services => hiera('discovery::services'),
[20:06:52] <bblack>	 ^ necessary arguments to authdns::config
[20:08:05] <bblack>	 ideally those would be passed down through some kind of lint class/role, but I donno what that looks like yet
[20:27:06] <hashar>	 DEBUG: 2017-03-20 21:26:52 +0100: Looking up lvs::configuration::lvs_services
[20:27:06] <hashar>	 /home/hashar/.gem/gems/hiera-1.3.4/lib/hiera/interpolate.rb:48:in `scope_interpolate': undefined method `[]' for nil:NilClass (NoMethodError)
[20:27:09] <hashar>	 moaar progress
[20:29:22] <hashar>	     should compile into a catalogue without dependency cycles
[20:38:08] <hashar>	 bblack:  https://gerrit.wikimedia.org/r/343747  is all I got for the moment
[20:38:21] <hashar>	 should compile the catalogue using production modules/manifests/hieradata
[20:38:24] <mutante>	 bblack: am i stepping on any toes by doing a "authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones" on all servers?  reason: 2 new WP languages are addedin langs.tmpl
[20:38:44] <hashar>	 with an empty site.pp that is autogeneratd and a hiera.yaml file autogenerated based on the production one
[20:39:13] <hashar>	 once running the command listed in the commit message, that should give a repl interface that lets one explore the puppet catalogue that got generated
[20:43:08] <bblack>	 mutante: uh, no new toes or ongoing work, assuming that works in general
[20:44:07] <bblack>	 hashar: awesome
[20:44:20] <mutante>	 bblack: ok. yes, it does. it's part of  https://wikitech.wikimedia.org/wiki/Add_a_wiki  
[20:44:40] <hashar>	 puppet-3.7.5/lib/puppet/resource/catalog.rb  should have the methods/properties accessibles. From there one can find the file resources and their :content
[20:44:47] <hashar>	 and hopefulyl generate the config files
[20:45:01] <hashar>	 I am pretty sure joe solved that with the puppet compiler software
[20:45:13] <bblack>	 hashar: how do I do the stuff in your commitmsg "usage:"? log into a CI slave manually, do that as a regular user?
[20:45:27] <hashar>	 bblack: on your own machine
[20:45:33] <hashar>	 or in a vm if you dont trust rubygems
[20:45:35] <bblack>	 ok
[20:45:55] <bblack>	 well "my own machine" is probably lacking lots of deps or has wrong versions of puppet-like things
[20:45:57] <hashar>	 bundle is used to define dependencies for a project (they are listed in Gemfile)
[20:46:03] <bblack>	 ok
[20:46:06] <hashar>	 it install random unsecure crap from the internet on your machine
[20:46:20] <hashar>	 each dependency (== a gem) is installed somewhere under ~/.gems
[20:46:27] <hashar>	 when running "bundle exec  foo"
[20:46:47] <hashar>	 bundle changes ruby loading path order so the gem at ~/.gems are looked up first then run foo
[20:46:56] <hashar>	 so you end up running "foo" with the set of gems defined in Gemfile
[20:47:09] <hashar>	 CI does exactly that:  bundle install && bundle exec rake test
[20:47:33] <hashar>	 we tried to make Jenkins jobs as dumb as possible and delegate all the config/commands to developers 
[20:47:38] <hashar>	 including dependencies
[20:47:47] <bblack>	 template-only updates to add new languages: yet another pain that would be eased by moving DNS to puppet :)
[20:48:01] <hashar>	 well
[20:48:15] <hashar>	 I am surprised you havent moved gdnsd to json and a rest api!
[20:48:31] <bblack>	 it does output json, just doesn't consume it for config
[20:48:40] <bblack>	 and it does have a readonly rest api of sorts :)
[20:48:58] <hashar>	 but how I can edit the DNS from the wiki??? :D
[20:49:10] <bblack>	 :P
[20:49:46] <hashar>	 want a quick hangout so I demo it to you ?
[20:50:05] <bblack>	 root@eeden:~# curl  http://localhost:3506/json
[20:50:05] <bblack>	 {
[20:50:05] <bblack>	 	"uptime": 4159,
[20:50:05] <bblack>	 	"stats": {
[20:50:05] <bblack>	 		"noerror": 5276202,
[20:50:08] <bblack>	 		"refused": 128793,
[20:50:10] <bblack>	 		"nxdomain": 77774,
[20:50:20] <hashar>	 \o/
[20:50:31] <bblack>	 hashar: I can't atm, but I think when I next circle around to this (later tonight or tomorrow), I can figure things out from what you said above
[20:50:42] <hashar>	 hopefully
[20:51:04] <hashar>	 will spend a few more minutes to dig potentially interesting properties  and reply back on the gerrit change
[20:51:19] <hashar>	 then hopefully it would be fairly straightforward to generate the config files
[21:40:40] <hashar>	 bblack: https://gerrit.wikimedia.org/r/#/c/343747/ patchset 2 should dump a list of files and their content
[21:40:59] <hashar>	 one can then figure out how to rewrite the /srv/... path to some build directory
[21:41:13] <hashar>	 and then invoke the lint command against that sub tree of config files
[21:41:36] <hashar>	 theroically on jenkins one would rewrite ^/  -->  ENV['WORKSPACE']/build
[21:41:59] <hashar>	 then maybe do something like:   authdns-lint --dir "$WORKSPACE/build"
[21:42:09] <hashar>	 and maybe your trouble is all magically solved 
[21:42:59] <bblack>	 well we could also give authdns::config a path parameter and let linting supply a different one and avoid rewrites
[21:43:23] <bblack>	 I figure the less magic outside the manifests the better
[22:00:33] <hashar>	 bblack: you can try patchset 2 on your local machine
[22:00:49] <hashar>	 else CI rake job should show in the console the list of files and their content
[22:01:00] <hashar>	 but CI is a bit overloaded, so the test result will take a while to land back
[22:01:05] <hashar>	 I am off for some sleep *wave*