[08:26:09] 10Traffic, 10Operations, 10ops-eqiad: cp1080 - kernel / bnxt_en failures - https://phabricator.wikimedia.org/T203194 (10ArielGlenn) p:05Triage>03Normal [08:58:53] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) With the two users approach (certcentral / www-data) we just stop nginx from writing in /etc/certcentral. We should also co... [09:09:37] 10Traffic, 10Operations: Content purges are unreliable - https://phabricator.wikimedia.org/T133821 (10ArielGlenn) [11:36:37] 10Traffic, 10Operations: prometheus-varnish-exporter@frontend.service: Unit entered failed state - invalid character 'C' - https://phabricator.wikimedia.org/T203191 (10ema) p:05Triage>03Normal [12:13:41] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) That makes sense, so we're preferring one of these: * Set the group of the files to be www-data, chmod the files 640. * Put ww... [13:31:00] vgutierrez, do you see what I mean wrt. different nameservers or backend http hosts? [13:33:11] if we have to push the challenge to multiple hosts because LE's requests could go to one of many of them, I think we should check against all [13:44:26] Krenair: yup, for http-01 there is no doubt [13:45:10] but I'm worried about services with a lot of backend servers [13:45:30] we should catch the result for intermediate states? [13:45:34] well for dns-01 it's easier [13:45:35] s/catch/cache/g [13:45:43] there are only a few nameservers [13:45:55] yup.. I need an efective way of getting the NS records but yes [13:45:59] 3 of them currently for the prod domains [13:46:05] *effective [13:46:39] hmmm [13:47:02] come the thought of it, for http-01 we'd have to get access to the list of backend servers too [13:47:06] that might prove difficult [13:47:28] for some we could pull from confd (?), for some we wouldn't be able to... [13:47:28] well.. it should be the authorized host list for that certificate, right? [13:47:33] uh well [13:47:36] ideally [13:47:49] but you might want to add a new host so you authorise it first then you actually pool it [13:48:01] also there we should have the complete list, but maybe some of them are currently depooled [13:48:04] oh but that doesn't matter [13:48:11] as long as it proxies the challenge to us it should be ok [13:48:17] indeed [13:48:25] the http-01 challenge must be proxied to us [13:48:34] alright so I suppose that's it [13:48:43] and if another host is authorized we shouldn't issue a new certificate [13:48:50] s/another/a new/ [13:49:10] just let the new host access the existing certificate [13:51:23] yeah [13:51:32] true [13:52:24] for DNS we just use the NS records (hoping that our SSH-to-all-DNS-servers script worked), for HTTP we pull the authorised hosts list (which should be a superset of the hosts actually pooled, but even the non-pooled ones should be configured for it anyway) [13:52:47] vgutierrez, do we want to do this in a separate commit? [13:53:37] hmmm the pool/depooled state is a tricky thing for http-01 [13:53:48] if we are not aware of the hosts that are currently depooled [13:54:08] we could block the certificate issuance till the depooled hosts come back online [13:54:37] so I don't know if we should keep the simplified approach for http-01 till we have confd integration [13:54:46] oh right because a depooled host could actually be shut down or something, instead of having a working webserver (just not getting any incoming traffic) [13:54:55] indeed [13:55:15] even confd integration wouldn't be enough as this thing needs to work for misc stuff [13:55:27] would need the ability to provide a list of pooled hosts in config I guess [13:55:36] I'll work right away on the dns-01 and I'll amend the current commit [13:56:16] ok [13:56:22] then we follow up for http-01? [13:57:02] yup [13:58:34] I'll make a little ticket [14:02:18] 10Traffic, 10Operations: http-01 challenge checking on *all* working backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) p:05Triage>03Low [14:02:28] 10Traffic, 10Operations: certcentral: http-01 challenge checking on *all* working backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [14:03:52] 10Traffic, 10Operations: certcentral: http-01 challenge checking on *all* working backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [14:03:59] btw, checking all backend servers it means that cercentral must have network access to port :80 (or whatever the port is) on every backend server that needs a certificate managed by certcentral [14:04:10] 10Traffic, 10Operations: certcentral: http-01 challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [14:04:20] vgutierrez, yeah [14:04:28] we're also assuming that the server runs on port 80 in the backend, and that could be true or not [14:04:47] if it doesn't they're going to have problems handling the challenge vgutierrez [14:04:52] nope [14:05:06] the load balancer in front of the server should allow the connections in port 80 [14:05:29] the backend server could be running in port 80 or not [14:05:32] and then proxy to a different backend port? [14:05:33] bah [14:05:34] true [14:05:52] I guess that we should allow set that in the configuration for the http-01 challenge [14:05:56] *setting [14:06:32] beginning to wonder if it's worth the effort for http-01 [14:06:47] not for the first release [14:07:46] 10Traffic, 10Operations: certcentral: http-01 challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) btw, checking all backend servers it means that cercentral must have network access to port :80 (or whatever the port is) on every backend server t... [14:08:48] re: challenge hosts (http-01 backend lists or dns-01 server lists): [14:09:38] for http-01, it seems simpler to just require that if is being provisioned with LE using http-01, it's on the puppetization of that service to route challenges back to certcentral, rather than try to push anything explicitly. [14:10:00] that's the intention [14:10:04] especially given there can be great variance in HTTP server software (e.g. apache, nginx, ATS, Varnish, and gerrit's java process, just to think of a few) [14:10:09] that's how I've got it set up now [14:10:36] it's not going to be solvable in a generic fashion [14:11:15] and for dns-01, I guess pulling NS records for the zone could be some kind of default/fallback for others' reuse of the software? but in general it's not a reliable method, and it won't work here either. [14:11:40] because the public view of NS records doesn't always map to the actual set of DNS servers behind them. [14:12:24] well.. but we should check the public view on the DNS side, right? [14:12:25] (e.g. we're expecting the case where today we'd have 10 total authdns servers (2 per DC in 5x DCs), but just 2x NS records pointing at 2x public nameserver IPs, which are anycasted) [14:12:49] aka ns[0-2].w.o answer with the proper TXT records [14:13:07] and trust our DNS system to avoid inconsistencies between backend DNS servers [14:13:08] oh I forgot our DNS IPs did anycast [14:13:30] even in the abscence of anycast, some deployers put multiple authdns servers behind authdns loadbalancers [14:14:17] vgutierrez: re: validation, validating against the public NS stuff would be better than no validation at all, but it's an unreliable validation (the test queries could be loadbalanced/anycasted to 1/N servers that happens to verify correctly, but LE hits another that failed) [14:14:41] ack [14:14:49] well the current situation is validation against our local recursor [14:15:01] which isn't nothing [14:15:45] for our specific case in our infra, we probably want to have puppetization push a list of authserver hostnames, and that list is using to SSH out the challenge, and also used as validation targets (direct DNS queries against the IPs of the authservers' hostnames, which will be different than the public view of NS) [14:15:54] we could make it check the NS records, so it checks each of the public IPs, or we could give it a list of all the real backend servers I guess [14:16:42] for the generic case, it's not an awful default/fallback, in the absence of other config, to use the public NS records both as where-to-push and what-to-validate, I guess, so long as there's also some useful default way of pushing (nsupdate?) [14:16:48] we already have that list of authdns servers in puppet [14:16:53] right [14:17:03] the one used to build our authdns-update script [14:17:39] eventually that may chance a little (there could be a future split where, for 1x actual authdns server, it's reached by SSH on one hostname/IP and reached for DNS protocol validation on another) [14:18:04] that can be sorted out in puppetization when it happens, but maybe best to parameterize them separately and just happen to pass in the same list at present. [14:18:18] s/chance/change/ 2 lines up heh [14:18:58] the future scenario under full anycast, is we'll have 2x dns servers per DC for 10 total, named like "dns100[12].wikimedia.org", which handle both authdns and recdns anycast. [14:19:40] and since auth/rec are separate and both listen on 53, neither will be configured to listen on the ANY address, and at most one of them could be listening on the IP for the canonical hostname dns1001 [14:20:03] so validation might be against a separate per-server IP that's just for healthchecks (and this) like dns1001-authdns [14:20:11] whereas ssh would still go to "dns1001" [14:20:32] I take it there won't be any funny network restrictions preventing us from sending DNS queries out (to hosts other than our recursor) and getting responses back. I know some networks restrict this [14:21:03] to our local authdns no [14:21:23] ok good [14:21:33] I'd assume the certcentral hosts will be deployed in the internal network though, and wouldn't be able to directly query other public nameservers, and that should be fine. [14:21:41] yeah [14:21:51] I expect so [14:21:53] and I guess use our generic proxies to reach LE HTTPS APIs [14:22:24] so we should expect the list of DNS servers in a config file and fallback to resolv.conf in case of that file isn't there? that makes sense for you bblack & Krenair? [14:22:25] yep [14:22:36] vgutierrez, yeah [14:22:48] nice, I'll implement that approach then :) [14:22:49] yes, but expect two separate lists in the config for authdns pushing, and authdns validation [14:23:01] if you set challenge dns-01 for a cert, then you can provide a list of backend hosts for that [14:23:11] otherwise it falls back to NS records [14:25:00] that won't be needed (the dns server list per certificate), cause we will triggering the authdns pushing "always" against the same set of auth dns servers I guess [14:25:27] always --> taking into account that authdns servers can be depooled, and the list will get updated to reflect that [14:25:31] yeah I think that sounds right, until someone provides a counterexample I guess [14:25:56] a lot of these design questions depend on how hard we want to push for this being a generic open source project geared for easy reuse by others (for some value of "easy") vs just meeting internal needs [14:26:18] I tend to, by default, think of that in library-vs-daemon sort of terms [14:26:55] genericize the library of interesting code so it can be reused widely, but the daemon itself is relatively thin on code and makes more assumptions for current/our use-cases, and is easy to replace for others. [14:27:21] to do that for dns-01 we should abstract a little bit our code that spawns the authdns-update and let a 3rd party plug their implementation to support AWS Route53 for instance [14:27:33] it's doable, but IMHO out of scope for this quarter [14:27:58] 10Traffic, 10Operations, 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Urbanecm) [14:28:32] right [14:29:12] well for labs I expect to substitute it for a script that talks to the Designate API [14:29:22] when in doubt, don't over-abstract for future possibilities that don't exist today. wait for someone to ask for the feature that requires that abstraction, or it's easy to get lost in writing 10x more code than will ever be really executed/needed :) [14:29:29] bit of a different API to Route 53 but doing more or less the same type of thing [14:30:40] Krenair: yup... in that case, IMHO I'd recommend moving out of our CertCentral class the responsability of spawning the authdns script and come with a base class and 2 implementations, one that let you run a bash script and another one that implements the integration with the Designate API [14:46:06] alternatively we just have the bash script run something against the designate API [14:51:39] vgutierrez, do you still want to do this change for dns-01 in this commit? [14:51:43] or shall we merge and follow-up? [14:51:54] I'm conscious of the chain of commits building up [14:54:43] hmm go ahead and I'll add it in a new one [15:02:08] otherwise I'm going to die rebasing commits :) [15:09:14] ok [15:12:49] sigh.. some tests are running faster than pebble [15:12:50] AssertionError: != [15:12:58] (and not always) [15:13:42] I'll handle that on the tests to avoid this annoying errors [15:14:54] I'll update the ticket I filed earlier for dns-01 too [15:15:00] thx [15:16:48] 10Traffic, 10Operations: certcentral: challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [15:17:49] 10Traffic, 10Operations: certcentral: challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) After some discussion in -traffic with @vgutierrez and @bblack I've expanded the scope of this ticket to include dns-01 [15:19:19] vgutierrez, I'll also take care of the packaging commit when I've spoken to legoktm [15:21:32] ack [15:21:43] I'll have some updates for the README.md there [15:21:53] but those can be included in following commits as well [15:22:14] i.e: provide an example of a valid config file [15:29:48] yeah good idea [21:03:55] 10Traffic, 10Operations, 10TechCom-RFC, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) @mobrovac I think as a first step we should: * Standardise the name of the header (for services that can/do set a head...