[09:44:03] 10Traffic, 10Operations: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10ema) p:05Triage>03Normal [09:45:36] 10Domains, 10Traffic, 10Operations: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10ema) p:05Triage>03Normal [09:47:07] 10Traffic, 10Operations: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10ema) p:05Triage>03Normal [09:47:34] 10Traffic, 10Operations: Puppetise OCSP stapling for all one-off HTTPS servers - https://phabricator.wikimedia.org/T204992 (10ema) p:05Triage>03Normal [09:47:56] 10Traffic, 10Operations: Consider adding Must-Staple header to enforce revocation checking - https://phabricator.wikimedia.org/T204987 (10ema) p:05Triage>03Normal [10:12:17] vgutierrez, hey [10:12:53] so while we wait for other people to review the main puppet commit, should I sort out the problems with the follow-up DNS commit? [10:13:23] ack [10:14:32] yes? [10:14:42] indeed [10:14:45] ok [10:21:31] Krenair: so.. the current dns-sync.py doesn't specify the user in the destination system [10:21:45] Krenair: so I guess it's assuming certcentral in both systems [10:22:28] dunno if using certcentral as a username in the dns host could backfire in some scenario, cause it's the user "required" by the certcentral debian package [10:23:33] if you wanted to run a certcentral instance on an authdns server itself yes there could be problems [10:23:53] you'll notice it's a User defined in modules/profile/manifests/authdns/certcentral_target.pp [10:24:11] indeed [10:24:15] 10:23:40 modules/role/manifests/authdns/server.pp:42 wmf-style: Found hiera call in class 'role::authdns::server' for 'certcentral::cert::certcentral_host' [10:24:31] what do I do about this? lint:ignore? [10:25:17] maybe it's worth adding the -l flag to the ssh CMD in dns-sync.py [10:25:21] hmmm [10:25:57] if there are certcentral hosts, they should be defined by the certcentral module, right? [10:26:17] don't understand the question [10:26:44] hosts get defined in manifests/site.pp which include roles which include profiles, one of which will include the certcentral module [10:26:49] certcentral::cert::certcentral_host --> that should defined by the certcentral puppet module and not by the authdns one [10:27:15] or the naming is confusing me [10:27:17] hieradata is defined in hiera... not in any manifests [10:28:08] it needs to be pulled into the certcentral::cert class so hosts can know where to get their certs [10:28:19] it needs to be pulled into the authdns stuff so we can let certcentral SSH into authdns [10:28:49] we could theoretically make a second hiera variable for this, but I don't know if it makes sense here [10:29:23] oh damn.. I misunderstood the linter error message [10:29:28] forgive me :) [10:30:54] it would be repeating ourselves [10:31:04] but it would get us past the style lint [10:45:32] the linter message is more about moving the hiera() call into a role/profile, which then passes it as a parameter for the authdns module stuff [10:45:53] instead of hiera in the authdns module itself [10:47:47] so it can still be one hiera variable, but put the hiera() call into modules/role/manifests/authdns/server.pp, to set params of "class authdns", and a new param there, etc... [10:48:55] this isn't going into the authdns module, just the authdns role [10:49:53] it doesn't like that I've got this: $certcentral_ferm = hiera('certcentral::cert::certcentral_host') [10:50:00] inside this block: class role::authdns::server [10:50:09] where's the change that failed lint? [10:51:33] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459809/7/modules/role/manifests/authdns/server.pp [10:52:57] ehm... hiera calls should be only in profiles ;) [10:53:09] https://wikitech.wikimedia.org/wiki/Puppet_coding#Hiera [10:54:01] volans, that doesn't seem to apply to this manifest [10:54:09] it's doing this already: [10:54:14] hiera('lvs::configuration::lvs_services') [10:54:17] hiera('discovery::services') [10:54:26] so it should let me do hiera('certcentral::cert::certcentral_host') [10:54:49] Krenair: the linter complains for new code, not existing [10:54:59] otherwise it will fail always [10:55:05] lint:ignore it is then [10:55:09] as we don't have yet migrated it [10:55:14] *all of it [10:58:51] ok [10:59:05] why not moving the ferm rule to the profile where it should be? [10:59:07] well there's some other stuff too [10:59:09] I mean just the new one [10:59:39] the certcentral_dns_ssh seems to belong to profile::authdns::certcentral_target [10:59:40] the other profile would/should throw the same lint error anyways, it just doesn't because 2x conflicting things in present patch [10:59:44] not role::authdns::server [10:59:52] volans, it will presumably fail there too? [10:59:59] why? [11:00:01] it will be a parameter [11:00:05] with a hiera call [11:00:13] volans: it's about the mixture of namespaces [11:00:23] because I will be looking up certcentral::cert::certcentral_host ? [11:00:41] ooh hang on [11:00:42] we have two "unrelated" namespaces, and we need a common dataset (a list of hostnames) brought in from hiera for both [11:00:50] looks like I did already duplicate this for the security rule [11:01:02] oh wait, I'm still confused [11:01:14] hostnames might be doen with puppetdb query too, maybe is an option here, dunno, I didnt' checked the code thouroghly [11:01:17] it's two different pieces of data: the certcentral hosts' puppetization needs a list of authdns servers [11:01:25] the authdns profile needs a list of certcentral servers [11:01:30] so independent :D [11:01:38] (not puppetdb, it needs to be explicit) [11:01:48] why? [11:02:33] because that's crazy, you should always be able to say "I want this to operate on exactly servers A+B, not C+D that also happen to have the same profile/role/whatever in puppetdb, because those might be on the way out for decom, or freshly commed but not ready yet, or testing, etc.. [11:03:03] this is why we have an explicit list of authdns server hostnames in puppet too, vs just ask puppetdb where the role is applied [11:04:05] if the solution is hardcoding them I think we've lost the whole infrastructure as code paradigm [11:04:14] s/lost/failed/ [11:04:41] if there is a concept of pooled/depooled that can be taken into account [11:04:46] of course [11:04:53] in a puppet manifest? [11:05:10] (or do you mean in every integration script that operates on a set of hostnames?) [11:05:46] I mean in general, as best practice [11:05:55] the get_clusters stuff looks at what hosts have a specific class elsewhere. magical. [11:05:56] hardcode should be the last resort IMHO [11:05:58] it doesn't seem like a practical practice [11:06:56] but I don't want to be a blocker on this [11:07:02] so back to the hiera calls [11:07:14] it seems to me that the definition of the ferm rule certcentral_dns_ssh [11:07:14] well nevermind certcentral, the same argument applies to the existing list for authdns in general [11:07:33] doesn't belong to role::authdns::server, but belongs to profile::authdns::certcentral_target [11:07:37] (which should be hieradata anyways and is currently a list in a manfiest, but that would otherwise be easy to fix) [11:07:39] if I apply that profile I want that ferm rule [11:07:55] there's multiple layers to this onion, really. [11:08:02] indeed [11:08:18] let me rewind a little, because I think I get the whole context now [11:08:43] so: [11:08:56] also we're hardcoding IP in hiera, not even hostnames, maybe constants.pp however horrible is better here? [11:09:28] the IPs I assume you mean $ns_addrs? [11:09:47] profile::authdns::certcentral_target::certcentral_hosts: [11:09:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/459809/9/hieradata/common.yaml#594 [11:09:52] okay [11:10:26] yeah those shouldn't be there clearly, those should probably come from resolve calls [11:10:32] resolve calls? [11:10:34] you mean in ferm? [11:11:29] or should resolve in the deployed software, if at all possible [11:12:19] anyways, that's not the part I'm trying to peel back, we can fix the other obvious problems [11:12:37] this is the layers of the onion on the real wmf-style standards conflict: [11:13:30] 1) modules/role/manifests/authdns/data.pp holds some constant data variables that should arguably be hieradata. We can pretend we fixed that for now, but they'd be in some profile::authdns namespace anyways [11:13:54] 2) One of those constant data is $ns_addrs, which is a list of our live authdns servers to operate on [11:14:36] 3) It needs to live in the authdns namespace because it does get consumed there (for authdns-update to know the peer set of servers to operate on) [11:15:34] 4) Separately, there's profile::certcentral::server , which is also going to need that same list of authdns servers (to send commands to them) [11:15:52] where do we store this list once, in hieradata, for those two separate profile namespaces, without violating wmf-style? [11:16:14] common.yaml [11:16:21] * volans will not repeat this in public thoough ;) [11:16:26] lol [11:16:32] but that's a file, not a namespace [11:16:39] yeah it's kinda horrible [11:16:46] no namespace [11:16:49] authdns_hosts: [11:16:57] the point is, even if it's in common.yaml, it's going to be certcentral::foo or authdns::foo, and the hiera calls in the two separate profiles, one of them will be a wrong-namespace hiera call [11:17:00] or authdns_servers: [11:17:26] it shouldn't [11:17:28] IIRC [11:17:39] rewinding to the argument about puppetdb and etcd vs hieradata host lists: [11:18:09] there's in practice at least 3 layers of state on "is serverX a live part of groupY" for this sort of thing [11:19:30] does it have the right role/profile applied? is it considered "in production" (as opposed to testing, partly commed, partly decommed, temporarily out for investigation, etc)? and then a third "is it pooled?" (I guess, assuming etcd integration with everything that matters) [11:20:05] and I'd totally buy the argument that "is it pooled?" should include that middle one about testing/decom/etc, if "pooled" wasn't a binary flag [11:20:19] but it's a binary flag, that automated scripts flip, so it's the most-ephemeral of states [11:20:40] we already have the problem that we manually depool nodes and scripts flip them back pooled when doing something that's meant to be a temporary depool of a live thing [11:20:44] testing/decom/etc should be covered by netbox statuses now [11:21:04] how does that integrate? [11:21:08] not yet integrated into puppet ofc, but they will be [11:21:29] but that piece of logic should be mostly covered [11:21:33] so, we'd flip the "this isn't part of the live set" bits in the netbox UI? [11:21:34] going forward [11:22:06] we flip the is about to be installed/commissioned, is active, is about to be decommed [11:22:26] then once is active the pooled/depooled should cover the live part [11:22:27] well there's going to be othe reasons than comm/decomm to remove a server from the live set [11:22:53] and pooled depooled is out of scoper for netbox, is live state, not config [11:22:56] the real problem here, that I've railed on several times before, is that pooled is a bit and not a stack of reasons [11:23:58] not all services have the same pooled, we have yes/no/inactive already, we could have a reasons stack too I guess for other cases [11:23:58] this isn't a service-specific issue IMHO, it's a general design one [11:24:22] if the only other state we have to go on is puppetdb role application and in the future "is being commed/decommed", that leaves a bunch of reasons to flip the pooled bit, and actors don't know who last flipped the bit for what reason they should or shouldn't overide [11:24:42] agree [11:24:53] but if each actor only sets/clears its own reason for depooling in an array of reasons to be depooled [11:25:12] and consumers treat it as "empty set == pooled, non-empty == depooled", that solves that problem [11:25:57] yeah [11:30:26] we have some hackarounds in varnish-backend-restart and such for this problem, but it's still a race condition. [11:31:03] and we've still ended up in states where we know we depooled a server because of some hardware ticket, yet two weeks later we find it repooled for an unknown reason (because some script or person wasn't aware of why it was depooled, and cycled a depool->repool blindly) [11:32:16] there's a whole other dimension to the problem when you consider the "inactive" part, or other such states, too [11:32:50] e.g. the cp servers lists, right now we have one conception of them from etcd for lvs and varnish<->varnish definitions to route traffic. [11:33:14] and then we still have a separate list of them in hieradata for ipsec [11:33:30] (which will eventually go away because ipsec, but ipsec is just an example) [11:33:55] it's another case like "inactive" where you really need two distinct layers of final resulting "pooled" state [11:34:11] I'm not even sure how you tackle both problems together [11:34:36] nested pooled-like states of differing names? or possibly there's some that don't want nesting [11:35:36] but you could imagine a reason-set for "inactive?" and then a reason set for "depooled?", and perhaps an implicit dependency in some cases that inactive also implies depooled even if there's no other reason to be depooled [11:35:52] and that there could be multiple different labels like "inactive" and "depooled" that things could need [11:36:24] somewhere there's an elegant-enough solution that all use-cases fit together neatly [11:36:35] :) [11:36:43] (yet can still be represented in simple state data in something like etcd/zk/etc) [11:38:29] anyways, the TL;DR is that a singular state isn't enough to operate reliably. in the cp-servers case we've chosen to live with it and suffer mildly and complain [11:38:54] in the authdns server case, we've chosen to explicitly define the "live set" inside puppet and skip it [11:41:29] and that's ok for now [11:42:00] so [11:42:02] for certcentral [11:42:08] keep a list of hosts authorised to SSH to authdns [11:42:21] separate to the name of the server that nodes should get their certs from? [11:43:08] ok, so yeah rewinding to the problems at hand! [11:43:57] 1) We should do a quick refactor and move authdns's data.pp stuff to hieradata somewhere, at least $nameservers [11:44:17] 2) We should move it to common.pp as some un-namespaced thing [11:44:25] 3) certcentral profile should consume that [11:44:45] the other direction is relatively trivial, because certcentral does needs its own server hostname list [11:44:59] sorry, gotta go for lunch, I'll read backlog later [11:45:11] so we can define that as data of profile::authdns::certcentral* stuff, the hieradata list of certcentral hosts to authorize for ssh [11:51:01] (looking at 1 now, will take a few!) [11:52:35] of course, there is no profile::authdns::anything yet heh [11:55:10] it seems like we have to lift all of our role::authdns::* to copes of them at the profile level, I guess [11:55:20] enforced levels of hierarchy! :P [12:05:23] lol, I concede defeat to puppet, let's skip the major refactoring [12:05:29] role::authdns::server already contains: [12:05:32] lvs_services => hiera('lvs::configuration::lvs_services'), [12:05:35] discovery_services => hiera('discovery::services'), [12:05:49] which will be illegal anywhere I put them, and I didn't even design that part, and it looks hairy to fix :P [12:06:16] how about just the minor refactor of authdns_servers to common hieradata [12:14:52] the major refactoring being the quick refactor? [12:20:53] I gave up trying to fully fix the authdns classes/roles/profile/whatever [12:21:02] I'm just moving the nameservers list to common.yaml [12:32:17] ok so in light of that quick shove through jenkins' objections [12:32:29] :) [12:34:16] well let's step through the related issues one at a time here [12:34:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/459809/9/hieradata/common.yaml#594 [12:34:24] ^ why is this IPs and not hostnames? [12:35:22] no reason AFAIK [12:35:28] we can swap it with the DNS names [12:35:42] does that actually work? [12:35:46] it's used in ferm rules... [12:35:59] I have no idea, just asking [12:36:06] it should! :) [12:36:12] hmmm you can set hostnames in fw rules [12:36:15] actually we do it all the time [12:36:19] ok! [12:36:35] they get resolved on rule addition time [12:36:35] ditto for this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/459809/9/hieradata/labs/deployment-prep/common.yaml#514 [12:36:55] right [12:36:59] although I think there's more issues there than just the IPs [12:37:08] IIRC Krenair needs to clean that yaml [12:37:22] to get rid of all the alexmonk.uk references [12:37:28] [12:37:28] sync_dns_servers: [12:37:28] - deployment-certcentral-testdns [12:37:29] validation_dns_servers: [12:37:49] so this bit: (1) I guess deployment-certcentral-testdns is a reference to some other data with a nameserver list? [12:38:07] (2) why is validation a separate list? I feel like this came from some discussion where I argued for it, but I don't recall why at the moment [12:38:31] hmm that's right you argued that :) [12:39:00] where does deployment-certcentral-testdns get defined at? [12:39:00] the sync dns servers are feeded to the sync script, and the validation dns servers are the ones checked before submitting the challenges as validated to LE [12:39:23] dunno, Krenair ^^ [12:39:41] sigh.. I feel really out of my place with the labs environment [12:39:43] vgutierrez: do you remember why I argued they should be separate? [12:40:12] because right now, I feel like that should be chalked up to temporary insanity on my part [12:40:35] bblack: hmmm actually cause ns[012].wm.o have different IPs than authdns[12]001.wm.o, right? [12:41:01] oh, hmmm... that bit [12:41:23] yeah that alexmonk.uk stuff is there because we can't actually delegate out of wmflabs.org subdomains through the UI right now, but I wanted to test gdnsd stuff [12:41:29] it still doesn't make sense though. I don't think any part of this should actually look at the ns[012] hostnames at all [12:41:48] need to do it properly using the API for labs dns [12:42:03] validating against ns[012] doesn't tell you much without some awareness of the loadbalancing/routing behind them. [12:42:09] so validation wise, before submitting the challenges to LE makes sense to double check that what they're seeing it's what we expect that they are going to see [12:42:19] well, yes [12:42:34] the thing is, doing one set of queries against ns[012] doesn't acutally tell you that with any certainty [12:43:09] because ns[012] are virtual ideas. there can be all kinds of routing or loadbalancing or anycast behind them, meaning your test and LE's views don't hit the same real servers anyways [12:43:56] I think we have to assume that if the Administrator configures "the list of DNS servers to push challenge data to is X", that that's also the same list to validate against to prove that the public view is gauranteed correct [12:44:29] the only other thing might be some differentiation about dns-listen-ip vs ssh-listen-ip for the same server, but I think we can avoid that in engineering. [12:45:07] are the IPs for SSHing and verifying the same? [12:45:21] that's what I was saying above [12:45:36] right [12:45:37] they are right now, they might not be someday in the future? but we can cross that bridge when we get there [12:46:00] so at the certcentral-the-software level, having them separate gives flexibility [12:46:08] but in our puppetization, they should be defined from the same data at present [12:46:09] yes [12:47:58] (which is not the list of ns[012] or their IPs, but the list that has hostnames like authdns1001, multatuli, etc) [12:48:10] (which I just moved to be available from hiera('authdns_servers')) [12:49:57] so vs PS9 [12:50:07] profile::authdns::certcentral_target::certcentral_hosts should be hostnames rather than IPs [12:50:39] which may need some semantic fixups to the variable name $ips in profile::authdns::certcentral_target [12:51:59] and then I guess "hieradata/labs/deployment-prep/common.yaml" is labs-specific anyways, so whatever is clean and works there [12:52:35] but for the prod version of that hieradata for " [12:52:37] profile::certcentral::server::config" [12:53:02] sync_dns_servers + validation_dns_servers should come from hiera('authdns_servers') [12:53:34] which... I'm not sure how that works if it's deep in a structure [12:54:01] I guess v1 of that is over in: https://gerrit.wikimedia.org/r/c/operations/puppet/+/441991/50/hieradata/role/common/certcentral/server.yaml [12:54:04] but it's currently empty [12:54:26] 18<bblack18> profile::authdns::certcentral_target::certcentral_hosts should be hostnames rather than IPs [12:54:49] that's fine for ferm but I don't know if security::access::config likes that [12:55:20] what's security::access::config? [12:56:01] ugh a whole other rabbithole [12:56:54] anyways [12:57:21] security::access::config is apparently to say "the certcentral user on the authdns hosts can only log in from machines [X]" [12:57:34] separately from the ssh key and the sudo rules as other layers of ACL [12:57:38] IIRC it's to do with labs PAM (certcentral user won't be in the deployment-prep LDAP group), can probably stick it in a realm branch [12:57:52] ah ok [12:58:46] also, I don't really care, but if you want you can remove the "--" from the gdnsdctl args now, that was fixed a while back [12:58:57] unless it confuses some other part of the stack I guess [13:00:02] so the part I'm still stuck on is hieradata profile::certcentral::server::config having a sub-structure that needs to reference other hieradata (authdns server list) [13:03:16] ignoring any practical issues, I'm guessing the root of that problem is defining /etc/certcentral/config.yaml as a "file" resource using ordered_yaml($config) from the above. [13:03:36] effectively shoving a hieradata deep structure straight into a file, without a way to parameterize the parts at the puppet level [13:04:16] (it may as well just be a literal files/config.yaml at that point rather than hieradata, if we have to override the entire structure to change one thing) [13:04:38] and that's kind of how I look at the solution so far, naively, not having tried it [13:05:26] move it to a templates/config.yaml, put static information in there directly if it doesn't need to be parameterized, and put a variable in the template for the dns server lists, which we can define from class params and ultimately from hieradata [13:09:17] it's a tricky issue [13:09:35] the list of certs should clearly be a hieradata chunk, I think? [13:10:19] yeah [13:11:11] so re-clarifying context: https://gerrit.wikimedia.org/r/c/operations/puppet/+/441991/50/hieradata/role/common/certcentral/server.yaml [13:11:35] the scope of ::config there is too big to be a single static block that varies only on realm or whatever [13:11:48] yes [13:12:02] ok [13:12:08] move that structure to a literal yaml block in a template for the config.yaml file, and hieradata-ize the necessary bits at smaller scopes [13:12:20] so that the dns servers can be plugged in at least [13:12:50] but you can probably still leave the "certificates" sub-structure as a single hieradata variable with that whole block [13:12:57] you could put the config in a manifest, have it fetch the accounts and certificates from hiera, and have it build the challenges stuff with a call to get the nameservers out of hiera elsewhere [13:13:49] step 1 I think is replace file { .. ordered_yaml() } with template{} and the outer structure in the template file itself [13:15:20] and instead of having $config as a paramter to certcentral::server, have some smaller chunks there like $certificates, $challenges, etc [13:15:44] $certificates, the profile can pull as hieradata block and stuff into the parameter there [13:16:02] $challenges, the profile might have to build that in the manifest, so it can put hiera('authdns_servers') into it [13:16:26] etc [13:16:56] yeah [13:17:43] nice [13:24:39] Krenair: should I do it, or you? just to avoid overlapping between us [13:47:41] so we run nginx on MXes, for the sole purpose of validating Let's Encrypt [13:47:50] is certcentral going to be able to change that? [13:48:05] bblack: what type of request data did you find was reset on vcl switching? https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/misc-frontend.inc.vcl.erb#L5 [13:48:19] paravoid: it those certificates can be validated by dns-01, for sure [13:48:45] paravoid: and AFAIK, every certificate can be validated with dns-01 [13:48:50] (not the other way around though) [13:49:00] bblack: this test passes without the duplicate vcl code in misc-frontend, I wonder if we can get rid of the duplication? https://github.com/wikimedia/puppet/commit/55233020e9a2d8575f405c3ae36589abb6a57791 [13:49:21] paravoid: yes, we won't have to run nginx on MXes anymore [13:49:34] ok [13:49:38] bblack: BTW, does it make any sense to support http-01 in our production certcentral environment? [13:49:41] that's the status of certcentral deployment? [13:50:02] herron is planning on deploying new smarthosts, should we pilot certcentral for this? [13:50:18] (cf. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463143/ ) [13:50:23] paravoid: final bikeshedding about the puppetization patches is ongoing before we merge them to real prod puppet repo. certcentral[12]001 hosts exists and are awaiting that puppetization. [13:51:05] paravoid: once it's deployed, we're going to do a test cert that's not in real used (e.g. a pinkunicorn cert or whatever) to validate things are working and shake out bugs [13:51:31] paravoid: after that, yeah we're going to start targeting real cases, the users of the older letsencrypt stuff being the obvious targets [13:51:44] yup.. integrations tests against pebble are happy, https://github.com/letsencrypt/pebble but we need to test it against the Real Thing(TM) [13:53:06] vgutierrez, you can do it [13:53:32] vgutierrez: I don't think so. it's nice to have the option for certcentral-the-software, but for our puppetization I think we'll only have the singular dns-01 challenge setup [13:53:38] paravoid, is that the first volunteered domain? :) [13:53:47] bblack: cool [13:54:13] Krenair: so we should parametrize allowing incoming traffic in 80/TCP [13:54:17] btw there is a new challenge type which might be nice to have as an option for certcentral, I haven't looked into it yet [13:54:30] tls-alpn-01 [13:54:48] vgutierrez, ok [13:55:36] Krenair: it looks like it's outside the ACME specification, right? [13:56:09] I mean.. different RFC [13:56:11] https://tools.ietf.org/html/draft-ietf-acme-tls-alpn-05 [13:56:15] well yeah [13:56:31] I think it's come along after the original ACME RFC [13:57:01] although [13:57:03] which itself still isn't yet an RFC :) [13:57:05] https://datatracker.ietf.org/doc/draft-ietf-acme-acme/ [13:57:11] it seems the ACME RFC is still only proposed [13:57:21] seems strange [13:57:27] you'd think they'd go back and amend it in [13:57:44] and we are already with ACMEv2 /o\ [13:58:07] the IETF process is rather arduous and rigorous, it can cause a lot of disruption to add new things when you're already late in the process [13:58:13] ah [13:58:41] acmev2 is the one in the draft I linked [13:58:48] I don't think acmev1 went to IETF at all [14:00:16] (v2 isn't really v2 in the IETF sense. It's more like "v2 of letsencrypt's API, which happens to match the state of the finally-solidying first IETF document on ACME" [14:00:19] ) [14:00:41] anyway [14:00:51] vgutierrez, do you have everything you need to get the DNS commit ready? where are we with the main commit? [14:01:45] so... the refactoring of the config, can be done in any of them [14:01:57] moving it to a template I mean [14:02:01] config refactor should probably take place in the main commit [14:02:04] yup [14:02:18] and making http-01 support optional too [14:02:24] so I'm with that right now [14:02:52] ok [14:52:05] bblack: btw https://gerrit.wikimedia.org/r/c/operations/software/cumin/+/465612/2 for you :) thanks for letting me find the bug [14:52:52] haha [14:53:07] I completely don't understand any of the context outside of that diff [14:53:34] ahahaha, that's normal, no need to review but wanted to add you as you were involved :) [14:53:39] but it smells fishy for the long-term view. Aren't there other metachars that matter? [14:54:06] from puppetdb docs: "Every backslash character must be escaped with an additional backslash. " [14:54:20] Thus, a sequence like \d would be represented as \\d, and a literal backslash (represented in a regexp as a double-backslash \\) would be represented as a quadruple-backslash (\\\\). [14:54:29] don't ask me why :) [14:55:38] oh, I thought the problem was at a different scope [14:55:57] (e.g. bash -> cumin -> python-library -> puppetdb end, and getting the slashes for the regexes through all *that*) [14:56:31] yeah there are many places where this can fail, and actually was tested, but stupidly just with the \\ and not the \. or \d [14:57:37] ofc r'\\' != '\\', just to be clear in case it's not [15:02:45] :) [15:03:40] if I had the excess time, I would spend it trying to run the URI regex on a puppetdb fact through cumin just to see all the bugs fly :) [15:04:13] lol [15:04:19] "the URI regex" == I'm thinking of the one from the back of that old oreilly regex book, that was like a page or three long [15:04:54] even if you make it all the way to puppedb I'm sure it will crash :-P [15:05:22] also, another nice trick [15:05:38] the regex are passed directly the DB backend [15:05:43] so in this case postgresql [15:05:56] or was it an email address regex. Now I can't even find what book I was thinking of [15:06:39] https://www.regular-expressions.info/email.html maybe? [15:07:16] none of those are nearly big enough [15:09:16] the one I remember, was near the end of some paper book on regexes, and gave an example that was a printed page or more long for fully-validating all possible legal , either email addresses or URLs or URIs [15:09:23] which nobody would ever use, but it was amazing [15:12:45] 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Volans) 05Open>03Resolved [15:12:55] totally [15:14:32] I bet it was email, and I bet it was at least very similar to the "Perl/Ruby" variant shown in: [15:14:35] https://emailregex.com/ [15:15:43] oh my! yeah RFCtoRegex has always been a total mess [15:15:57] or similarly this: https://stackoverflow.com/questions/20771794/mailrfc822address-regex [15:16:47] RFC-to-Regex being a total mess is really just a symptom of a bigger problem, especially with older RFCs [15:16:48] the funny part is that no matter how precise you do it, an email is valid only if its MX tells you so [15:17:25] and it might also be temporarily invalid too (full) :-P [15:17:47] the bigger problem being that standards committees often didn't think from the perspective of whether multiple implementors could actually realistically and fully implement the spec without wanting to kill someone [15:18:02] rotfl [15:18:22] either it was a 1-implementation sham process of documenting someone's horrible buggy C code into a "standard" [15:18:36] or it was an academic-style standard where they just published a thought experiment without ever really trying it [15:19:58] (the oldest DNS RFCs have a lot of the former, and DNSSEC a lot of both) [15:21:06] I was going to say, this is sounding familiar [15:22:30] lol [15:23:58] DNS gripe of the day: DNS response "compression". they could've just tacked something onto the end that allowed e.g. basic huffman-compressing the whole response packet or something. [15:25:25] instead, it's basically: the echo'd question is uncompressible, all the response metadata is uncompressible. All response text or binary data is uncompressible. Some domainnames in the response are compressible, but only old ones from the original set of types (no compression for newer RR-type standards). [15:25:52] and in that one case where domainnames are "compressible", it's by embedding pointers inside the encoding of the names, to point at a duplicate copy of that data that occured earlier in the packet [15:26:47] so as you generate an output packet, every time you output a name, you need to keep track of it for future compression opportunity, and also scan over all the ones you stored so far and find compression matches that alter the output with pointers. [15:26:48] eww [15:27:05] which happen on label boundaries and have to match the remainder of the name [15:27:32] it's extremely annoying and inefficient to deal with. yet it's the only way to reduce output packet sizes, and output packet size with DNS is at a premium [15:28:38] it was well within reasonability at the time of the original standards being written, for someone to skip all that bullshit and just say "if bitX is set in the header, the whole thing is compressed with " [15:29:17] which would be the 80s equivalent of making an easy decision today like "gzip all this shit before sending" [15:30:43] --- [15:31:22] the other fun interaction there: DNS response outputs can be legally up to 64K in length (those over 512 going over TCP of course, or using edns0 to declare you can receive larger-than-512 over UDP). [15:31:46] but those inter-domainname compression pointers can only target names in the bottom 16K of the packet (because the maximum legal offset value is 16383) [15:32:09] lol [15:32:19] you can have names in the 16K-64K region of the output *use* compression pointers to data below 16K, but they can't contribute new data to be compressed against. [15:32:50] this is all loads of fun for implementors :P [15:33:19] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata) [15:33:47] for gdnsd-3.x, I decided to just hardcode the fact that response sizes over 16K are ludicrous. the output code assumes all outputs are smaller than 16K and ignore that problem. the zonefile parsing code validates (painstakingly) that you haven't created zonefile data that would cause the generation of >16K-sized output response. [15:36:18] does it take into account the dynamic TXT records bblack? [15:36:26] yes [15:36:28] * Krenair hides [15:36:44] huh [15:36:52] and mixing them with static data (if you configured a static challenge response in the zonefile, and then also added dynamic ones at runtime) [15:37:09] so theoretically the acme-dns-01 command can fail if it would result in too much data? [15:37:16] yes [15:37:36] it would take a crazy amount of separate challenges for the same name for that to happen though. they encode to like 55 bytes a peice :P [15:38:03] but if you manage to do so, it will just output all it can up to the 16K mark and not output the rest [15:39:55] manage to do -> dynamically create ~300+ separate challenge responses for a single challenge domainname for different certs/sans, in the 10 minute default TTL window of challenges. [15:41:17] (and probably before that breaks down, something else will. does letsencrypt actually iterate 300 challenge TXTs looking for the one it wants, or does it give up sooner? seems edgy) [15:42:56] realistically, you wouldn't put the same domainname (for challenge purposes) in a SAN list more than twice (twice being the case for *.example.com + example.com in the same san list) [15:43:21] so it would be over a hundred unique certs, all referencing the same root-domain+wildcard, in that 10 minute window [15:43:40] some other ratelimiter would kick in I think :P [15:44:27] (and if everything else worked... the challenges would eventually work if they were spaced out in time. as old ones expired, newer ones that were being left out would become visible with the 16K outputs) [15:46:14] that reminds me, push a new gdnsd-beta release through to a new package in stretch-wikimedia today [15:46:38] (no useful updates for us, but there's some freebsd bugfixes in master, docs updates, and I need to fix the debianization a bit on the postinst script anyways) [15:46:50] hopefully, the last beta release [15:48:52] maybe I should look at openvz first though, apparently it has users stuck on old 2.6 kernels, I don't know how that fares [15:53:51] eh if that proves it needs fixes and worth fixing, can always cut another release [16:25:26] bblack: what's the upper bound of gdnsd floating point version precision? :-P [16:25:40] Krenair: we need to keep the deployment-prep/common.yaml changes? [16:25:49] Krenair: in that case, you'll need to ammend the regr data [16:25:55] 10Domains, 10Traffic, 10Operations: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) Thank you for the detailed explanation. I will get back to Legal and MarkMonitor about it. [16:25:58] well [16:26:02] maybe some of it [16:26:03] 10Domains, 10Traffic, 10Operations: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) a:03Dzahn [16:26:06] some of it needs to change [16:28:02] we can throw out the current cert in there [16:28:06] I need to make a dns-sync.py script for designate [16:28:24] I need to add the regr data for the accounts there [16:31:38] sigh.. your rebase messed with my commit update :_( [16:32:25] volans: I think for most purposes it's virtually-infinite, until something like autoconf fails :) [16:33:12] volans: but, there's an internal concept of version-numbering used for control-socket compatibility issues (where gdnsdctl<->gdnsd, or gdnsd<->gdnsd, see each other's version over the socket and decide how to handle backwards/forward-compat when necc) [16:33:33] volans: and that one is limited to integers in the range 0-255 for each of the major.minor.patch fields [16:34:24] volans: (which I've already exceeded in several recent beta builds for fun. the net result is it rolls over modulo 256 for controlsocket protocol versioning, so I've just had to be careful that the modulo output continues to move forwards and not backwards as I pick new arbitrary numbers) [16:34:41] the new .9942-beta == .214 [16:35:38] vgutierrez, oops, sorry :( [16:36:49] nah.. no problem.. but let me ammend this before you push anything else plz [16:39:11] vgutierrez, ack [16:44:00] 10 beta relases 2.99.x-beta so far, with x values: 5 6 7 8 9 42 1729 9161 9930 9942 [16:44:30] it seems like a healthy range, and I'm running out of things to go validate/check/test/re-read [16:45:57] I've done the deployment-prep labs/private commit [16:46:51] hmm jenkins is not fond of my ammend [16:47:20] 16:46:13 modules/certcentral/manifests/server.pp:51 WARNING top-scope variable being used without an explicit namespace (variable_scope) [16:48:00] nice typo [16:48:11] oh, that's what it is :D [16:48:55] yeah, it should state $account_details instead of $accounts_details [16:49:05] was sat there thinking WTF [16:50:51] Krenair: so.. the config template expects that you provide reg strings for every defined account [16:50:57] *regr [16:51:45] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441991/54/hieradata/labs/deployment-prep/common.yaml VS https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441991/54/hieradata/role/common/certcentral/server.yaml [16:53:06] looks like this is a result of our commits crossing [16:53:14] I did add regr stuff for beta [16:53:23] vgutierrez, are you amending? [16:56:20] not right now [16:56:50] ok [16:56:55] I was panicking cause I wasn't able to find my passport [16:56:57] (solved) [16:57:50] and I'm going to another continent tomorrow.. pretty important thing to have my passport with me [16:58:25] heh [16:58:26] yeah [16:58:34] where you heading to? [16:58:37] Morocco [16:58:53] that's been on my list for a long time, enjoy it! [16:59:00] nice [16:59:07] have fun [16:59:08] https://goo.gl/maps/ebLUgwnavCo [16:59:26] hopefully you'll have me around working next week from there [16:59:34] I spent a couple years in Tunisia as a child and have a fondness for the region. It's somewhere in my bucket list to go spend some time in Morocco as an adult. [16:59:39] (tomorrow is a bank holiday here, so i'll be offline) [17:00:23] lol and ack (versioning) [17:00:41] Krenair: regarding https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441991/55/hieradata/labs/deployment-prep/common.yaml [17:00:59] Krenair: take into account that if you don't define a default account, certcentral is going to use the first one... so the production one [17:01:34] yeah I know [17:01:45] ok [17:02:34] Krenair: so, we need to add a flag that enables/disables http-01 in manifests/server.pp [17:03:13] alright, well [17:03:26] conditionally add ferm service [17:03:30] conditionally add nginx config [17:03:32] yup [17:03:34] anything else? [17:03:39] nothing else AFAIK [17:04:17] vgutierrez, want me to do the http-01 disabling? [17:04:32] I can do it as well, but only one of us please :) [17:05:57] it's a bit late for you isn't it vgutierrez ? [17:07:35] kinda :) [17:09:31] I'll do it [17:09:37] talk to you on monday [17:09:39] thx [17:12:31] 10netops, 10Operations: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10ayounsi) p:05Triage>03Low [17:14:35] 10netops, 10Operations: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10RobH) Your case # 00532252: Wikimedia Foundation, Inc._Existing Customer_San Francisco has been updated with the following: "IPv6 Network Information: Network: 2607:fb58:9000:7::/64 Gateway: 2607:fb58:900... [17:45:59] forgot to mention, please use puppet types for new class parameters [18:25:21] hey guys, the Unified GlobalSign cert expires on November 22nd. it has been noticed because the HTTPS check for planet started warning for under 45 days i think [18:25:29] i see a ticket about renewing them.. but 2017. which is: [18:25:34] https://phabricator.wikimedia.org/T178173 [18:25:49] on that ticket Krenair asked if it should be closed [18:26:02] i can make a new one for 2018 if you like? [18:26:30] (SSL WARNING - Certificate *.wikipedia.org valid until 2018-11-22 07:59:59) [18:28:06] also reading https://phabricator.wikimedia.org/T196248 now [18:47:34] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) 05Open>03Resolved [20:12:17] mutante: yes! thanks for the heads up. I knew this point was coming up "soon", but I needed the reminder and hadn't looked :) [20:12:52] I'll make a ticket, there's some details I need to fill in there [20:14:00] heh, and close the old one [20:14:14] when we have so many open tickets, it's easy to miss stale ones left done+open [20:15:52] bblack: sounds all great:) thank you [20:20:20] 10Traffic, 10Operations, 10Patch-For-Review: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173 (10BBlack) 05Open>03Resolved a:03BBlack Yes, these certs are long-deployed :) [20:36:17] good that they haven't been awaiting for one year xD [20:42:18] 10Traffic, 10Operations: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) p:05Triage>03Normal