[06:57:56] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) I am back from vacations! I am still seeing some https traffic to lists.w.o and text-lb from stat1005 though, so I think that I... [09:45:34] 10Wikimedia-Apache-configuration, 10Operations: Redirect 2030.wikimedia.org to the new movement strategy portal - https://phabricator.wikimedia.org/T202498 (10Reedy) [12:04:57] 10netops, 10Operations: set up NAT from 208.80.155.15 to frpig1001 - https://phabricator.wikimedia.org/T202520 (10Jgreen) p:05Triage>03Normal [13:06:30] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) For the moment I captured only these flows: ``` elukey@stat1005:~$ grep https ipv6_after_changes.log| while read line; do endp... [13:53:16] 10netops, 10Operations, 10fundraising-tech-ops: adjust NAT for 208.80.152.231 (codfw bastion) to point to frbast2001 (10.195.0.67) - https://phabricator.wikimedia.org/T202536 (10Jgreen) p:05Triage>03Normal [14:08:50] 10netops, 10Operations, 10fundraising-tech-ops: adjust NAT for 208.80.152.231 (codfw bastion) to point to frbast2001 (10.195.0.67) - https://phabricator.wikimedia.org/T202536 (10Jgreen) [14:57:47] damn ocsp nonce generation :/ [14:58:15] 10netops, 10Operations, 10fundraising-tech-ops: adjust NAT for 208.80.152.231 (codfw bastion) to point to frbast2001 (10.195.0.67) - https://phabricator.wikimedia.org/T202536 (10ayounsi) [15:04:18] wireshark doesn't approve a sha1(uuid.uuid4()) as a valid ocsp nonce [15:04:55] but it looks like 20 bytes should suffice :/ [15:11:26] what ocsp are you testing? [15:11:50] I've found a pure python pyasn1 based ocsp client [15:11:59] so I've adapted it to work with my x509 implementation [15:12:08] and the OCSP request itself works as expected [15:12:34] but it fails if I attempt to add a proper nonce [15:12:42] well.. I get a valid OCSP response [15:12:49] but wireshark complains about the OCSP nonce [15:12:53] I meant at a different level, like, I didn't think the certcentral stuff would be doing ocsp fetches [15:13:02] hmmm [15:13:09] (or is this ocsp-checking as a client, on LE's https endpoint?) [15:13:16] how is supposed to detect if a certificate has been revoked or not? [15:13:59] "a certificate" it has issued, or LE's cert for the API comms? [15:14:31] a certificate it has issued [15:15:51] rfc2560 it's pretty vague regarding the nonces https://www.ietf.org/rfc/rfc2560.txt :/ [15:15:58] the idea being, I'm assuming... periodically check existing issued certs for ocsp revocation, to know to re-issue them earlier than it otherwise would based on expiry? [15:16:16] indeed [15:16:53] I guess I'm assuming revocations themselves would be manual, but must involve the same set of tooling and data somehow [15:17:22] if you add some kind of "revoke-cert" CLI functionality that operates using the stored certs+keys+metadata, etc.... [15:17:33] then you'd implicitly know when it was revoked and not have to check [15:18:00] (and revoke would I guess also immediately reissue with a new private key) [15:18:28] can revocation be done by anyone other than us? [15:18:47] you need the private key to revoke [15:18:48] the CA itself could do it *technically* speaking [15:18:56] the CA shouldn't be able to do it itself? [15:19:17] well.. the CA could at least fake it at OCSP levle [15:19:20] *level [15:20:10] true, but they can't actually-revoke, just screw up OCSP-vs-reality? [15:20:34] oh, apparently you can also revoke on LE using the account private key [15:20:41] https://letsencrypt.org/docs/revoking/ [15:20:49] but what's reality here? a certificate is revoked if it's included in the CRL list or the OCSP response says it has been revoked [15:21:11] I thought revocations had to be signed by the original PK, but maybe I'm wrong, maybe that's juts part of Standards that PK ownership should be proven. [15:21:47] either way, by LE's documented practices you need the account privkey or cert privkey to do the revoke [15:22:06] so [15:22:11] only if one of our secrets were to leak [15:22:14] would that situation actually occur [15:22:26] in which case we don't want anything automatically renewing [15:22:28] which is exactly the situation in which you want to revoke [15:22:39] but yeah, we don't want auto-revoke+reissue hiding that fact [15:22:53] someone should notice the problem, fix the problem, and then revoke+reissue [15:22:58] because they'd likely be able to get the new one exactly the same way [15:23:02] yes [15:23:10] ack [15:23:26] so we should detect the certificate has been revoked and issue a warning :) [15:23:27] I guess the only other situation is our CA goes rogue and revokes it :) [15:23:28] so, my point is, it might be simpler just to have some CLI tool/argument/whatever for manual revoke+reissue and just assume no revocation otherwise. [15:23:35] at which point we have far bigger problems [15:24:10] I agree [15:24:11] as far as rogue revokes go, the endpoint hosts that deploy/use the certs, will have OCSP fetchers and OCSP validation, etc... icinga will trip on a random revocation screwing those up [15:24:24] yes [15:24:45] do we do OCSP monitoring on all our certs or just the main wiki ones? [15:25:50] we only do OCSP at all for the main wiki ones deployed on the cache terminators. We could/should for everything else, we just haven't hooked all the bits together. [15:26:10] so including the misc-web stuff but not e.g. gerrit [15:26:34] the puppetization of that doesn't terribly-much care where the cert came from, it's just fed input public key paths and dumps OCSP staple data to a directory for nginx to consume (but I've never tried to integrate apache) [15:27:06] gerrit would be even more problematic. We could run it anyways just for the validation, but I don't know if gerrit's TLS termination has a way to serve up the live staples [15:27:54] they're really two separate issues I guess: self-checking/fetching your own certs' OCSP data (the existing puppetized scripts probably work fine for that for any cert, assuming provider does sane OCSP) [15:28:27] and actually using the fetched OCSP, stapling it to TLS connections live (only puppetized for nginx, not all terminators may support external staple data like that, or stapling at all) [15:30:07] puppet's modules/sslcert/manifests/ocsp/ has the existing stuff we use on the cache terminators [15:31:05] and modules/tlsproxy has the nginx integration to serve them (requires our patched nginx, in the case of multi-cert for ECDSA+RSA) [15:31:28] <%- @certs_nginx.each do |cert| -%> [15:31:28] ssl_stapling_file /var/cache/ocsp/<%= cert %>.ocsp; [15:32:10] but the current legacy stuff in puppet for LetsEncrypt doesn't try to use any of that if the cert deployed via nginx tlsproxy happens to be an LE cert [15:42:58] alright so in summary [15:43:15] 1) don't attempt to detect revocation inside certcentral [15:43:56] 2) do provide a CLI tool to manually revoke and trigger certcentral reissuing [15:44:27] *sigh* :) [15:44:47] I don't like the fact that certcentral would miss that a certificate has been revoked [15:44:47] 3) at some point in the future, close the gaps in our OCSP monitoring? but not in scope for this work [15:44:57] bblack, have I got that right? [15:44:59] something to think about, but lowest-priority, is having a flag (per configured cert) to ask for the ocsp must-staple attribute on the issued certificate. I think LE supports that, via params set in the req. [15:45:35] we'll eventually want that, at least for cases where we know the termination can do it [15:45:58] Krenair: I think so, modulo valentin's discomfort :) [15:46:38] vgutierrez, certcentral would miss it but we have separate monitoring for it, and don't necessarily want certcentral to be taking action on that right now [15:47:00] ack [15:47:11] so I'm getting rid of CertificateStatus.REVOKED for the time being [15:47:14] ok [15:47:26] BTW [15:47:45] right now we only check if certificates are going to expire soon on config load [15:47:50] on the general topic of must-staple: the real purpose there is that if the endpoint gets compromised, the attacker can steal both the private key and the signed public cert. It's easier to use the existing signed public cert to impersonate us, vs going out and getting a new one issued to that private key (or another). [15:48:18] but if the stolen, signed public key has the ocsp must-staple attribute, then the attack has to staple ocsp for the cert to work, and thus as soon as we revoke it revokes for their hacks as well [15:48:20] so only once when certcentral is spawned... I guess we want to do this every X hours, once a day or whatever [15:48:53] vgutierrez, so we know what certificates we have issued already [15:48:58] how about timers based on their expiries? [15:49:10] (vs if the cert doesn't have must-staple, and browsers don't require stapling, and the attacker doesn't stupidly staple, they could keep impersonating beyond revocation in theory, for at least some clients) [15:49:19] or is it simpler to just regularly check all? [15:49:48] Krenair: right not the simplest approach is forcing certcentral to reload the config every 12 hours (or whenever we want) [15:49:55] s/right not/right now/g [15:50:15] that would trigger a check on all the configured certificates [15:50:41] on that topic: checking on routine reload is probably fine to get it out the door. [15:50:48] could probably have puppet reload it each time :) [15:51:14] IIRC I made it listen for SIGHUP and reload config [15:51:14] in the long term, yeah, timers would be nice, because they can be a bit more efficient and reactive. [15:51:28] Krenair: yep, that's still there [15:51:35] e.g. if cert renewal is failing, it can slowly increase retry speed as expiry approaches [15:55:21] if renewal fails we probably want big red warnings [15:56:02] past a certain point anyway, it's possible for the LE API to go down for a bit and that'd be fine [15:56:14] 10netops, 10Operations, 10fundraising-tech-ops: adjust NAT for 208.80.152.231 (codfw bastion) to point to frbast2001 (10.195.0.67) - https://phabricator.wikimedia.org/T202536 (10ayounsi) 05Open>03Resolved [15:56:44] so.. now that OCSP is out of the equation [15:56:59] I've in my TODO two big points [15:57:04] actually 3 [15:57:08] 1. DNS-01 [15:57:11] 2. Logging [15:57:15] 3. prometheus integration [15:57:46] right now for http-01 I already have integration tests that shows that certcentral works as expected [15:58:51] I think for DNS-01 with gdnsd we'd have some mechanism that involves it SSHing out to the auth dns servers to run commands to add the challenge [15:59:00] and for designate I guess just contact the designate API [15:59:07] as discussed with bblack, dns-01 should be as simple as writing the challenges on disk and spawning a subprocess calling a script (to be provided by bblack) [15:59:29] ok [15:59:31] so no big issue for us [16:00:19] right, at the end of the day, basically we need to iterate over the list of authdns server hostnames (provided by puppet), ssh to them all, and run "gdnsdctl acme-v2-challenge example.org q9348yr9weyf9qyew9qew8f98h" [16:00:46] we should implement a check to know if the Challenge is already deployed (requests.get for http-01 and a TXT lookup for dns-01) [16:01:05] I think I'll add multiple challenges per line too, just to make the whole process more-efficient [16:01:20] right, at the end of the day, basically we need to iterate over the list of authdns server hostnames (provided by puppet), ssh to them all, and run "gdnsdctl acme-v2-challenge example.org q9348yr9weyf9qyew9qew8f98h example.com q309tuq039u4t0q34uralkewjd www.example.com 8432qerhaoi8yeiawh4 ...." [16:01:26] ^ that [16:01:36] regarding the logging... shoud we log to stdout and let systemd/journald handle everything? [16:01:40] so when you have 23 SNIs for one cert, they can all be shoved out the door in one command [16:01:52] (I think each SNI will have its own challenge like that, anyways) [16:02:23] or taking into account some journald fiascos like pybal, also log to disk? [16:03:24] I'd send it all to stderr and let systemd deal, at least for first stab it's simpler and supposed to work [16:03:46] ack [16:04:00] regarding prometheus, what we should send there? [16:04:13] number of certificates being handled, certificates on each state... [16:04:17] I have no idea [16:04:20] 10netops, 10Operations: set up NAT from 208.80.155.15 to frpig1001 - https://phabricator.wikimedia.org/T202520 (10ayounsi) 05Open>03Resolved Done. ``` $ nc -zv 208.80.155.15 443 Connection to 208.80.155.15 443 port [tcp/https] succeeded! ``` [16:04:48] if you're reloading every 12 hours and reload resets exported stats that are happening at a very low rate, it may be hard to get any consistency in grafana anyways [16:04:57] (e.g. logging a counter event on failed cert fetches) [16:05:56] well... ideally we should have almost the 100% of the certificates in the VALID status, and from time to time, some of them switching to NEEDS_RENEWAL and back to VALID [16:07:15] and when adding new certs they'd briefly be in one of the earlier states, and updating of existing certs... [16:07:35] yup [16:08:08] I'll give it a think to the timers approach [16:08:20] maybe we could go with them since day #1 [16:09:21] BTW... for switching from http-01 to dns-01, shall I implement it as a config parameter? [16:09:43] so we can switch from http-01 to dns-01 seamlessly? [16:09:44] alternative being just detect when it's a wildcard cert [16:09:58] but then everything has to implement the /.well-known/acme-challenge proxying [16:09:59] we're already able to to that [16:10:10] but AFAIK bblack wants to go dns-01 a 100% [16:10:11] so yeah just make it a field on the cert in config [16:10:24] ack [16:11:03] (well, everything that needs to be covered by a cert without a wildcard anyway) [16:11:43] the refactor branch needs some reviewing.. it's starting to be a PITA to handle the rebases on the 3-4 standing commits [16:11:58] ok [16:12:02] https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/451867/ [16:12:11] let's see if we can close this one by the end of the week [16:12:21] I'd like volans to check it as well, but he's busy as hell :) [16:12:22] those only became non-WIP like today didn't they? [16:12:37] yup [16:12:39] ok [16:12:56] I'll see if I find the time, quite busy with the switchdc stuff so far, sorry [16:13:13] np volans [16:26:09] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Gehel) for reference, the mediawiki implementation of cache invalidation: https://github.com/wikimedia/mediawiki/blob/0ac1ee6... [17:04:11] 10netops, 10Operations, 10decommission, 10ops-eqiad: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390 (10ayounsi) [17:21:15] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Email to EQ: > Shaun, > > Our Equinix portal lists you as our account rep for SG3, so I'm hoping you can assist me in a recent issue we're having. > > We have a defecti... [17:21:23] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) a:03RobH [17:52:41] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure: https://sv.wikipedia.beta.wmflabs.org/ has invalid certificate - https://phabricator.wikimedia.org/T202564 (10matmarex) [17:52:49] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure: https://sv.wikipedia.beta.wmflabs.org/ has invalid certificate - https://phabricator.wikimedia.org/T202564 (10matmarex) [17:53:04] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: https://sv.wikipedia.beta.wmflabs.org/ has invalid certificate - https://phabricator.wikimedia.org/T202564 (10matmarex) >>! In T191184#4523999, @Arlolra wrote: >> Host: sv.wikipedia.beta.wmflabs.org. is not in the cert > > ``` > ssh deployme... [17:56:04] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: https://sv.wikipedia.beta.wmflabs.org/ has invalid certificate - https://phabricator.wikimedia.org/T202564 (10Krenair) Pretty much the same thing as {T199387} [18:09:57] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Mholloway) a:05Pnorman>03Mholloway [18:10:37] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban): Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Mholloway) [18:16:33] 10Traffic, 10Operations, 10monitoring: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 (10fgiunchedi) If we can avoid false positives I believe the alert has value, also because AIUI a traffic drop might not necessarily result in visible errors o... [18:42:07] 10Traffic, 10Varnish, 10Operations, 10Maps (Tilerator): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) a:03Mholloway [18:42:24] 10Traffic, 10Varnish, 10Maps-Sprint, 10Operations, 10Maps (Tilerator): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) [18:42:42] 10Traffic, 10Varnish, 10Maps-Sprint, 10Operations, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) [20:52:03] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, and 2 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Mholloway) Deployed to beta cluster. Note that we won't be able to deploy the updated max-age to production until the production upgrade to Stretch (T1... [21:00:17] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban): Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Mholloway) [22:00:12] 10Traffic, 10Operations, 10monitoring, 10Patch-For-Review: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 (10ayounsi) a:03ayounsi [22:02:08] 10netops, 10Operations, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) A chassis reboot cleared that specific issue.