[08:58:19] 10Traffic, 10Operations, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ` ['cp2018.codfw.wmnet', 'cp2025.codfw.wmnet'] ` The log can be found in `/var/lo... [09:07:17] 10Traffic, 10Operations: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10MoritzMuehlenhoff) [09:10:34] 10Traffic, 10Operations, 10User-ArielGlenn: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10ArielGlenn) [09:13:00] 10Traffic, 10Operations, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2018.codfw.wmnet', 'cp2025.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2018.codfw.wmnet', 'cp2025.codfw.wmnet'] ` [09:51:45] 10Traffic, 10Operations, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ema) 05Open>03Resolved [10:06:18] morning vgutierrez [10:06:40] morning [10:09:47] your release commits look okay though I see you merged them all already :p [10:21:38] Cherry picking existing commits should be harmless :) [11:39:29] so... initial tests in one node look promising [11:39:42] I wasn't able to trigger the "valid status" error [11:39:59] and checking CT logs only 2 certificates have been issued on each test round [11:41:28] so I'm going to add certcentral2001 to the equation [11:56:00] cool [11:57:49] indeed [11:57:59] I'm going to need a new SAN besides pinkunicorn.wm.o [11:58:17] cause it's already validated till Nov 18th [11:58:51] so thanks to the latest optimizations certcentral is not triggering the challenge validation anymore [11:59:04] also I want to test the scenario that bblack described yesterday [11:59:14] pinkunicorn2.wm.o? :P [11:59:23] asking for one cert for pinkunicorn.wm.o (already validated) + pinkunicorn2.wm.o (not validated) [11:59:40] so if that new created order comes with pinkunicorn.wm.o signaled as validated [11:59:48] we can take advantage of that as well [11:59:52] yeah [12:00:10] would our code mishandle that currently or just put up more challenges than necessary? [12:00:39] I suppose if LE doesn't provide challenge data for a domain then we really have nothing to use? [12:00:54] just put up more challenges than necessary [12:01:11] ok [12:18:13] 10Certcentral, 10Patch-For-Review: Avoid using acme.client poll_and_finalize() method - https://phabricator.wikimedia.org/T208967 (10Vgutierrez) Test results against LE staging environment are really promising: `name=certcentral1001 Nov 09 12:05:31 certcentral1001 certcentral-backend[30803]: SIGHUP received No... [12:19:28] I'm going to switch manually to pinkunicorn2.wm.o [12:19:43] hopefully we should see 1 DNS auth per node [12:19:55] and the second cert on each node reusing the first DNS auth [12:29:24] sigh [12:29:32] almost /o\ [12:35:05] 10netops, 10Operations: Investiagate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ema) [12:35:14] 10netops, 10Operations: Investiagate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ema) p:05Triage>03Normal [12:37:29] 10Certcentral, 10Patch-For-Review: certcentral "wrongly" assumes that a new order always implies fulfilling new challenges - https://phabricator.wikimedia.org/T208948 (10Vgutierrez) I ran a test to issue new certificates for a non already challenged hostname: pinkunicorn2.wikimedia.org, these are the results:... [12:38:55] 10Certcentral, 10Patch-For-Review: certcentral "wrongly" assumes that a new order always implies fulfilling new challenges - https://phabricator.wikimedia.org/T208948 (10Vgutierrez) A second attempt, 10 minutes shows that certcentral1001 is able to fetch the certificates this time: `Nov 09 12:37:42 certcentral... [12:41:31] so.... [12:41:35] as you can see in https://phabricator.wikimedia.org/T208948#4734797 [12:41:56] at the firsts attempt, certcentrall2001 was able to fetch the two certificates [12:42:19] and shows the expected behaviour, fulfills 1 challenge for the first cert, and the second one benefits from that [12:42:33] certcentral1001 fails to get both certificates [12:42:51] certcentral reports that Let's Encrypt has rejected the dns-01 challenges [12:43:06] and fetching the authz manually shows the same [12:43:50] the error provided my LE in the authz is "detail": "Incorrect TXT record \"b6_IMxhS361pfcNPJ_X3RlpXO75LMZ_5zC6IpJxXCrU\" found at _acme-challenge.pinkunicorn2.wikimedia.org" [12:43:58] s/my/by/ [12:44:17] that token is the one used successfully by certcentral2001 account [12:45:34] a manual DNS query showed that the 3 challenges were there [12:45:43] https://www.irccloud.com/pastebin/LX4LPTPk/ [12:47:58] and as we discussed in the past, boulder checks every TXT record available for the _acme-challenge TXT record: https://github.com/letsencrypt/boulder/blob/master/va/va.go#L873-L879 [12:48:42] the "Incorrect TXT record" error messages comes from https://github.com/letsencrypt/boulder/blob/master/va/va.go#L891 [12:49:33] so at that point we must assume that from LE point of view, _acme-challenge.pinkunicorn2.wikimedia.org TXT record only contained b6_IMxhS361pfcNPJ_X3RlpXO75LMZ_5zC6IpJxXCrU [12:52:41] I'm wondering if this could be avoided by fixing T207461 [12:52:41] T207461: Validate DNS-01 challenges against every DNS server - https://phabricator.wikimedia.org/T207461 [12:53:53] funny enough [12:54:04] this could be avoided sharing the same ACME account between certcentral nodes [12:54:31] 10netops, 10Operations: Investiagate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10elukey) [12:54:40] 10netops, 10Operations: Investiagate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10elukey) [12:55:44] bah [12:57:15] dunno if T207461 is the culprit org LE DNS caching is messing with us [12:57:32] it would be crazy setting a TTL 0 for those TXT records? [12:58:01] bblack: ^^ [12:59:14] 10netops, 10Operations: Investigate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10Aklapper) [13:16:13] interestingly enough.. https://letsencrypt.org/docs/integration-guide/ recommends 1 account and using dns-01 [13:18:43] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10faidon) Why is this still pending? [13:35:57] hmm how could we debug this? [13:44:07] from https://github.com/letsencrypt/boulder/issues/2516#issuecomment-274500957 it looks like they are querying the auth servers directly, that could be useful to debug this [13:58:57] vgutierrez, I take it we can't easily inspect the packets moving between the LE network and our auth DNS servers? [14:02:45] the auth DNS servers handle a big amount of queries.. that makes it harder [14:03:10] yes [14:03:36] I'm assuming therefore there is no logging of who makes what request when and what we respond with :) [14:17:26] yeah but I think it's pretty easy to see what's going on here [14:18:06] they have a cache on their end, they've already cached the set of TXT records for that hostname for our advertised 600s, and they trust their cache and thus don't re-query to see the larger set after the second account/server creates its records [14:18:44] and realistically, 0s TTLs don't actually "work", there will be some minimal caching and it'd just be a race condition [14:19:17] single-account would fix it, the question is whether going single account would screw up other parts of our grander scheme here [14:23:44] according to https://github.com/letsencrypt/boulder/issues/1088#issuecomment-209518797 [14:24:00] they do cache 300 secs the DNS queries [14:24:59] heh so not even a minimal race, pretty much gauranteed :) [14:27:34] so there's ratelimits, but I think we're probably ok there [14:27:47] are there other things that significantly change by sharing the account? [14:28:52] nope, the biggest concern is rate limiting [14:29:32] besides that, as I mentioned before, they do recommend sharing accounts in https://letsencrypt.org/docs/integration-guide/ [14:34:04] BTW the limit of 300 pending authorizations refers to authorizations and not challenges [14:34:32] oh good [14:34:52] I was just picking through that in the back of my head, thinking about our 100-SAN certs we'll eventually do for all the non-canonical names [14:34:52] from https://letsencrypt.org/docs/rate-limits/ section :"Clearing pending authorizations" [14:35:07] and thinking "wow we'll hit 400 if we just do ecdsa+rsa on 2 servers" [14:36:41] of course now tha twe can do wildcards, we don't necessarily have to do chunks of 100 SANs for that scheme either, we could break it down to something more manageable and real [14:37:19] I guess we should come up with some ideal ratio that best stays under all applicable ratelimits, for that case [14:37:45] and the key word is "pending", the authorization object includes the 3 challenges {dns-01,http-01, tls-alpn-01} [14:38:36] e.g. if you figure we have ~700 domains in the non-canonical set, and many of them will want 2 SANs (wildcard and root domain), that's 1400 total SANs. [14:39:36] you could split that up as e.g. ~14x 100-SAN certs, or 28x50, 35x40, 40x35, 50x28, 100x14, etc... [14:40:05] there's probably a way to think about the relative looseness of different LE ratelimits that guides the correct shape for bulk certs like those [14:43:25] I'm doing a quick test against pinkunicorn3 with both nodes sharing the same account [14:43:40] let's see what's the behaviour with that setup [14:59:16] as expected.. shared account means both nodes get the same dns-01 tokens [15:00:07] right [15:00:35] plus even if they somehow didn't, if they're in a race window of time, they're within the much longer time the first challenge has already succeeded and thus it doesn't even need the second one. [15:01:03] so the 2x servers will setup redundant TXT outputs with the gdnsd servers, but that's fine. [15:02:55] are we planning to do any direct alerting from/about certcentral's cert management (e.g. certX failed to validate/renew for some time, or certY is <7d from expiry so clearly renewal has been failing for a while, etc).... [15:03:12] or just leave that up to the puppetized cert monitoring on the consuming endpoint hosts (which I think would work just as well) [15:06:43] we should get those alerts for free from Let's Encrypt to noc@wm.o [15:06:58] at least the expiry ones [15:07:47] well and we'll also have puppetized monitoring on the end-hosts, we do that already for existing cases [15:07:47] the other ones we must provide them somehow [15:07:53] it's not hard [15:08:28] I just didn't know what the current thinking was, on whether we should also monitor cert status things on CC hosts themselves for all managed certs (but my guestimate is it would just be redundant with the other monitoring and pointless complexity) [15:09:12] basically if it's a new cert issue and CC is failing, obviously whomever's doing the new cert deployment will notice (or something will functionally alert) if they can't get their new certs [15:09:44] and for existing, CC will have some policy to renew well before expiry, and icinga is checking the actual TLS endpoint hosts for certs that are too-close to expiry, implying CC couldn't renew them when expected by policy [15:10:56] it seems like it would be nice to have some kind of heads-up/warning if CC is failing a bunch of cert fetches (esp renewals), but then we don't want it to spam on transient issues that go away on retries and aren't a big deal, either. [15:16:55] ARG! [15:16:58] no fricking way [15:17:14] Nov 09 15:13:35 certcentral1001 certcentral-backend[11095]: Location: https://acme-staging-v02.api.letsencrypt.org/acme/order/7090084/12884174 [15:17:22] Nov 09 15:13:34 certcentral2001 certcentral-backend[15024]: Location: https://acme-staging-v02.api.letsencrypt.org/acme/order/7090084/12884174 [15:17:58] the 2 hosts running at the same time send the same set of DNS names, they get the same fricking order [15:18:41] bblack: would you hate me If a add a unique SAN per cercentral host? /o\ [15:19:15] haha [15:19:30] I don't think that's the right answer, we need to think a bit here [15:23:23] delaying the non-active node would solve it also [15:24:20] this is proving much trickier than I expected [15:24:29] I think, what's going on here [15:24:31] with the AMCE protocol, LE trying to be clever, and LE's arbitrary limits [15:24:41] between the DNS-level issue pointing at shared accounts, and shared accounts pointing at shared orders [15:25:01] is that LE/ACME are, in practice, pushing us towards a more active/passive setup [15:25:20] yep I'm afraid that's the casd [15:25:25] s/casd/case/ [15:25:31] and really, we could restructure the deployment very slightly and do that, it's not the end of the world [15:25:56] in my high-level and possibly faulty view, I see it like this: [15:26:30] 1) I assume certcentral's total set of state (config, certs/keys/etc) all lives on disk in perhaps two directories (one for config input, one for managed cert data) [15:27:26] 2) whichever one we're currently calling "passive" for the TLS-client side of this, we also call "passive" for the CC server side of it, which means it won't (at least not immediately) actually take any actions (but more on this later...) [15:28:09] 3) and we set up a bidirectional rsync job between the two hosts' state dirs to keep them in loose sync, so that if the active one gets an axe to the head, it's easy to flip puppet active-ness and resume operations without reissuing the world. [15:28:40] as for how the passive-side really operates, I see two basic options: [15:28:57] so in puppet we just ensure the passive backend service is stopped? [15:29:06] and set up rsync between them? [15:29:24] a) either passive really means doing nothing at all. you could have the daemon just not running, but it might be simpler from a server config/monitoring perspective to just give it a passive config flag or whatever that tells it to run and sleep and do nothing. [15:29:33] or: [15:31:06] b) if you wanted to get fancy, you could have the passive (or maybe in this case "less-active") side simply have a different in its timing/thresholds. e.g. active server renews by at worst 20 days to expiry of old cert, and "passive" cert doesn't even start trying to renew unless there's 19 or fewer days left. And for new issues, active tries immediatley, and passive doesn't try until a new cer [15:31:12] t config has sat for several hours with no cert appearing from the active side on the FS. [15:31:35] so that it more or less automatically takes over, even if the other daemon is just inexplicably "stuck" and no other monitoring quite points that out [15:31:46] or something along those lines [15:31:55] but I tend to think (b) isn't worth the complexity at this stage, and (a) would suffice! [15:32:22] I'd suggest going for (a) now and get fancy in the (near) future [15:32:41] really it's hard to justify (b), the edge cases are tricky [15:33:01] and it also kind of falls into a defensive-systems-programming anti-pattern as well [15:33:19] on the bright side, this showed me a small bug I introduced yesterday by not surrounding the finalize_order with a try except clause :) [15:33:23] where you're designing a system to hide errors so well that you're going to be unaware of issues you should've been aware of, until it's way too late :) [15:33:30] yup [15:44:52] Krenair: I've missed this yesterday: https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/472676/ [15:45:06] shame on me :( [15:47:11] this kinda thing happens :) [15:47:26] as bblack suggested, if we go towards a master/slave, active/passive or whatever you can label this without being politically incorrect nowadays... [15:48:10] I think it makes sense to implement it as a config flag, to let certcentral run anyways in the passive node [15:48:26] it would be easier regarding monitoring and so on [15:49:23] so [15:49:33] what would certcentral be doing while in passive mode? [15:49:39] just sitting there running with an empty config? [15:50:01] load the config and sleep in the main runloop, waiting for future restart/reload/whatever [15:50:23] for now anyways. maybe something off in The Future we might make passive-mode do something more interesting? [15:50:46] the alternative is not giving CC a passive-mode like that, and handling it in puppetization [15:50:57] where the non-active node stops the daemon and host icinga checks don't expect it to be running, etc [15:51:11] I was assuming we'd do it in puppet [15:51:31] if the passive node gets all the certificate data via rsync it can still be an active node for certcentral-api [15:51:37] it will make for some annoying alerts on transitions, given the lag time to next puppet runs -> icinga check changes, etc [15:51:41] yes [15:52:14] expect the rsync won't be realtime [15:52:43] however often we run it, there will be that much lag time in certs showing up there, causing unecessary delay windows of client failure until the cert appears, if one tried to use it actively for the cert api [15:54:32] and of course rsync it's going to consider the master node as the source and the slave as the destination , right? [15:54:43] and not a both ways sync [15:55:23] no, you can do bidirectional [15:55:35] and have them both run it at offset times [15:56:03] (but then you don't get deletes, hmmm) [15:57:57] we don't want the passive server which doesn't have a backend service running to be pushing stuff to the active server which does have a backend service running [15:58:20] I guess the "right" model in terms of monitoring, failure modes, not leaving junk around for lack of deletions, etc.... [15:58:45] would be to run the rsync command itself from the current active one, pushing data with the --delete towards the passive one. [15:59:09] and maybe an icinga check that goes critical if the data directory is empty, that runs on both. [15:59:44] and make the rsync command non-fatal (if the passive server is unreachable/dead/failing, it's not a failure that causes spam on the active one) [16:00:00] yeah that gets messy too [16:00:07] how about: [16:00:45] active has the puppetized rsync pushing to the passive one with --delete, and does alert on failure (which we can go ack/downtime if we know there's a dead passive) [16:02:26] and the script/whatever that drives the rsync, also touches a .status file in the directory just before each push, just for the mtime. [16:02:51] and the passive side has a puppetized icinga check that fails if .status doesn't exist or has an overly-stale mtime [16:03:38] (which will catch the case that both hosts are up, but the rsync is persistently failing to do its job, leaving overly-stale data in place for the next failover) [16:05:10] alerting on the master side of the rsync failing is tricky, but I guess it could/should be a cronjob anyways, and cron can just alert on bad exit value via email heh [16:05:31] the icinga .status check on the passive side will give us the icinga side of things [16:06:26] that or just store the rsync exit code in a file and let icinga check it [16:06:41] hmmm yeah [16:06:48] so the master node gets a set of icinga alerts and the slave a different one [16:06:52] beats more cronspam emails for someone to get annoyed at [16:06:58] indeed [16:07:13] so somewhere in CC puppetization [16:08:59] if $active { service { "cc": ensure => running, enable => true, ... } cron { "rsync_push_script": ... } icinga_check { "rsync_push_script_exitval" ... } } else { service { "cc": ensure => stopped, enable => false } icinga_check { .status mtime } } [16:09:29] well, maybe service should be enable=>false on both sides, so that it never starts on boot, and only starts when puppet starts it [16:09:59] otherwise if the active suddenly dies and we flip masters, then fix hardware and boot the old active back up, it might initially do bad things before it puppetizes over to passive mode [16:10:18] hmmm that argument applies to the rsync push script too [16:10:33] so maybe that should be an exec that happens in the agent run, rather than a cron. [16:10:58] it'll still push data every ~30 mins, which is fine if all we're trying to prevent is reissing ALL the certs when we flip masters. If we have to redo the last ~30 mins of activity it's ok. [16:13:45] maybe this has been pointed out already but just to make sure, it seems rsync will also sync secret material? if so I guess rsync will need some encryption [16:13:50] 10Traffic, 10Operations, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10RobH) [16:13:52] 10Traffic, 10Operations, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) 05Open>03stalled p:05High>03Low Vivian @ EQ Singapore fixed it, adding in maint announce to their alerts. We should get the next ale... [16:13:59] 10Traffic, 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10MoritzMuehlenhoff) >>! In T... [16:14:04] godog: yeah I assume this is rsync+ssh :) [16:14:26] (with some accounts to manage going across the nodes, like we do for CC->authdns executions over ssh) [16:15:09] bblack: oh ok, I had rsync::quickdatacopy in mind, nevermind [16:16:53] I hadn't even seen that :) [16:27:09] 10Traffic, 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) >>! In T153468#473... [16:29:40] 10Certcentral: switch certcentral servers from active/active to active/passive - https://phabricator.wikimedia.org/T209161 (10Vgutierrez) [16:29:50] 10Certcentral: switch certcentral servers from active/active to active/passive - https://phabricator.wikimedia.org/T209161 (10Vgutierrez) p:05Triage>03High [16:38:31] so.. for rsync+ssh we need to grant SSH access aka generate a SSH key.. add it to keyholder and so on, right? [16:44:35] I assume so [16:44:42] or taking into account that the only involved hosts are the certcentral ones [16:44:49] maybe we could reuse the authdns_certcentral SSH key? [16:47:08] that's probably okay [16:47:16] the keys to that are held by the certcentral hosts [16:47:26] yes [16:47:33] would just mean putting the public parts on the certcentral hosts and granting the appropriate permissions [16:49:19] just make sure that certcentral user can write to /etc/certcentral [16:49:38] yep [16:49:43] it'll be able to already [16:49:50] at this point I'm wondering if we should keep using /etc/certcentral or move the certificates and so on to /var/certcentral [16:50:15] but that could be easily refactored later [17:09:09] vgutierrez: I'd say it makes more philosophical sense to split the managed data from the input config (/etc/ for puppet-driven config input, /var/ for managed cert data outputs) [17:11:05] yep [17:12:36] so I guess /etc/certcentral should just have config.yaml, conf.d/, and accounts/ (with all its current contents, which seem to all come from puppet) [17:13:23] and all the other bits there like new_certs/, live_certs/, csrs/, dns_challenges/, ... should all be in some /var/ dir [17:13:37] probably you could bikeshed a bit on *nix/LSB standards about which subdir of var [17:15:02] right.. I'll move that after the active/passive switch [17:15:04] probably /var/lib/certcentral/ makes the most sense [17:15:33] I'm guessing that will involve a code patch though, because probably current CC code has one root directory for all the things [17:15:44] and then we only rsync the one in var [17:19:26] hmmmm [17:19:51] yeah.. but it won't be that hard to change [19:32:02] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10CDanis) [20:06:55] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10faidon) Can this task be resolved, given we have T178592 to track the bast4001 decom? [21:01:30] 10Traffic, 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) >>! In T153468#469... [21:34:09] 10Traffic, 10Beta-Cluster-Infrastructure, 10DNS, 10Operations, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) So that seems to w...