[00:25:17] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) [00:29:08] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) Went on site today to do my hand scan, get my access code and have an idea where things are . Got the router from shipping and racked it already. [00:45:10] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) @chris can you please update cr2-eqord Custom Fields in netbox when I am on site tomorrow I will put in the asset tag information. Or if you have the purchase task number you can just ink it to t... [09:29:31] 10netops, 10Cloud-Services, 10Operations: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) I can investigate how difficult is this and give a better guess-estimate of the disruption to end users. I'll try the approach that @chasemp suggested, h... [10:16:36] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) @ayounsi What time do you want to start working on this today? I can be on site by 10:30 am Chicago time. [12:36:34] 10Traffic, 10Operations: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10jijiki) a:03Vgutierrez [12:36:45] 10Traffic, 10Operations: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10jijiki) p:05Triage>03Normal [12:52:59] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [12:53:01] 10Traffic, 10Operations, 10Patch-For-Review: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) 05Open>03Resolved a:03ema Upgrade finished! [13:11:55] 10netops, 10Operations, 10ops-eqiad: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10ayounsi) 05Open>03Resolved Seems all solved. [13:15:31] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) 11:00 Chicago time works for me, I sent you a calendar invitation. [13:15:51] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [13:24:31] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ayounsi, for this ticket, shall we ask for these to be set up in the public VLAN? [13:39:22] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) That sounds good to me but will have @faidon doublecheck. Ideally please distribute those servers across multip... [15:07:08] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Faidon asked for a diagram to help understand the data flow. Here we go! {F26768261} [15:21:34] Krenair: hi! [15:21:43] when you have a spare moment I've a couple of CRs for you [15:21:59] https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/469407/ --> exponential backoff [15:22:28] https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/469446/ --> avoid getting stuck in CHALLENGES_PUSHED status in the T207737 scenario [15:22:28] T207737: LE rejects issuing two certificates with the same CSR on a short timespan - https://phabricator.wikimedia.org/T207737 [15:22:49] I saw that second one and noticed you assigned it to volans [15:23:05] yey, I wanted to discuss some python stuff with him first [15:23:10] regarding __setattr__ [15:23:41] and the best way of making some attributes RO [15:24:49] hm [15:28:31] vgutierrez: the "two certificates with same CSR in short timespan thing".... I assume it's the combination of CSR + account key? [15:28:50] and the CSR would of course differ for the RSA+ECDSA crypto-duplicates [15:29:20] bblack: hmm actually I'm not so sure about the crypto-duplicates [15:29:30] bblack: let me test it [15:29:35] well I know the CSR wouldn't be bit-identical [15:29:44] (for ECDSA vs RSA but same SAN list) [15:29:50] yeah, but it doesn't matter to LE [15:30:01] right now every time we generate a CSR we are generating a new private key [15:30:08] oh right [15:30:10] and still LE considers it the same one [15:30:43] I know we've talked about issuing ECDSA+RSA, I don't remember if we decided that should be default / every-time, or just a per-cert option or whatever. [15:31:01] (but if there's no good reason not to, we should just get every cert with both for now) [15:31:18] we always issue both [15:31:22] ok [15:31:50] could make it configurable if someone came up with a use case [15:32:13] someday we might be able to start slowly migrating towards singular ECDSA-only, especially for lesser services than the unified projects cert. [15:32:27] it would just be an optimization though, to not generate a pointless/unused RSA copy [15:32:51] our old LE stuff is RSA-only, which isn't ideal either of course [15:33:02] well.. [15:33:22] today LE staging directory it's letting me issue the same certificate... [15:33:28] again and again [15:33:33] heh ok :) [15:33:47] I would've expect that, at least until you hit the documented ratelimiters [15:34:28] after the 6th attempt I didn't get the same behavior as we've been seeing on Friday and yesterday [15:34:47] right [15:34:59] maybe ratelimits are more-relaxed in the test endpoint though, who knows [15:35:32] notably, the duplicate certificate ratelimit in https://letsencrypt.org/docs/rate-limits/ doesn't mention the RSA+ECDSA issue. I wonder if that counts double for those purposes. [15:36:06] hmmm [15:36:15] I'm wondering if that duplicate certificate limit applies to ACMEv2 as well [15:36:25] --> For users of the ACME v2 API you can create a maximum of 300 New Orders per account per 3 hours. [15:36:54] I think 300 new orders per 3h, means, "regardless of the SANs, only 300x non-renewal requests / 3h" [15:37:24] it does sound like the "5 dupes per week" limit will apply to our ECDSA+RSA+2servers scenario [15:37:39] yeah, but if the 5 per week applies I wouldn't be able to get any more new certificates for pinkunicorn [15:37:57] maybe test has relaxed limits though [15:38:10] indeed [15:38:12] (you'd think it would, otherwise it's hard to test very quickly when you screw up) [15:38:15] that could be a reason [15:38:43] under the 5-dupes rule, a brand-new cert issued for ECDSA+RSA on cc1001+cc2001 will eat 4/5 of the dupe count for the week. [15:38:53] so we're "ok" there, but not much room for error [15:39:32] (and splayed renewals are still probably statistically likely to often hit within a 1w span of each other on the different hosts, but still ok) [15:47:06] so.. [15:47:22] we can move from one LE account to two without (almost) any hustle [15:48:10] I don't think that's against LE rules after reading this [15:48:11] You can create a maximum of 10 Accounts per IP Address per 3 hours. You can create a maximum of 500 Accounts per IP Range within an IPv6 /48 per 3 hours. [15:50:00] Krenair: thx for the review, I've already addressed your comments [15:50:28] right [15:50:49] I think with our old per-host puppetized LE script, we're actually creating a new account for every unique cert basically [15:51:00] so 2 accounts for the 2x Cc hosts should be np [15:51:36] 1 account per cc host seems reasonable and more manageable regarding revoking certs and so on [15:51:42] but even with 2 accounts, the 5-dupe limit will still apply even cross-account. but again, we're just barely fine on that with 2 hosts and ECDSA+RSA [15:52:17] if we get any problem with that [15:52:23] we can go for an ACME account per certificate [15:52:24] we can ask for a raise, right [15:52:33] it could be implemented easily [15:52:45] we create the ACME accounts programatically right now [15:52:50] no, the ratelimits about duplicate SAN lists / domains / etc... they don't take accounts into account [15:52:58] hmmm [15:53:25] so I can DoS an organization using LE asking for certificates against their domains? [15:53:35] if you can actually authorize them! [15:53:39] that's the key [15:53:45] right, issued certificates [15:53:52] so if the process fails in the middle [15:53:54] is not an issue [15:54:04] it's still a good idea I think, to do 2 separate accounts [15:54:07] it does help in some areas [15:54:08] what we cannot do is lose certificates :) [15:54:09] so wait [15:54:12] why don't we have [15:54:15] for each host [15:54:20] one account for RSA and one for ECDSA? [15:54:31] that doesn't help with the dupe limit [15:54:36] as bblack said, it's a matter of SANs and domains [15:54:37] not accounts [15:54:46] bah [15:54:49] :) [15:54:52] so to recap their ratelimits to be clear: [15:55:29] 1) 50 certs per registered domain per week (e.g. anything.wikipedia.org is somewhere in SAN list, keying on wikipedia.org), account doesn't matter [15:55:40] 2) 100 names per cert [15:56:11] 3) 5 Duplicates per week (exact same CN/SAN list, account/crypto doesn't matter). [15:56:36] 4) There's an exception to (1) if it's a renewal of an existing, but not an exception to (3) [15:57:07] 5) 5 failed validations per account, per hostname, per hour [15:57:39] 6) raw request limit for most important endpoints at 20-40/sec [15:57:57] 7) Account creation: 10 per IP per 3 hours [15:58:27] 8) 500 new accounts per ipv6/48 per 3 hours [15:59:09] 9) 300 pending authorizations (as in, mid-process without finishing up) per account (so this is like, a parallelism limit per account) [15:59:34] 10) ACMEv2: 300 new orders (non-renewals) per 3 hours per account [16:00:00] so, having dual accounts (or more), helps with 9/10, but going crazy with it could risk 7/8 [16:00:16] accounts doesn't really affect 1-6 [16:00:45] and our certcentral hosts will be coming from the same webproxy IP [16:00:59] or will they... [16:01:07] was it one webproxy per DC? [16:01:19] either way, that's an implementation detail we can fix independently of CC [16:01:37] right now we have 2 webproxies, one each in eqiad+codfw [16:01:41] and other DCs CNAME over to those [16:01:54] yeah no it's per-site, no problem [16:02:25] ok [16:02:33] but realistically, it's really hard to hit the 9/10 limits in practice, even in our worst medium-term future cases [16:02:47] I'm worried about #3 [16:02:49] I'd say just do 1 account per DC to make them independent of each other [16:03:28] #3 is worrying. hypothetically if everything's working bug-free it won't affect us, but you can imagine this scenario: [16:03:33] realistically that means we can issue one cert per week? [16:03:41] no [16:04:23] we can issue one cert per week, for an exact duplicate of the same cert (which is something we should only really need once per renewal period, e.g. 2-3 months) [16:04:44] but one logical cert on our end is 4 LE requests [16:04:50] which is less than 5 :) [16:04:57] yeah [16:05:08] the bad scenario for us on that limit unfolds like this: [16:05:18] that doesn't help us though [16:06:04] 1) We request a new (or renewal) of cert X, which within one week (or mere minutes if it's brand-new) does 4 duplicate certs for cc1001+2001/ECDSA+RSA [16:06:27] 2) The request succeeds (authorization works, we are handed the 4 certs), which consumes 4/5 of the dupe limit for this cert [16:07:12] 3) After LE has given us the good certs, some bug in certcentral causes a late-failure on our end, where we buggily then drop those on the floor and lose them and call the whole thing a failure. [16:07:35] 4) Now we want to retry, and the the first host+crypto combo will succeed hitting 5/5, and the other 3 will fail the ratelimiter. [16:07:41] I suppose if we wanted to amend a cert we'd get a different CN/SAN list [16:07:46] and now we have to wait a week to try again [16:07:52] yes [16:08:31] if we're e.g. adding a new SAN element to what we think of as an existing cert, that isn't a dupe anymore, so we can do that more-often (until we hit the 50/dom thing or something) [16:08:55] yeah [16:09:44] we should probably document the implications of our RSA+ECDSA and multiple hosts thing and the ratelimits, in the config file in puppet [16:10:14] I think the important POV here, is we have to be clear about what the classes of failure are and whether we retry. [16:10:39] it's ok to just fail, and require human investigation, maybe needs good log outputs of the LE responses and the CSRs used, etc [16:11:01] but spamming retries on any-generic-failure anywhere in the process, is likely to cause us ratelimiter pain [16:11:54] especially if the LE part of the process is successful but we fail later in the process (e.g. maybe the write of the cert to the FS fails because the FS is full?), we have to stop on those late local failures and just fail-hard. [16:12:11] next release is going to include exponential backoff, so it shouldn't be spammy anymore [16:12:30] that scenario is a PITA TBH [16:12:46] yeah but there's a whole class of such local failures, they will eventually happen [16:13:13] indeed [16:13:38] when thinking about retries and backoffs, another useful thing is to consider that (probably in more ways than just this): a new issue with no existing cert is quite different from a renewal. [16:14:23] e.g. if we're attempting to start renewing an existing cert at ~60/90 days into the old one's expiry (probably subject to some random swizzle), and it fails for any reason, there's no need for a "fast" retry at all. [16:14:48] you can log a hard fail and retry like once a day, and we alert on the expiry drawing near in icinga checks and/or the CC software logging the failure. [16:15:13] and in a disk-full type of scenario it leaves plenty of time for humans to sort out the mess. [16:16:04] a brand-new cert is trickier, there's probably people waiting on the service to start working, and there's no existing cert to fall back on that's still valid. [16:16:17] so it makes more sense to want to retry faster (with exp backoff) [16:17:08] but even then, I'd argue it's not an emergency, if you're turning on a new service that didn't exist before and it fails.... we don't expect failure, and any failure probably means ratelimits, or a software bug somewhere, or some generic host issue. [16:17:26] so maybe fast retry doesn't make sense for any case? [16:17:46] things to ponder! [16:18:43] hmmm fast-retrying can mitigate getting stuck due to intermitent network issues and so on [16:19:07] a timeout sshing one authdns server or validating one TXT record [16:19:15] that would be embarrasing IMHO [16:19:18] if we had some idea (in the code, on failure) that that was the kind of reason [16:19:45] 15:57 < bblack> 5) 5 failed validations per account, per hostname, per hour [16:20:02] ^ there's also this, which looks at all the things. fast retries on auth failure could cap us out on this quickly. [16:20:27] I assume if you hit 5 failed validations, you simply can't validate any more until you get out of that hour [16:20:29] we only submit a challenge to LE after it's been validated by us [16:20:51] right, but it could also be an intermittent network failure of LE->ourdns [16:21:09] ugh I've gotta run [16:21:13] :) [16:26:26] vgutierrez, I'm thinking 16 max retries [16:26:47] the last retry will be for 0.75 days but that means in total you'll get about 1.5 [16:27:13] http://www.wolframalpha.com/input/?i=sum+2%5Ex%2F60%2F60%2F24+from+1+to+16 [16:28:27] alternatively we could go for powers of 3 with max 10 retries [16:29:11] ok, 16 then [16:30:36] PS3 has your comments addressed :D [16:30:44] arg, PS6 [16:31:58] vgutierrez, also commit message :) [16:32:01] (sorry :p) [16:32:33] nice catch :D [16:45:15] vgutierrez, you taking care of the debian branch commit? [16:45:35] indeed [16:48:31] Tagged 0.3 [16:48:45] thx [18:31:40] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [18:43:21] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) a:05Cmjohnson>03Papaul [22:08:28] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) Email sent to Equinix so they update their MAC filtering. [22:31:18] 10Traffic, 10Operations, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10ayounsi) [22:31:20] 10Traffic, 10Operations: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) 05Open>03Resolved a:03ayounsi I imported everything that was not the servers' uplinks (for the reason mentioned above).