[00:17:45] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) > If this does go into the 'public' VLAN, could we restrict access to these nodes using some simple ferm rules?... [00:23:11] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Krenair) >>! In T207321#4687651, @ayounsi wrote: >> Where are the labsdb hosts going to live if they are being moved out... [06:34:43] Krenair: sorry! I took yesterday off to be able to recover from my stomatch sickness [07:45:32] I'm debugging the certcentral2001 error we saw on Friday [07:45:42] Oct 19 15:35:22 certcentral2001 certcentral-backend[20885]: acme.messages.Error: urn:ietf:params:acme:error:malformed :: The request message was malformed :: Order's status ("valid") is not acceptable for finalization [07:45:46] that one [07:46:45] we misinterpreted how dns-01 challenge validation works [07:47:02] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Urbanecm) Thank you @Dzahn! [07:47:22] it's not enough to have the TXT record at least returning the required challenges [07:47:32] we must return only those [07:48:25] so in certcentral2001 the first certificate issuance worked, so we got the rsa-2048 one, but the ec-prime256v1 one failed cause the rsa-2048 challenges responses were still there [07:49:23] this is a big issue if we need to run certcentral1001 and certcentral2001 as active/active regarding the certcentral-backend service [07:50:57] even worse, some similar projects like jetstack/cert-manager (a sort of certcentral for k8s) have the same issue https://github.com/jetstack/cert-manager/issues/593 regarding wildcard certificates [07:51:40] so, if we need to issue a certificate for pinkunicorn.wm.o and *.pinkunicorn.wm.o, we must validate first pinkunicorn.wm.o, WIPE the TXT record, and only then validate *.pinkunicorn.wm.o [07:52:45] BTW; this limitation is not present on pebble (the CI version of boulder), otherwise, we've found it a long time ago [07:52:48] *sigh* [07:55:41] I'm wondering if gdnsdctl acme-dns-01-flush allows to specify a concrete TXT record [07:55:48] otherwise that needs to be refactored as well [07:57:25] acme-dns-01-flush - Flush (remove) all ACME DNS-01 payloads added above [07:57:37] :_( [07:59:22] vgutierrez, damn [07:59:33] what's worse [07:59:42] ACME RFC doesn't require it [08:00:04] "3. Verify that the contents of one of the TXT records match the digest value" [08:00:17] https://datatracker.ietf.org/doc/draft-ietf-acme-acme/?include_text=1 --> page 63 [08:00:52] vgutierrez, where does LE document this requirement then? [08:01:00] I'm checking that now [08:01:12] I must say I'm surprised by this [08:01:20] It doesn't make much sense to require this? [08:02:33] https://github.com/letsencrypt/boulder/blob/master/docs/acme-divergences.md [08:02:55] hmmm that's documented against the draft-15 [08:02:59] and the latest one is draft-16 [08:04:20] same point it's in draft-15, but in page 62 [08:04:32] also from draft-15 page 62 [08:04:32] The client SHOULD de-provision the resource record(s) provisioned for [08:04:32] this challenge once the challenge is complete, i.e., once the [08:04:33] "status" field of the challenge has the value "valid" or "invalid". [08:04:47] SHOULD != MUST (obviously) [08:07:35] well.. let's check boulder source code [08:11:20] well [08:11:21] https://github.com/letsencrypt/boulder/blob/master/va/va.go#L861-L869 [08:11:33] it looks like the code honors the RFC [08:11:49] let's keep digging [08:23:20] hmmm I just issued the two certificates from codfw right now [08:23:23] and it just worked [08:23:46] but the two challenges appended... [08:23:47] willikins:~ vgutierrez$ host -t txt _acme-challenge.pinkunicorn.wikimedia.org [08:23:50] _acme-challenge.pinkunicorn.wikimedia.org descriptive text "t7_OX4Ohp_WNz8g2Sr6V1NV_UVWOxjbEpi0-rsOLkAU" [08:23:53] _acme-challenge.pinkunicorn.wikimedia.org descriptive text "t7_OX4Ohp_WNz8g2Sr6V1NV_UVWOxjbEpi0-rsOLkAU" [08:24:18] are fricking identical [08:25:04] I'm going to add a bogus one manually and repeat the process [08:25:06] let's see what happens [08:32:20] so.. initual status for the TXT record [08:32:25] _acme-challenge.pinkunicorn.wikimedia.org descriptive text "it_is_a_bogus_token_of_43_bytes_and_no_more" [08:33:27] and I'm able to get the certificate... [08:33:29] _acme-challenge.pinkunicorn.wikimedia.org descriptive text "it_is_a_bogus_token_of_43_bytes_and_no_more" [08:33:33] _acme-challenge.pinkunicorn.wikimedia.org descriptive text "t7_OX4Ohp_WNz8g2Sr6V1NV_UVWOxjbEpi0-rsOLkAU" [08:33:47] so I don't see a good reason for what we seen on Friday [08:40:49] so.. I think I'm focusing in two things [08:41:12] T207478 + being able to check challenge validation status using LE API [08:41:13] T207478: Avoid infinite attempts on issuing a certificate on permanent LE side errors - https://phabricator.wikimedia.org/T207478 [08:46:00] 10Certcentral: Check challenges status on LE side before asking for a certificate - https://phabricator.wikimedia.org/T207725 (10Vgutierrez) [08:46:20] 10Certcentral: Check challenges status on LE side before asking for a certificate - https://phabricator.wikimedia.org/T207725 (10Vgutierrez) p:05Triage>03Normal a:03Vgutierrez [08:46:26] I'm beginning with T207725 [08:46:27] T207725: Check challenges status on LE side before asking for a certificate - https://phabricator.wikimedia.org/T207725 [08:47:01] 10Certcentral: Check challenges status on LE side before finalizing the order and fetching the certificate - https://phabricator.wikimedia.org/T207725 (10Vgutierrez) [09:00:29] vgutierrez, ok, thanks [09:26:13] hmm [09:26:15] interesting [09:26:26] we already check the challenges status [09:26:42] so [09:26:51] essentially we *do* ask LE if it's good [09:26:52] and it says yes [09:26:56] so we say go for it, and it fails? [09:27:00] nah [09:27:12] we submit a finalization order [09:27:26] and LE says... WTF! I already have a certificate for you [09:27:35] I'm not going to issue a new one [09:28:23] here [09:28:24] Oct 19 15:35:22 certcentral2001 certcentral-backend[20885]: acme.messages.Error: urn:ietf:params:acme:error:malformed :: The request message was malformed :: Order's status ("valid") is not acceptable for finalization [09:28:28] that's a 400 [09:28:58] we should send a finalization order iif order status === ready [09:29:05] o "ready": The server agrees that the requirements have been [09:29:05] fulfilled, and is awaiting finalization. Submit a finalization [09:29:06] request. [09:29:14] while valid means [09:29:21] o "valid": The server has issued the certificate and provisioned its [09:29:24] URL to the "certificate" field of the order. Download the [09:29:26] certificate. [09:31:31] I'm pretty sure we are hitting some race condition [09:31:32] and presumably it becomes valid when you submit a finalisation request? [09:31:38] usually yes [09:31:39] and we try to send twice for some reason? [09:31:52] hmmm that [09:31:58] or overlapping between servers [09:32:54] don't they generate two different CSRs that presumably need signing separately? why would LE confuse them? [09:33:48] hmmm they do share accounts [09:33:56] so the key signing the CSR is the same [09:34:10] the only thing it's different is the timestamp for the CSR [09:34:41] and the private key to be used for the cert? [09:34:51] different of course [09:34:59] but that key doesn't reach LE ever [09:35:05] well of course not [09:35:34] anyway to me this seems like enough for LE to treat the two CSRs as distinct? [09:38:08] if this does turn out to be the problem we could do add something host-dependent [09:38:17] like an extra SAN [09:38:38] still I'm not convinced [09:39:18] me neither [09:39:24] I'd like to be able to reproduce this [09:41:35] ema / bblack: your eyes on https://gerrit.wikimedia.org/r/c/operations/puppet/+/468320 would be welcomed! [09:43:37] got i! :) [09:43:41] *it [09:47:07] Krenair: https://phabricator.wikimedia.org/P7711 [09:47:34] first attempt, the certificate is fetched [09:48:02] and this happens consistently? [09:49:13] yep [09:49:20] right now I'm stucked in CHALLENGES_PUSHED [09:49:48] and every time I try to fetch the certificate I get the error 400 [09:50:42] cause the server already issued the certificate the first time, so I should go and fetch it, instead of finalizing the order [09:50:49] and getting a new one [09:52:58] so... this consistenly happens with 1 server [09:53:08] let's see if certcentral1001 and 2001 interfere with each other [09:54:17] how about we fix it for 1 server first and then throw a second server into the mix? [09:55:09] I expect it will still fail in the same way on both servers right now [09:57:51] gehel: any chance to test the untested code paths? :) [09:58:19] also vgutierrez might want to take a look too! [09:58:22] ema: I might be able to write some rspec [09:58:54] but I doubt it will help much [09:59:03] Krenair: dunno, cause as you said, the keys for the certificate are different [09:59:48] gehel: I was rather thinking of configuring multiple proxies on the same fqdn somewhere in labs perhaps [09:59:52] And I'm not even sure there are real use cases for those. Things like multiple different certs, with multiple proxies on different ports and with a redirection port configured... [10:00:10] ema: my next patch is doing exactly that on relforge [10:00:20] perfect! [10:00:44] it might be possible to test that on labs at reasonable cost [10:01:05] or just fix the issue after the fact [10:06:02] Krenair: hmm but that doesn't matter [10:06:41] Krenair: cause in a new certificate generation CertCentral._new_certificate() we always issue a new private key [10:08:00] https://github.com/wikimedia/certcentral/blob/master/certcentral/certcentral.py#L393-L395 [10:11:44] ema: lunch time here, i'll see what i can do on labs after [10:13:22] so yeah.. two servers with the same LE account are going to interfere if they ask for the same certificate [10:13:39] the easiest solution to that would be to use one LE account per server [10:14:23] gehel: bon appetit! [10:15:38] 10Certcentral: Check challenges status on LE side before finalizing the order and fetching the certificate - https://phabricator.wikimedia.org/T207725 (10Vgutierrez) 05Open>03Invalid [10:36:18] 10Certcentral: LE rejects issuing two certificates with the same CSR on a short timespan - https://phabricator.wikimedia.org/T207737 (10Vgutierrez) [10:36:38] 10Certcentral: LE rejects issuing two certificates with the same CSR on a short timespan - https://phabricator.wikimedia.org/T207737 (10Vgutierrez) p:05Triage>03Normal [10:38:08] Krenair: so, as I mention in T207737, this can be partially solved by T207478 [10:38:09] T207478: Avoid infinite attempts on issuing a certificate on permanent LE side errors - https://phabricator.wikimedia.org/T207478 [10:38:09] T207737: LE rejects issuing two certificates with the same CSR on a short timespan - https://phabricator.wikimedia.org/T207737 [10:41:36] vgutierrez: you mentioned T207478 twice there [10:50:56] vgutierrez, shouldn't we report this is an upstream bug in boulder? [10:50:58] this is not technically the same CSR is it? they'll have different public keys etc.? [10:51:10] as well as work around it locally with multiple accounts? [10:51:34] paravoid: you mean "This issue could be fixed partially by T207478, but only relaying on T207478 would result in delayed certificate issuance in some scenarios"? [10:51:35] T207478: Avoid infinite attempts on issuing a certificate on permanent LE side errors - https://phabricator.wikimedia.org/T207478 [10:52:32] yes [10:52:42] oh I see, that was on purpose [10:53:01] indeed :) [10:56:28] Krenair: even with different keys, if the same account asks for the same certificate twice, I assume they are going to ay that there is no reason to not use the same certificate [10:57:05] they should at least document this and advise to use different accounts for cases such as ours [11:02:35] it's a matter of minutes btw [11:03:16] 2-3 minutes is enough to ask again for the same CSR and get a new certificate [11:37:05] Hello! We got quite a number of Certificate about to expire email in the Ops Maintenance google group, would love to confirm that someone is looking into them or something.. [11:38:08] vgutierrez, ^ [12:17:57] httpd 2.4.37 supports TLS 1.3 \o/ [12:35:53] 10netops, 10Cloud-Services, 10Operations: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) >>! In T207663#4687081, @faidon wrote: > It's sad to hear that's a major disruption :( Would it make sense to do this now when it's early in the migratio... [12:53:18] elukey, what about ESNI? [12:54:26] I am not sure, I didn't see any note about it but worth to check! [12:55:02] the main changes were making mod_ssl work with TLS 1.3 (various libs, libressl, openssl, etc..) [12:56:40] esni is still in experimental state [12:56:46] https://tools.ietf.org/html/draft-ietf-tls-esni-01 [12:57:44] ah [12:59:43] 10Certcentral: Avoid infinite attempts on issuing a certificate on permanent LE side errors - https://phabricator.wikimedia.org/T207478 (10Vgutierrez) a:03Vgutierrez [13:11:30] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ok, great, then it sounds like this should go in the public VLAN, with ACLs in the Analytics VLAN to allow us t... [14:09:39] 10netops, 10Operations: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) p:05Triage>03Normal [14:41:00] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10bd808) >>! In T207321#4687656, @Krenair wrote: >>>! In T207321#4687651, @ayounsi wrote: >>> Where are the labsdb hosts go... [14:50:19] 10Traffic, 10DNS, 10Operations: Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10jijiki) p:05Triage>03Normal [14:51:48] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations, and 2 others: Remove *.cz domains from WMF's infrastructure - https://phabricator.wikimedia.org/T206923 (10jijiki) p:05Triage>03Low [14:54:33] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations, and 3 others: Remove *.cz domains from WMF's infrastructure - https://phabricator.wikimedia.org/T206923 (10jijiki) a:03jijiki [15:14:53] hi, the 30+ certs warnings in icinga can be acked? [16:02:59] gehel, how long do these certs have left? [16:03:03] godog, ^ sorry gehel [16:04:12] Krenair: 29 days [16:04:27] 10Traffic, 10Cloud-Services, 10Operations: Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10ema) [16:04:44] that's for the wikipedia wildcard, 2x for rsa+ecdsa [16:04:47] I'd ask Brandon but he's at the conference [16:05:09] hmm [16:05:47] godog, GlobalSign? [16:05:59] This one is https://phabricator.wikimedia.org/T206804 [16:06:25] so he's aware of it [16:06:25] the alert doesn't mention the issuer but I'm assuming so yeah [16:06:34] is it normal practice to ack the alert when there's a task? [16:06:40] he's assigned it and everything so... [16:09:18] 10Traffic, 10Cloud-Services, 10Operations, 10Beta-Cluster-reproducible: Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10Krenair) [16:09:41] ack, yeah I don't know what's the workflow [16:10:18] thanks for looking into it Krenair, it can wait bblack to be back [16:11:46] 10Traffic, 10Cloud-Services, 10Operations, 10Beta-Cluster-reproducible: Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10Krenair) Was traffic-ats-stretch using 172.16.2.180 when this broke? Is it possible you got migrated across regions... [16:12:53] godog, worse comes to worse there's still the Digicert cert :) [16:14:02] heheh indeed, if we manage to ignore the alerts for 30 days we deserve it too [16:15:38] ema, this was a cross-project traffic flow that I was unaware of... what do you need to talk to deployment-mediawiki* directly for? [16:16:52] Krenair: that's how I've been testing new varnish versions and varnish changes for a while, now using the same approach for ats as well [16:17:11] ah [16:17:18] That's very interesting to hear. [16:18:14] I'll sort this out [16:18:18] (hopefully) [16:18:43] ema, just port 80 right? [16:19:01] Krenair: thanks very much. Yes, just port 80. [16:19:29] ema, please can you try now? [16:19:57] Krenair: it works :) [16:20:01] what was it? [16:20:31] security groups like I thought [16:20:50] 10Traffic, 10Cloud-Services, 10Operations, 10Beta-Cluster-reproducible, 10cloud-services-team (Kanban): Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10aborrero) [16:20:56] you were relying on the default 10/8 port 80 rule which didn't match you anymore because you changed source IPs [16:21:31] ah, deployment-prep's security group I assume [16:21:42] 10Traffic, 10Cloud-Services, 10Operations, 10Beta-Cluster-reproducible, 10cloud-services-team (Kanban): Traffic project in labs cannot talk HTTP with deployment-prep any longer - https://phabricator.wikimedia.org/T207763 (10Krenair) 05Open>03Resolved a:05aborrero>03Krenair ema, this was... [16:21:54] ema, of course [16:22:22] Krenair: wonderful, many thanks! :) [16:22:23] I don't think we really do many egress security groups around here, though I'd hope you'd have checked for that if you were using them :) [16:23:39] yeah, I did. Good to close the working day on a positive note :) [16:23:42] see you [16:23:51] cya [17:00:33] yes, the expiring cert is known and can be acked, or I can do it in a bit [17:02:27] godog, ^ [17:02:52] [done] [17:40:13] 10netops, 10Cloud-Services, 10Operations: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10Andrew) There are currently 23 projects running in the new region, and we're moving more over every day. This would have been a reasonable request when were origi... [18:11:24] 10netops, 10Cloud-Services, 10Operations: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10faidon) This is essentially part of T122406, which we resolved earlier in the week with the intention of making it more specific with this task (among others). Ba... [21:42:43] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) Not impacting that task, but for labsdb10[08|09|10], the presence of sensitive data + need to be reached from Cl...