[00:10:19] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Smalyshev: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) Looking at the distribution of Special:EntityData fetches, if we cache entities under 10K, we wi... [09:19:54] 10Acme-chief, 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [09:19:57] 10Acme-chief: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 (10Vgutierrez) 05Open→03Resolved [09:30:33] I'm going to try to issue the unified certificate against LE staging environment [09:31:06] so that shouldn't interfere with our production account if we run into some acme-chief bug/issue/misbehaviour [09:59:01] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Addshore) >>! In T217897#5056213, @Smalyshev wrote: >> I'm still a bit confused about this logic inside the updat... [10:10:29] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Addshore) >>! In T217897#5060748, @Smalyshev wrote: > Looking at the distribution of Special:EntityData fetches,... [10:22:53] Krenair: I think I've found a small bug in the acme-chief config puppetization [10:24:07] adding a default account doesn't work as expected: https://puppet-compiler.wmflabs.org/compiler1002/15378/acmechief1001.eqiad.wmnet/ [11:36:14] vgutierrez, good idea re staging env [11:36:22] vgutierrez, interesting... does ordered_yaml usually add colons to 'default' like that? [11:36:32] indeed [11:48:31] so... I've tested it manually [11:48:40] and it's been a 50% success [11:48:58] acme-chief was able to get the ec-prime256v1 cert but it failed to obtain the rsa-2048 one [11:49:48] regarding the dns-01 challenge script handling all the parameters we're ok [11:49:53] no issues there [11:53:25] I'm not sure if it's actually ordered_yaml doing this [11:53:53] but it looks like maybe you have a fix in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/499426/ ? [11:54:45] nope.. quoting default doesn't solve it [11:54:51] the output is even weirder [11:55:13] I suspect the problem may be in how puppet reads in from hiera, but I could be wrong [11:55:31] https://puppet-compiler.wmflabs.org/compiler1002/15380/acmechief1001.eqiad.wmnet/change.acmechief1001.eqiad.wmnet.pson [11:55:39] you can check the config.yaml content there [12:03:34] hmm I do have the sensation that we've seen this before: https://acme-staging-v02.api.letsencrypt.org/acme/order/7090084/28568534 [12:10:42] looks pretty similar to https://phabricator.wikimedia.org/T208948 [13:50:30] 10netops, 10Operations: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10Dzahn) p:05Triage→03Normal [13:51:42] 10netops, 10Operations: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10Dzahn) [13:54:15] how can I purge a page on a non-wikipedia website? trying to purge https://performance.wikimedia.org/asreport/ from varnish [13:54:37] curl -X PURGE gives me a 204 response, but the page doesn't seem to get purged [13:59:58] gilles: what's the curl command you're using exactly? [14:00:14] just curl -X PURGE https://performance.wikimedia.org/asreport/ [14:00:21] reedy@deploy1001:~$ echo "https://performance.wikimedia.org/asreport/" | mwscript purgeList.php enwiki [14:00:21] Purging 1 urls [14:00:21] Done! [14:00:22] is that insufficient? [14:00:25] gilles: WFM :P [14:00:31] Report generated on 2019-03-27. [14:00:37] ah yes, thank you [14:00:58] good thinking [14:01:38] It's a hack [14:03:00] it works [14:03:02] but we like those around here [14:03:07] and it's not permenant! [14:04:18] yeah so, mwscript is documented here https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge [14:05:32] sending a PURGE as you did should (1) not be allowed, and (2) only PURGE the first caching layer (frontend) and not the rest of the CDN [14:08:20] yeah wm_common_recv_purge does not look right [14:32:56] morning bblack, could you remind me the maximum number of TXT challenges that gdnsd will hold into memory? [14:32:58] maybe a 100? [14:34:26] vgutierrez: there's no limit on the number of TXT challenges gdnsd will track (within reason, I guess at some point you'll get some kind of OOM-related error, way way past sanity). [14:34:47] ack [14:34:54] vgutierrez: but there is a 100 limit per gdnsdctl invocation, and a ~200 limit on challenges stacked up for the exact same domainname. [14:35:09] I've tried to get the unified wildcard cert this morning against LE staging environment [14:35:19] 50% success rate [14:35:30] the ec-prime256v1 got issued, the rsa-2048 failed to validate [14:35:34] ok [14:35:43] on smaller certs, only 1 gets validated [14:35:54] and the other one reuses the challenges [14:36:08] that makes sense due to how challenges are calculated [14:37:05] so either LE's server-side stops reusing challenges when there's so many SANs? Or alternatively, for some reason the two requests don't actually match (the rsa has a different set than the ecdsa)? [14:37:15] I'm verifying the challenges that acme-chief handled to gdnsd and the ones that LE apparently expected... and it doesn't make any sense [14:37:24] hmm that's easy to check... let me verify the CSRs [14:38:01] I was going to warn, that it's possible that LE might have sanity-checks about top-N domains and reject wikipedia [14:38:05] but if ecdsa worked at all, I guess not [14:39:25] nah, the rejection cause is invalid TXT challenge [14:39:35] ok [14:39:49] can you dump whatever info you've got that doesn't make sense? [14:40:43] but realistically, it seems unlikely that gdnsd had all the right TXTs to validate the ecdsa, then somehow didn't have them again for the rsa [14:41:15] nah, I'm not blaming gdnsd [14:41:23] so.. this is the failed order: https://acme-staging-v02.api.letsencrypt.org/acme/order/7090084/28568534 [14:42:05] https://www.irccloud.com/pastebin/ZqGg3Nsx/ [14:42:21] and that's the tokens that were sent to fulfil it [14:43:47] I know you said earlier that LE tends to reuse challenge success if it's all the same, but does acme-chief still actually configure separate challenges (towards gdnsd) for ecdsa+rsa? [14:43:57] nope [14:44:02] acme-chief is able to detect that [14:44:07] bblack: hi! Could you take a look at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/499488/ please? Manually tested on pinkunicorn and it doesn't seem to break local PURGEs (while actually returning 405 on PURGEs from the outside) [14:44:14] and reuses the challenges if it's able to do so [14:44:21] so it doesn't hit gdnsd twice [14:45:41] you can see an example from this morning, when apt certificate renewal was triggered [14:45:50] https://www.irccloud.com/pastebin/3mkNqFzp/ [14:46:30] at 08:00:03 gdnsd got the challenges for the ec-prime256v1 [14:46:53] ema: local_host is used (and wants the current definition) in wikimedia-frontend too [14:46:59] I think [14:47:24] oh you have that [14:47:27] bblack: right, the patch should update that too [14:47:55] and at 08:00:11 got a response from LE signaling that the challenges were already fulfilled for the rsa-2048 one, so acme-chief skipped sending them to gdnsd again [14:54:04] vgutierrez: when I look at the pasted failure and go through all the staging challenge URIs to look at the individual results on the SANs, every single one of them failed [14:54:32] yeah, 30 of 39, cause there are only listed 30 authzs [14:54:37] and we request 39 SANs [14:54:38] there are 40 [14:55:09] (at the bottom of: https://acme-staging-v02.api.letsencrypt.org/acme/order/7090084/28568534 ) [14:55:29] oh wait, now I get 39 [14:55:40] dashes are annoying [14:56:39] but all of them report a failure [14:57:58] yep, checking the tokens reported there [14:58:08] they don't match at all with the ones passed to gdnsd [14:58:16] that's what got me puzzled [14:58:38] did you dig them at gdnsd too? [14:58:44] they're probably all expired out by now [14:58:46] yep [14:59:02] I'll write an script to dump all of them for proper debugging [14:59:16] 39 it's a bit too much for manual testing [14:59:59] hang on, I'm still looking at some things before you bother re-resting [15:00:03] *re-testing [15:00:22] 10Traffic, 10Operations: Make authdns-update compatible with local emergency changes - https://phabricator.wikimedia.org/T219400 (10Volans) [15:00:33] bblack: followup of what we said yesterday ^^^ [15:01:57] heh I got side-tracked by my ISP, apparently they started intercepting all DNS reqs again (they weren't for a while, but used to long ago) [15:02:17] so I can't dig @some-remote-authserver and get a reliable response, it's always intercepted by an ISP cache that fucks things up :P [15:03:26] ouch, that sounds bad, especially for you :) [15:04:22] also, they're translating NXDOMAIN into SERVFAIL heh [15:04:33] (effectively. I guess it means their cache is broken) [15:04:48] that's confidence-inspiring [15:06:16] vgutierrez: do you have the staging order link for the ecdsa that worked? [15:06:52] nope [15:07:15] I got the challenges that acme-chief handled to gdnsd if that helps [15:08:21] yeah [15:08:26] one sec [15:09:04] https://www.irccloud.com/pastebin/DLNtCuVo/ [15:12:46] ok so the first thing to note, is I guess (unlike all past orders?) for this one you got different challenges for ecdsa-vs-rsa? [15:13:20] indeed [15:13:34] note that is the first one in a loong time that we run against the staging environment [15:14:17] I guess I can run a small one like wikiba.se against the staging environment as well to see how it behaves nowadays [15:14:46] yeah might want to try, just to confirm that's the new LE behavior (which I guess means their prod will do so too, eventually) [15:14:58] from our point of view the dns challenges were fulfilled as expected [15:15:04] (acme-chief verifies that as well) [15:15:38] right [15:15:58] what I'm seeing just looking at w.wiki as 1/39 of the examples is: [15:16:07] the ecdsa order condfigured it for gdnsd as: [15:16:20] '_acme-challenge.w.wiki', 'adW92E-brLB7ZF2D44seEBENiXouxdau8WAJej1nxI8' [15:16:35] the rsa hit gdnsd as: [15:16:38] '_acme-challenge.w.wiki', 'wP0qO5iQZ5BJUZICN-n2B60vvLKMQAb8r_q-AZYOmgw' [15:16:56] and then the failed rsa order's output (the detailed link for that particular SAN verification fail) says: [15:16:56] bblack: (if it is not a good moment I'll come back another time) - I am finally rolling out interface::rps to the mc10XX, doing it one host at the time very slowly. mc1035 has been running for months with the setting without any issue reported. The question that I have is - I guess that rollback, in case of some weird issue, is not as simple as git revert right? [15:17:09] "detail": "Incorrect TXT record \"adW92E-brLB7ZF2D44seEBENiXouxdau8WAJej1nxI8\" found at _acme-challenge.w.wiki", [15:17:34] so it seems to be seeing the ecdsa challenge value when it wants the rsa one [15:18:00] boulder loops over every TXT record for a hostname [15:18:03] and the timestamps of your two commands to send it to gdnsd are only 15s apart [15:18:13] so it can't have expired or whatever [15:18:27] some of the failures say "and 1 other", like: [15:18:51] "detail": "Incorrect TXT record \"eMqWBOXl3keVIxzYsYbMWTS6NMiAPZR9dyx0UP1VYkE\" (and 1 more) found at _acme-challenge.wikiversity.org" [15:19:09] if that wasn't the case we couldn't get issued the first one either cause *.wikipedia.org and wikipedia.org share the same challenge TXT hostname [15:20:32] note that in wikiversity's case with the wildcard, we have that double-entry case. [15:20:38] the failed rsa ones were: [15:20:44] '_acme-challenge.wikiversity.org', 'ye2OnnIbzY0P7MJF5eFYSWpPmxSxlpubCEM_cL736oQ' [15:20:47] '_acme-challenge.wikiversity.org', 'jkAnokTgkVqqSlF06kc0rF_NjH8ipDtk31mvajLizTo' [15:20:56] and the working ecdsa ones were: [15:20:58] '_acme-challenge.wikiversity.org', 'eMqWBOXl3keVIxzYsYbMWTS6NMiAPZR9dyx0UP1VYkE' [15:21:01] '_acme-challenge.wikiversity.org', 'gnlHgIuboB59HixQOtS-Bdd1WekIvXnmIdwryNLmRhw' [15:21:12] so again, the rsa failure notes "incorrect TXT record" showing the value from the ecdsa challenge [15:21:29] (well, one of them aynways) [15:21:36] and here is the fragment where boulder loops against all the TXT records returned: https://github.com/letsencrypt/boulder/blob/master/va/va.go#L568-L574 [15:22:07] it's possible this is just a gdnsd bug with the hacky challenge stuff, and it's not layering them like expected, at all, when they come from two different commands [15:22:25] should be easy to test that with fake names and values independently [15:23:08] (doing that now) [15:25:19] hmmm in simple scenarios it works fine anyways [15:25:30] maybe I need a repro that's closer to what we're actually doing [15:27:46] vgutierrez: I can't trigger it with any simple-ish example (I've tried separate gdnsdctl invocations on a test instance, which configured 2x challenge for the same name in one command, and then do that a few separate times (separate commands) with all different values, and they all show up in dig as expected... [15:28:34] vgutierrez: yeah so maybe try to repro again, and capture via "dig @ns0.wikimedia.org _acme-challenge.w.wiki. TXT" or similar to confirm what LE should be seeing? [15:28:44] ack [15:29:00] it sure seems like gdnsd must not be returning the second set of tokens [15:29:07] (based on LE's error reports) [15:31:02] vgutierrez: does anything log the output of the gdnsdctl command? I'm kind of assuming we at least check its exit value. [15:31:15] nope, we just check the exit value [15:32:29] hmmm there is something weird in the LE staging environment [15:32:39] I can reproduce it as well with the way shorter wikiba.se cert [15:32:48] well [15:33:03] so I assume first of all, that the simple wikiba.se cert case did use separate challenges? [15:33:12] yes [15:33:14] and then they failed the same way? [15:33:19] indeed [15:33:20] (I guess ecdsa ok, rsa fail?) [15:33:23] ok [15:33:29] https://www.irccloud.com/pastebin/CDIQQjaL/ [15:33:36] and https://acme-staging-v02.api.letsencrypt.org/acme/authz/tW9M7Bg-P2qeN9aSVOFKBS4K__ZttyEG01nIuKxcv_4 [15:33:55] so it's all my fault for being a coward and testing in the LE staging environment /o\ [15:34:14] what? [15:34:56] I mean, things should work in staging right? :) [15:35:06] yeah, I'm joking :) [15:35:32] so I see you had two different "host" outputs there [15:35:49] and we should've expected the second one to have 4x TXT records, right? [15:36:00] why? [15:36:10] this cert asks only for wikibase.se and www.wikibase.se [15:36:21] so we should get 2 records per SNI [15:36:31] oh ok [15:36:38] so why the two different outputs? [15:37:06] your paste has 4 total different TXTs showing, in two sets of two [15:37:11] yeah [15:37:16] oh, www [15:37:20] I missed that part [15:37:23] :) [15:38:01] any news from LE, maybe a blog post or draft standard update or something, that might explain new staging behavior? [15:39:26] oh it did get an RFC recently [15:39:36] https://datatracker.ietf.org/doc/rfc8555/ [15:40:12] hmmm [15:40:25] if that's the case... I should be able to trigger against the latest build of pebble [15:40:29] * vgutierrez checking [15:46:27] anyways, I suppose the double-challenge for different cert algorithms would be legal either way, just unexpected based on past observation. [15:46:43] the real question is why does it seem to not be looking through all the TXT records? [15:47:56] (or maybe it is, and maybe we still have some kind of timing issue we're not catching here about the sequence of events... is there some race issue going on in acme-chief, where it's not configuring the second set of challenge TXTs into the DNS until after the second challenge has failed?) [15:47:58] so.. what has me puzzled is that apparently the challenge expected for www.wikiba.se is NSeApe8U06qii9HI6dXV990rE7a1Gn3Iz4nxuHq6NS8 [15:48:21] but acme-chief sets SvDoS54YOUnapsstAetLXBJy1GEpUhEWDRZSkoGiago and SzUmUtlVwrEol3rggh5fzAv1BklR_kUR-jjyJcFVJYY [15:48:22] :/ [15:49:22] oh hmmm [15:49:33] are you sure that token field in the failed dns-01 response is the token it was looking for? [15:49:42] because that was also true in the w.wiki example if so [15:52:02] another dumb question from the outside POV: do we copy tokens from LE, or do we compute them independently in acme-chief somehow? [15:52:24] they're computed independently in acme-chief [15:52:33] hmmm [15:53:08] maybe something has changed about the computation, but only for the RSA case? what happens if you do just-rsa and no ecdsa with the staging wikiba.se test? [15:53:36] (or if you edit acme-chief code to do the rsa first?) [15:54:16] I poked at the diff from draft-15 -> RFC, and there were some subtle changes that are hard to pin down. maybe acme-chief has some assumption like "the nonce will be the same for both", but it's not anymore [15:56:45] that could be the case, that our acme library (the one developed by let's encrypt) it's deprecated and cannot be longer used against the staging environment [15:57:13] acme-chief builds a wrapper in top of that, but stuff like challenge response is delegated to LE code [16:00:26] what version of the library do we use? [16:02:02] I'm assuming it's some packaging of certbot/acme ? [16:02:05] https://github.com/certbot/certbot/commits/master/acme/acme [16:02:13] indeed, we're using 0.31.0 IIRC [16:02:19] 10netops, 10Operations: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10RobH) FYI: Please note that even when the ACL is setup, bastions allow SSH proxy but not HTTPS proxy. Alternatively, you can setup proxy via cumin servers to get both. This should be fixed (maki... [16:02:48] Installed: 0.31.0-1 [16:03:11] this commit is new in 0.32 and sounds suspiciously-possibly-related: [16:03:13] https://github.com/certbot/certbot/commit/339d034d6a5a57d296607795a4706203f81d7059#diff-7208aa0b55d4995dfb4012f8facddc03 [16:03:57] so I'm guessing that we shouldn't find any issue in the prod. environment [16:05:01] yeah but either way if our acme-chief is failing against staging, then we can't be far off (in time) from it failing with prod when they next update prod I'm guessing [16:06:13] seeing this.. I think I'm going to come with an instance running everything against the staging environment [16:06:23] something like pybal-test [16:06:28] but for acmechief [16:07:48] it's not that easy to test locally, also.. pebble is not very helpful in this scenario [16:08:08] pebble master version doesn't work with acme 0.32.0 (latest version released) [16:10:07] odd [16:10:27] can we try acme-chief + acme 0.32.0 -vs- staging easily? [16:11:15] hmmm not that easily cause we don't have python3-acme 0.32.0 available in debian [16:12:19] bblack: patch merged with puppet disabled on A:cp, testing it on cp1008. Do you have any specific test in mind other than checking that: (1) can't PURGE from the outside anymore (2) purges from vhtcpd still work on both varnish-fe and varnish-be (3) X-Client-IP and X-Connection-Properties still go through [16:12:24] and I don't want to mess that way our production environment [16:12:48] that's why an staging instance in ganetti would be really useful :) [16:12:54] ema: sounds reasonable [16:13:36] vgutierrez: maybe just manually copy the library files somewhere or whatever. it would just be useful to know at this point I think, whether the library version is even the issue [16:16:33] 10Acme-chief: acme-chief fails to issue certificates against LE staging environment - https://phabricator.wikimedia.org/T219414 (10Vgutierrez) p:05Triage→03High [16:18:34] elukey: it's probably not as simple as a git revert, no [16:19:30] bblack: that could work... let me try triggering it on acmechief2001... to avoid messing too much with the active instance [16:19:43] elukey: it may be that the simplest way to revert would be to do the git revert, and then reboot the host. It can be done online of course, but probably very manual. [16:21:02] bblack: ah okok revert+reboot seems something easy that can be done if things gets bad for some reason, without the need to call you crying for help :) [16:21:20] I don't expect any problem, but I'll roll out the change very slowly [16:23:25] thanks :) [16:28:53] 10netops, 10Operations: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10ayounsi) a:05Dzahn→03ayounsi [16:35:20] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > the cache we are talking about there would be unnecessary if the wdqs just hit varnish. It is probl... [16:37:22] bblack: same issue with 0.32.0 [16:41:34] have you tried rsa-only, or rsa-then-ecdsa? I'm curious if rsa is even the factor, or if it's just whichever is second [16:42:04] I'm pretty sure that rsa-only would work.. let me test it [16:42:21] I mean, I suspect what's happening in the original failure is that for 1/2 cert algs, acme-chief is calculating a different token than LE staging is, but that's not a sure thing yet [16:44:05] if the above does turn out to be correct, and then either it's that the RSA calculation is wrong, or just the second algorithm for which we're issuing the same basic cert is wrong [16:45:37] bblack: change applied, thanks for your help in today's episode of "let's play with fire"! :) [16:47:18] 10netops, 10Operations: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10Dzahn) @Robh I can't confirm this. I can proxy via bast2002 just like i can via bast2001. Using "`ssh -D 8081 bast2002.wikimedia.org` and setting my browser's proxy settings to SOCKS5 and "localhost... [16:53:44] bblack: rsa only works as expected [16:56:59] this is going to require some fine debugging :) [16:59:25] vgutierrez: earlier you mentioned something about.... in the common case today when we issue ecdsa,rsa and LE happens to ask for the same authorization for the second one, acme-chief doesn't bother with the redundant DNS provisioning? [16:59:39] vgutierrez: do you have some pointer to where that optimization happens at? [16:59:49] sure [17:03:22] https://github.com/wikimedia/acme-chief/blob/master/acme_chief/acme_requests.py#L402 [17:04:26] oh I see [17:04:32] basically LE returns the order to acme-chief in state ready if challenges are already satisfied [17:04:41] this doesn't happen in this case [17:07:48] seeing that we handle the amount of challenges as expected [17:07:56] I'll try against prod environment tomorrow morning [17:08:09] and I'll debug all this mess later [17:09:42] ok [17:10:02] thx for your help :) [17:24:59] np! :) [20:07:13] 10HTTPS, 10Traffic, 10Operations, 10Toolforge, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) Live config from tools-proxy-03.tools.eqiad.wmflabs shows only the `add_header Strict-Transport-Security "max-age=86400";` in the...