[09:29:30] 10Acme-chief: Implement server-side OCSP stapling - https://phabricator.wikimedia.org/T219765 (10Vgutierrez) [10:26:32] interesting.. globalsign embeds the certificate of the OCSP responder in the OCSP stapling response [10:27:27] https://www.irccloud.com/pastebin/DykYiZFn/ [10:28:34] of course.. that affects the OCSP stapling size a lot... 527 bytes for LE versus 1607 bytes for the globalsign one [10:56:24] 10Acme-chief, 10Patch-For-Review: Implement server-side OCSP stapling - https://phabricator.wikimedia.org/T219765 (10Vgutierrez) p:05Triage→03Normal [13:17:46] Krenair: the issue regarding notify => Service['nginx'] is that is triggering a nginx restart [13:18:01] and that's a no-go, it doesn't matter if the acme-chief is being used to serve traffic or not [13:18:08] oooh wait [13:18:12] a restart instead of reload? [13:18:16] yep [13:18:16] indeed [13:19:09] vgutierrez: what's jerkins problem? [13:19:15] indentation [13:19:15] indentation [13:19:17] fixing it right now [13:19:52] so how do we do this in prod right no [13:19:54] now [13:20:07] there is some mechanism to reload nginx if the cert changes in puppet right? [13:20:25] or does it happen so rarely that someone rolls the cert out in one step and then uses cumin or something to reload stuff? [13:20:52] Krenair: OCSP stapling puppetization reloads nginx every 12 hours [13:21:02] ok [13:21:14] fixed in PS2 ema :) [13:21:19] that should be sufficient for renewals [13:21:32] may cause confusion around deliberate changes to the cert [13:21:38] indeed [13:21:47] we should review our current nginx based acme-chief clients [13:21:54] cause I'm assuming we're causing restarts there [13:22:23] is not that worrisome compared to cp servers but still [13:27:24] nice, now puppet/nginx behaves as expected on a NOOP puppet agent run [13:29:15] ema: https://gerrit.wikimedia.org/r/c/operations/puppet/+/499780/20 this could be merged now without messing with eqsin cache servers [13:30:08] looking [13:32:46] vgutierrez: what do you mean when you say "without messing with [...]"? :) [13:33:01] without breaking them :) [13:33:13] pink unicorn is tamed and playing with acme-chief as expected [13:33:42] so we shouldn't have nasty surprises on cp5* [13:34:25] of course I'll stop puppet in cp eqsin servers before merging that and let puppet run in one server from each cluster before re-enabling it again [13:35:29] vgutierrez: sounds good. Maybe depool the host first too, to err on the side of caution [13:35:37] of course [13:36:22] vgutierrez: also remember that the "Notice: tlsproxy [...]" and "Notice: /Stage[main]/Profile::Cache::Ssl::Unified [...]" lines are expected, we get them at every puppet run :) [13:36:30] yes :) [13:36:50] vgutierrez: +1 [13:37:59] I'm being paranoid and re-running pcc first [13:38:27] cause the pcc run posted in that CR predates abandoning my change on tlsproxy and adopting Krenair patch [13:38:34] vgutierrez: I did already https://puppet-compiler.wmflabs.org/compiler1002/15462/ [13:38:50] I love the extra paranoid mode [13:47:52] cp5001 depooled... let's run puppet there [13:51:37] vgutierrez: looks fine I think? [13:52:19] puppet run as expected, acme-chief unified cert has been deployed and stapled as expected \o/ [13:52:30] nice :) [13:52:42] cool [13:53:09] what is the next step? [13:53:13] first acme-chief client in eqsin BTW [13:53:14] roll out to other DCs? [13:53:37] nope, for the time being it will remain on eqsin servers [13:53:47] wikibase cert will hit every DC though [13:53:59] as soon as I make ema's reviews happy [13:54:00] is the next step to serve traffic in eqsin with it then? [13:54:18] yes, but not before Q4 [13:54:26] we need to test that the renewal is handled as expected :) [13:54:40] Q4 is what, April-June? [13:54:43] indeed [13:54:48] we're in Q4 [13:55:03] not before the first renewal cycle then :) [13:55:06] :) [13:55:08] this is true [13:55:11] so in 3 months more or less [13:55:22] we wanna play safe with the cp servers [13:55:48] but the wikiba.se cert will be a nice testing environment [13:55:54] yeah [13:55:55] or a nice test case [13:57:37] so realistically Q1 for unified LE certs live then [13:58:00] indeed [13:58:18] the goal for Q3 (ended on Friday) was deploying the cert and get it stapled successfully [13:58:24] not serving traffic with it [13:58:45] running puppet on cp5007 (text cluster) [13:58:49] k [13:59:03] it should be the same as cp5001... [13:59:17] says the incident report [13:59:22] ahahha [14:00:53] so this is the current service definition for nginx [14:00:58] https://www.irccloud.com/pastebin/o1Lc4ehK/ [14:02:00] I'm wondering if we should set the restart command to issue a reload [14:02:31] (everything looking good in cp5007) [14:04:50] ema: let's reenable puppet on eqsin cp servers? [14:04:55] vgutierrez: +1 [14:06:59] vgutierrez: I imagine there's a good reason why we're using things like `before => [Service['nginx'], Exec['nginx-reload']]` instead of having notify trigger a reload? [14:07:25] hmmm maybe is a good but ancient|deprecated reason [14:07:36] but yeah, I'm pretty sure that's not a concidence [14:07:47] where is Exec['nginx-reload'] defined BTW? [14:08:13] hmm I saw that before... [14:08:17] one sec [14:09:24] https://github.com/wikimedia/operations-puppet-nginx/blob/master/manifests/init.pp#L61-L64 [14:10:44] ah! [14:10:46] yeah there's some old complex hacks, which should someday be refactored, around trying to make initial installs work better, that interact with all that [14:11:23] but really step 1 should probably be to un-submodule nginx [14:11:36] so that the rest of the related CRs can make sense heh [14:11:43] yes :) [14:18:07] ema: regarding $monitoring on wikibase.pp.. that's the same behaviour that we have for unified.pp [14:18:21] so.. I could fix that providing two checks.. one with ocsp stapling and one without it [14:20:00] wikibase has ocsp right? [14:22:10] ema: also while looking at related things, so the trafficserver/backend.yaml, for A/A cases like these tends to only define one side and comment out the other? I guess we decided to rely on disc-dns and many services don't have it? [14:22:54] bblack: yes [14:23:13] bblack: BTW, I've realized that the OCSP response provided by LE is way smaller than the one provided by globalsign [14:23:18] bblack: yes (re: backend.yaml) [14:23:26] see https://www.irccloud.com/pastebin/DykYiZFn/ [14:23:32] vgutierrez: yeah, digicert also gives the smaller form, only globalsign has the huge version [14:23:43] ack :) [14:23:55] nice improvement in the TLS handhshake I guess [14:23:57] vgutierrez: but we've looked at it before, and while it's not ideal, it's not enough to push us over the edge of requiring another RTT on estab [14:24:12] (assuming IW10, which is mostly a decent assumption these days) [14:25:02] vgutierrez: why does the $letsencrypt side of the wikibase branch set do_ocsp => false? [14:25:24] ema: cause the old LE puppetization doesn't provide OCSP stapling support [14:25:38] booo [14:26:03] we don't need the old LE puppetization at all though, we could/should kill it from tlsproxy to be less-confusing [14:26:09] err.. you're booing the wrong guy ;P [14:26:11] unless deployment-prep still relies on it? [14:26:35] bblack: i think that's still the case, and I've kept it to be consistent with unified.pp [14:26:49] we can get rid of it in both unified and wikibase on a following commit [14:26:54] ok [14:27:27] so.. I've refactored wikibase.pp to let be able to provide useful monitoring for both acme-chief and letsencrypt [14:27:41] and make ema happy [14:27:49] (he is one though reviewer9 [14:29:02] yeah, so... the monitoring variant without ocsp is for? [14:29:15] $letsencrypt [14:29:39] which we think we don't actually use anywhere? [14:30:42] I guess Krenair will use it as soon as we merge that into production in the deployment-prep environment [14:31:10] sok [14:31:23] vgutierrez: your adulation worked and you got a +1 [14:31:26] so deployment-prep *does* still use the old LE support? [14:32:01] ema: https://gerrit.wikimedia.org/r/c/operations/puppet/+/499823/ needs it as well [14:33:00] bblack: so wikibase.pp it's brand new, as soon as we merge it is going to break puppet in deployment-prep and it will require setting the proper hieradata to pick letsencrypt or acme_chief [14:33:16] I'm not aware of the current state there TBH [14:33:26] Krenair can provide more information about that environment [14:39:21] bblack: I see that we support "If-Cached: $etag" in vcl. It doesn't look like anyone is using it though, and the standard way to do similar stuff seems to be `CC: only-if-cached`? Can/should we get rid of it? [14:40:31] ema: IIRC the reason for If-Cached was something to do with some Swift-related scripts.... something about using the caches to speed up replication or something like that... [14:40:43] ema: whether we still use those scripts, etc... I have no idea [14:41:09] (or if we do, how easy it would be to switch them to some other mechanism [14:41:32] ) [14:41:50] maybe godog remembers something related [14:43:11] ooh interesting [14:47:37] bblack: yes! https://github.com/wikimedia/operations-software/blob/master/swiftrepl/swiftrepl.py#L64 [14:50:55] 10Acme-chief, 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [14:52:32] https://www.irccloud.com/pastebin/4N82ZemA/ [14:52:35] * vgutierrez amazed [14:53:03] https://pinkunicorn.wikimedia.org still works.. so it's not a total disaster [14:53:04] ema: I think the problem with CC: only-if-cached is it wouldn't give an exact etag match (could be a cached older version than swift wants) [14:54:08] willikins:~ vgutierrez$ echo|openssl s_client -connect pinkunicorn.wikimedia.org:443 -servername wikiba.se 2>/dev/null|openssl x509 -noout -issuer -subject [14:54:08] issuer= /C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3 [14:54:08] subject= /CN=wikiba.se [14:54:13] bblack: yeah, I had no idea this was actually part of something we use, I thought it was some nice-to-have way for the world to ask for cached stuff [14:54:32] * vgutierrez dances a little bit on the chair [14:54:36] yeah but cp1008 goes to the eqiad ATS-be's heh [14:54:40] and they say can't connect [14:54:57] < HTTP/2 502 [14:55:01] < server: ATS/8.0.3 [14:55:05] < x-cache: cp1071 miss, cp1008 miss [14:55:05] < x-cache-status: miss [14:55:11] yeah.. probably because they didn't get the puppet change yet? [14:55:25] and an HTML error message that probably needs unification with our standard error outputs eventually heh [14:55:27] the eqiad ATS-be's I mean [14:56:03] I've included the ATS patch in the wikiba.se change [14:56:19] meaning https://gerrit.wikimedia.org/r/c/operations/puppet/+/499825/12/hieradata/role/common/trafficserver/backend.yaml [14:56:21] oh yes the ATS error message! Funny how things todo just find you without even looking [14:56:30] OCSP does work [14:57:10] yeah I guess cumin the ats-bes and get them updated [14:57:59] vgutierrez: they're A:cp-ats [14:58:17] lovely .D [14:58:26] bblack@haliax:~/repos/puppet$ curl --cert-status -4 -vI https://wikiba.se/ --resolve wikiba.se:443:208.80.154.42 [14:58:43] [...] [14:58:45] * Server certificate: [14:58:46] * subject: CN=wikiba.se [14:58:46] * start date: Mar 28 13:09:12 2019 GMT [14:58:46] * expire date: Jun 26 13:09:12 2019 GMT [14:58:46] * subjectAltName: host "wikiba.se" matched cert's "wikiba.se" [14:58:48] * issuer: C=US; O=Let's Encrypt; CN=Let's Encrypt Authority X3 [14:58:50] * SSL certificate verify ok. [14:58:53] * SSL certificate status: good (0) [14:58:55] looks good! [14:58:59] \o/ [14:59:30] wonderful [15:00:15] I wonder how acme-chief-api will behave with ~80 new clients [15:00:39] perfectly of course, because it's high-performance and bug-free like all production software :) [15:01:09] keep that in mind for the upcoming performance review }:) [15:01:13] haha [15:01:30] (it's that time of the year again) [15:02:03] (if it starts failing 1/N API requests, call it a ratelimiting feature to prevent abuse via random request denial ftw) [15:03:24] curl --resolve wikiba.se:443:208.80.154.42 https://wikiba.se -o /dev/null -v it's giving a 200 OK here [15:04:21] I get a 200 for wikiba.se but a 502 for www.wikiba.se [15:04:30] I guess ATS-be stuff doesn't match the www [15:04:50] right [15:05:11] I wonder if we're even supposed to support that directly, or redirect it? [15:05:12] I dunno if our wikiba.se puppetization itself suppports www.wikiba.se [15:05:15] guess could check existing behavior [15:05:42] the existing HTTP-only site (hosted elsewhere) serves both name variants with the same content, no redirect [15:06:00] yeah [15:06:09] arguably, https://wikiba.se/ is canonical and the other should just be a redirect (but redirect in the static-misc puppetization, not in the caches, I think) [15:06:18] but in puppet I've only found references to wikiba.se (without the www) [15:07:10] mainly https://gerrit.wikimedia.org/r/c/operations/puppet/+/389949 [15:07:42] that's why I've omitted the www.wikiba.se entry [15:08:41] we should probably support it, but I don't see an existing microsites examples that has aliases [15:08:48] I can sync with daniel tomorrow regarding that [15:08:58] till we don't swap the DNS records it shouldn't be a problem [15:09:17] right [15:09:33] so... green light for https://gerrit.wikimedia.org/r/c/operations/puppet/+/499981/? [15:09:43] of course disabling puppet first and being careful :) [15:09:58] (I don't want that t-shirt with the last commit) [15:10:08] So right now, the www.wikiba.se support exists in the cert SAN list, and in the tlsproxy configuration, and I guess in varnish-be? [15:10:28] not in varnish-be nor ATS-be [15:10:32] ok [15:10:46] regarding varnish/ATS only wikibas.se exists [15:10:49] *wikiba.se [15:11:53] yeah so we should probably fix that at the varnish/ats-be level, but it can be done separately/after [15:12:08] hmmm the microsite supports www.wikiba.se? [15:12:20] it doesn't, but that's a separate concern [15:12:26] either way it's going to fail until the whole stack has it [15:12:35] right [15:12:59] I'll ping daniel tomorrow regarding that and I'll proceed now with deploying wikiba.se along our text cluster [15:13:00] but we can do the varnish/ats part easily and independently, without wading into "how do you add redirected alias support to the standard microsite setup" [15:13:03] right [15:13:06] +1 [15:13:32] so.. I'll disable puppet, get that merged, depool one node in eqsin/ulsfo, and see what happens there [15:13:35] looks good? [15:13:37] or they can choose, I guess, to just have www as an alias and serve two copies of the contenxt, but the 301 probably makes more sense. [15:13:45] vgutierrez: yes [15:13:56] awesome [15:14:38] BTW, we don't have HSTS at all in wikiba.se via pinkunicorn [15:15:09] oh, good catch [15:15:17] it should be added to our canonicals support in VCL for that [15:15:38] hmm HSTS isn't handled by nginx ssl_ciphersuite() populated settings? [15:15:54] no [15:16:06] for the cache clusters, it's handled in VCL, as is the HTTP->HTTPS 301 [15:16:20] I can throw up a patch right quick [15:16:32] ack [15:16:46] I mean, providing HTTPS for wikiba.se is already an improvement :) [15:16:58] but yeah, hsts should be there as well [15:17:03] vgutierrez: in case you want to make cp1008 use varnish-be instead of ATS, s/cp1071/cp1008/ in hieradata/hosts/cp1008.yaml [15:20:15] 10Traffic, 10Operations, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992 (10Volans) @ema given the speedup due to prometheus 2 do you think this still needs to be worked on or could be resolved? [15:20:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/500472 [15:20:51] hmm that will result in wikiba.se being in the HSTS preload list [15:21:03] I know that's a requirement for our canonical domains [15:21:09] but I don't know if wikiba.se is already there [15:21:16] (ready to be in the HSTS preload list) [15:21:39] well it will result in it being possible to preload it [15:21:56] bu tyes, it's a good point, in theory the wikibase folks could still argue about the domain ownership transfer and back out of this [15:22:13] and the max-age set by us is kinda aggressive for a new domain [15:22:30] eh, for this case because of the above, maybe [15:22:40] in general, HTTPS is all that matters anymore [15:23:08] of course, in the general case the max-age is ok [15:26:16] splitting... [15:27:54] technically the 301 pollutes browsers too, so I guess we should hold on both in this particular transition case [15:28:08] since the existing site doesn't listen on 443 at all :P [15:28:13] yeah [15:31:22] targeting cp5007 for the first puppet run [15:31:39] cp1008 has been a NOOP as expected [15:33:49] https://www.irccloud.com/pastebin/JTZbixJu/ [15:33:52] cp5007 looks happy [15:35:46] vgutierrez@bast3002:~$ curl --resolve wikiba.se:443:10.132.0.107 https://wikiba.se -v -o /dev/null --> that looks as expected [15:37:20] and openssl s_client -connect 10.132.0.107:443 -servername wikiba.se -status looks good as well [15:41:46] let's hit one node on every dc [15:47:29] cp4032 looking good as well [16:01:25] 10Traffic, 10Operations, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992 (10ema) 05Open→03Resolved a:03ema >>! In T190992#5074344, @Volans wrote: > @ema given the speedup due to prometheus 2 do you think this still needs t... [16:01:34] 10Traffic, 10Operations, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10ema) 05Open→03Resolved a:03ema [16:03:07] cp2023 looking good as well \o/ [16:50:44] 10Acme-chief, 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [16:51:01] 66% done, we should get the non-canonical certs issued (tomorrow) [16:56:02] bblack: so.. I managed to deploy a SNI based vhost in the text cluster without breaking anything apparently [16:56:18] thanks ema & Krenair :D [16:58:57] :) [16:59:59] you need to work on your tshirt-winning skills! [19:01:45] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10ayounsi) 05Open→03Resolved Link has been up for 1+ day. Got a notification saying the emergency maintenance was done. [21:28:06] 10netops, 10Operations: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) `lang=diff,name=cr1-eqsin [edit protocols bgp group Transit4] - import [ BGP_sanitize_in BGP_transit_in BGP_avoid_long_RTT_in BGP_community_actions ]; + import [ BGP_sanitize_in BGP_tran... [22:21:14] 10netops, 10Operations: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) `lang=diff,name=cr2-eqsin [edit protocols bgp group Transit4] - import [ BGP_sanitize_in BGP_transit_in BGP_avoid_long_RTT_in BGP_community_actions ]; + import [ BGP_sanitize_in BGP_tran... [23:01:27] 10netops, 10Operations: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) 05Open→03Resolved Pushed progressively and confirmed with the looking glasses that only the proper communities were received on the other side. As well as the proper local_pref was applied...