[07:54:18] morning! Just rebooted cp2010, it was frozen for some reason [07:54:36] checked varnishlog and pooled=yes entries, didn't find anything weird [09:56:49] <_joe_> win 19 [09:56:54] <_joe_> grrr damn synergy [10:09:42] :) [10:33:19] elukey: thanks! but also, please depool first in a case like that [10:33:23] 07:47 < icinga-wm> PROBLEM - Freshness of OCSP Stapling files on cp2010 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [10:33:26] 07:49 < elukey> cp2010 seems working fine, pooled and varnishlog shows traffic [10:33:34] 08:00 < icinga-wm> RECOVERY - Freshness of OCSP Stapling files on cp2010 is OK: OK [10:34:09] (depool until healthchecks are ok) [10:35:43] ah snap I thought it was ok that OCSP was briefly out of date [10:39:16] it can be, but it depends [10:39:29] bblack: what service(s) should I depool? service=nginx or all? [10:39:42] rigth, didn't think about it sorry [10:39:46] *right [10:40:00] for the general-case unthinking rule, depool the whole machine until healthchecks are OK (and maybe until crash is understood too) [10:40:33] for OCSP nginx would be enough, and even then, could also manually verify that the stale OCSP was OK, or manually refreshed it [10:40:33] sure, will do it the next time [10:40:50] if the stale OCSP isn't ok, FF users get errors [10:41:16] (well at least FF users, possibly others depending how bad the OCSP data is) [10:44:12] bblack: sure, sorry I'll be more careful next time (== depool and wait) [11:20:38] <_joe_> wow I was about to say "brandon is here sooner and sooner" [11:20:54] <_joe_> then realized we are just 5 hours away now [11:21:12] <_joe_> so it's damn early but not as much as I thought :P [11:22:00] :) [11:22:13] yeah I also thought I woke up in the middle of the night [11:22:23] s/I woke/he woke/ [11:22:30] we're too lazy to bother getting our extra hour of sleep back sooner [11:23:05] <_joe_> but hopefully, next week america will be great again [11:23:13] lol [11:23:25] <_joe_> couldn't resist, sorry :P [11:23:33] next week on tuesday! [11:23:47] <_joe_> I don't think I'll sleep much [11:23:49] <_joe_> that night [11:23:56] I doubt anyone will [11:24:11] <_joe_> it's one of the two times I won't go to sleep waiting for things to happen in the US [11:24:25] the race isn't as close as it looks, though. the numbers are close, but projections are putting clinton's odds of win at 90% [11:24:28] <_joe_> when there is a presidential election and when the Spurs play the NBA finals :P [11:24:42] <_joe_> bblack: I fear the poll-shame effect [11:24:52] <_joe_> that happened with Berlusconi on his first election here [11:25:02] <_joe_> people felt judged for wanting to vote for him [11:25:16] <_joe_> by "liberals and radical-chic elites" [11:25:32] <_joe_> so the polls before voting day dramatically underrepresented his success [11:25:39] https://twitter.com/UpshotNYT/status/791606382189502464/photo/1 [11:26:12] <_joe_> I usually use http://projects.fivethirtyeight.com/2016-election-forecast/ [11:26:15] the electoral college system works out very different than the random popular-vote polls do [11:26:22] <_joe_> and jeez, clinton lost a lot of ground [11:27:31] all of those maps are scary on a different level about how divided the country is in the long term [11:28:40] e.g. in the 538 one, clinton has a 0.2% chance of winning Oklahoma, and Trump has a <0.1% chance of winning california [11:28:54] it's not good when state populations are so dramatically out of sync in their political opinions :( [11:28:58] <_joe_> yes [11:29:18] <_joe_> well it's pretty natural everywhere, I mean italy has the same [11:30:16] <_joe_> basically it's a conservative catholic country with leftist-liberal major cities (that's mixed in milan, though), and a "red belt" of former socialists in the center [11:31:05] <_joe_> and yes, it's bad [11:31:13] <_joe_> sorry for the OT [11:31:14] <_joe_> :P [11:32:06] eh [11:32:15] it's the same here underneath it all [11:32:29] it's mostly the battle between the urban and the rural [11:32:52] <_joe_> sorry, "red" is usually associated with social-democrats here, and blue to conservatives, which makes all this kind of discussions pretty confusing :P [11:32:53] there's a long-term trend of urbanization, and then a backlash of thsoe holding to rural values and lifestyle [11:34:00] that's a lot of the longer-term undercurrent, whoever the candidates and whatever the issues-of-the-day are here [11:34:38] (and the electoral college gives a bit of a nudge to the rural vs their weight in raw popular-vote terms) [11:35:45] the map shows it plain as day though. the blue states here are mostly the ones dominated by urban populations or in their sphere of influence, and vice-versa [11:37:18] relevant: http://waterfordwhispersnews.com/2016/10/21/clinton-set-to-become-first-woman-president-to-use-nuke/ [11:38:29] <_joe_> bblack: well texas has big cities, I wouldn't call it "rural" [11:38:56] yeah Texas is an outlier in that regard. You would think the major metro areas it has would turn it blue based on the above alone. [11:39:11] (and the major metro areas in TX do vote blue in local matters) [11:39:25] <_joe_> also half the population is of latino descent, or so, that's why I can't understand how people can vote for Trump [11:39:38] <_joe_> well actually TX was blue until the 70s IIRC? [11:39:53] well sure but everything was different the further back in history you go [11:40:18] if you change your cutoff date, the democrats were pro-slavery and the republicans were the ones more aligned with modern civil rights, too [11:42:27] in any case, CA is comparable in size and scope, but CA's population is 95% urban and TX's population is only 85% urban (2010 numbers) [11:43:27] oklahoma from the earlier example above is 66% urban [11:45:25] I guess there's some tipping point where a state's finances and politics are dominated by urban issues (or by the gravity well of other nearby states' urban issues), to flip it to the blue side. [11:46:03] and in TX, a lot of the "urban" population is still heavily involved with rural issues, so the numbers are fuzzier there. [11:51:32] <_joe_> https://www.bloomberg.com/news/articles/2016-10-31/centurylink-agrees-to-buy-level-3-for-34-billion-in-cash-stock this is huge [11:52:15] jeez [11:54:22] was looking at the openssl read_ahead issue in boringssl's fork [11:54:49] the main reason read_ahead is necessary is DTLS, although it's a supposed efficiency win to do it for the non-DTLS case [11:56:06] but boringssl basically turned it on for DTLS and killed it for traditional TLS (making the compatible call a no-op), and then they now have a TODO to split the SSL_read_n() code to make it less confusing and mixed between the readahead and non-readahead cases [11:56:27] (aka DTLS and TLS cases) [11:56:41] that was ~1.5 years ago [11:56:53] how very prescient of them :P [11:58:21] in the long view, it probably makes more sense to concentrate those efficiency hacks in the DTLS direction anyways. that will probably be the future a few years down the road, and already increasingly is for Google since they control both ends with QUIC [11:58:37] but even for the rest of us, we're going to end up on IETF-QUIC w/ DTLS-1.3 [11:59:24] ( https://tools.ietf.org/html/draft-hamilton-early-deployment-quic-00 ) [12:00:03] "QUIC current handshake will be replaced by TLS 1.3 in the future [12:00:08] " [12:01:19] 10Traffic, 06Operations, 13Patch-For-Review: nginx SSL_do_handshake spam filling disks - https://phabricator.wikimedia.org/T148893#2756139 (10BBlack) 05Open>03Resolved a:03BBlack wmf13 nginx package fixes this [13:32:24] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2756346 (10BBlack) >>! In T148917#2739171, @BBlack wrote: > Can anyone still repro this is... [13:37:04] bblack: is the stream.wm.org certificate expiry handled by you and/or robh? [13:38:31] paravoid: stream.wm.o doesn't need a separate cert anymore [13:39:01] is there a ticket on it already? [13:39:13] I don't know, I'm just seeing an icinga alert [13:39:50] where? [13:40:11] in any case, that woul dbe our unified [13:40:14] (if it's monitored) [13:40:20] which is expiring on Dec 10th or so [13:40:29] bblack: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=stream.wikimedia.org&service=HTTPS+stream.wikimedia.org [13:41:25] yeah so the 40d warning in that link, that's our unified cert expiry [13:41:30] I think we suppressed it in other cases [13:42:35] yeah for some reason stream is factored differently for monitoring, I guess a holdover from when it was separate [13:44:23] fixing up puppetization to kill that separate alert [13:45:17] ( https://gerrit.wikimedia.org/r/#/c/318921/1 ) [13:53:06] paravoid: while we're on related things, any thoughts about monitoring OCSP directly from check_sslxNN or similar? Now that we've got icinga on jessie, Net::SSLeay is technically new enough to do it. [13:53:33] paravoid: but is it worth continuing to embrace+extend check_ssl? or should we start over with trying to make it more-efficient and purpose-suited, etc? [13:54:32] more efficient/purpose-suited how? [13:54:54] I don't mind either way, but I'm wondering what you're thinking of :) [13:55:11] I donno. I was looking at it all late last week, how check_sslxNN wraps over check_ssl. The extensions to do OCSP are going to be even more invasive in that sense than just parallelizing. [13:55:34] I guess the alternative is write something better, or google search for a better icinga ssl checker someone's already written. [13:56:15] the OCSP stuff requires hooking into Net::SSLeay a bit more deeply than check_ssl easily allows for. it's possible, but it might get pretty ugly. [13:56:24] I did before I wrote check_ssl, it was all pretty sucky [13:57:32] check_ssl also uses both IO::Socket::SSL and Net:SSLeay [13:58:02] iirc, the latter was so that I could do some more lower-level checks, rather than relying on IO::Socket:SSL magic [13:58:09] basically the dependency chain here goes like this: nginx internal stapling is overall "better" - it gets rid of some limitations in our current approach, especially e.g. for supporting Digicert's certs, but other stuff too. But, the way thigns currently work, we rely on the external python stapler sanity-checking everything, and icinga checking that its output hasn't gone stale due to failure t [13:58:13] but OCSP we may be able to rely on IO::Socket::SSL [13:58:15] o validate OCSP contents. [13:58:38] so if we want to switch to nginx internal stapling, we lose the ability to monitor by just checking external file freshness, and we need to confirm OCSP outputs from nginx directly. [13:59:03] my $ocsp_cache = IO::Socket::SSL::OCSP_Cache->new; [13:59:06] my $client = IO::Socket::SSL->new( [13:59:06] PeerAddr => $dst, [13:59:06] SSL_ocsp_mode => SSL_OCSP_FULL_CHAIN|SSL_OCSP_FAIL_HARD, [13:59:06] SSL_ocsp_cache => $ocsp_cache, [13:59:07] ); [13:59:16] yeah, looks easier [13:59:27] have to run off to a meeting [13:59:49] oh I didn't see that [14:00:03] I was looking at the Net::SSLEay API for it, but that seems better / higher-level [14:00:30] (assuming it can be forced to require stapling, instead of falling back to its own external ocsp fetch) [14:32:12] the caching bit is tricky, but basic validation of things we care about is easy [14:32:43] ideally we'd use SSL_ocsp_cache to avoid redundant OCSP checks, esp on intermediates where it has to make a network check [14:33:05] but the way we parallelize things currently, the threads probably can't share an OCSP cache in parallel [14:43:51] I donno, now that I check timing, it doesn't seem to slow it down much [14:44:21] that may be because we ship our intermediates into the local cert stores, and thus our own local checks don't consider the intermediates to be a peer cert that needs checking [14:45:05] (that's something that has bugged me before: why do we effectively add our intermediates to the local cert store as if they're trusted roots??) [15:46:11] bblack: speaking of jessie/check_ssl... https://gerrit.wikimedia.org/r/#/c/318949/ [15:46:22] I've had that queued up locally for quite some time [15:47:58] this doubles the number of checks too, and einstenium may not have the necessary headroom [15:48:12] generally, check_ssl/sslxNN is becoming more of a test suite of all potential SSL issues [15:48:33] while they are all useful, I'm not entirely sure if it makes much sense to run those very often against each individual server [15:49:07] maybe we need to run the full suite against the service IP, and then against each cp* either a more limited set of tests *or* the full suite less often (e.g. every 10 minutes?) [15:49:53] otoh, it is plausible that we may fuck up the config of one individual server that is pooled and on rotation, and taking up to 10 or 15 minutes to find out might be unacceptable [15:50:47] well [15:51:03] it's complicated :) [15:51:08] yeah it is [15:52:16] I think it's probably actually-important that we negotiate+verify ECDSA+RSA-based connections, and that we check OCSP on them [15:52:42] (and do that on every cache) [15:52:56] the other high-level question is whether it makes sense to check each individual domain w/ SNI these days [15:52:59] but probably what we could dump is the xNN part that checks all the various SNI hostnames [15:53:02] heh [15:53:27] and just dump out the SNI list from the cert and validate that it's exactly what we expect it to be, if anything [15:53:45] with the next round of renewals (right now), I'm adding the last two cache_misc certs into unified anyways [15:53:59] basically all cache terminators will always have a single unified cert covering everything, and anything else doesn't [15:54:01] another idea would be to just check the fingerprint maybe [15:54:14] eh [15:54:19] what we really care about on the per-host check is that they're all serving the same certificate [15:54:30] eventually we'll add HPKP, and we'll probably want check_ssl to validate that the PKP hash matches [15:54:33] which is basically the same thing [15:54:44] and then have a service-level check that validates that that certificate covers the domains we expect it to [15:54:59] we don't actualyl want them all serving the same cert: esams+asia vs US vendor split [15:55:00] in theory that one could just be a jenkins test but it's a little too complicated for jenkins [15:55:09] right, true [15:55:36] in any case, our unified SAN list is a fairly fixed thing that changes slowly [15:56:05] I'd just strip back what we're doing today on the xNN front to do a single non-SNI check for cert properties (ECDSA+RSA, OCSP, basic validity) using wikipedia.org [15:56:22] and then also tack on a sorted comparison right after that the SAN list matches the expected unified list [15:56:46] xNN is only used for our unified anyways, it's basically a custom check_ssl_unified [15:58:33] so (1) replace check_sslxNN with check_ssl_unified (single check of either non-SNI or cn==en.wikipedia.org + quick in-memory comparison of SAN list to expected list) [15:58:39] (2) merge up ECDSA+RSA changeset [15:58:43] (3) merge up OCSP changeset [15:59:14] (1) is not as easy [15:59:25] because we can't get at the san list in the cert? [15:59:29] yeah [15:59:36] I don't see why we wouldn't be able to in theory [15:59:39] but we can add a check_ssl option where you give it a list of domains or something [15:59:54] sslxNN is a wrapper around check_ssl [15:59:59] it basically calls check_ssl a few times and gets the output [16:00:02] yeah we'll ahve to extend check_ssl [16:00:08] anyway [16:00:09] meeting time :) [16:00:11] but in check_ssl, surely we can dig into the ssl object [16:00:21] (and get the raw san list) [17:35:01] 10netops, 06Operations: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2757566 (10mark) Row A-D uplinks to cr2-eqiad have all been moved from fpc 5 to fpc 3. Remaining: - pfw1 uplinks (xe-5/0/3) - Zayo wavelength to codfw (xe-5/2/3) - Equinix Ashburn port (x... [17:49:54] 10netops, 06Operations: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2757634 (10mark) Reverse DNS (interface names) should also be updated for all moved ports... [18:14:28] paravoid: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:check_ssl_stuff [18:15:07] ^ adds OCSP stapling, authalg name output, and SAN-list checking (in a single connect), then replaces check_sslxNN for the caches' unified with 2x singular invocations of that for ECDSA+RSA [18:15:13] ? [18:15:49] it's going to double the number of embedded script invocations for the certpair, but the checks should run way way faster with only a single connection each [18:20:14] cp3035: [18:20:16] bblack@einsteinium:~$ time /usr/lib/nagios/plugins/check_sslxNN -H 10.20.0.170 -p 443 [18:20:22] real 0m3.536s [18:20:22] user 0m4.312s [18:20:22] sys 0m0.284s [18:20:50] time ./check_ssl -H 10.20.0.170 -p 443 -o must-staple --authalg ECDSA --cn en.wikipedia.org --sans '..... [18:20:56] real 0m0.825s [18:20:56] user 0m0.304s [18:20:56] sys 0m0.016s [18:33:42] reviewed [18:34:00] so, check_sslxNN right now does a check with SNI enabled and one without [18:34:15] er, one without and multiple with, that is [18:34:28] nginx may serve a different certificate in those cases [18:34:38] it's impossible to know [18:35:04] it would need a hell of a misconfiguration for this to happen but it is plausible [18:35:11] well, the basis of getting rid of the many separate connection is that we know our nginx is only configured with 1x certificate [18:35:23] 1+1, but yeah [18:35:29] in the SNI sense anyways [18:35:42] but we do add more certificates at times, right? [18:35:49] we have the phabusercontent now for instance [18:36:07] so it's plausible that for some fucked up reason the non-unified becomes primary and is served when no SNI is present [18:36:07] those are being added to the unified this time around (*.planet.wikimedia.org + *.wmfusercontent.org) [18:36:22] right, but we may add another one at some point [18:36:45] I donno [18:36:51] *if* that happens and we screw this up, it's possible that the site will break for non-SNI capable UAs and we wouldn't even notice it [18:36:57] as we all run recent browsers [18:37:15] well, my thinking on that is: [18:37:47] 1) The set of domains in "unified" is pretty static. We're not adding to it often, and shouldn't be, and the unified cert should be the default cert. [18:38:43] 2) If someone really has a brand-new 2LD we need to add as canonical, inbetween yearly renewal cycles, with such importance that we need to get it a temporary second cert... [18:38:59] 2a) Dealing with configuring that is part of the pain they're asked to endure to ask that of us, and [18:39:07] 2b) It would be a separate non-default cert [18:39:18] (with its own separate check) [18:39:32] sure, all that is true [18:39:34] 3) The check on unified does an explicit SNI for en.wikipedia.org, so it's matching whatever matches that [18:39:56] if we accidentally switched the default around, the only thing that could get messed up is non-SNI browsers [18:40:01] yeah, exactly [18:40:09] which means that we won't otherwise notice [18:40:40] most of them either already can't connect to us, or won't connect to us in the near-ish future (IE8/XP) [18:40:51] perhaps we should keep sslxNN (but +ECDSA/RSA +OCSP) for the service IPs? [18:40:52] do we know which percentage of traffic is made by non-SNI browsers? [18:40:59] just do the full test there [18:41:10] it's expensive, but it's a dozen service IPs, so whatever? [18:41:15] volans: yes, roughy something on the orde rof <1.5% [18:41:27] (I don't know more specifically than that, could be much lower) [18:41:53] my point being, if it's so low, an hypotetical downtime of 15 minutes for <1.5% is not "that bad"(TM) [18:42:13] so some more expesive checks to ensure this rare case could run with less frequency [18:42:32] yeah I don't think so either. we may continue to support non-SNI browsers in some sense, but if they fail it's more on them than us in 2016 and beyond IMHO :P [18:42:47] either way, we *still* don't have to check all the connections to the public even [18:43:06] we could just do a single non-SNI connection to the public and validate that it matches en.wikipedia.org, and that suffices to catch the edge case [18:45:01] paravoid: on to other points: the reason --sans isn't exhaustive (exact match) is because we do sometimes add to the list on renewal. this way we don't have a race condition between deployment and monitoring (deploy new cert with +1 names, then update monitoring post-deployment) [18:45:22] it's technically not an issue if we have *extra* SANs from the monitoring pov, so long as we match all the SANs we're known to expect. [18:46:05] yeah ok [18:46:21] ~1.5% is on the range of a "dangerous" percentage [18:46:32] not big enough to notice yourself/by users on IRC [18:46:36] big enough that it matters [18:46:41] :) [18:46:54] hence something that I think we should be catching with monitoring [18:47:31] would you be ok with just adding another check_ssl with --no-sni against the publics that must match wikipedia.org (which is in the unified and wouldn't move from it)? [18:47:47] yeah, that works too [18:47:53] that's also a lot of checks though [18:48:10] 8x IPs, 10x post-asia [18:48:21] oh just for the publics [18:48:22] yeah, sure [18:48:27] I even suggested to keep sslxNN for the publics [18:48:45] it's just to confirm lack of gross misconfiguration on our part, it won't catch a singular misbehaving caches [18:49:29] but is the kind you'd like a monitoring system to let you know [18:49:37] this is even less expensive than keep sslxNN [18:49:37] (one misbehaving) [18:50:13] one misbehaving in this case would be due to 1x misconfigured because ... I guess someone turned off puppet and edited? I donno, it seems like we'll have other things around that [18:51:14] or deploying something on one production host before deploying to all [18:51:15] we could also just keep sslxNN exactly as it is and run it once an hour or once a day [18:53:25] yeah [18:53:55] I think at some point we'll be forced to have the per-cp check to just be a consistency check [18:54:15] ? [18:54:16] and then expiries, subjects etc. to be on a service IP level [18:54:34] if the certificate is about to expire, you don't want every single cp* check to go WARN or CRIT [18:54:38] you want one, and a paging one at that [18:54:44] well [18:54:56] that could easily happen for a number of reasons on just one host, too [18:54:58] what you really care about is that each cp* is consistently configured [18:55:09] yes as long as teh cp* checks that they have the same cert with fingerprint cheks maybe [18:55:09] because puppet is retarded and restores an old file, or one host doesn't get upgraded with a last-minute renewal [18:55:10] same cert, same fingerprint for example [18:55:38] and then one check that this specific fingerprint also has the correct expiry and subject and whatnot [18:55:45] I'm much more worried about "wrong cert actually deployed on a single cache host because of some fuckup / software bug" than I am us misconfiguring SNI [18:55:53] (well default_server for SNI) [18:56:37] and if you're making a singular SSL connection to get a fingerprint, I doubt a few extra string compares to go ahead and validate the full SAN list and OCSP matters much for monitoring load [18:57:09] (and we also still need to check OCSP on all caches constantly no matter what. there are a ton of real scenarios where 1x cache host could screw up automatic OCSP stapling) [18:57:14] it doesn't, it'll just be annoying to see a hundred WARN alerts close to expiry [18:57:21] yeah, I was about to say this about OCSP [18:57:51] spam is really a separate problem [18:58:13] I think I've said the same thing 10 times since I got here: icinga-wm should be more concise when spam appears [18:58:51] after it has spewed 10x message in a short time window, it can just buffer up and do something like "397 alerts in the last minute, check https://icinga.wikimedia.org/" [18:59:01] I've suggested that too :) [18:59:17] but in this case it's not exactly the same [18:59:37] what you really care about is that each host in a certain cluster runs the same config [18:59:42] not that the cert it serves is valid [18:59:53] e.g. with the GlobalSign shit [19:00:01] if one of the servers was misconfigured with the wrong chain [19:00:13] it would still pass check_ssl with flying colors [19:00:28] but it's clearly an inconsistency that you'd like to deal with [19:00:44] or if in our dual-vendor future you manage to get only one server in eqiad with Digicert but all the rest with GlobalSign [19:00:46] well we can add that pretty easily, just like --issuer? [19:00:48] then that's also a problem [19:00:53] (or maybe issuer already does that?) [19:01:13] issuer is a string regexp right now IIRC, so it depends on the check you want to do [19:01:28] yeah ok, the string didn't change in the GS case [19:02:05] at a certain point if our config management tool is broken it's broken though [19:02:32] (if the cert had failed to deploy to 1x host for whatever reason and we're not getting broken/disabled puppet alerts) [19:03:32] I'd say wait till we're doing the puppetization for the DC-split with 2x vendors to attack validating the intermediate fingerprint. [19:04:25] it has the same update problem as the SAN list of course :) [19:04:47] risking to be a bit offtopic, do you think that this whole ssl-related stuff is so complex that might deserve a dedicated software that in turn sends passive checks to Icinga? (with all the deduplication logic, etc..) [19:05:12] with anything like that we have monitoring races. I guess the only general solution is to allow multiple values and use them during transition [19:05:13] * volans mostly thinking out loud, feel free to ignore him ;) [19:05:22] yeah, multiple values++ [19:05:24] volans: shouldn't be that complex really [19:05:41] we could do that with the SAN list too, make it explicit and support multiple separate lists [19:10:19] (seems annoying vs the existing solution in that case, though - supporting the extra syntax and array-of-array, etc) [19:21:11] is check_interval in minutes? [19:21:19] we can slow down the existing one, too heh [19:21:30] 1/min is crazy considering we're hitting all 100x caches [19:21:51] bblack: is in "time units", should be in minutes in our config [19:22:09] well I guess I'm not thinking about the 3x fails to reach alert though [19:22:43] yes, it's in minutes [19:25:15] paravoid: what if I just add --no-sni to the new check_ssl_unified? it doesn't need SNI and ensures we're checking unified-is-default [19:25:33] doesn't that fix everything without more checks? [19:26:39] (I guess what it doesn't catch is that we stupid add a second certificate for SNI==wikiversity.org when it's already part of unified's SAN list, but there's only so much stupid that can be prevented...) [19:28:05] 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2758037 (10Krinkle) >>! In T147845#2749511, @BBlack wrote: > Ok, I was only considering the websockets case. Still, since the python code is unaware of X-Client-IP..... [19:32:26] * volans dinner [19:32:31] 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2758046 (10BBlack) I just don't really want to support that at the end of the day - the complexity cost is too high for all the rest of our stack (not that websockets... [20:03:47] paravoid: I switched the ocsp patch to default-off, so it doesn't screw up existing check_ssl for internal certs like cassandra stuff, and merged up all the feature stuff, just not the final switch from check_sslxNN [20:22:27] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2758245 (10BBlack) [20:22:30] 10Traffic, 06Operations, 13Patch-For-Review: Extend check_sslxnn to check OCSP Stapling - https://phabricator.wikimedia.org/T148490#2758242 (10BBlack) 05Open>03Resolved a:03BBlack Fixed now in check_ssl itself. [20:24:10] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1149975 (10BBlack) [20:24:13] 10Traffic, 06Operations, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2758247 (10BBlack) [20:27:28] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2758256 (10BBlack) This task has continued to evolve. Basically, the remaining steps on the current path to resolution are: 1. Deploy... [20:38:24] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2758264 (10BBlack) @EdErhart-WMF are you the person now working on this? Can we get a status update fixing the remaining issue (correct HSTS header)? [22:32:48] question: with wikimedia varnish setup, does backend server know the IP of the other end client? I.e. do we have x-forwarded-for or alike? [23:00:21] SMalyshev: yes [23:00:38] bblack: thanks! Is it x-forwarded-for? [23:01:25] SMalyshev: we parse X-F-F at the front edge to look at things like trusted external proxies (e.g. OperaMini), and there's some complex logic around that [23:01:49] bblack: sure, but what the backend sees? [23:01:55] SMalyshev: but then there's a set of standardized header exposed to the middle layers and the application. X-Client-IP is the one that carries our notion of the "true" client IP [23:02:18] SMalyshev: there's also X-Carrier in the case of a known mobile carrier, and X-Trusted-Proxy in the case of one of those [23:02:47] bblack: aha, thanks, X-Client-IP is great [23:05:01] https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L199 [23:05:33] ^ that's the closest we have to documentation about it. it will eventually be part of the traffic "contract" we're supposed to be writing up, to document how the layer behaves as a whole [23:09:07] bblack: thank you! [23:12:21] 10Traffic, 06Operations: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2758735 (10BBlack) [23:13:21] 10Traffic, 06Operations: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2684468 (10BBlack) After some IRC discussion, it seemed better to host the content pages on metawiki. It will look more-official, and it will also be easier to develop an...