[07:06:02] morning [07:06:10] what happened last EU night in esams? https://grafana.wikimedia.org/dashboard/db/load-balancers?panelId=12&fullscreen&orgId=1&from=1524175207254&to=1524181431662 [07:06:25] or it's just a glitch on the monitoring? [07:06:43] yeah I think bast3002 had issues [07:07:30] vgutierrez: see backscroll by Brandon in the other channel [07:08:06] ack :) [11:54:09] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Daniel: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#4145675 (10daniel) [12:22:27] i would like to merge the bug fixes to the bgp fsm code as exposed by the recently merged fsm unit tests: https://gerrit.wikimedia.org/r/#/c/423997/ etc. [12:22:31] any objections? [12:26:23] nope [13:15:56] <_joe_> mark: you're on your tech friday, right? And working on pybal? If so, maybe you're interested in https://phabricator.wikimedia.org/T192437 [13:16:14] * _joe_ shops his whishlists around the team [13:16:54] _joe_: regarding wishlists.... [13:17:14] _joe_: how can we speed up updating nginx on confd instances? [13:17:45] <_joe_> vgutierrez: what do you mean? [13:18:00] https://phabricator.wikimedia.org/T164456#3723386 [13:18:57] <_joe_> vgutierrez: you're talking about the switch to nginx-light? [13:19:12] <_joe_> the nginx-full upgrade can be done now [13:19:26] <_joe_> well, maybe on monday [13:19:29] the nginx-full upgrade [13:19:44] <_joe_> but it should be generally almost painless [13:19:56] right, we're trying to first align on the most-recent version of our nginx-full first, to make the nginx-light transition less-painful [13:19:58] <_joe_> also, all of our applications should survive etcd dying pretty well [13:20:12] <_joe_> if they don't, it's a design flaw [13:20:19] hypothetically nginx upgrades are seamless even from apt, but still [13:20:22] <_joe_> I was thinking of doing what google does with chubby [13:20:32] it was more about coordinating with owners Just In Case [13:20:38] <_joe_> a planned N-minute outage every quarter [13:20:57] <_joe_> to verify if any application comes to rely too much on etcd for its operations [13:21:09] chaos monkey FTW [13:25:46] conf[1001-1003].eqiad.wmnet,conf[2001-2003].codfw.wmnet,mw2139.codfw.wmnet [13:25:54] seem to be the only tlsproxy-using hosts left with outdated nginx [13:27:55] (not sure why that one mw server is oddball, maybe was down for repairs during a previous upgrade cycle or something) [13:28:12] <_joe_> I guess so [13:28:25] <_joe_> vgutierrez: so, ping me on monday I guess? [13:30:37] _joe_: will do <3 [13:31:23] right now I'm pretty busy with my new traffic security duties.. but I'd like a chance to implement T192437 [13:31:23] T192437: Pybal support of configuration from the kubernetes API - https://phabricator.wikimedia.org/T192437 [13:33:46] BTW bblack, I was talking this EU morning with ema regarding a tiny issue with how we fill X-CP-* info in varnish [13:36:10] considering this VSL query: "ReqMethod ne PURGE and ReqHeader:X-CP-Key-Exchange eq RSA" [13:36:28] with the current VCL you also get every non HTTPS request that we get [13:37:33] so I was wondering if we should wrap this code: https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L285-L341 within a if (req.http.X-Forwarded-Proto == "https") [13:39:27] this currently happens because we use RSA as a fallback here: https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L313-L314 [14:32:46] I've brought mw2139 in line, it was probably down for anything dc-opsish when I had upgraded the other nginx packages [14:33:36] thx moritzm [14:57:23] vgutierrez: mmh, actually https_recv_redirect already handles the xfp!=https case https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L35 [15:04:48] so why I need the xfp==https condition in varnishlog / varnishncsa to avoid false positives? [15:06:04] there are known cases of "internal" software setting xfp=https and hitting varnish directly [15:06:12] can that be the cause of what you're seeing? [15:07:47] but if they set xfp=https, filtering with xfps==https on the VSL query should be useless, right? [15:08:09] that's a great point! :P [15:11:16] vgutierrez: what's a false positive in this context? [15:12:01] 301s TLS redirect for instance I assume [15:12:01] something reporting RSA as a Key Exchange algorithm when it's actually a plain http connection [15:12:24] and by something I mean varnish :) [15:14:32] vgutierrez: a way to rephrase my question is: how do you know it's actually a plain http connection? [15:15:11] xfp is not set [15:16:00] and our nginx sets xfp to https for every request [15:16:54] I'm not getting any output with `varnishncsa -n frontend -q 'ReqMethod ne PURGE and ReqHeader:X-CP-Key-Exchange eq RSA and ReqHeader:X-Forwarded-Proto ne https'` [15:17:37] get rid of X-Forwarded-Proto condition [15:17:49] cause is not there at all [15:18:42] varnishlog -n frontend -q "ReqMethod ne PURGE and ReqHeader:X-CP-Key-Exchange eq RSA" [15:18:48] I'm getting hits with that on cp3030 right now [15:19:31] of course you'll have some that are actually TLS ones using AES128-SHA [15:20:47] and some that are plain text, easy to spot because TLS version is empty [15:22:07] let me paste on phabricator a full request dump [15:22:24] (without the user IP of course) [15:24:38] so [15:24:47] varnishncsa -n frontend -q 'ReqMethod ne PURGE and ReqHeader:X-CP-Key-Exchange eq RSA' -F '%{X-Forwarded-Proto}i %{X-CP-TLS-Version}i %m %s' [15:25:22] https://phabricator.wikimedia.org/P7020 [15:25:27] this only outputs GET+301 or HEAD+200 it seems [15:25:51] (for requests w/o TLS version) [15:26:10] mmh no, there's some GET+200 too, interesting [15:30:00] btw, http://www.wikipedia.com/ works but https://www.wikipedia.com/ returns a SSL certificate error, funny :) [15:37:32] hmm I'm also seeing requests without xfp that are proper https connections :/ [15:37:43] * vgutierrez is missing something here [15:38:26] hmm forget that, I misread -X behavior [15:47:42] hmmm ok.. I got one of those GET 200 without TLS version... [15:47:55] it's a HTTP/1.0 request without Host header [15:49:50] ema: that's how they're "bypassing" the xfp redirection [15:49:52] https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L40 [15:50:09] it falls into that non-existant else [15:51:33] ah there you go [15:52:19] curl -v -H 'Host:' -0 http://en.wikipedia.org/ [15:52:22] like this [15:54:10] I guess we could stop that on the frontend varnish instead of getting it reach the mediawiki server? [15:54:35] s/getting/letting/ [16:16:51] vgutierrez: so, to correct what I said before, it's not that we allow requests w/o xfp from "local" IPs, rather we allow local IPs to set it [16:19:34] perhaps we should be adding www.wikipedia.com to that regex probably [16:20:32] in general, .com (eg: wikisource.com) [16:23:45] yeah most of the requests falling in the category vgutierrez is mentioning seem to be for www.wikipedia.com [16:27:45] though we would need to add *.wikipedia.com to the certs... mmh [16:29:28] so to recap: [16:30:28] if the Host header is missing or not matching the regex (eg: wikipedia.com), there's no 301 tls redirect and we consider the (non-existent) key exchange as RSA [16:30:49] mediawiki does redirect wikipedia.com to wikipedia.org [16:31:20] our wildcard certs don't list *.wikipedia.com (and others) [16:33:37] one option would be fixing the regex and the certs and end up with: GET http://www.wikipedia.com -> 301 TLS redirect https://www.wikipedia.com -> 301 (mediawiki) https://www.wikipedia.org [19:31:56] huh? [19:32:26] I guess I should read more of the backscroll. I thought we weren't marking non-HTTPS reqs in HTTPS stats at all... [19:34:09] anyways, that's definitely buggy, and could really be affecting some of our stats I think? [19:36:05] X-Connection-Properties is only allowed to come from nginx, and all the other cipher stats should only be logged/generated if X-C-P was initially present. [19:36:25] there are multiple non-HTTPS pathways into and through Varnish that should be no-ops for HTTPS stats, but apparently aren't [19:37:54] luckily, I think it only sets: [19:37:55] set req.http.X-CP-Auth = "RSA"; [19:37:55] set req.http.X-CP-Key-Exchange = "RSA"; [19:38:14] from that else clause, but doesn't actually set a cipher (I guess blank cipher over in grafana? I donno) [19:38:46] we should probably block the whole out where none of the X-CP stuff is run unless X-C-P is defined [19:40:22] or maybe they don't leak into the graphs at all, not sure [19:41:25] re: specifically wikipedia.com, it's just the most-popular of hundreds of non-canonical domains, there's some tickets and longterm work todo about the lot of them. For now it's expected that they fail HTTPS and pass through unmolested for normal HTTP (where hopefully MediaWiki either redirects or errors)