[00:10:47] cache_maps is going to end up with zero lines of cluster-specific VCL code heh [00:11:21] cache_misc can probably get there, if we move a few generic features upstream with switches [00:11:41] cache_upload might get close post-varnish4, close enough that we could do the same [00:12:19] cache_text may be the lone exception, I don't see it becoming cluster-specific-code free anytime in the foreseeable future. we can still cut it back a bit, though. [01:34:36] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2381584 (10mmodell) 05stalled>03Open [09:42:59] 10Traffic, 06Operations: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2382136 (10ema) p:05Triage>03Normal [09:54:14] 10Traffic, 06Operations: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2382164 (10BBlack) Hiera usage isn't specific to VCL, it's all over the ops/puppet repo in several design patterns. Deciding to go against that is out of scope here, IMHO. That's not necessarily a defe... [09:55:04] bblack: re: x-cache int/err/bug. Is it even possible to end up in vcl_backend_error with status < 500? [09:55:18] * ema stares at https://gerrit.wikimedia.org/r/#/c/293721/1/modules/varnish/templates/vcl/wikimedia-backend.vcl.erb,unified [09:59:25] 10Traffic, 06Operations: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2382175 (10BBlack) Also, I don't think the "3 scopes" section above understands hiera, or I don't. Setting `varnish_version4` as a top-scope hiera variable does not make it available anywhere in puppet... [10:00:28] ema: I don't know, but probably the "err" flag should be outside the conditional regardless. [10:01:12] bblack: agreed [10:01:55] I think we just moved that conditional from vcl_error (v3), but is probably not necessary given the split between _synth and _backend_error [10:02:15] and probably, should switch to "err" in v3 around a bit too [10:03:10] hmmm wait other way around [10:03:19] well as the patch is now, a synth 404 in v3 would be considered 'err' instead of the more appropriate 'int' [10:03:20] should probably set to err in vcl_synth in the >400 case [10:03:39] I think 'err' is probably more-appropriate, anytime we're using our synthetic error page call [10:03:52] oh I see now! [10:04:00] we need to agree on the semantics of err :) [10:04:41] cool, if we show the errorpage it's an err, everything else (eg: redirects) is an int [10:04:51] I'd say 'err' means both of these things, which I think are currently VCL-equivalent: we use our synthetic HTML error template, and the status is >=400 but not 413 [10:05:43] I'm not sure why 413 is an exception really, should figure that out and document it [10:06:36] maybe historically it's a special case for large/chunked uploads to commons and has its own special error output we don't want to mask, or something [10:08:50] 9f6a5ad9b52f52acdb23cf394ce09b8babc81919 [10:09:06] "Treat HTTP status 400/413 specially" [10:09:58] mark: do you happen to remember the reasoning behind that? [10:10:59] hmm gerrit doesn't see that hash? [10:11:10] oh yeah, not even >= [10:11:18] mark: https://github.com/wikimedia/operations-puppet/commit/9f6a5ad9b52f52acdb23cf394ce09b8babc81919 [10:11:33] If82c582ad601a3225b157ab69d993e2c5a27cdbb [10:11:46] https://gerrit.wikimedia.org/r/#/c/92864/ [10:13:12] 31 october 2013 [10:13:18] a few days before my 3 week holiday to thailand [10:13:24] when I migrated squid text to varnish last minute ;p [10:13:35] so it would have been text related [10:13:37] likely [10:13:58] but no, I don't remember why :( [10:14:14] * mark checks email [10:17:38] no I don't remember [10:17:47] i'd say, monitor any 400/413 responses we have today [10:17:50] it may well be redundant [10:18:21] presumably i wanted to let any existing error message from upstream (apache/mw) through instead of generating the varnish error, but don't remember why [10:20:34] ema: "err" patch updated, I think this one is more-correct. needs varnishxcache matching fixups first. [10:27:04] mark: so it might be cargo-cult :) thanks for checking [10:46:00] <_joe_> ema: quite disappointingly, our patch to logrotate resulted in hhvm logs not getting written anymore [10:46:24] <_joe_> and it's expected, as they get written via syslog [10:46:39] <_joe_> so, I have to change that, sigh [10:48:14] <_joe_> it seems we can't use logrotate AND let unprivileged users peek at the logs [10:48:25] really? [10:48:31] <_joe_> at specific logs it is [10:48:40] <_joe_> well, I have to think this through [10:48:42] <_joe_> but [10:49:30] <_joe_> I don't know why rsyslog is refusing to write to that file to be honest [10:52:01] <_joe_> ah, nevermind. It actually wasn't writing just on one machine where I did my tests yesterday, and restarting rsyslog solved the issue there [10:52:10] <_joe_> ok so, we're good :P [10:52:15] _joe_: nice :) [10:52:35] <_joe_> note to self: never check results just where you made some tests [10:58:09] <_joe_> on the other hand, we stopped logging to /var/log/hhvm/error.log on all trusty machines around the 19th of may [10:58:17] <_joe_> the logs are in upstart too, though [11:11:29] bblack: puppetfails on cp* [11:12:04] 'req.http.X-CDIS': cannot be set in method 'vcl_backend_error' [11:13:21] yeah [11:13:29] also, I don't think it can be set in vcl_backend_response either [11:14:14] (which I was trying to do for hit_for_pass stuff) [11:14:23] I reverted both for now [11:14:50] ok [11:15:46] bblack: I was taking a look into https://gerrit.wikimedia.org/r/#/c/276529/, old attempt at changing how we name backends [11:16:33] 1) we need to do the same thing a few lines below where we iterate over the backends [11:16:56] 2) maybe the etcd template can be changed accordingly, but for sure it's gonna mean blood [11:17:32] eg: https://github.com/kelseyhightower/confd/blob/master/docs/templates.md#replace [11:18:25] <_joe_> ema: I think we do regex mangling already [11:18:39] <_joe_> also check if our current version of confd can do that [11:18:50] _joe_: oh I thought we were splitting [11:19:07] <_joe_> ema: I don't look at those files since 1 year [11:19:08] <_joe_> :P [11:19:58] .backend = be_{{ $parts := split $node "." }}{{ index $parts 0 }}; [11:20:47] <_joe_> ok so we use split and index [11:20:59] <_joe_> wow, that's horrible :D [11:21:06] yep! :) [11:21:12] <_joe_> I did it! [11:21:18] it works! [11:21:52] _joe_: I guess if they don't mention regexps explicitly then Replace doesn't support regexps? [11:21:55] https://golang.org/pkg/strings/#Replace [11:22:22] <_joe_> it doesn't, no [11:22:30] <_joe_> it would make your life too easy [11:22:41] <_joe_> and Rob Pike doesn't like easy [11:23:40] but still we can do a few ugly replaces [11:24:25] like {{ replace $node "." "_" -1 }} and then {{ replace $node "wmnet" "" -1 }} [11:25:10] it's probably not worth it in the long run, should abandon that change I think [11:25:39] I'm grasping at the real-world reason we wanted it [11:26:05] it basically gave us backend naming like appservers_svc_codfw + appservers_svc_eqiad right? [11:26:24] be_rcs1001 -> be_rcs1001_eqiad [11:26:37] be_palladium -> be_palladium_eqiad [11:26:39] and so forth [11:27:04] and be_appservers -> be_appservers_svc_eqiad [11:27:30] the idea there is it allows us to define both in a single VCL file [11:27:57] right now we're limited to defining one of the eqiad|codfw svc hostnames for whatever, in one VCL file (so in one datacenter) [11:28:21] I think what brought it up was the codfw switchover testing for swift [11:28:38] because we wanted to have both available and switch just thumbs over independently of originals [11:28:58] hence the hack (in place of that broken commit) where we defined duplicate backends for cache_upload for thumbs + originals [11:29:17] anything against using the FQDN with s/./_/ ? because that would be easy to do [11:29:22] but yeah, in the long run for active:active, we need something like this [11:30:15] before we also did s/-/_/ for some reason, too [11:30:25] maybe variable-naming restrictions at some layer? [11:30:35] possibly, yes [11:30:44] s/before we also did/currently we do/ :) [11:31:05] probably VCL itself doesn't allow dash in backend names or something [11:32:22] mmmh [11:32:56] food, bbl [11:39:47] re: X-Cache, the other thing is the tie-in with loop-detection. [11:40:03] if we want to use X-Cache for that, it would have to change fundamentally to work on the requesting-side [11:41:06] well really, skipping over a bunch of other thinking, the bottom line is X-Cache needs to work roughly like it does, and we can't use it for loop-detection heh. [11:41:06] actually, I should start get going (amsterdam!) see you later o/ [11:41:12] have fun! [11:42:02] we could use X-Forwarded-For for loop-detection. the only minor issue there is IPv4 vs IPv6. but the node could know both of its own IPs. [11:43:33] oh, the other issue is XFF already looks like looping. in the case that nginx->fe->be stays on the same node, we get 3x XFF entries all the same. [11:43:44] well 2x [11:43:46] hmmmmm [11:44:23] well, we can do something a lot like XFF, but only set in varnish-backends, and may as well use the hostname at that point [11:45:13] not the hostname, the dcname [11:45:38] so if a request normally flows ulsfo->codfw->eqiad, and eqiad tries to send it back to codfw... [11:46:02] it would arrive at codfw for the second time with X-LoopDetect: ulsfo, codfw, eqiad, and codfw would see itself in the list and 503. [11:46:40] (or some other appropriate error code) [11:47:34] we could steal "508 Loop Detected" from WebDAV, although it's meant for a slightly different purpose [12:01:18] https://gerrit.wikimedia.org/r/#/c/294478/ for ^ [12:16:13] 10Traffic, 06Operations, 13Patch-For-Review, 05codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#2382315 (10BBlack) Some further thinking: without changing the cache-level stuff discussed above, this would also support a config like: ``` restbase... [13:16:42] 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2382462 (10BBlack) @ori @Krinkle - any thoughts or pointers on getting this tested more-broadly and then switching before the cert expiry date? [14:06:11] 10Traffic, 06Operations, 06Services: Define a standardized config mechanism for exposing services through varnish - https://phabricator.wikimedia.org/T110717#2382584 (10BBlack) [16:00:09] 10Traffic, 06Operations, 13Patch-For-Review, 05codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#2264674 (10GWicke) >>! In T134404#2382315, @BBlack wrote: > Where `restbase.svc.wmnet` is defined in gdnsd and uses the closest underlying service end... [16:02:18] 10Traffic, 10DBA, 06Labs, 06Operations, 10Tool-Labs: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2382892 (10Antigng_) >>! In T137707#2380275, @jcrespo wrote: > BTW, the API is definitely faster, one just need to use it efficiently: > > > ``` > $ time curl 'htt... [16:03:27] bblack: hangout? [16:06:55] 10Traffic, 10DBA, 06Labs, 06Operations, 10Tool-Labs: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2382924 (10Antigng_) >>! In T137707#2379997, @jcrespo wrote: > For the API part, I would like to add that API infrastructure (application servers and databases) is s... [18:43:05] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Stream: stream.wikimedia.org doesn't redirect to HTTPS - https://phabricator.wikimedia.org/T137915#2383513 (10BBlack) [18:43:32] 10Traffic, 06Operations: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2383531 (10BBlack) [18:43:35] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Stream: stream.wikimedia.org doesn't redirect to HTTPS - https://phabricator.wikimedia.org/T137915#2383532 (10BBlack) [18:45:31] 10Traffic, 06Operations, 06Performance-Team, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2383541 (10ori) p:05Normal>03High [18:45:42] bblack: i'll look asap [18:48:45] ori: look at the other ticket first, that one's just a blocker-reminder :) [18:49:09] oh sorry you're one step ahead of me, by other ticket I meant the cache_misc you just bumped heh [18:50:24] 10Traffic, 06Operations, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2383564 (10BBlack) [18:54:35] 10Traffic, 06Operations: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#2383571 (10BBlack) [21:46:24] 10Traffic, 06Operations, 10Phabricator, 10hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2383998 (10mmodell) [22:34:53] Krinkle: could you port one of your tools which use RCStream to hit misc-web-lb.wikimedia.org (but with Host: rcstream.wikimedia.org, so that misc varnish routes the request correctly)? [22:35:04] see and the preceding comments [22:35:26] ori: The ones I have use client-side JS. [22:35:36] socket.io doens't have a way to override the host header afaik [22:35:46] I can use etc/hosts though I suppose [22:36:08] yeah, that'd be great if you could [22:36:56] 91.198.174.217 stream.wikimedia.org [22:37:18] misc-lb instead of stream-lb [22:39:48] 208.80.153.248 for me, but yes, that is correct [22:39:52] ori: Seems to fail with Bad Gateway over ws:// protocol internally. Falls back to XHR polling, which seems to work (albeit with a redirect to http and then back to https) [22:40:13] GET ws://stream.wikimedia.org/socket.io/1/websocket/776551653921 502 Bad Gateway [22:40:58] hmm [22:41:06] Using http://codepen.io/Krinkle/pen/laucI/?editors=0010 [22:41:09] see network [22:43:47] Hm... seems to work fine now [22:44:14] It was using http by default, which caused an issue. I don't know how common that is. [22:44:22] I just updated this demo to use HTTPS instead. [22:44:23] wss:// [22:44:37] https:// for socket.io, which will in turn result in wss:// [22:45:36] brandon found that most clients used http, so he figured we need to support that, at least for a little bit [22:45:48] Yeah [22:46:03] I didn't realise that it stayed over http [22:46:13] I assumed for no reason that it would upgrade to https [22:46:19] like a redirect [22:46:27] But it never did [22:46:46] the http fetch does redirect to https [22:47:03] but socket.io doesn't see that happening, it continues with the hostname it is given (and defaults to port 80 like for the first fetch) [22:49:17] maybe confusion here is about hostnames, I wasn't aware of "rcstream.wikimedia.org" [22:49:30] actually that doesn't exist, just checked [22:50:13] how are you using the codepen thing with IP changed? [22:50:55] stream.wikimedia.org (Krinkle understood what I meant) [22:51:06] he and I both edited /etc/hosts to point to misc-web-lb IPs [22:51:11] ah ok [22:51:15] nod [22:51:41] and I can confirm from chrome://net-internals/#dns : stream.wikimedia.org IPV4 208.80.153.248 [22:51:44] I did the same here on my linux box, and if I switch the codepen code to http:// it still works [22:52:07] it falls back to a different transport [22:52:10] it's not using websockets [22:52:12] stream.wm.o is exempted from the caches' http->https redirect code, because I found my earlier test clients would break on it [22:52:36] well that's confusing then :) [22:53:35] actually, for me it appears to work with websockets on http [22:54:15] it worked for my little test clients, the ones from wikitech (python and local nodejs) [22:54:31] unless those libraries have some silent fallback too [22:54:35] Krinkle: is the 502 reproducible? [22:54:38] it looked legit on the varnish logs side though [22:54:54] * Krinkle re-tries [22:57:52] interesting. the http fetch (for socket-io meta data) is redirected by HTST preflight, not by our server. But HSTS, while active on *.wm.o, doesn't apply to ws:// aparantly. [22:57:57] Too bad :) [22:58:02] oh yeah haha [22:58:18] I wasn't thinking about all our browsers having HSTS already [22:58:45] in any case, cli clients don't have that issue [22:59:18] on chrome://net-internals/#sockets I only see an ssl socket for stream.wm.o (after etc/hosts change). [22:59:23] I don't see the ws socket [22:59:55] anyhow, yeah, it's working both ways [23:00:05] I see the socket in network tab on the page, but not in chrome:// [23:00:13] I can't reproduce the bug I was seeing [23:00:22] bblack@alaxel:~$ cat rcs.js [23:00:23] var io = require('socket.io-client'); [23:00:23] var socket = io.connect('stream.wikimedia.org/rc'); [23:00:23] socket.on('connect', function() { socket.emit('subscribe', 'commons.wikimedia.org'); } ); [23:00:25] socket.on('change', function(data) { console.log(data.title); }); [23:00:28] bblack@alaxel:~$ nodejs rcs.js [23:00:30] Category:Images from the New York Public Library [23:00:33] ^ that's basically how I was testing [23:00:40] bblack: looks good [23:01:27] I'd say, with an announcement to wikitech and 1-2 days time this should be fine to apply without issues. [23:01:31] re chrome anomalies, probably HSTS + HTTP/2 means it puts everything over the existing HTTP/2 connection (maybe even one it made for another of our sites in another tab heh) [23:01:45] Yeah [23:01:57] in theory ws can just be another stream inside an http/2 conn that's still used for other things [23:02:04] donno if it's actually implemented like that, though [23:02:18] I don't think so, but yeah, it could. But only if HTTPS though in that case. [23:02:39] ok [23:03:16] I don't think there's any real functional change for users (shouldn't be anyways), but we can warn them of the change, and use the opportunity to remind them to update things to https:// and/or wss:// [23:03:36] after we get through this, then we can tackle trying to close off insecure access [23:03:54] which will probably require more warnings and hand-holdings, like the insecure post issue. [23:04:06] because the js and python client libraries don't follow 301, so there's no soft landing [23:05:00] it's wednesday anyways. post something to wikitech tomorrow, target actual switch to misc-web for monday? [23:15:00] Krinkle, bblack: sounds good to me. who should write the message? i can volunteer, if you like. [23:17:46] I'll assume it's on me (which is fine) unless one of you pings me to indicate otherwise [23:19:54] ori: you sounds awesome to me :) [23:20:27] np. thanks for taking care of this [23:20:46] :) [23:25:24] 10Traffic, 06Operations, 06Performance-Team, 10Wikimedia-Stream, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2384236 (10ori) >>! In T134871#2382462, @BBlack wrote: > @ori @Krinkle - any thoughts or pointers on getting this tested...