[02:55:01] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3048748 (10Tgr) >>! In T66214#3030371, @cscott wrote: > The one exception is that the current thumbnail code arbitrarily quantizes sizes in order to reduc... [08:06:26] Hello! I am running tcpdump on cp3010 to dump some traffic, I'd like to know what happens with those Fetch errors EOF [08:17:47] ah Ipsec! Never encountered with tcpdump, nice.. [08:48:31] Summary of what I have been investigating in T154558 if anybody has new ideas [08:48:32] T154558: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558 [08:50:01] from oxygen it is clear that there is a constant rate of 503s coming from piwik (bohrium in cache:misc). I placed varnishlog everywhere and the main theme seems to be "FetchError http first read error: EOF" emitted by the Varnish backends. [08:51:10] originally the issue seemed to be traceable to pass(es) up to cp1* hosts emitting "FetchError no backend connection" [08:51:24] (cp1* backends to be precise) [08:51:54] After some tweaks on bohrium's apache it seems that the cp1* backend FetchErrors are gone [08:52:12] but I keep seeing "FetchError http first read error: EOF" everywhere else [08:54:25] timestamps are not helping, the failures are always happening after ~0.05s, so no match with first_byte_timeouts etc.. [08:54:34] ah no 50X on bohrium's apache [08:55:51] from varnishadm I can see "timeout_linger 0.050 [seconds] (default)" but it is surely a red herring, I might want to see a root cause where it is not there [08:57:08] nice to notice, varnishlog -q FetchError shows other similar issues with EOF not related to piwik [08:58:23] Last thing that I tried was tcpdump on cp3010 but I've hit the ipsec wall, and to be honests I am a bit ignorant about how to circumvent it (or maybe I am missing something trivial) [09:00:36] the EOF error to me seems a socket error while trying to read from connection X [09:00:50] but atm I am a bit out of ideas :) [11:32:09] elukey: https://wiki.strongswan.org/projects/strongswan/wiki/CorrectTrafficDump [11:32:14] tl;dr: not easy [11:32:36] oh yes I saw it and decided not to attempt anything on cp hosts :D [16:08:06] bblack: something like this? https://gerrit.wikimedia.org/r/#/c/338953/ [16:08:19] (applayer backend probes) [16:19:01] ema: yeah I think so [16:19:17] I'm not yet sure what to do about the app_def_be_opts merge thing [16:19:23] (which is not this change) [16:19:54] probaby, unify the defaults across clusters, and then put the defaulting inside the VCL template [16:20:04] so we avoid the manifest-level merge and can stuff the data over in hieradata [16:21:23] yeah the merges aren't great [16:21:52] before the merges, there were defaults at multiple levels heh [16:22:07] I think I killed one in favor of the other in some earlier simplification, but picked the wrong one [16:22:24] the other option of course is to not have defaults and explicitly list it for everything [16:22:38] 'port' => 80, [16:22:38] 'connect_timeout' => '5s', [16:22:38] 'first_byte_timeout' => '185s', [16:22:39] 'max_connections' => 100, [16:22:45] ^ adding that to every backend def, basically [16:23:23] also not great :) [16:23:44] but VCL-level defaults aren't awful [16:23:56] so the idea would be to put the defaults in the VCL and then allow to override those in hieradata? [16:24:05] I'd rather the defaults were in hieradata though [16:24:29] yeah like your "if director.key?('probe')" [16:24:46] it would be if key port else hardcoded-default [16:25:08] or I guess, we can put per-cluster defaults into hieradata too and have the VCL ref that [16:25:27] even better yep [16:25:39] the more we can move data to hieradata instead of manifests or VCL, the easier it is to think about porting things [16:27:03] the ideal state would be that our VCL code/templates handle VCL-isms but have no real "data" hardcoded in the templates [16:27:28] and our hieradata structures describe all of the functionality of the cache abstractly, to be taken in by ATS / lua to do the same stuff [16:27:55] if wishes were fishes we'd never be hungry though [16:28:06] :) [16:28:07] there's lots of holes in that and we'll never reach the ideal, but it's nice to aim towards [16:32:03] speaking of lua, I was thinking of providing a hiera flag to enable/disable lua support in tlsproxy [16:32:16] any reasons not to? [16:33:20] seems, reasonable? [16:33:31] me and godog are taking a look at https://github.com/knyar/nginx-lua-prometheus/ which seems interesting [16:33:36] ok [16:33:56] tlsproxy is getting a bit overloaded in general, it will be interesting to see if we can keep using it for all such things while adding more special cache-related things [16:34:05] but no reason not to keep expanding things for now [16:35:10] cool [16:38:11] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3050017 (10GWicke) I think the discussion about restricting thumbnail sizes is orthogonal to this RFC. Nothing in this RFC limits our ability to later a)... [17:27:42] <_joe_> bblack: regarding DNS discovery, do you think it would be wise to first test it on a standalone gdnsd instance, instead of enabling a new plugin on all of our main auth dns servers? [18:01:47] _joe_: it would be wise to test the patch separately before merging it up to the primary authdns, yes [18:02:12] _joe_: but the plugins are well-behaved, it shouldn't be a stability issue [18:02:29] more about broken config in the patch or whatever [18:08:23] <_joe_> bblack: ack, I trust your judgement [18:11:08] <_joe_> I hope to get started with that tomorrow [18:17:10] _joe_: awesome [18:19:43] <_joe_> yeah no promises [18:19:56] <_joe_> today I got down the rabbithole of a puppet refactor [18:26:20] yeah I'm doing some of that today too [18:33:19] how it usually goes for me http://i.imgur.com/cUeQ1Up.gifv [18:36:09] :) [18:39:39] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3050554 (10GWicke) [18:42:19] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3050575 (10GWicke) >>! In T66214#2981032, @Gilles wrote: > Accept headers and Vary: Accept are missing from the current task description. I added a secti... [18:51:09] is it correct that our varnishes do not cache POST queries? [19:12:27] SMalyshev: I think that's correct [19:13:18] SMalyshev: there is some support for POST-caching in Varnish4/5 in general, but it's non-default and limited in scope [19:13:31] (and AFAIK we haven't tied to turn it on, and probably want to avoid that if we can) [19:13:37] s/tied/tried/ [19:17:58] bblack: ok, thanks. since we've enabled POST queries for query.wikidata.org now it may be useful to cache those too maybe... though I'm not sure how efficient/worth it that is. [21:00:26] 07HTTPS, 10Traffic, 06Operations, 10fundraising-tech-ops: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3051267 (10RobH) I've emailed the new public cert and private key file (the key being pgp encrypted) over to @EWilfong_WMF. [22:06:27] 07HTTPS, 10Traffic, 06Operations, 10fundraising-tech-ops: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3051412 (10EWilfong_WMF) Thanks, @RobH, the new cert is in place on benefactorevents.wikimedia.org. [22:06:59] 07HTTPS, 10Traffic, 06Operations, 10fundraising-tech-ops: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3051417 (10RobH) 05Open>03Resolved