[08:33:24] uhm for some reason `service nginx configtest` is failing on misc (and only on misc) [08:33:51] duplicate listen options for [::]:443 in /etc/nginx/sites-enabled/unified:9 [08:34:09] the same config works fine on other clusters [08:34:48] moritzm: have you ever encountered this during nginx rolling restarts? ^ [08:40:47] ooh, misc has more stuff under sites-available, that might be the reason [08:42:56] don't remember that, but /etc/nginx/sites-available would've been my guess as well [08:51:28] fascinating, removing fastopen=150 fixes the problem [08:55:34] https://github.com/jfryman/puppet-nginx/pull/330#issuecomment-52406683 [08:57:18] right so we should only include it `if @default_server` [09:22:58] Comodo trying to trademark Let's Encrypt https://letsencrypt.org//2016/06/23/defending-our-brand.html [10:29:27] 10Traffic, 06Operations, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2404243 (10ema) We're currently running with TCP Fast Open enabled on all tlsproxies, limiting the number of concurrent pending TFO requests to 150 to mitigate the risk of Resource... [11:55:06] ema: sorry I should have noticed that yesterday [11:55:23] in general, all of the listening options have to be default-server only [11:58:22] (basically the default server is the one that triggers actual creation of sockets and setting global setsockopt-level stuff, the non-default servers just use the socket) [12:00:17] ema: might want to add a note to tlsproxy's use of fastopen on the http-redirect port, or someone might cargo-cult it into other puppetized nginx... [12:00:39] in general TFO isn't compatible with unencrypted HTTP, unless you're very careful about a lot of other things. [12:01:09] tlsproxy is an exception because all of its unencrypted HTTP requests are answered idempotently with either 301 or 403 [12:01:15] bblack: good point [12:01:18] (in a very strict sense, it can't do anything else) [12:03:23] is there a way to reset netstat's counters for tcpexts or in general? [12:03:31] it would be nice to toss out the ones from before it was turned on [12:03:36] my approach has been rebooting so far :( [12:03:48] on my laptop of course [12:05:39] I found out how to do something similar on HP/UX though :) [12:09:06] 34 degrees here, I'll go out for a while enjoying the heat [12:09:08] see you later! [12:09:26] It would be nice to reboot for tcpmhash_entries someday anyways, but it's not a strong enough argument yo go rebooting everything, I kinda figured wait for another kernel update reason [12:09:43] ema: see you later. I can't fathom "enjoying the heat" lol [12:18:19] <_joe_> bblack: you don't live in berlin ;) [12:32:02] some days I walk outside to do yardwork and by the time I've walked to the shed my shirt is wet with sweat already [12:32:42] "some days" being most days, except for the rare cloudy days, and the slightly-less-rare days when it rains torrentially [13:17:56] ema: for when you return: since I turned tcp_metrics back on the other day ( https://gerrit.wikimedia.org/r/#/c/295723/2/modules/role/manifests/cache/perf.pp ), "ip tcp_metrics" now shows useful stuff on caches, too :) [13:18:22] ema: it's supposed to show per-client-ip TFO info as well, but I think our iproute2 is too old to know how [13:18:54] still, it's fascinating to dig through "ip tcp_metrics show" and look at e.g. how the average metrics-saved cwnd differs from initcwnd 10 and such [13:19:25] it may be that what is shows on some lines as "metric_5" and "metric_6" is tfo-related [13:19:36] but maybe not, not sure yet [13:21:44] busy servers are showing ballpark 100K tcp_metrics entries, which supports the (already-merged) patch to raise tcpmhash_entries from 16K to 64K (which would reduce average hashtable collision chains from ~6 to ~1.5) [13:22:12] but also means it's not horribly-critical. avg 6 isn't awful, just non-ideal. [13:22:28] (can wait for next natural reboots) [13:26:16] I ran some stats on all the saved cwnd values on cp3040: [13:27:05] 95% of the saved cwnd are 10 or higher (no worse than initcwnd), 5% are <10 (meaning congestion/loss remembered a cwnd worse than default initial value, which may or may not be a good thing) [13:27:54] the overall average saved cwnd is 30. the average of the 95% cwnd10+ group is 32, and the average of the 5% sub-cwnd10 group is 6 [13:28:48] the saved RTTs have to be useful for getting congestion/pacing right earlier in the connection, too [13:36:47] I re-started again my varnishlog experiment this morning and on cp300[89] I can see a reduction of VSL timeouts with -T 600 (~ half of the ones registered with the default 120 seconds). With -T I don't see timeouts related to releases.tar.gz anymore, but only socket.io related req [13:37:07] (With -T 600) [13:38:55] I am wondering two things: [13:39:49] 1) what is the downside of setting a high -T with vsl-related consumers? Surely not head of line blocking issues, but is there anything else to consider? [13:41:25] 2) Is there the possibility that Varnish worker buffers in misc are not flushed too often causing delays? (for example, one batch of log tags related to a req flushed and the rest takes long because it needs to wait for another round of buffer filled up) [13:55:37] 10Traffic, 06Operations: Backport iproute2 4.x from debian testing -> our jessie - https://phabricator.wikimedia.org/T138591#2404751 (10BBlack) [13:59:08] bblack: oh I though `ip tcp_metrics show` only reported client-side info [14:04:40] it reports whatever we've saved about peers in general [14:05:00] but before re-enabling tcp metrics, it only had basic route info (as in, which of our addrs that peer last communicated with) [14:05:24] how it has rtt, cwnd, ssthresh, etc that are saved between connections from the same client [14:05:30] s/how/now/ [14:05:51] but no TFO stuff then given that it gets cached by the peer initiating the connection [14:06:46] I think we would have tfo info on our side of some kind in there, if our iproute2 was new enough [14:07:38] or not? not even fo_mss? [14:08:47] in any case, on my client side, chromium-on-linux with the chromium flag set manually, and a 4.x kernel with default settings, does TFO with us and shows it in client-side tcp_metrics [14:09:01] 208.80.153.224 age 577.676sec cwnd 10 rtt 37331us rttvar 26778us fo_mss 1460 fo_cookie XXXXXX [14:09:07] ^ text-lb.codfw [14:09:23] are there any clients that do TFO by default? [14:09:32] UA/OS combinations [14:09:34] we're not really sure yet :) [14:09:41] probably no widely-deployed ones [14:09:48] I think only the client is concerned with storing the server mss, but yes I've also seen that info with tcp_metrics on my machine! really cool [14:09:50] possible windows preview/developer builds of Edge already do [14:10:28] it's also possible some chrome+android versions do it by default already, too [14:11:14] root@cp4016:~# grep '^TcpExt:' /proc/net/netstat | cut -d ' ' -f 90-95 | column -t [14:11:17] TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd [14:11:20] 0 0 734 79 0 78846 [14:11:26] 4016 faces asia + western us, so a little different than esams [14:11:38] but since the enable of TFO in non-esams, it's had 734 TFO successes [14:11:47] interesting [14:11:50] esams has lower numbers IIRC [14:12:08] in general, a lot of the hosts show more passivefail than passivesuccess [14:12:15] which could be from e.g. mobile CGN [14:12:22] (client IP shifting a lot) [14:12:46] https://gerrit.wikimedia.org/r/#/c/295900/ <- if the patch works as intended we can build a dashboard for these stats :) [14:13:23] I just wish we could get a reset on CookieReqd without a reboot, since those numbers have been incrementing since long before we turned it on, so no good comparison to open/openfail [14:13:38] although I guess if we have rate graphs, we can still correlate the rate of increase across that [14:17:33] what I do see on our server-side is metric_5 and metric_6 with no explanation (but newer iproute2 might know what they are) [14:17:47] seems too common in the stats to be TFO though [14:19:50] I'll be repooling esams soon fwiw [14:27:58] 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2404878 (10BBlack) Since the last update (past ~4 days): New usernames: ``` Electron_Bot Pahles KSFT Amalthea_(bot) Qsx753698 AlphamaBot `... [14:28:42] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2404879 (10akosiaris) 05Open>03Resolved a:03akosiaris This is finally fixed in rOPUPb3ef0ad. labs VMs in https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_re... [14:29:30] 10Traffic, 06Operations, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2404883 (10Krinkle) @BBlack I agree that technically "Not Modified" is a lie from MediaWiki in that case, but I'm not convinced that behaviour is wrong or needs changing. In many cas... [14:42:27] 10Traffic, 06Operations, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2404938 (10BBlack) Well, it's certainly legal from some point of view. But if you want to claim Not Modified on what are considered minor non-breaking changes then you have to live w... [14:47:50] bblack: puppet doesn't really complain if we mess up the nginx config, does it? [14:48:12] perhaps we could notify `service nginx configtest` upon modification? [14:48:37] ema: that's not a bad idea [14:48:46] I think nginx is still a submodule (ewwwww) [14:48:58] bblack: how about an icinga check? :P [14:49:34] also: service nginx restart does not check if the config is valid before restarting [14:49:34] puppet's probably better [14:49:56] modules/nginx/manifests/init.pp has a "managed" parameter that tells whether to do config changes immediately [14:50:06] so a "safer" restart would be `service nginx configtest && service nginx restart` [14:50:16] we could extend that a bit to support the configtest notify and have tlsproxy use that [14:50:41] 10Traffic, 06Operations, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2404952 (10Krinkle) For as long as I can remember (at least 6 years), we've made countless breaking changes based on the basic assumption that caches roll over within ttl ("30 days").... [14:50:48] having puppet do just a plain nginx config reload and notice the failure would be ok too [14:50:54] right [14:50:56] we just don't want puppet triggering nginx "restart" [14:51:06] yeah, surely not :) [14:51:31] some changes require restart (which we usually do manually with "upgrade" for a zero-loss restart) [14:51:50] but I guess we can't ever really automate the distinction, and we don't want to do upgrade on every config change that only needs a reload [14:51:59] yep [14:52:26] I think triggering a reload should be fine, at least to be notified of broken nginx configs [14:52:38] FWIW reload happens hourly anyways though, for OCSP Stapling updates, from a cronjob [14:53:00] so triggering a reload is no worse than that on the automation front, and arguably better due to immediate puppetfail on bad config [15:11:58] uh after this whole discussion I went ahead and implemented configtest instead of reload :) [15:12:14] I guess the heat also has negative effects on the ability to focus [15:27:06] 10netops, 06DC-Ops, 06Operations, 10ops-esams: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#2405121 (10faidon) [15:31:44] :) [15:49:51] bblack: I don't know what I'm doing, but: https://grafana-admin.wikimedia.org/dashboard/db/tcp-fast-open [15:51:30] I think you want .rate? I'm not even sure [15:51:54] btw I've mostly given up prefixing everything I say with "I don't know what I'm doing", I figured it's just implied by now :) [15:52:33] ema: editing it for a bit... [15:52:44] bblack: please go ahead! [15:57:24] hmm ok I guess diamond just logs raw values, not samples with rates and such.. [15:58:13] oh maybe it is a rate, but that's all it is [15:58:56] 10netops, 06Operations: Network ACL rules to allow traffic from Analytics to Production for port 9061 - https://phabricator.wikimedia.org/T138609#2405243 (10elukey) [16:01:03] 10netops, 06Operations: Network ACL rules to allow traffic from Analytics to Production for port 9060 - https://phabricator.wikimedia.org/T138609#2405270 (10elukey) [16:02:15] or per-minute, no idea [16:02:28] it's definitely a rate-per-something, and CookieReqd shows up as zero [16:03:44] ema: reload, I put it back to just cp* all in one for now while working out the rest so it's not 4x things to edit, and made it show all TCPFastOpen*, and called it events/sec but I have no idea what the rate really is [16:03:50] is CookieReqd really not going up? [16:04:34] I gotta run out, but look at the query as it is now and work from there? [16:04:46] bblack: if it is going up, it's not growing fast [16:05:07] bblack: thanks! I'll see what I can do without getting a BSc in grafana hopefully [16:05:15] https://graphite.readthedocs.io/en/latest/functions.html <- helpful in building up the metrics queries manually, the dropdowns have lots of limitations/confusion. [16:05:28] (but also, our graphite might be behind 0.10.0, so some may not work as documented)