[03:18:57] XioNoX: if you mean on the varnishd listening sockets, it's the timeout_idle parameter. currently 5s on frontends, 120s on backends. [03:19:58] ok [03:20:17] I have some interesting packet captures to show you tomorrow [03:21:33] but like within a minute, the cp send like 60 tcp retransmission to a single ip and port [03:22:08] and all of them are replied with as many icmp unreachable [03:22:47] (the same segment is being sent over and over) [03:25:39] ok [08:50:30] upload-ulsfo has been running with only 4 cache hosts for the past days without much fuss :) [08:50:37] I've repooled cp4021 now [09:22:27] bblack: ok, now https://gerrit.wikimedia.org/r/#/c/388064/ should include everything we've got in varnishxcps! [09:26:15] \o/ \o/ [09:46:10] disabling puppet on cache nodes to merge and test X-Cache-Status patch https://gerrit.wikimedia.org/r/#/c/387817/ [09:52:15] looks good, re-enabling puppet [10:43:48] so, the majority of requests in misc eqiad/codfw result in int-front [10:43:50] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&panelId=2&fullscreen&orgId=1&var-cluster=misc&var-site=eqiad&var-site=codfw [10:44:50] those are mostly GET requests for http://stream.wikimedia.org/socket.io/1/ from UA:Java/1.8.0_60-ea and a single GCE IP [10:46:20] see varnishncsa -n frontend -q 'RespStatus eq 301' -F '%{X-Cache-Status}o %{X-Client-IP}i %{User-Agent}i %r %s' on any misc eqiad/codfw host [10:47:13] the TLS redirects aren't followed, but if they would they'd result in 404s [10:49:16] this isn't creating any issues, but come on, they could have a bit of decency [10:53:05] https://goo.gl/fjtA6D [12:42:02] ema: yeah we actually killed that API, that's the old streams API that got replaced by eventstreams [12:42:33] (quite a while ago) [15:28:11] 10Traffic, 10Operations, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3733373 (10MattFitzpatrick) [15:28:33] bblack: 32 retransmits in 32 seconds https://www.irccloud.com/pastebin/mNkE7IZR/ [15:29:00] (I changed the last two bytes of the client IP) [15:30:54] is that happening a lot, or isolated case? [15:31:07] I mean, the host could legitimately have become broken/disconnected mid-transfer or whatever... [15:42:57] bblack: that seems to happend a lot, and especially in esams. What surprises me is the amount of retransmits [15:43:37] default /proc/sys/net/ipv4/tcp_retries2 is 15 [15:45:38] we're at default on that [15:45:54] yeah [15:47:54] the hosts behind the EX4500 perhaps? [15:47:59] need to get rid of that [15:49:43] why would the ex4500 be related to that though [15:52:10] because it has bad buffer management and may have packet drops [15:52:16] it's always behaved weirdly [15:52:21] and we've never really investigated it properly [15:52:48] tcp performance was often lower for boxes on that switch for example [15:53:59] interesting, that might explain why only esams have high number of retransmits [15:54:30] but i believe only few boxes now are behind that switch these days [15:54:36] most of the servers on that switch are decom'ed [15:54:47] just do check [15:54:58] it's rack OE11 from the top of my head [15:55:11] on the other hand, I'm trying to understand why the host sends that many retransmits, (more than any kernel settings), here more than 30 of the same packet, while all the limits seem to indicate that 15 should be the absolute max [16:02:49] hum, that packet capture is on a cp not connected to that ex4500 [16:03:25] ok [16:49:10] so the %{VSL:Timestamp:Resp}x thing works on varnish 4.1.8 and 5.2.0 but not on 5.1.3 because it hasn't been backported [16:49:15] https://github.com/varnishcache/varnish-cache/issues/2479 [17:03:39] in other news, this should be enough I think to start sending varnish daemon logs to logstash: https://gerrit.wikimedia.org/r/#/c/388482/ [17:04:11] you mean varnish's syslog output, not "varnishlog", right? :) [17:08:00] yes :) [17:46:05] hello people, cp1045 seems to have caused some 503s [17:46:31] well all ores related it seems, might be the backend [17:47:46] yeah [17:48:01] chashing will do that (seem to implicate 1x varnish-backend), if the errors are all coming from the same URL [17:48:51] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [17:49:04] usually as you mentioned, 500s there are appserver [17:49:25] 503s can be either, they're probably thrown by varnish, but then you have to figure out which side of the varnish<->app connection was at fault heh [17:49:42] trying to follow up with Aaron in #operations :) [18:35:01] so I'm going through the 5.1.3 -> 5.2.0 commit logs, out of curiosity but also because 5.1.3 is unsupported already heh [18:35:11] bblack: is this as bad as I think? https://github.com/varnishcache/varnish-cache/commit/6621dc2b06e6d5d15452b1ea785af926b94459c2 [18:39:12] or does calloc just multiply under the hood and nobody cares really? [18:41:56] calloc mostly just multiplies under the hood [18:42:10] there's lots of good reasons to put them in the right order, but at the end of the day it's not going to cause a runtime bug [18:42:56] there was something else someone brought up about 5.2 being a potential compat issue as well, IIRC [18:43:02] I donno if jumping straight to 5.2 is making our lives net-easier or not [18:44:36] also, I'm kind of surprised that 5.1 is already out-of-warranty, what gives with that? [18:44:54] https://varnish-cache.org/releases/index.html [18:45:24] oh it was elu.key mentioning that 5.2 says "The VSM API for accessing the shared memory segment has been totally rewritten. Things should be simpler and more general." [18:45:31] that's what I was thinking of, could be a PITA for varnishkafka [18:45:45] ah there you go [18:46:24] https://varnish-cache.org/releases/index.html doesn't have much to say about policy really :P [18:47:47] anyways, "supported" doesn't have a very concrete or meaningful meaning in the context of our usage of Varnish [18:48:17] 5.1.0 release was just under 7 months ago [18:48:36] 5.0.0 ~14 months back [18:49:00] and 5.2.0 just came out <2 months ago [18:49:22] seems like at least 5.1.x should still get support/releases by any sane standard, given 5.2.x has breaking changes [18:49:28] yeah, it all started with me wondering about the varnishncsa format string differences between 4.1.8 and 5.1.3; I've been told that there's a fix in 5.2.0 and that 5.1.3 is "retired" [18:50:15] but I do agree with you that it's not been out working long enough to enjoy retirement already :) [18:52:24] it must be from [18:56:08] ema: while you're here (shouldn't you already be enjoying a weekend?) - on the varnishxcps VCL stuff... I think I'm going to make a further change to the existing stats, and deprecate/kill output of the older stats, like in the next few days. [18:56:20] ema: so may as well stall on that for a little bit and then make an easier transition [18:58:21] the alignment of history between the old+new statsd stats is good enough in the current tls-ciphers + tls-ciphersuite-explorer grafanas that we can kill the old one, anyways. [18:58:51] and I'm realizing now that the new hierarchical stats have some minor issues that haven't hit us yet because we have no TLSv1.3, but better to plan for it now with some minor changes. [18:59:30] sounds good! [18:59:52] (the dumbest part being that the actual "cipher" part of the chapoly strings changes from CHACHA20-POLY1305 to CHACHA20-POLY1305-SHA256 in TLSv1.3 just to make life fun, even though it's the exact same algorithms) [19:00:59] the options there are either strip the -SHA256 from the TLSv1.3 variant (sucks to make the newest stuff the one with hacks, though), hack it into the end of TLSv1.2 chapoly strings (better, but TLSv1.2 will be around a while too). [19:01:26] or just split out hmac as another separate field like keyexchange, auth, cipher [19:01:52] anyways, I'll figure it out Soon [19:02:52] hmac isn't really even the right name for it, I donno if there's one name for that that fits all historical cases [19:03:08] could just call it "hash" I guess [19:05:08] thanks for reminding me that the weekend should have started already! see you! :) [19:21:59] cya :) [22:53:56] 10Traffic, 10Commons, 10Multimedia, 10Operations, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3331800 (10Tgr) Preventing WP0 users from doing file patrolling seems like an acceptable level of collateral damage (maybe with a wiki wh...