[07:54:13] <volans>	 bblack: should we consider some form of autoremediation restarting after a certain value at this point? Or gerrit/376751 will mostly fix it already and will be deployed soon?
[08:50:47] <ema>	 elukey: thanks for taking care of T175473!
[08:50:47] <stashbot>	 T175473: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473
[08:57:16] <elukey>	 ema: :)
[08:58:03] <ema>	 volans: normally I'd say that cache backend autorestarts based on certain criteria are more likely to cause an outage than preventing it 
[08:58:24] <elukey>	 +1
[08:58:48] <ema>	 on the other hand it is true that the issue is coming up quite often now unfortunately :(
[09:01:14] <volans>	 yeah, ofc if we go in that direction should be with a lot of safe checks, and must be cluster-aware
[09:06:32] <ema>	 bblack: the comment here is fantastic https://gerrit.wikimedia.org/r/#/c/376751/1/modules/varnish/templates/vcl/wikimedia-backend.vcl.erb
[09:07:03] <ema>	 I like in particular the part where hash_ignore_busy is described as a cheaper hfp :)
[09:24:37] <ema>	 https://gerrit.wikimedia.org/r/#/c/376665/ merged, I'll update the LVS dashboards in a week or so once the new metrics will have some history as godog suggested
[09:25:14] <ema>	 the end result should be faster loading times for all relevant dashboards, and I imagine lesser strain on the server side too?
[09:28:50] <godog>	 yeah, less datapoints to fetch
[09:41:54] <moritzm>	 ema, bblack: modules/profile/manifests/lvs.pp are defining two salt::grains (lvs and lvs_class), can these be removed? I suppose you've moved to using cumin for all your cluster work?
[09:43:02] <ema>	 moritzm: yes, we're happy cumin users by now :)
[09:43:09] <volans>	 lol
[09:45:02] <moritzm>	 ok, I'll prepare a patch for these later on :-)
[12:30:41] <ema>	 alright, finally the future parser is happy https://gerrit.wikimedia.org/r/#/c/376242/
[12:30:48] <ema>	 bblack, _joe_ ^
[13:47:33] <bblack>	 ema: I split my commit to just do (more) stats fixes separately first: https://gerrit.wikimedia.org/r/#/c/377255/
[13:48:36] <bblack>	 also, I removed the hash_ignore_busy for now from the main patch.  It seems like it's probably not useful? hits don't stall, and misses are (instantly, without talking to a backend) converted to passes, which also don't stall.  h_i_b would only affect a miss-fetch, which we're never doing in this case.
[13:49:38] <bblack>	 well, the exception to the above thinking would be the blending of local-be + remote-be fetches of the same object
[13:50:27] <bblack>	 an eqiad backend might be doing a normal miss-fetch (cacheable) of ObjectA for a client of an eqiad frontend, and simultaneously a request hits the same eqiad backend from a ulsfo backend on behalf of a ulsfo client.
[13:51:09] <bblack>	 with h_i_b the ulsfo-derived fetch would proceed in parallel as a pass, without it the ulsfo fetch will stall and wait on the cache hit being generated for the local client.  seems like a fine idea to not use h_i_b in this case though.
[13:52:10] <bblack>	 if the timing were reversed, ulsfo's request would already be processing a stall-free pass when the local request comes through and does a true miss-fetch.  the object still ends up in cache, and still nobody stalls that shouldn't.
[13:55:33] <bblack>	 (all the cases where it's concurrent requests from multiple remote backends would be stall-free concurrent passes regardless of h_i_b.   or concurrent stalls on the same miss-fetch for a local client)
[14:09:36] <ema>	 bblack: looking
[14:11:49] <ema>	 bblack: mmh I don't think ^hit$ is correct for hit-front?
[14:12:07] <ema>	 record is in the format "hit, hit"
[14:12:24] <bblack>	 ema: re_simplify from above is applied first
[14:13:00] <bblack>	 the original record is like "cp1111 miss, cp2222 miss, cp3333 hit", or for a frontend-only request "cp4444 hit"
[14:13:18] <bblack>	 re_simplify turns those into "miss, miss, hit" and "hit", respectively
[14:14:17] <bblack>	 the ", " separators come from the original VCL string, and aren't mucked with by the simplifier
[14:15:37] <bblack>	 I guess my example is misleading, as the first one would be backwards from that in practice if you're looking at my fake numbers
[14:15:57] <bblack>	 err no, ignore that last line
[14:16:10] <bblack>	 the misses and hits are backwards from common view, though
[14:16:44] <bblack>	 a realistic one might be "cp1071 hit, cp3037 miss, cp3038 miss" for a hit-remote
[14:16:54] <bblack>	 or "cp3038 hit" for a hit-front
[14:17:05] <ema>	 I've tried the patched varnishxcache on cp1074 and it doesn't find any frontend hits 
[14:17:08] <bblack>	 and then simplifying those two to "hit, miss, miss" and "hit"
[14:17:27] <ema>	 printing record after re_simplify I don't see any "hit" but rather "hit, hit"
[14:17:53] <bblack>	 how can there possibly be a double-layer hit? a hit implies not asking a next-deeper layer...
[14:19:47] <bblack>	 oh, right
[14:19:56] <bblack>	 the hit records the records from beneath it, that's what I'm forgetting
[14:20:11] <ema>	 yeah, like cp1062 hit/5, cp1074 hit/36
[14:20:34] <bblack>	 if the object exists in the local backend, but doesn't in the frontend, two serial requests to the frontend will report "be hit, fe miss", then "be hit, fe hit"
[14:21:01] <bblack>	 ok let me go stare at varnishxcache more, maybe that part was already correct
[14:21:21] <ema>	 yeah I think "hit$" should be fine
[14:23:30] <bblack>	 yeah the whole front|local|remote part for hit/int actually is correct.  I just didn't think it was when I looked in there to look at the other issue, because this is all quite non-straightforward :)
[14:24:27] <ema>	 indeed :)
[14:25:39] <bblack>	 ok re-uploaded with just the miss/pass stuff
[14:26:20] <bblack>	 the way both logics work (X-Cache-Status + varnishxcache), if there was any hit|int anywhere in the string, we wouldn't reach the miss|pass(|bug) cases
[14:26:38] <bblack>	 so once we get there we're dealing with some kind of string composed solely of misses and passes
[14:26:51] <bblack>	 it seems once we start adding too many confusing cases where some layers pass and others don'
[14:27:13] <bblack>	 t, we should call it a miss if there's a true miss anywhere, and thus only a pass if everything passes
[14:28:13] <bblack>	 you could almost make "pass" just the else-clause without checking, but better to have a catchall bug/unknown case where we never matched any of the 4 strings, just in case.
[14:36:33] <bblack>	 back to ulsfo hw
[14:37:13] <bblack>	 cp4021 has edac reporting a memory error in DIMM B1
[14:37:21] <bblack>	 it's being corrected, but still :P
[14:38:49] <bblack>	 both of the nodes I pooled friday (cp4021+cp4027) have a single dmesg line reporting:
[14:38:52] <bblack>	 [Fri Sep  8 16:00:34 2017] TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised.
[14:40:43] <bblack>	 actually a lot of our nodes have that, not just the new ones
[14:41:05] <bblack>	 maybe something changed (perhaps not with the actual GRO implementation, but with finally noting the issue?) with newer kernels
[14:41:55] <bblack>	 72% of cache nodes have that line in their current dmesg buffer, anyways.  a few in esams are understandable, the old ones there I don't think have bnx2x at all
[14:42:00] <bblack>	 (maybe?)
[14:42:34] <bblack>	 could be uptime related (message ran off dmesg buffer after a while?), could be that the issue is only detected under certain traffic patterns too
[14:44:16] <bblack>	 https://patchwork.ozlabs.org/patch/723007/
[14:45:08] <ema>	 oh so the check is wrong
[14:45:14] <bblack>	 maybe
[14:45:17] <bblack>	 it could be right, too
[14:45:26] <bblack>	 but usually bnx2x driver is pretty high quality, not sure
[14:48:06] <ema>	 http://elixir.free-electrons.com/linux/latest/source/net/ipv4/tcp_input.c#L165
[14:50:32] <bblack>	 so the "&& skb_is_gso(skb)" hasn't made it to linus tree at all yet?
[14:51:26] <bblack>	 at least, not a release
[14:51:43] <ema>	 it doesn't look like, but the check in the latest kernel (link above) is different from the one we're running: http://elixir.free-electrons.com/linux/v4.9.25/source/net/ipv4/tcp_input.c#L166
[14:54:54] <bblack>	 I don't get much google hit on new, real bnx2x+GRO problems
[14:55:15] <bblack>	 lots of historical hits on the old known issues in much older kernels (we used to disable GRO for that reason, until we got to new enough kernels long ago)
[14:55:53] <bblack>	 I'd say we call it a red herring unless we see evidence of a real problem to match with that message
[14:59:23] <ema>	 +1
[15:00:16] <ema>	 cp4021 had memory errors? I thought cp4024 was the cursed one
[15:06:31] <bblack>	 well both, in different ways
[15:06:47] <bblack>	 cp4021 isn't actually failing, just has some spam of correctable memory errors from ECC
[15:07:11] <bblack>	 they seem highly localized though, so we should replace the DIMM I think
[15:07:14] <bblack>	 I'll make a ticket
[15:14:19] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3597148 (10BBlack)
[15:14:38] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3597162 (10BBlack)
[15:36:40] <wikibugs>	 10Traffic, 10Operations: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3597248 (10Esc3300)
[15:37:36] <wikibugs>	 10Traffic, 10Operations: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#3597250 (10Esc3300)
[15:38:11] <wikibugs>	 10Traffic, 10Operations: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3597252 (10BBlack)
[15:38:16] <wikibugs>	 10Traffic, 10Operations: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#3597253 (10BBlack)
[16:07:15] <wikibugs>	 10Traffic, 10Operations: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3597396 (10BBlack) So far, other nodes are testing ok on this front.  This is likely a node-specific early hardware failure.
[16:07:31] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3597398 (10BBlack)
[21:51:52] <wikibugs>	 10Traffic, 10Operations, 10monitoring: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636#3598644 (10BBlack)