[07:54:13] bblack: should we consider some form of autoremediation restarting after a certain value at this point? Or gerrit/376751 will mostly fix it already and will be deployed soon? [08:50:47] elukey: thanks for taking care of T175473! [08:50:47] T175473: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473 [08:57:16] ema: :) [08:58:03] volans: normally I'd say that cache backend autorestarts based on certain criteria are more likely to cause an outage than preventing it [08:58:24] +1 [08:58:48] on the other hand it is true that the issue is coming up quite often now unfortunately :( [09:01:14] yeah, ofc if we go in that direction should be with a lot of safe checks, and must be cluster-aware [09:06:32] bblack: the comment here is fantastic https://gerrit.wikimedia.org/r/#/c/376751/1/modules/varnish/templates/vcl/wikimedia-backend.vcl.erb [09:07:03] I like in particular the part where hash_ignore_busy is described as a cheaper hfp :) [09:24:37] https://gerrit.wikimedia.org/r/#/c/376665/ merged, I'll update the LVS dashboards in a week or so once the new metrics will have some history as godog suggested [09:25:14] the end result should be faster loading times for all relevant dashboards, and I imagine lesser strain on the server side too? [09:28:50] yeah, less datapoints to fetch [09:41:54] ema, bblack: modules/profile/manifests/lvs.pp are defining two salt::grains (lvs and lvs_class), can these be removed? I suppose you've moved to using cumin for all your cluster work? [09:43:02] moritzm: yes, we're happy cumin users by now :) [09:43:09] lol [09:45:02] ok, I'll prepare a patch for these later on :-) [12:30:41] alright, finally the future parser is happy https://gerrit.wikimedia.org/r/#/c/376242/ [12:30:48] bblack, _joe_ ^ [13:47:33] ema: I split my commit to just do (more) stats fixes separately first: https://gerrit.wikimedia.org/r/#/c/377255/ [13:48:36] also, I removed the hash_ignore_busy for now from the main patch. It seems like it's probably not useful? hits don't stall, and misses are (instantly, without talking to a backend) converted to passes, which also don't stall. h_i_b would only affect a miss-fetch, which we're never doing in this case. [13:49:38] well, the exception to the above thinking would be the blending of local-be + remote-be fetches of the same object [13:50:27] an eqiad backend might be doing a normal miss-fetch (cacheable) of ObjectA for a client of an eqiad frontend, and simultaneously a request hits the same eqiad backend from a ulsfo backend on behalf of a ulsfo client. [13:51:09] with h_i_b the ulsfo-derived fetch would proceed in parallel as a pass, without it the ulsfo fetch will stall and wait on the cache hit being generated for the local client. seems like a fine idea to not use h_i_b in this case though. [13:52:10] if the timing were reversed, ulsfo's request would already be processing a stall-free pass when the local request comes through and does a true miss-fetch. the object still ends up in cache, and still nobody stalls that shouldn't. [13:55:33] (all the cases where it's concurrent requests from multiple remote backends would be stall-free concurrent passes regardless of h_i_b. or concurrent stalls on the same miss-fetch for a local client) [14:09:36] bblack: looking [14:11:49] bblack: mmh I don't think ^hit$ is correct for hit-front? [14:12:07] record is in the format "hit, hit" [14:12:24] ema: re_simplify from above is applied first [14:13:00] the original record is like "cp1111 miss, cp2222 miss, cp3333 hit", or for a frontend-only request "cp4444 hit" [14:13:18] re_simplify turns those into "miss, miss, hit" and "hit", respectively [14:14:17] the ", " separators come from the original VCL string, and aren't mucked with by the simplifier [14:15:37] I guess my example is misleading, as the first one would be backwards from that in practice if you're looking at my fake numbers [14:15:57] err no, ignore that last line [14:16:10] the misses and hits are backwards from common view, though [14:16:44] a realistic one might be "cp1071 hit, cp3037 miss, cp3038 miss" for a hit-remote [14:16:54] or "cp3038 hit" for a hit-front [14:17:05] I've tried the patched varnishxcache on cp1074 and it doesn't find any frontend hits [14:17:08] and then simplifying those two to "hit, miss, miss" and "hit" [14:17:27] printing record after re_simplify I don't see any "hit" but rather "hit, hit" [14:17:53] how can there possibly be a double-layer hit? a hit implies not asking a next-deeper layer... [14:19:47] oh, right [14:19:56] the hit records the records from beneath it, that's what I'm forgetting [14:20:11] yeah, like cp1062 hit/5, cp1074 hit/36 [14:20:34] if the object exists in the local backend, but doesn't in the frontend, two serial requests to the frontend will report "be hit, fe miss", then "be hit, fe hit" [14:21:01] ok let me go stare at varnishxcache more, maybe that part was already correct [14:21:21] yeah I think "hit$" should be fine [14:23:30] yeah the whole front|local|remote part for hit/int actually is correct. I just didn't think it was when I looked in there to look at the other issue, because this is all quite non-straightforward :) [14:24:27] indeed :) [14:25:39] ok re-uploaded with just the miss/pass stuff [14:26:20] the way both logics work (X-Cache-Status + varnishxcache), if there was any hit|int anywhere in the string, we wouldn't reach the miss|pass(|bug) cases [14:26:38] so once we get there we're dealing with some kind of string composed solely of misses and passes [14:26:51] it seems once we start adding too many confusing cases where some layers pass and others don' [14:27:13] t, we should call it a miss if there's a true miss anywhere, and thus only a pass if everything passes [14:28:13] you could almost make "pass" just the else-clause without checking, but better to have a catchall bug/unknown case where we never matched any of the 4 strings, just in case. [14:36:33] back to ulsfo hw [14:37:13] cp4021 has edac reporting a memory error in DIMM B1 [14:37:21] it's being corrected, but still :P [14:38:49] both of the nodes I pooled friday (cp4021+cp4027) have a single dmesg line reporting: [14:38:52] [Fri Sep 8 16:00:34 2017] TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised. [14:40:43] actually a lot of our nodes have that, not just the new ones [14:41:05] maybe something changed (perhaps not with the actual GRO implementation, but with finally noting the issue?) with newer kernels [14:41:55] 72% of cache nodes have that line in their current dmesg buffer, anyways. a few in esams are understandable, the old ones there I don't think have bnx2x at all [14:42:00] (maybe?) [14:42:34] could be uptime related (message ran off dmesg buffer after a while?), could be that the issue is only detected under certain traffic patterns too [14:44:16] https://patchwork.ozlabs.org/patch/723007/ [14:45:08] oh so the check is wrong [14:45:14] maybe [14:45:17] it could be right, too [14:45:26] but usually bnx2x driver is pretty high quality, not sure [14:48:06] http://elixir.free-electrons.com/linux/latest/source/net/ipv4/tcp_input.c#L165 [14:50:32] so the "&& skb_is_gso(skb)" hasn't made it to linus tree at all yet? [14:51:26] at least, not a release [14:51:43] it doesn't look like, but the check in the latest kernel (link above) is different from the one we're running: http://elixir.free-electrons.com/linux/v4.9.25/source/net/ipv4/tcp_input.c#L166 [14:54:54] I don't get much google hit on new, real bnx2x+GRO problems [14:55:15] lots of historical hits on the old known issues in much older kernels (we used to disable GRO for that reason, until we got to new enough kernels long ago) [14:55:53] I'd say we call it a red herring unless we see evidence of a real problem to match with that message [14:59:23] +1 [15:00:16] cp4021 had memory errors? I thought cp4024 was the cursed one [15:06:31] well both, in different ways [15:06:47] cp4021 isn't actually failing, just has some spam of correctable memory errors from ECC [15:07:11] they seem highly localized though, so we should replace the DIMM I think [15:07:14] I'll make a ticket [15:14:19] 10Traffic, 10Operations, 10ops-ulsfo: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3597148 (10BBlack) [15:14:38] 10Traffic, 10Operations, 10ops-ulsfo: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3597162 (10BBlack) [15:36:40] 10Traffic, 10Operations: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3597248 (10Esc3300) [15:37:36] 10Traffic, 10Operations: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#3597250 (10Esc3300) [15:38:11] 10Traffic, 10Operations: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3597252 (10BBlack) [15:38:16] 10Traffic, 10Operations: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#3597253 (10BBlack) [16:07:15] 10Traffic, 10Operations: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3597396 (10BBlack) So far, other nodes are testing ok on this front. This is likely a node-specific early hardware failure. [16:07:31] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3597398 (10BBlack) [21:51:52] 10Traffic, 10Operations, 10monitoring: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636#3598644 (10BBlack)