[00:04:25] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team: VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063305 (10Krinkle) Further testing shows that while it matches the custom default VirtualHost on mwdebug1001 and mwdebug...
[00:28:24] <wikibugs>	 10netops, 10Operations, 10ops-codfw, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4063349 (10Papaul) p:05Triage>03Normal
[00:30:27] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team: VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063366 (10Krinkle)
[01:39:26] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4063485 (10Volker_E)
[03:59:47] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4063721 (10BBlack) Some notes from digging around in related things: * Varnish docs claim that duplicate probes (e.g. due to vcl reloads) are coalesced into a sin...
[04:57:44] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4063740 (10Bawolff)
[05:15:12] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4063758 (10Prtksxna) Thanks @Dzahn {icon smile-o}
[09:06:14] <ema>	 so we're having mailbox lag / failed fetches issues in text esams again, although less severe than last week
[09:06:29] <ema>	 https://grafana-admin.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=13&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=1521525394219&to=1521536740514
[09:06:53] <ema>	 just added this ^
[09:08:03] <ema>	 something I didn't really consider before is: we have fe<->be probes, do they fail during these events?
[09:08:11] <ema>	 and the answer is yes
[09:08:58] <ema>	 there is not per-probe failure stat in varnishstat, but there's the total number of happy probes (per backend)
[09:09:08] <ema>	 s/not/no/
[09:09:53] <vgutierrez>	 so I guess that the node self-recovers after the probes mark it as sick and loses user traffic for a while?
[09:10:00] <vgutierrez>	 (morning BTW)
[09:11:56] <ema>	 morning!
[09:12:37] <ema>	 so it would be great to have stats about which backend is considered sick and when
[09:12:43] <ema>	 (nothing like that is available)
[09:12:50] <vgutierrez>	 BTW, if the probe is failing as well that means connectivity issues
[09:13:11] <vgutierrez>	 cause the probe at L7 is always going to return a HTTP 200 OK
[09:15:22] <ema>	 well if varnish is in a confused state (as it is when mbox lag and similar things show up) it seems reasonable that it does not manage to respond to health checks too
[09:15:36] <ema>	 I don't think we should conclude that there are connectivity issues
[09:16:14] <vgutierrez>	 hmmm I didn't express myself correctly.. I didn't mean connectivity issues in terms of network outage
[09:16:31] <vgutierrez>	 but that varnish fails to accept the incoming (probe) connection
[09:16:37] <ema>	 oh yes, sure
[09:17:51] <ema>	 note that the fe<->be probe does not time out
[09:17:54] <ema>	 it just fails
[09:17:57] <ema>	 for example:
[09:18:04] <ema>	 0 Backend_health - vcl-8d2bafd0-8037-43b1-9e91-ae5e068e1da6.be_cp3031_esams_wmnet Still sick 4--Xr-- 0 3 5 0.000000 0.018553
[09:18:43] <ema>	 I was lucky enough to capture this with `varnishlog -n frontend -g raw -i Backend_health | grep -v 'Still healthy'` while the probes were failing
[09:19:26] <ema>	 see vsl(7) for an explanation of the fields
[09:20:10] <ema>	 interestingly, the last two are: Response Time, Average Response Time
[09:21:00] <ema>	 oh, wait, I assumed that if response time != probe_timeout (100ms) then the check did not time out
[09:21:09] <ema>	 however here response time is 0.0 
[09:21:19] <ema>	 what does that even mean 
[09:22:36] <vgutierrez>	 that r is signaling "Error received (no response from the backend)"
[09:23:11] <vgutierrez>	 so 4 --> IPv4 OK, X --> Good Xmit (Request to test backend sent)
[09:24:21] <ema>	 right, so if no response is received, response time is 0
[09:24:38] <ema>	 which seems to indicate we're timing out
[09:24:58] <ema>	 perhaps 100ms is a bit ambitious as a timeout when varnish-be is having difficulties 
[09:25:52] <ema>	 and fe<->be probes failing, with the consequence of backends being marked as sick, might be worsening the situation actually (by messing up chashing)
[09:30:43] <vgutierrez>	 how many varnish-be can we lose without degrading the service?
[09:31:14] <ema>	 it depends :)
[09:31:17] <ema>	 why are you asking?
[09:32:05] <ema>	 BTW, we have ~300 quadrillion happy probes
[09:32:18] <vgutierrez>	 cause the probes marking the misbehaving node as sick should be a good thing.. the node loses traffic (and load) and get time to recover
[09:33:16] <ema>	 right, in the meantime traffic goes to other backends though, some of which might already have been close to their scalability limit
[09:33:34] <ema>	 and now they get a ton of new objects in cache
[09:33:57] <vgutierrez>	 so if we cannot lose a cp node then we have a serious problem
[09:35:01] <ema>	 we can "lose" a cp node (we do reboot them often without consequences)
[09:35:31] <ema>	 my theory is that flapping do to probes marking backends as healthy might be making the "mbox-lag" thing worse
[09:36:41] <ema>	 s/do/due/
[09:36:50] <ema>	 s/healthy/sick-healthy/
[09:37:43] <vgutierrez>	 yep.. and the extra load from discarded VCLs' probes on fe isn't any good
[09:38:45] <ema>	 I'd like to make something useful out of the varnish_backend_happy prometheus counter 
[09:39:36] <ema>	 so far I've plotted sum(varnish_backend_happy{job="varnish-$cache_type",layer="$layer",instance=~"($server):.*"}) by (backend)
[09:39:50] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=12&fullscreen&edit&orgId=1
[09:40:37] <ema>	 I'm not sure how useful it is to know that we've probed cp1052 successfully 267 quintillion times though
[09:41:38] <vgutierrez>	 maybe the rate could be useful
[09:43:35] <ema>	 yeah but I mean what is that
[09:44:02] <ema>	 we went from 258 quintillion probes to 498 in 4 minutes
[09:44:11] <ema>	 that seems suspicious :)
[09:49:30] <ema>	 right, so the value of varnish_backend_happy is 1.8446744073709552e+19 for all backends
[09:51:54] <vgutierrez>	 hmmm
[09:52:06] <ema>	 so yeah we've reached 64 bit unsigned int
[09:54:21] <ema>	 I think that if we want to make anything useful out of varnish_backend_happy, we need an aggregation rule computing the rate, and use that (after resetting the counters with a fe restart)
[09:54:36] <ema>	 perhaps godog has better ideas!
[09:58:32] <ema>	 even though I mean, what we really would like to know is "when and for how long was backend X considered sick"
[09:59:03] <ema>	 the number of happy/sad probes is not that interesting really
[09:59:39] <godog>	 seems more like a logging use case than a metrics use case, to me at least
[10:00:02] <vgutierrez>	 we need to ingest those Backend_health stats
[10:00:43] <ema>	 yeah I agree with both sentences :)
[10:00:50] <ema>	 let's send backend_health to logstash?
[10:01:03] <ema>	 when it's != happy
[10:03:08] <ema>	 or even just the transitions really
[10:05:49] <godog>	 the transition sounds good to me, is that something varnish logs natively?
[10:06:40] <ema>	 it definitely says "Back healthy"
[10:07:06] <ema>	 right, and "Went sick"
[10:09:04] <ema>	 bbiab
[10:46:56] <mark>	 vgutierrez: thanks!
[10:49:40] <vgutierrez>	 np
[10:49:53] <vgutierrez>	 you don't like the idea of moving the constants to its own file?
[10:51:28] <mark>	 not really, not right now
[10:51:46] <mark>	 if it becomes worse we'll create a bgp.constants some day perhaps
[10:52:53] <vgutierrez>	 my OCD screams seeing those imports there though :_(
[10:52:57] <wikibugs>	 10Traffic, 10Operations, 10Goal, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#3901180 (10Vgutierrez) varnishxcps metrics are being used in the following dashboards: * db/tls-ciphersuite-explorer (TLS - Ciphersuite Explorer) * db/tls-ciphers (TLS Ciphers...
[10:53:52] <mark>	 there are imports further down the file as well
[10:53:53] <mark>	 not many
[10:54:08] <mark>	 but i don't see a problem with imports after constants
[10:54:20] <mark>	 hm, coveralls reports a DECREASE in coverage
[10:54:21] <mark>	 why is that
[10:55:39] <mark>	 apparently it doesn't see the new test code itself as covered
[10:56:47] <wikibugs>	 10netops, 10Operations: Can't login on netbox - https://phabricator.wikimedia.org/T190134#4064218 (10fgiunchedi) p:05Triage>03Normal
[11:25:46] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#4064286 (10ema)
[11:25:50] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#4064284 (10ema) 05Resolved>03Open This [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521510908805&to=1521544427097 | occurred again...
[11:35:03] <ema>	 so another interesting conclusion is that synth responses are also affected by the -sfile debacle apparently
[11:36:55] <volans>	 ema: if/when you have 30s free: https://gerrit.wikimedia.org/r/c/419799/
[11:37:30] <ema>	 volans: trading that one for https://gerrit.wikimedia.org/r/#/c/420680/
[11:39:35] <volans>	 ema: that's unfair, 3 commented lines vs 180 lines :-P
[11:39:56] <volans>	 anyway, I already saw that but was wondering if it's a new script or not, it seemed old from the copyright notice
[11:40:02] <volans>	 and also, why not py3? :D
[11:41:14] <ema>	 volans: it's new but gilles and I stole code from each other, hence the copyright
[11:41:42] * ema regrets pinging volans
[11:42:20] <volans>	 you asked for it! :D
[11:55:53] <ema>	 bblack: tl;dr from this morning is: https://phabricator.wikimedia.org/T174932#4064284 and https://gerrit.wikimedia.org/r/#/c/420680/
[11:56:31] <ema>	 plus the open question of whether our current fe<->be probe definitions are ok in the context of T174932
[11:56:31] <stashbot>	 T174932: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932
[11:58:22] <ema>	 is the timeout too short? Or perhaps the window? Is that worsening the situation by removing from the pool of healthy servers a backend which might have soon recovered (while moving load elsewhere)?
[11:59:42] <ema>	 anecdotally, I've seen a few quick flaps while looking at varnishadm -n frontend backend.list earlier on today, so a longer window might have helped
[12:03:33] <ema>	 to rephrase: are we making things worse by shifting traffic here and there when backends are quickly going in and out of hospital
[12:20:28] <bblack>	 maybe?
[12:20:37] <ema>	 good morning :)
[12:20:41] <bblack>	 does the backend in question go sick for all frontends?
[12:21:10] <bblack>	 because the TW pileup probably affects one frontend first (the one with the most probes)
[12:21:36] <bblack>	 and would cause failing/delayed probes/live-traffic from just that fe to that be
[12:22:51] <bblack>	 varnishhospital I assume logs the source host of the probe as the standard host= field
[12:23:21] <ema>	 yes
[12:24:07] <bblack>	 many things are hard to tease apart cause/effect or best course of action here
[12:24:35] <bblack>	 but the spam of conn:close probes closing up connections to be's fits firmly in the cause side of the table, and seems decidedly-bad
[12:25:15] <ema>	 what we know is that the fe<->be connections pileup does happen on multiple frontends
[12:25:25] <ema>	 see for example:
[12:25:26] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=6&fullscreen&orgId=1&from=now-12h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-server=cp3031&var-layer=frontend
[12:26:14] <ema>	 was cp3043's be sick at 03:20ish? I don't think we know that
[12:28:04] <bblack>	 3030 and 3031 have 11 and 9 stale-busy VCLs respectively
[12:28:10] <bblack>	 not too far off each other in total probe rate
[12:29:45] <bblack>	 on another note from above: would it help if probe responses didn't use synthetic()? Just headers with no content output? (are all consumers ok with that? pybal? icinga?)
[12:31:14] <ema>	 I'm not sure whether it would help, but you could argue that if a certain varnish instance is not able to do synth it can be considered as being in trouble at that point in time
[12:31:41] <bblack>	 yes but you're also trying to argue we're being too sensitive to the trouble above and should make probes more tolerant of failure :)
[12:32:02] <bblack>	 it has never been a deep check, it's intended to be a lightweight check
[12:33:39] <ema>	 I do often disagree with myself :)
[12:33:48] <ema>	 in this case though I think that there is value in knowing that we can't do synth, but we should perhaps be more tolerant of brief flaps
[12:34:00] <ema>	 whatever "brief" means
[12:35:56] <ema>	 increasing probe intervals from 100 to 1000 ms did help quite a bit with accepted sessions rate:
[12:35:59] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521450281845&to=1521500066644&panelId=7&fullscreen
[12:36:16] <bblack>	 ema: even though it only affected the newest VCL, not all of the old stale ones that are still running...
[12:36:39] <ema>	 compare the backend restart at 12:15 with the one at 21:35
[12:36:50] <bblack>	 the rate of probe-induced closes won't drop off sharply until basically all the FEs are restarted
[12:37:03] <bblack>	 which is why sadly I think maybe we should put a weekly cron in for that, at this point
[12:37:51] <bblack>	 so, some numerical context / example on the time-wait stuff:
[12:38:05] <bblack>	 cp3030 a few minutes ago when I peeked, had ~220K timewait sockets total
[12:38:30] <bblack>	 ~24K of those are natural public ones on the public :80/:443 and similar (unavoidable and reasonable and non-problematic)
[12:39:16] <bblack>	 ~195K of those are on the nginx<->fe pipeline with its 8 ports and its special ability to kill them easier when necessary (because tw_reuse and both ends local)
[12:39:23] <bblack>	 err sorry
[12:39:37] <bblack>	 ~158K of those are on the nginx<->fe pipeline with its 8 ports and its special ability to kill them easier when necessary (because tw_reuse and both ends local)
[12:40:06] <bblack>	 the remaining 37K are fe<->be timewaits, 
[12:40:36] <bblack>	 but this is on a node that's not currently in trouble in either direction afaik
[12:42:02] <bblack>	    5776 10.20.0.168
[12:42:02] <bblack>	    5766 10.20.0.167
[12:42:02] <bblack>	    5249 10.20.0.165
[12:42:02] <bblack>	    4297 10.20.0.175
[12:42:02] <bblack>	    4205 10.20.0.178
[12:42:04] <bblack>	    4194 10.20.0.166
[12:42:07] <bblack>	    4184 10.20.0.176
[12:42:09] <bblack>	    4120 10.20.0.177
[12:42:22] <bblack>	 ^ that's the ~37K timewaits split up by the frontend on the other side, with .166 being itself
[12:42:51] <bblack>	 those numbers aren't wonderful, but they're not in problematic range, either
[12:43:02] <bblack>	 (I don't think, anyways)
[12:43:58] <bblack>	 note when I looked on the other sides of these, I found no timewaits.  so basically it's the BE that carries them (the FE closes first when probing).
[12:44:44] <bblack>	 err sorry that's backwards right? if the BE is carrying the TIMEWAIT that means the BE closed first?
[12:45:44] <bblack>	 I need a star-wars font t-shirt that says "Han suffered the TIME_WAIT state" to remind me :P
[12:45:45] <ema>	 IIRC timewait is on the side that closed first
[12:47:11] <bblack>	 right
[12:47:24] <ema>	 the side that sent FIN first, right
[12:47:51] <bblack>	 so, if one of these flows became time-wait problematic, it would be because the count of timewaits on a BE facing a given FE rose to the point that it choked off the necessary count of ephemeral ports we need to open for normal traffic.
[12:48:20] <bblack>	 but in our overall timewait graphs, I don't know how visible it would be
[12:49:35] <bblack>	 I kind of assume at points yesterday that this must be happening around the time we see a connection pileup, but maybe not
[12:50:04] <bblack>	 (a connection pileup in varnish stats, where it hits its threads/conns limits and is queueing, etc)
[12:50:33] <ema>	 lots of trouble would be spared if https://gerrit.wikimedia.org/r/#/c/420395/ actually did fix the busy vcl issue :)
[12:50:54] <ema>	 let's give it a try on pu/misc I'd say
[12:50:55] <bblack>	 I think that's a pure memory leak
[12:51:05] <bblack>	 it will save us lost memory, but won't free up the vcl refs
[12:51:12] <bblack>	 (small amounts of lost memory per vcl)
[12:51:54] <bblack>	 but sure, it's worth a shot
[12:52:42] <ema>	 right, we would have liked to see a patch adding a VCL_Rel somewhere magic I guess
[12:53:08] <bblack>	 the stale VCLs problem in general, probably our easiest fixup is to move in the direction of not having confd do reloads, and adding a weekly fe restart to clean up true reloads.
[12:53:39] <bblack>	 (easier than figuring out whether we can fix VCL_Rel issues on our own)
[12:54:40] <bblack>	 but even with that mitigated and a slower baseline probe rate (which I'm not all that sure it's wise to keep.  it's nice to have faster reactions to server outages when things are working right)... that probes would kill our persistent conns so routinely is problematic on a number of levels.
[12:55:10] <bblack>	 cache_text has 36 hosts globally
[12:55:28] <bblack>	 and 8 in eqiad specifically
[12:55:58] <bblack>	 well that latter part is unecessary, since be's don't probe their own dc's be's
[12:56:37] <bblack>	 so any given cache_text server in eqiad is probed by 36 other varnishds (8x local fe + 28x remote be)
[12:57:04] <bblack>	 even if we had no stale VCLs and a 1s probe time, it's 36 pointless conn:close towards it per second.
[12:57:17] <bblack>	 if we go back to the desirable 100ms interval it's 360
[12:57:27] <ema>	 so it might be useful to spend time on switching to persistent connections for probes
[12:57:32] <bblack>	 live rate of such probes last night with all the stale VCLs was more like ~1420/s
[12:57:43] <bblack>	 (into a single backend in eqiad)
[12:58:41] <bblack>	 yeah I don't know whether that's easy or hard, I haven't looked much yet
[12:59:44] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4064480 (10kchapman) TechCom is declining because the use case is not current. This needs a new owner and use case.   There is actually one current use cas...
[13:00:34] <bblack>	 I may be wrong, but my current reading of the probe client-side code seems like:
[13:00:54] <bblack>	 1) The probes all run in a separate thread for probe-polling, but re-use the same pool of tcp conns as live traffic.
[13:01:13] <bblack>	 2) The prober closes the connection on any error or timeout (probably reasonable)
[13:01:45] <bblack>	 3) The prober's logic assumes connection-delimited response.  in other words, it keeps reading response bytes until remote-close or timeout reached.
[13:01:57] <bblack>	 (it doesn't parse CL and find the logical end of the response, etc)
[13:02:17] <bblack>	 that latter bit being the core polling loop in vbp_poke()
[13:02:30] <bblack>	 and then it unconditionally closes its end once timeout or remote-close is reached
[13:05:30] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations: Review/Merge/Deploy advisors.wikimedia.org apache vhost - https://phabricator.wikimedia.org/T190143#4064494 (10Reedy)
[13:06:21] <bblack>	 oh!
[13:06:40] <bblack>	 so, apparently this whole situation isn't as bad as it first appears, once you dig through the code enough
[13:06:53] <bblack>	 the probe code *does* share the TCP connection pool with live traffic.
[13:07:58] <bblack>	 but live traffic uses VBT_Get() on the pool to acquire a connection (which might be a recycled idle one, or a fresh one), whereas the prober uses VBT_Open(), which always opens a fresh connection.
[13:08:19] <bblack>	 so, therefore, probes do not actually kill off the persistent conns used by live traffic
[13:08:43] <bblack>	 but they do create new temporary ones constantly, which still adds to the timewait issue in general
[13:11:58] <bblack>	 ok, so rewinding brain <<<<<<<
[13:12:56] <bblack>	 the ~1420/sec probes landing on a cache_text node in eqiad do not mean we're closing off random live-traffic persistent conns at a rate of ~1420/sec.  It just means that in addition to the live-traffic conns, we're creating->destroying ~1420 new tcp conns just for probing, /sec.
[13:15:00] <bblack>	 in esams it's more like ~663/sec presently, since it doesn't have remote DCs hitting it, etc
[13:15:46] <bblack>	 that was on cp3043 in particular
[13:16:11] <bblack>	 which also had 37522 be-port timewaits stacked up around the same time
[13:16:32] <bblack>	 37522/663.2 = 56.577
[13:17:11] <bblack>	 so probably that makes sense, that the 37K timewaits means timewaits are created at this ~663/s rate and last ~60s each to reach that number
[13:18:26] <bblack>	 but if this is never problematic, why does slowing down the probe rate seem to help things? maybe should re-examine what exactly it's helping and why.
[13:20:28] <ema>	 I think it helps just because each probe opens a new fe<->be tcp connection, and what we're plotting there is the rate of accepted tcp connections on the backend side
[13:20:46] <ema>	 so by reducing the rate of probes, we reduce the rate of accepted tcp conns
[13:21:02] <bblack>	 well around the time it went live, the rate of accepted conns went *up*
[13:21:20] <bblack>	 I assume by helpful you meant maybe that was letting live traffic flow in parallel more-freely, as indicated by the increase?
[13:21:42] <bblack>	 but it may well be that it just caused another reloaded VCL to add onto the pile, and the increase is just the actual probe connections themselves.
[13:21:46] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521450281845&to=1521500066644&panelId=7&fullscreen
[13:21:51] <vgutierrez>	 bblack: nice catch the VBT_Get VS VBT_Open
[13:22:05] <ema>	 helpful as in 12:20 vs 21:35
[13:22:15] <bblack>	 oh right
[13:22:23] <bblack>	 helpful as in, it's not going up by quite as much
[13:22:31] <ema>	 yep
[13:22:37] <bblack>	 that makes basic sense, but really tells us little I guess
[13:22:51] <bblack>	 we reduced the rate of probe connections, and the total count of connections goes up by less on each reload :)
[13:23:20] <ema>	 which makes me happy because the contrary would have required a tableflip
[13:24:44] <bblack>	 well, the backend connections accepted/sec mostly tracks probe rate from the total warm VCLs probing it
[13:24:50] <bblack>	 since they're the connections that constantly recycle
[13:25:32] <ema>	 I think recycle means something else in varnishland, something like "reopen" you mean?
[13:27:32] <bblack>	 yes, I mean connections that are constantly opening and closing
[13:27:53] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=51&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams%20prometheus%2Fops
[13:28:07] <bblack>	 ^ since the backend sessions graph there is fairly-close to the accepted-rate shown from the probes
[13:28:24] <bblack>	 I think we can deduce that under normal conditions, live traffic doesn't use much connection parallelism from fe->be?
[13:29:00] <bblack>	 in other words, not only do probes dominate the tcp connection opens/closes for a backend, they also dominate the total connection count
[13:29:20] <ema>	 yes
[13:30:28] <ema>	 eg, cp3030's frontend serves ~1.5k rps
[13:30:32] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=14&fullscreen&orgId=1&var-server=cp3030&var-datasource=esams%20prometheus%2Fops&from=1521508182283&to=1521521601861
[13:30:54] <ema>	 and the reused connections rate is not too far I think last time I checked
[13:31:36] <ema>	 (if you remove cache_hits from the number)
[13:31:41] <bblack>	 right
[13:32:11] <bblack>	 reminding myself from when I saw it in the be sessions graph above: "herd" is idle connections that haven't closed yet and could be reused.
[13:34:14] <ema>	 meanwhile https://gerrit.wikimedia.org/r/#/c/420395/ hasn't broken anything on cp1008, ok to release 5.1.3-1wm5 and upgrade a cache_misc host?
[13:35:22] <bblack>	 sure
[13:37:49] <wikibugs>	 10netops, 10Operations: Can't login on netbox - https://phabricator.wikimedia.org/T190134#4064592 (10ayounsi) a:03ayounsi Typo in the config, confirmed working for me now.  https://gerrit.wikimedia.org/r/#/c/420699/
[13:40:20] <bblack>	 what was the other fix you had floating that's not widely deployed yet?
[13:40:25] <bblack>	 upstream extrachance patch?
[13:40:29] <ema>	 yup
[13:40:40] <ema>	 deployed on misc@esams
[13:41:35] <bblack>	 is our old extrachance patch gone too?
[13:41:44] <bblack>	 or is this a merger of the two?
[13:42:40] <bblack>	 oh I see, removed from series-file
[13:42:42] <ema>	 our old extrachance patch is gone since 5.1.3-1wm4 (w/ upstream extrachance fix), valentin's backport of the memleak fix is 5.1.3-1wm5
[13:43:35] <bblack>	 interesting they also chose to decrement extrachance at the bottom of the loop :)
[13:43:49] <ema>	 heh
[13:44:25] <ema>	 they got inspired!
[13:44:30] <bblack>	 so basically their solution is also to only catch the edge-case-race once, and if it seems like the race, force a fresh connection and then don't consider the race again.
[13:45:49] <bblack>	 at least it won't loop infinitely
[13:46:17] <ema>	 o gerrit, where art thou
[13:46:21] <bblack>	 but I vaguely recall the reason we noticed this was because RB would fail a request in a certain spectacular way which would appear to the extrachance logic to be a race-case, when it wasn't
[13:46:26] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4064623 (10fgiunchedi)
[13:46:29] <wikibugs>	 10netops, 10Operations: Can't login on netbox - https://phabricator.wikimedia.org/T190134#4064621 (10fgiunchedi) 05Open>03Resolved Confirmed working! Thanks @ayounsi !
[13:46:39] <bblack>	 and it was a long-timeout case
[13:47:01] <bblack>	 so replacing ours with upstream may result in a cascade of double-requests for that case
[13:47:20] <bblack>	 (but at least that's better than a cascade of up-to-infinite retries)
[13:48:13] <ema>	 we don't have to deploy it of course
[13:48:34] <ema>	 we can also stick to our own patch :)
[13:49:40] <ema>	 the original thinking was: this is most likely unrelated to the pileups, but let's give it a go as it brings us closer to what upstream does anyways
[13:49:40] <bblack>	 well what they're doing is an improvement for the actual race-case
[13:50:01] <bblack>	 but yeah the logic-hole is still there
[13:50:23] <bblack>	 the logic-hole in midst of vbe_dir_getfd() is this:
[13:51:18] <bblack>	 if varnish sends a request to a backend, then fails to get response headers at all (e.g. timeout before complete/any headers? I don't know all the cases), and the connection happens to be a re-used idle one, it assumes the failure was due to the race and wants to try again.
[13:52:08] <bblack>	 when we know there's another possibility: that there was no close-race, and the application layer simply hung forever processing the request and not generating (complete?) header output and then terminated the connection.
[13:53:28] <bblack>	 basically the code there lacks the information to tell true races from the above fake-races
[13:53:46] <bblack>	 I'm not even sure if anywhere within varnish's state about the conn/request, such information exists
[13:56:55] <bblack>	 (although trivially, it seems like a strong hint would be the amount of time that transpires.  a 60s timeout waiting on headers is not a network race for the req-vs-fin)
[13:57:55] <bblack>	 (the stronger and less-heuristic hint would be to deep-dive for linux TCP_INFO as mentioned the other day and check for unacked data on conn-close)
[13:59:12] <ema>	 interesting that we have thousands of connections in time_wait and then fail on a race when closing a connection
[14:03:49] <bblack>	 do we?
[14:04:23] <bblack>	 (ever really fail on that race significantly)
[14:04:52] <bblack>	 the thousands of timewaits are all for the probes, and the probes always use a fresh connection for a single request and then close it up without letting anything else use it.
[14:06:12] <bblack>	 but anyways, thinking this through, I think maybe we should flip back to our extrachance fix vs theirs, and keep the value at zero.
[14:07:42] <bblack>	 their solution is (a) still fooled by applayer misbehaviors with possibly long timeouts, which is not the intended race-case and (b) with our varnishd layering, even when they limit it to "single retry" it's going to cascade, in the worst cases retrying up to 8 times at the eqiad->applayer end of a request originating in eqsin.
[14:08:36] <bblack>	 I think?
[14:08:57] <bblack>	 yes, still cascades
[14:09:44] <bblack>	 I'd say merge the two approaches, but our setting extrachance to zero at the top conflicts with their force_fresh approach, so it would result in a fresh connection-per-request, which isn't great either :)
[14:21:59] <ema>	 oh well yes, we can easily revert to wm3 then
[14:22:20] <ema>	 wm4 is the upstream extrachance fix which has the issues you described
[14:22:48] <ema>	 wm5 doesn't really save the world (busy vcl repro on cp3007)
[14:27:48] <bblack>	 well we can still leave in the memleak fix, maybe not worth a global restart until we have more fixes to try though
[14:44:55] <wikibugs>	 10netops, 10Operations, 10ops-codfw, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4064892 (10Papaul) p:05Normal>03High
[14:49:13] <bblack>	 win 14
[14:52:09] <wikibugs>	 10netops, 10Operations, 10ops-codfw, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4064917 (10Papaul) a:05ayounsi>03RobH
[14:56:10] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4064924 (10Anomie) >>! In T66214#4064480, @kchapman wrote: > TechCom is declining because the use case is not current.  Everything listed in the task's des...
[15:08:32] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4064962 (10ayounsi) Netbox has been upgraded to 2.3.1 which supports virtual chassis switches.  Updating description for next steps.
[15:14:56] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4064982 (10ayounsi)
[15:27:56] <bblack>	 ema: I've poked around at some of the events shown in https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=6&fullscreen&orgId=1&from=now-12h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-layer=frontend
[15:28:24] <bblack>	 but in no case have I found a correlation when looking at both of the fe+be in varnish-machine-stats at their tcp timewait counts.
[15:28:38] <bblack>	 if anything, when the connection volume spikes there, the timewaits drop a little
[15:29:14] <bblack>	 so I suspect the whole idea that the timewait pileup from probes is a cause of the Real problems is a bust
[15:30:31] <bblack>	 it would still be nice to reduce the probe-spam (by reducing stale vcl), and perhaps make it more resilient to blip failures with the windowing, I think.
[15:30:43] <bblack>	 but I don't think those things are going to solve any foundational problem
[15:33:36] <bblack>	 ---
[15:34:24] <bblack>	 I think the upcoming thought also doesn't solve any foundational problem, but it occured to me when thinking through the general efficiency of our structure of caches, and may not be worth the trouble of pursuing:
[15:35:37] <bblack>	 [picking random numbers from various clsuters as examples, obviously the numbers vary situationally]
[15:36:40] <bblack>	 right now we have a 200G fe mem cache initially, the chashing into an array of 8x nodes that each have 720GB of slower and more-problematic disk storage for a combined size at that layer of ~5.7TB.
[15:37:33] <bblack>	 it might be substantially more-efficient to add a virtual middle tier there, by structuring more like:
[15:38:27] <bblack>	 an initial 100G mem cache, which chashes into an array of 8x backend nodes.  in each backend node, we hit a 100G memcache and then fall back to the 720G of slow disk.
[15:39:05] <bblack>	 so we go from effective set size tiering of 200G-mem -> 5.6TB disk, to 100G-mem -> 800G-mem -> 5.6TB disk
[15:39:55] <bblack>	 doing that "easily" would require the backend varnishd to support the idea of configuring disks and 100G malloc separately and actually tiering the caching internally, which I think is either difficult or impossible within VCL
[15:40:28] <wikibugs>	 10netops, 10Operations: Add puppetmaster2001 to analytics-in4 - https://phabricator.wikimedia.org/T190167#4065148 (10fgiunchedi)
[15:40:36] <bblack>	 or having it [the backend varnishd] reconnect to itself on miss/pass with a new header set to indicate using the different storage layers.
[15:42:12] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4065195 (10RobH)
[15:42:34] <bblack>	 internally-tiering it without leaving the initial VCL context, I don't know if it's possible.  maybe with hooks in vcl_{hit|miss|pass} that switch storage and maybe have to do a request-restart.
[15:43:05] <bblack>	 but even that sounds fishy, I don't think it would pull the cache misses up into memory storage as they're accessed, either.
[15:43:13] <wikibugs>	 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4065212 (10Nuria) ping @bblack could we possibly get this review done next quarter?
[15:46:20] <wikibugs>	 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4065219 (10BBlack) Yes, we've had this on the discussion list for ops Q4 goals (the elimination of the need for ipsec for caches<->kafka-brokers...
[15:49:15] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4065258 (10ayounsi) a:03faidon
[15:49:51] <wikibugs>	 10netops, 10Operations: Add puppetmaster2001 to analytics-in4 - https://phabricator.wikimedia.org/T190167#4065267 (10ayounsi) 05Open>03Resolved a:03ayounsi Done.
[16:11:24] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4065385 (10Volans)
[16:34:31] <XioNoX>	 eqsin to the British Indian Ocean Territory would go eqsin-> Los Angeles -> London -> Satellite -> IO
[16:35:00] <ema>	 hell of a ride!
[16:35:20] <vgutierrez>	 ping over 9000 xD
[16:36:33] <ema>	 so I have succesfully nerdsniped myself into backporting https://github.com/varnishcache/varnish-cache/commit/5cc47eaa8a174d6f072c68d996ff17b38ccd16eb to 5.1
[16:37:04] <ema>	 changes proposed as https://gerrit.wikimedia.org/r/420758, which also re-introduces our own extrachance patch
[16:37:28] <ema>	 more in general, all 4.1.x patches might be interesting https://github.com/varnishcache/varnish-cache/blob/4.1/doc/changes.rst
[16:39:23] <XioNoX>	 volans: re T170144 and backups, yes, I though that was automatic :)
[16:39:23] <stashbot>	 T170144: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144
[16:39:53] <volans>	 XioNoX: you're so optimistic :-P
[16:40:33] <XioNoX>	 volans: should I open the task or do you want to?
[16:41:49] <volans>	 I can do it, I had it as a todo from yesterday's monitoring meeting already :)
[16:43:13] <XioNoX>	 thx!
[16:49:03] <wikibugs>	 10netops, 10Operations: Netbox: setup backups - https://phabricator.wikimedia.org/T190184#4065562 (10Volans) p:05Triage>03Normal
[16:49:09] <wikibugs>	 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4065574 (10Volans)
[16:49:14] <volans>	 done
[16:53:15] <bblack>	 ema: interesting.... the leaked objhead thing
[16:53:28] <bblack>	 I wonder if it could be contributing to various kinds of object storage pressures
[16:55:03] <bblack>	 looking at the other 4.1.x changes...
[16:55:18] <bblack>	 https://github.com/varnishcache/varnish-cache/pull/2555 sounds scary/interesting.  it's amazing such a deep change is in a point-release :P
[16:57:25] <ema>	 I'm having a hard time figuring out where the change you're mentioning was initially merged to
[16:57:44] <ema>	 as in, was it a 5.x (6.x?) patch that then was painfully backported to 4.x?
[16:58:25] <ema>	 oh no, that's the pull request for master, then the backport to 4.x is https://github.com/varnishcache/varnish-cache/commit/a02e4f277c3ff12c8ef05b692713f87813a85da9
[16:59:49] <bblack>	 so many of the most-interesting bugs make references back to https://github.com/varnishcache/varnish-cache/issues/1799
[16:59:49] <ema>	 and the issue it tries to address is https://github.com/varnishcache/varnish-cache/issues/1799 (still open)
[16:59:57] <bblack>	 lol
[17:00:00] <ema>	 yeah :)
[17:01:32] <bblack>	 we should probably look at the original archived trac link + 1799 and check our VCL for whether we can mitigate 1799 indirectly by changing our behavior
[17:01:37] <bblack>	 it sounds scary if we're hitting it...
[17:02:26] <ema>	 open since 2015
[17:02:31] <ema>	 nice
[17:02:55] <bblack>	 maybe we need to simplify our grace windows and vcl_hit logic to avoid it, etc
[17:06:25] <bblack>	 but I could see this causing issues, it sounds like there might be plausible scenarios in our own traffic with hot objects and grace+waitinglist
[17:06:40] <bblack>	 (since we do set object grace higher than our common healthy-backend grace)
[17:09:07] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4065621 (10ayounsi)
[17:10:49] <ema>	 I'll give a try to 5.1.3-1wm6 on cp3007 (tests green, all good on pinkunicorn)
[17:13:42] <wikibugs>	 10netops, 10Operations: Netbox: setup backups - https://phabricator.wikimedia.org/T190184#4065669 (10ayounsi) Correct, plus: /srv/deployment/netbox/deploy/netbox/netbox/media/ and /srv/deployment/netbox/deploy/netbox/netbox/reports/
[17:13:55] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4065670 (10BBlack)
[17:24:36] <ema>	 bblack: objections to merging varnishospital?
[17:32:12] <bblack>	 ema: so long as it works as advertised, seems like useful data.  added one nitpicky comment.
[17:32:23] <ema>	 thanks!
[17:32:47] <bblack>	 so rewinding a bit, because I think I'm N layers off looking into other things too much
[17:34:30] <bblack>	 the functional problem that presents itself in the present is that we're getting 5xx spikes, usually esams in the eu morning-ish times, and they correlate with an "obvious" problem of a spike of fe->be connections, usually localized to 1x be at any one point in time.  when the problem sustains long enough to do so, these connection spikes get plateau-limited when we run out of available threads.
[17:35:12] <bblack>	 (but even without hitting the the limiting case there, they still present 5xx problems to some degree, right?)
[17:36:06] <bblack>	 the problem moves around to different backends over time, as shown in e.g. https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&panelId=3&fullscreen&from=now-24h&to=now
[17:36:20] <bblack>	 which may be indicative of chash moving things around as healthchecks fail, etc
[17:37:17] <bblack>	 is that more-or-less accurate, if trying to stick to the basics that are relatively-certain and not our suppositions beyond?
[17:37:25] <ema>	 it is
[17:42:04] <ema>	 I'm not entirely sure about the 5xx-without-thread-limiting part though
[17:42:31] <bblack>	 hmmm ok
[17:43:07] <bblack>	 we can of course raise the limits, but I imagine the outcome isn't going to help us much :)
[17:43:51] <bblack>	 (it might result in a net reduction in 5xx, but more slowlog entries, and a harder time pinpointing the underlying cause of the pileup?)
[17:45:11] <bblack>	 it's the implicit action-at-a-distance that makes this hard
[17:45:57] <bblack>	 there's no obvious pattern in the failing requests.  whatever the broken pattern is we're looking for, it was probably happening in a more-subtle way before the pileup starts happening and affecting $random_things badly.
[17:47:14] <bblack>	 and of course, we know from past investigations that one of the more-elusive causes can be long request timeouts to the applayer, too.
[17:47:43] <bblack>	 because they can start causing carnage just by existing and holding open threads and such, but they won't get logged about until much later when they finally timeout and "complete" and record a shmlog entry
[17:48:10] <bblack>	 and by then you've got a storm of bad things being logged, and the original minority slow requests are lost in the noise
[17:48:46] <bblack>	 and I guess the 3.5-day restarts haven't had a huge impact
[17:48:57] <ema>	 which is very interesting
[17:49:22] <ema>	 given that usually we only did see backends running for >~ 4 days having problems
[17:49:27] <bblack>	 well
[17:49:33] <ema>	 I think?
[17:49:58] <bblack>	 (a) even given a random distribution of which backends are failing, at any given time under weekly restarts, half of them would appear to be in the latter half of their age range.
[17:50:13] <bblack>	 (b) we're biased to look for aging varnishes as a problem due to past incidents
[17:51:30] <bblack>	 (c) aging varnishes have known but somwhat-mitigated performance issues.  so it's not unreasonable that while the age isn't causative, it can be an exacerbating catalyst to make this unrelated problem show up more frequently or prominently.
[17:52:37] <bblack>	 because of all of these things, it doesn't surprise me that we observed problems being on the longer-age varnishes when looking, and it also doesn't suprise me that cutting the maximum age in half didn't make the problems go away.
[17:53:00] <bblack>	 (although I was hopeful at the time!)
[18:09:56] <bblack>	 on the other hand, what the 3.5-day restart change has done, is double the pace at which we create stale VCLs in the ever-living frontend instances :)
[18:10:16] <ema>	 :)
[18:10:46] <bblack>	 I'm inclined given findings today, to revert my 1s-probe-interval changes and all directly-related bits
[18:11:20] <bblack>	 and maybe as you were noting, follow up with even-longer windows that are a bit more blip-resilient, but still based on the original 100ms intervals
[18:11:41] <bblack>	 but maybe that part should way on some hospital log evidence of the patterns there
[18:12:08] <bblack>	 s/way/wait/
[18:13:46] <bblack>	 I should sit and think a bit about the overall characteristics of the varnish probes' windowing system in general, I suspect I don't understand it as well as I should
[18:14:12] <bblack>	 I think it's just a threshold of N/M, e.g. 3/5 means 10101 results in healthy
[18:14:40] <bblack>	 I tend to prefer streak-based anti-flap that biases in the down direction
[18:16:23] <bblack>	 e.g. where the tunables are: N "fail" events (ever, windowless) transitions to the down state, but a streak of M "ok" events at any time resets the fail-counter (or transitions to up if you were already failed).
[18:17:08] <bblack>	 I think what varnish currently does is not so flap-resistant no matter what numbers you put in
[18:18:07] * mark thinks the investigation of this issue over the past week already fills a book
[18:19:08] <ema>	 hospital logs getting in (only a few triggered by me on cp1008 so far)
[18:19:11] <ema>	 https://logstash.wikimedia.org/goto/2b8945e034e3a00184fc9b4881b58f5f
[18:19:49] <ema>	 the first two do not include the host field because puppet didn't run on the logstash collector yet at that point, hence the magic in /etc/logstash/conf.d/20-filter_logback.conf wasn't there yet)
[18:23:14] <ema>	 merged 
[18:23:18] <ema>	 heh
[18:23:33] <ema>	 merged 5.1.3-1wm6 as it is behaving fine on cp3007
[18:23:44] <bblack>	 ok
[18:24:10] <ema>	 it seems like a good idea to test it further on misc@esams 
[18:24:53] <ema>	 and then if nothing breaks on a text host, to compare stats and see if fixing the OH leak does help at some level 
[18:25:10] <ema>	 "and then, if nothing breaks, ..."
[18:25:59] <bblack>	 yeah
[18:26:23] <bblack>	 some days this feels like playing jenga
[18:26:44] <bblack>	 if we can just pull this one more piece out, surely it won't fall apart on us...
[18:27:38] <bblack>	 headline from tomorrow: "So it turns out that leaked objheads were the only thing saving us from the more-brutal impacts of varnish bug #70818"
[18:28:23] <ema>	 hahahaha
[18:29:33] <ema>	 VSV0003 - DoS caused by unleaked objheads
[18:30:27] <ema>	 Under certain conditions, usually on Tuesdays, a specially crafted request ...
[18:31:05] <bblack>	 speaking of probe flaps and such, we never did get around to taking advantage of the shard director's rampup/warmup features in v5, which might be generally-helpful.
[18:32:10] <bblack>	 IIRC the rampup parameters basically allow for, when a given backend transitions from unhealthy->healthy, slowly ramping it backup to full participation in the chash ring, instead of suddenly all at once
[18:32:44] <bblack>	 and warmpup parameters let you send 1/N requests under normal (all-healthy) conditions to the next-best chash destination, in hopes of pre-warming hot objects ahead of healthy->unhealthy transitions.
[18:32:51] <ema>	 useful!
[18:32:53] <bblack>	 but it all defaults off and we never had it before
[18:34:00] <ema>	 bbiab
[18:34:05] <bblack>	 me too
[19:16:25] <ema>	 oh hey
[19:16:26] <ema>	 https://logstash.wikimedia.org/goto/f9be59947c1d7a2567887f9fe89fc5c9
[19:16:57] <ema>	 backends are going to the hospital and back
[19:24:20] <bblack>	 yeah but those VCLs already don't exist
[19:24:33] <bblack>	 and also it's odd that they come in bursts from a given node to a bunch of other nodes
[19:24:42] <bblack>	 clearly the prober is more at fault than the probee
[19:24:56] <ema>	 and they don't recover, do they
[19:25:42] <ema>	 I forgot to include layer info (frontend vs backend) 
[19:25:48] <ema>	 https://gerrit.wikimedia.org/r/420823
[19:25:50] <bblack>	 yeah but it can be inferred fairly easily
[19:26:01] <bblack>	 the point is though, it hasn't been that long since the last batch and the VCL UUID is gone
[19:26:17] <bblack>	 I tend to suspect that when a VCL gets discarded from going cold, all its probes go sick in the process?
[19:27:33] <bblack>	 (I'm inferring these are all be->be because fe->be doesn't probe x-dc)
[19:28:08] <bblack>	 some are same-dc, too, of course
[19:29:20] <ema>	 right
[19:29:47] <bblack>	 but still, I suspect we're getting spammed by VCL going cold during discard
[19:30:00] <bblack>	 (but possibly there are legit cases hiding, too)
[19:30:42] <ema>	 we might have to filter for probe results coming from the currently warm vcl only
[19:31:13] <bblack>	 yeah
[19:31:19] <ema>	 which sucks :)
[19:31:52] <bblack>	 if I can come up with some sane series of commits that eliminates reloads on confd pool changes, any general objections to heading in that general direction?
[19:33:00] <ema>	 nope
[19:33:25] <bblack>	 I think I've convinced myself it's possible to do so with creating new race conditions
[19:33:42] <bblack>	 but it's always hard to tell until you see the real implementation details
[19:33:51] <ema>	 that's great, assuming you mean *without* new races :)
[19:34:04] <bblack>	 yeah, that too
[19:34:30] <bblack>	 (and without stalling a depool action waiting for an agent run to complete to enact a puppet disable)
[19:35:16] <ema>	 the idea being that all backends are defined in vcl once, and they get depooled/pooled by changing their state with varnishadm?
[19:35:57] <bblack>	 it would be nice if, in place of puppet disables and other such things, we had some kind of host-level "doing things to varnishes states" mutex that all scripts coming from puppet, cron, cumin, confd, etc... which touch on varnish things could wait their turn on to serialize execution.
[19:45:59] <bblack>	 ema: yes.  my thinking is I can make something work more-or-less along the lines in the ticket update yesterday, wherever that was.  I think the edge-cases are tractable.
[19:46:46] <bblack>	 the 1-5 in the middle of: https://phabricator.wikimedia.org/T189892#4063721
[19:48:40] <ema>	 so the vcl will create a director and do something like director.add_backend(x) for all origin servers
[19:49:01] <ema>	 everything else will be done with set healthy/sick
[19:49:18] <bblack>	 right, all the stuff that's in the confd-templated directorts.vcl would be in the static puppet-driven templates like it used to be pre-confd
[19:49:59] <bblack>	 and persist the healthy/sick state based on pooling to a file, confd runs the set_healths from that file after it updates it, for all live VCLs.  vcl_reload also runs the set_healths on new VCLs it creates, after load and before use
[19:50:20] <bblack>	 and then some BS will hook into ExecPostStart to bring it onto the boot VCL on a daemon restart, too
[19:51:29] <bblack>	 whatever script runs tehse set_health will of course need to ignore errors, since backends could be created or destroyed (actual new nodes or decoms) while VCLs don't all reference the same set
[19:52:12] <bblack>	 and I think, what looks like race conditions between confd-pool-changes and puppet-based actual vcl_reloads for this case, really aren't, so long as operations are done in the proper order.
[19:54:38] <bblack>	 now that I really think about it, I don't know that we've ever really confirmed the sanity of how the shard director interacts with depool-via-delete-and-reload.  maybe we did I forgot?
[19:55:08] <ema>	 no, we didn't 
[19:55:22] <bblack>	 but in any case, in the general sense depooling via admin_health sick states seems less-likely to cause a complication with any such module.
[19:55:58] <bblack>	 and it kills a source of reload spam and stale VCLs
[19:56:21] <bblack>	 it may not solve our problems, but it seems a net win and might bring some clarity
[19:57:37] <ema>	 yes
[19:57:59] <bblack>	 anyways, I think I have 90% of a working final picture in my head that's sane.  next step is factoring that into sanely-deployable steps that don't create havoc during the transition heh
[19:58:04] <ema>	 meanwhile I've confirmed that if you manually set a backend as sick, probes still think it is healthy (if indeed that's the case)
[19:58:23] <ema>	 Backend name                   Admin      Probe                Last updated
[19:58:30] <ema>	 vcl-4d359c57-d34b-47d3-be57-bd3b9ac552e2.be_cp1008_wikimedia_org sick       Healthy             5/5 Tue, 20 Mar 2018 19:57:07 GMT
[19:58:45] <ema>	 which means we're not gonna get spammed by varnishospital for every admin action
[20:01:05] <bblack>	 well that's a plus :)
[20:01:12] <wikibugs>	 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4066449 (10Vgutierrez) varnishxcache metrics (varnish.xcache.*) are being used in the dashboard db/varnish-caching (Varnish Caching). Migration to Promet...
[20:02:42] <bblack>	 vgutierrez: I don't think the ciphersuite explorer graphs are yet equivalent between the old version and the prometheus one
[20:03:01] <vgutierrez>	 bblack: ack
[20:03:05] <bblack>	 vgutierrez: this may come down to just needing the prometheus queries and grafana settings massaged, I donno, I haven't had time to look deeply
[20:03:17] <vgutierrez>	 bblack: I'll do that
[20:03:41] <bblack>	 but if you look at the top graph of both (what really matters) for the same timespan, the two issues that stand out:
[20:04:00] <bblack>	 1) All the shapely lines in the old one become flat lines in the new one, clearly something's amiss there
[20:04:18] <bblack>	 2) The graph axis shows %, but the tooltip is showing some raw metric
[20:04:32] <vgutierrez>	 I already know how to fix #2
[20:05:19] <vgutierrez>	 it's fixed here for instance: https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-caching-wip?refresh=15m&orgId=1
[20:05:41] <bblack>	 3) No "total", showing the sum percentage of what the selectors at the top have narrowed things to (which is how we find out what the total percentage of, say, "TLSv1.2 connections that use RSA"/"all" is.  I think the new extra graphs at the bottom of the prometheus one are attempting to make up for that in limited cases)
[20:06:47] <bblack>	 4) Off-by-one-errors in my lists
[20:06:50] <bblack>	 anyways :P
[20:08:21] <vgutierrez>	 sure.. tomorrow I'll give to the prometheus tls cipher explorer dashboard some love before going for the next one
[20:10:23] <bblack>	 also in general, even on the non-prometheus one, the legacy-vs-new numbers never quite aligned.  I suspect this is because one or both are doing statistically-invalid things with how summing and averaging is done :)
[20:10:49] <bblack>	 I'm inclined to not care much, since we really haven't had anyone with the necessary background and rigor try to figure that out, but FYI
[20:11:26] <bblack>	 but it might be nice someday if someone sat through the rigor of figuring out that the numbers are valid, instead of just kinda-ballpark-close-at-least-for-certain-time-windows or whatever
[20:13:08] <bblack>	 that's mostly my fault of course.  a lot of these older graphs were the result of me hammering on the graphite query language + grafana settings until I got something that looked like what I wanted.
[20:13:39] <bblack>	 the query syntax ended up very obtuse and probably-statistically-wrong, but it was enough effort to get something reasonable displayed at all :P
[20:15:14] <bblack>	 e.g. for the non-prom ciphersuite explorer's main graph, if you ignore the outer sorting and label-aliasing junk:
[20:15:20] <bblack>	 summarize(asPercent(varnish.clients.tls.$tlsv.$kx.$auth.$ciph.rate,sumSeries(varnish.clients.tls.*.*.*.*.rate)), '$avgwin', 'avg', true)
[20:16:13] <bblack>	 I honestly couldn't tell you how the original input data -> graphite -> .rate -> sumSeries() -> asPercent() -> summarize() works out
[20:16:28] <bblack>	 probably statistically-invalidly :)
[20:16:59] <bblack>	 it was just the first incantation I managed to hammer together that created outputs looking like the shapes I wanted
[20:18:39] <bblack>	 improper averaging of averages of rates from different kinds of averaging, etc
[20:19:52] <bblack>	 (that's probably why total never comes out to 100.0% too)
[20:20:40] <vgutierrez>	 so I need to come with some prometheus dashboards massaged enough to be bblack approved :P
[20:21:12] <vgutierrez>	 nah, jokes aside, I'll try to mimick the graphite backed ones as much as possible
[20:21:28] <bblack>	 well really I'm saying the opposite :)
[20:21:52] <bblack>	 if you can make it more-statistically-valid along the way by not just cloning my graphite syntax over to a prometheus equivalent, so much the better
[20:22:17] <vgutierrez>	 and once we are happy with the new ones, let's toss the old system and let's continue to improve the prometheus based ones
[20:22:30] <vgutierrez>	 I'll try that, but I cannot assure anything
[20:22:45] <bblack>	 whatever gets us there with (a) the least pain and (b) not losing functionality during transition
[20:24:15] <wikibugs>	 10netops, 10Operations, 10ops-codfw, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4066539 (10RobH) Done with b3    >  asw-b3-codfw > mw2259 ge-3/0/8 > mw2260 ge-3/0/9 > mw2261 ge-3/0/10 > mw2262 ge-3/0/11 > mw2263 ge-3/0/12 > mw2264 ge-3/0/13 > mw22...
[20:40:17] <wikibugs>	 10netops, 10Operations, 10ops-codfw, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4066620 (10RobH) 05Open>03Resolved The remainder are now done as well.  ``` [edit interfaces interface-range vlan-private1-d-codfw]      member-range ge-1/0/0 { .....
[20:40:31] <XioNoX>	 I declare bankrupcy on trying to follow what's going on in this channel
[20:49:55] <bblack>	 nobody actually knows what's going on in this channel
[20:50:21] <bblack>	 we've just evolved a system of senseless chatter that looks somewhat self-consistent to outsiders so it seems like something inscrutable is happening.
[20:52:10] <XioNoX>	 Someone finally found a usage for AI
[20:53:33] <XioNoX>	 bblack: totally unrelated, do we have usages for let's encrypt wildcards?
[20:53:59] <bblack>	 yes
[20:54:14] <bblack>	 do we have plans to eventually use them for those usages?
[20:54:15] <bblack>	 yes
[20:54:40] <bblack>	 have we made any recent progress on the several technical and implementation hurdles in the way on our end recently?
[20:54:43] <bblack>	 no
[20:55:12] <XioNoX>	 ok, that's all I was curious about
[20:56:33] <bblack>	 we need an LE-as-a-service thing running in ganeti redundantly in codfw/eqiad, that caches and other edges can forward HTTP auth requests to, and fetch updated keys/certs from (which are generated there).
[20:57:11] <bblack>	 and then we need to add DNS-based (vs HTTP) auth to that for ACMEv2 to support wildcards
[20:57:35] <bblack>	 and then we need to create our junk-domain redirector service, which uses the service above.
[20:57:47] <bblack>	 and also convert all our one-off cases like lists.wikimedia.org to use the above too
[20:58:07] <bblack>	 and also make our big public unified certs on the caches use it as well, at least for 1 of our 2 redundant cert vendors (the other remaining commercial)
[20:58:27] <XioNoX>	 Is there a task I can follow to not as the same questions over and over?
[20:58:31] <XioNoX>	 ask*
[20:58:49] * ema pretends to go afk to make outsiders think he exists
[20:59:00] <bblack>	 there's some tasks for parts of this, but it hasn't been all brought together like the above in task structure yet, especially lately in light of live wildcard support
[20:59:21] <bblack>	 let's see
[20:59:37] <bblack>	 https://phabricator.wikimedia.org/T133717 was about initial conversion of one-off services, using our little LE-ification script that works for the small/single use-case
[20:59:45] <bblack>	 (almost done, except the MXes apparently)
[21:00:28] <bblack>	 https://phabricator.wikimedia.org/T134447 was about making that script more-scalable in various ways/directions, but doesn't quite reach the obvious end-goal of making it a separate service
[21:01:12] <bblack>	 https://phabricator.wikimedia.org/T141266 is about one of the other missing features there (parallel ECDSA+RSA certs)
[21:01:35] <bblack>	 https://phabricator.wikimedia.org/T133548 is about the junk/non-canonical domain redirector service that would rely on a more-scalable LE solution
[21:02:11] <XioNoX>	 thanks
[21:02:19] <bblack>	 all of those were from a timeframe when wildcards weren't available yet, so that changes a lot of things, the structure and contents of the various LE-related tasks need some re-thinking and updating at this point
[21:47:01] <bblack>	 ema: re-reading 1799, I follow different links out of it every time and see new fascinating things.  it's like, the nexus of all serious varnish issues
[21:47:24] <bblack>	 ema: on this reading, I finally really noticed an odd hinty gem here:
[21:47:30] <bblack>	 ema: https://github.com/varnishcache/varnish-cache/issues/1799#issuecomment-352801596
[21:47:52] <bblack>	 ema: "PR#2519 helps, but doesn't totally cure the problem in our 4.1.9 overloaded-40Gbit testing (we probably hit the late expiry thread case, no keep configured), we still get the fetch without busy object VCL_Error log message and some request storms - but fewer than before."
[21:48:41] <bblack>	 ema: first off, wtf is this person referencing when they say "we probably hit the late expiry thread case" like it's some known simple thing? because that sounds a lot like some of our problems.  also "some request storms" sounds familiar too....
[21:51:07] <bblack>	 ema: wtf he's referencing is probably some of the gems in https://github.com/varnishcache/varnish-cache/pull/2519 , there's a lot of text there, but some snippets to entice:
[21:51:40] <bblack>	 ema: "However, if a resource is very popular (like a manifest file in a live
[21:51:43] <bblack>	 streaming setup) and has 0s grace, and the expiry thread lags a little
[21:51:46] <bblack>	 bit behind, then vcl_hit can get an expired object even when obj.keep
[21:51:49] <bblack>	 is zero. In these circumstances we can get a surge of requests to the
[21:51:51] <bblack>	 backend, and this is especially bad on a very busy server."
[21:52:14] <bblack>	 (because this triggers the 1799 hit-returns-miss crap)
[21:53:15] <bblack>	 ema: everyone seems to like (and indeed, have merged) the 2519 fixups for the keep case, it just doesn't cover the grace case.  still, it could be the keep case that's biting us more for all we know.  and we could minimize the grace case by getting rid of our healthy-vs-sick grace differentials in vcl_hit
[21:56:42] <bblack>	 ema: aside from trying those VCL changes, perhaps we should be perusing some esams varnishlogs around the right time of the day to see if we're getting these "fetch without busy object VCL_Error" logs...
[22:04:32] <wikibugs>	 10Traffic, 10netops, 10Operations: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4066932 (10ayounsi)
[22:06:11] <bblack>	 ema: it really looks to me like maybe porting in the 2519 fixup and then eliminating wm_common_hit_grace (and possibly changing the singular grace value we're then stuck with might fix all related things?
[22:52:19] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4067011 (10Krinkle)
[22:53:15] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4036653 (10Krinkle) (Added a Status column to separate network analysis [Y/N/?] from current status [blocked/planned/done].)
[23:03:11] <bblack>	 ema: for tomorrow (not merging now, as I'm stepping out) - https://gerrit.wikimedia.org/r/#/c/420927/ + https://gerrit.wikimedia.org/r/#/c/420928/ ?