[09:01:47] <wikibugs_>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4152599 (10Vgutierrez) There is some stuff still missing, like setting the management password, if that's handled I think I can continue from there.
[09:04:09] <ema>	 mmh
[09:04:45] <ema>	 this change got diff'd in a pretty interesting way https://gerrit.wikimedia.org/r/#/c/428580/ 
[09:05:43] <vgutierrez>	 yup
[09:05:44] <ema>	 I've moved the XCPS parsing code from vcl_recv to its own sub (log_xcps_info) and the diff essentially shows changes to all surrounding code
[09:05:57] <vgutierrez>	 I was checking it and the diff gave me a hard time
[09:06:06] <vgutierrez>	 but it looks good
[09:10:23] <ema>	 testing in labs
[09:10:56] <vgutierrez>	 BTW, if you enable the full view on the diff it looks way better
[09:16:40] <vgutierrez>	 hmmm clush rcopy feature is awesome
[09:21:23] <ema>	 change works fine in labs, vtc test added
[09:22:22] <vgutierrez>	 :D
[09:37:07] <ema>	 added status dropdown to https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1
[09:37:37] <ema>	 we can now filter by status and see caching info for 4xx only (for instance)
[09:38:20] <ema>	 filtering for 5xx is interesting too
[09:38:31] <vgutierrez>	 stop sniffing my screen :P
[09:39:02] <vgutierrez>	 5xx is pretty interesting
[09:39:27] <ema>	 you can see the amount of errors at each layer (front, local, remote, applayer)
[09:40:45] <ema>	 you can see for example that yesterday's 21:50ish spike was due to applayer issues
[09:40:48] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-caching?orgId=1&var-cluster=All&var-site=All&var-status=5&from=1524519653876&to=1524522038673
[10:14:22] <ema>	 bblack: for your reviewing pleasure! https://gerrit.wikimedia.org/r/#/c/428580/ and https://gerrit.wikimedia.org/r/#/c/428580/ (based on your latest regex)
[10:14:38] <ema>	 bblack: heh, the latter would be https://gerrit.wikimedia.org/r/#/c/428594/
[10:16:20] <ema>	 bblack: also, yesterday I've upgraded misc to varnish 5.1.3-1wm7 (OH leak, #1799 fixup). OK to start upgrading a few text hosts?  
[11:56:58] <bblack>	 ema: the 1799 fix is just the initial one for the keep case that doesn't cover grace, right?
[12:04:57] <ema>	 bblack: it's the one that considers grace during lookup https://github.com/varnishcache/varnish-cache/commit/c12a3e5e2bc978edafaccfe1c586e5d5dab01fcb
[12:16:39] <ema>	 are we actually not affected by #1799 at all because of T192368?
[12:16:39] <stashbot>	 T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368
[12:18:36] <bblack>	 yes the point of my patch that made vcl_hit unconditionally return (deliver) was to avoid 1799
[12:20:00] <bblack>	 if I'm reading the upstream 1799 fix correctly, it basically moves some of this logic into varnish, to where if we don't care much about backend health (meh), we can stick with a return-deliver-only approach in vcl_hit and not suffer bad effects of long keep acting like grace, I think.
[12:20:06] <ema>	 oh ok I was under the impression that the unconditional return(deliver) helped with #1799 but did not fully mitigate it
[12:20:45] <bblack>	 it fully mitigates it, but the questions that remain are: (1) was it a big problem for us and (2) what are the tradeoff costs of the mitigation
[12:22:29] <bblack>	 I'm still parsing through how the upstream 1799 fix though, and how it interacts
[12:23:00] <bblack>	 I think, it may allows us to add back return(miss) if we're past grace and in keep/mboxlag, and not induce 1799 in the process.
[12:23:55] <bblack>	 and the VCL changes discussed at the bottom of the patch's commitmsg are more about shifting the nature of recv-grace vs obj-grace in light of the patch, but we've already effectively made that shift and stopped varying grace on healthiness altogether for now.
[12:23:58] <ema>	 my understanding of the tradeoffs (2) is on T192368 (mostly stale responses while bgfetch is in progress and possibly skewed stats) 
[12:23:58] <stashbot>	 T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368
[12:24:07] <bblack>	 right
[12:24:36] <bblack>	 so to recap the overall situation: without the 1799 source fix you linked
[12:24:44] <bblack>	 if you follow standard varnish recommendations
[12:25:02] <bblack>	 you set obj.grace=$sick_backend_grace_time
[12:25:44] <bblack>	 and req.grace you vary based on backend health to either be the same $sick_backend_grace_time or a shorter $healthy_backend_grace_time
[12:26:48] <bblack>	 and then vcl_hit says "if object is expired but within req.grace, return deliver, else return miss", (and misses can always use a fully-expired object for IMS-matching, keep is more or less a parameter that equates to forced expiry lag)
[12:27:45] <bblack>	 but this leads to a relatively-common scenario where obj.grace=$larger_sick_value, but req.grace is smaller, and it's a hit->return(miss) scenario, which triggers 1799.
[12:28:10] <bblack>	 with the source patch applied, you're supposed to change strategy on grace:
[12:28:40] <bblack>	 set obj.grace=$healthy_backend_grace_time, and set keep to >= $sick_backend_grace_time
[12:29:06] <bblack>	 so that there's not a common window where the backend is healthy but there's common reasons to return (miss) from vcl_hit.
[12:29:33] <bblack>	 we've lost that healthy/sick distinction for now anyways
[12:30:10] <bblack>	 in our current VCL, we could/should be making the grace distinction (if past grace but in keep, return (miss)), but it would cause 1799 again
[12:30:22] <bblack>	 with this patch, we should be able to make the grace distinction without triggering 1799
[12:30:42] <bblack>	 (and if we want, virtually-extend grace during sick-backend times, using a fraction of the keep time)
[12:32:15] <bblack>	 anyways, I'm staring at the other X-C-P patch now, it's curious that the diff gets so confused, it makes it hard for me to convince myself there's not something else amiss with the patch heh
[12:33:12] <ema>	 so the main issue with 1799 as far as I understand is that return(miss) in vcl_hit essentially disables request coalescing
[12:33:53] <ema>	 and c12a3e5e2bc978edafaccfe1c586e5d5dab01fcb fixes this by putting those requests on the waitinglist too
[12:35:23] <ema>	 bblack: I know right? That diff really doesn't help :)
[12:35:51] <ema>	 11:10 < vgutierrez> BTW, if you enable the full view on the diff it looks way better
[12:36:06] <ema>	 I'm not sure what's the "full view" ^
[12:37:18] <vgutierrez>	 ema: by default gerrit hides unchanged lines
[12:37:19] <bblack>	 it's just a very odd case I guess
[12:37:32] <bblack>	 because the moved block is larger than the top of vcl_recv it's jumping over?
[12:39:58] <bblack>	 ema: your understanding above I think is more-or-less correct, with the caveat that in order to accomplish its aims, the patch has to take some internal notice of grace-time that was never there before, and assumes a different meaning (that obj.grace is a minimal/healthy grace time, not a long one that we're likely to cut short in vcl_hit because a backend is normal+healthy)
[12:43:49] <bblack>	 so when we just apply the source patch and keep our VCL as it is today, we still have the super-stale-response problem
[12:44:03] <bblack>	 but, it allows us to then afterwards switch our vcl_hit to something like:
[12:44:17] <bblack>	 if (obj.ttl + obj.grace > 0s) { return (hit); } else { return (miss); }
[12:44:27] <ema>	 right
[12:44:32] <bblack>	 which will restore normal-ish stale-ness behavior without inducing 1799 misbehavior
[12:44:49] <bblack>	 and then if we want to check for sick backends again, we have to do it using a fraction of keep instead
[12:44:58] <ema>	 well s/hit/deliver/ in your snippet above
[12:45:16] <bblack>	 e.g. if we decide that a sick backend adds 30 minutes to our healthy value stored in obj.grace, it's:
[12:46:18] <bblack>	 if (backend_sick) { sick_add_time = 30m } else { sick_add_time = 0 } if (obj.ttl + obj.grace + sick_add_time > 0s) { return (hit); } else { return (miss); }
[12:46:24] <bblack>	 and yes, s/hit/deliver/ :)
[12:46:52] <bblack>	 the added sick time comes from the keep part of the timeline now, instead of putting it in obj.grace and trying to cut that shorter-when-healthy
[12:50:47] <ema>	 right, so in the new world we'll reduce/eliminate stale responses when backends are healthy 
[12:51:16] <ema>	 (without killing backends because 1799 will be a thing of the past)
[12:56:24] <bblack>	 yeah I think so
[12:57:12] <bblack>	 and it seems sanely-deployable from present state (first get the patch rolled out, then add back the basic vcl_hit->miss when outside obj.grace, then look at fixing our obj.grace back to a short healthy value and varying on sickness.
[12:57:16] <bblack>	 )
[13:03:16] <wikibugs_>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4153488 (10faidon) @Cmjohnson I think @BBlack's question above was for you -- task description seems to point at a few of the steps on your side being still pending at least.
[13:11:28] <wikibugs_>	 10Traffic, 10Operations: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368#4153519 (10ema) As of now, returning `deliver` instead of `fetch` is a valid mitigation for [[ https://github.com/varnishcache/varnish-cache/commit/c12a3e5e2bc978edafaccfe1c586e5d5dab01fcb| #1799 ]]. T...
[13:22:23] <vgutierrez>	 funny.. top UA using AES128-SHA looks bogus: "Mozilla/5.0 (Windows NT 6.5; rv:35.0) Gecko/20100101 Firefox/35.0" [TLSv1 RSA RSA AES128-SHA]"
[13:22:33] <vgutierrez>	 (note NT 6.5)
[13:23:37] <paravoid>	 firefox 35;
[13:23:44] <paravoid>	 could be spoofed
[13:23:52] <paravoid>	 some bot spoofing their UA
[13:24:03] <vgutierrez>	 yup
[13:24:18] <vgutierrez>	 every request with that UA comes from the same cloud provider
[13:25:45] <bblack>	 yeah likely spoofing
[13:26:38] <bblack>	 NT 6.5 never existed, right?
[13:26:53] <vgutierrez>	 not according to Wikipedia
[13:27:14] <vgutierrez>	 from 6.3 (Windows Server 2012 R2) it jumps to 10.0 (Windows 10)
[13:32:24] <mark>	 but maybe that article was edited by that bot
[13:33:27] <ema>	 looks like the town of Livigno also got some requests from that UA :)
[13:33:29] <ema>	 http://webcache.googleusercontent.com/search?q=cache:5Hrisc9EydQJ:www.comune.livigno.so.it/stat/agent_201706.html+&cd=9&hl=en&ct=clnk&gl=de
[13:47:45] <chasemp>	 spoofing a never-actually-existed UA seems inscrutable
[13:49:06] <ema>	 less likely to be blocked than UA:banana and still able to tell your magic bot from other traffic?
[13:49:42] <chasemp>	 will conform to UA parsing handlers but also unique yeah
[15:53:30] <wikibugs_>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4154242 (10Cmjohnson) @Vgutierrez Sorry about that, it was set but I had an extra . in the subnet.  Anyway, that is fixed.  Also, I am not sure which image you want to instal...
[16:10:39] <vgutierrez>	 bblack: can we apply the same image to lvs1016 as for the older eqiad lvs instances?
[16:12:41] <bblack>	 vgutierrez: you mean jessie-vs-stretch?
[16:14:26] <vgutierrez>	 nope, partman settings
[16:27:57] <vgutierrez>	 meaning modules/install_server/files/autoinstall/netboot.cfg entry
[16:30:51] <vgutierrez>	 cmjohnson1 already set DRAC password... so I could take lvs1016 from there
[16:32:49] <bblack>	 vgutierrez: sorry, yes, raid1-lvm like most of the rest, *not* "flat" like lvs2 and lvs1007-12 (we don't want to use hw raid, just sw raid)
[16:33:40] <vgutierrez>	 ack :)