[09:01:47] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4152599 (10Vgutierrez) There is some stuff still missing, like setting the management password, if that's handled I think I can continue from there. [09:04:09] mmh [09:04:45] this change got diff'd in a pretty interesting way https://gerrit.wikimedia.org/r/#/c/428580/ [09:05:43] yup [09:05:44] I've moved the XCPS parsing code from vcl_recv to its own sub (log_xcps_info) and the diff essentially shows changes to all surrounding code [09:05:57] I was checking it and the diff gave me a hard time [09:06:06] but it looks good [09:10:23] testing in labs [09:10:56] BTW, if you enable the full view on the diff it looks way better [09:16:40] hmmm clush rcopy feature is awesome [09:21:23] change works fine in labs, vtc test added [09:22:22] :D [09:37:07] added status dropdown to https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1 [09:37:37] we can now filter by status and see caching info for 4xx only (for instance) [09:38:20] filtering for 5xx is interesting too [09:38:31] stop sniffing my screen :P [09:39:02] 5xx is pretty interesting [09:39:27] you can see the amount of errors at each layer (front, local, remote, applayer) [09:40:45] you can see for example that yesterday's 21:50ish spike was due to applayer issues [09:40:48] https://grafana.wikimedia.org/dashboard/db/varnish-caching?orgId=1&var-cluster=All&var-site=All&var-status=5&from=1524519653876&to=1524522038673 [10:14:22] bblack: for your reviewing pleasure! https://gerrit.wikimedia.org/r/#/c/428580/ and https://gerrit.wikimedia.org/r/#/c/428580/ (based on your latest regex) [10:14:38] bblack: heh, the latter would be https://gerrit.wikimedia.org/r/#/c/428594/ [10:16:20] bblack: also, yesterday I've upgraded misc to varnish 5.1.3-1wm7 (OH leak, #1799 fixup). OK to start upgrading a few text hosts? [11:56:58] ema: the 1799 fix is just the initial one for the keep case that doesn't cover grace, right? [12:04:57] bblack: it's the one that considers grace during lookup https://github.com/varnishcache/varnish-cache/commit/c12a3e5e2bc978edafaccfe1c586e5d5dab01fcb [12:16:39] are we actually not affected by #1799 at all because of T192368? [12:16:39] T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 [12:18:36] yes the point of my patch that made vcl_hit unconditionally return (deliver) was to avoid 1799 [12:20:00] if I'm reading the upstream 1799 fix correctly, it basically moves some of this logic into varnish, to where if we don't care much about backend health (meh), we can stick with a return-deliver-only approach in vcl_hit and not suffer bad effects of long keep acting like grace, I think. [12:20:06] oh ok I was under the impression that the unconditional return(deliver) helped with #1799 but did not fully mitigate it [12:20:45] it fully mitigates it, but the questions that remain are: (1) was it a big problem for us and (2) what are the tradeoff costs of the mitigation [12:22:29] I'm still parsing through how the upstream 1799 fix though, and how it interacts [12:23:00] I think, it may allows us to add back return(miss) if we're past grace and in keep/mboxlag, and not induce 1799 in the process. [12:23:55] and the VCL changes discussed at the bottom of the patch's commitmsg are more about shifting the nature of recv-grace vs obj-grace in light of the patch, but we've already effectively made that shift and stopped varying grace on healthiness altogether for now. [12:23:58] my understanding of the tradeoffs (2) is on T192368 (mostly stale responses while bgfetch is in progress and possibly skewed stats) [12:23:58] T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 [12:24:07] right [12:24:36] so to recap the overall situation: without the 1799 source fix you linked [12:24:44] if you follow standard varnish recommendations [12:25:02] you set obj.grace=$sick_backend_grace_time [12:25:44] and req.grace you vary based on backend health to either be the same $sick_backend_grace_time or a shorter $healthy_backend_grace_time [12:26:48] and then vcl_hit says "if object is expired but within req.grace, return deliver, else return miss", (and misses can always use a fully-expired object for IMS-matching, keep is more or less a parameter that equates to forced expiry lag) [12:27:45] but this leads to a relatively-common scenario where obj.grace=$larger_sick_value, but req.grace is smaller, and it's a hit->return(miss) scenario, which triggers 1799. [12:28:10] with the source patch applied, you're supposed to change strategy on grace: [12:28:40] set obj.grace=$healthy_backend_grace_time, and set keep to >= $sick_backend_grace_time [12:29:06] so that there's not a common window where the backend is healthy but there's common reasons to return (miss) from vcl_hit. [12:29:33] we've lost that healthy/sick distinction for now anyways [12:30:10] in our current VCL, we could/should be making the grace distinction (if past grace but in keep, return (miss)), but it would cause 1799 again [12:30:22] with this patch, we should be able to make the grace distinction without triggering 1799 [12:30:42] (and if we want, virtually-extend grace during sick-backend times, using a fraction of the keep time) [12:32:15] anyways, I'm staring at the other X-C-P patch now, it's curious that the diff gets so confused, it makes it hard for me to convince myself there's not something else amiss with the patch heh [12:33:12] so the main issue with 1799 as far as I understand is that return(miss) in vcl_hit essentially disables request coalescing [12:33:53] and c12a3e5e2bc978edafaccfe1c586e5d5dab01fcb fixes this by putting those requests on the waitinglist too [12:35:23] bblack: I know right? That diff really doesn't help :) [12:35:51] 11:10 < vgutierrez> BTW, if you enable the full view on the diff it looks way better [12:36:06] I'm not sure what's the "full view" ^ [12:37:18] ema: by default gerrit hides unchanged lines [12:37:19] it's just a very odd case I guess [12:37:32] because the moved block is larger than the top of vcl_recv it's jumping over? [12:39:58] ema: your understanding above I think is more-or-less correct, with the caveat that in order to accomplish its aims, the patch has to take some internal notice of grace-time that was never there before, and assumes a different meaning (that obj.grace is a minimal/healthy grace time, not a long one that we're likely to cut short in vcl_hit because a backend is normal+healthy) [12:43:49] so when we just apply the source patch and keep our VCL as it is today, we still have the super-stale-response problem [12:44:03] but, it allows us to then afterwards switch our vcl_hit to something like: [12:44:17] if (obj.ttl + obj.grace > 0s) { return (hit); } else { return (miss); } [12:44:27] right [12:44:32] which will restore normal-ish stale-ness behavior without inducing 1799 misbehavior [12:44:49] and then if we want to check for sick backends again, we have to do it using a fraction of keep instead [12:44:58] well s/hit/deliver/ in your snippet above [12:45:16] e.g. if we decide that a sick backend adds 30 minutes to our healthy value stored in obj.grace, it's: [12:46:18] if (backend_sick) { sick_add_time = 30m } else { sick_add_time = 0 } if (obj.ttl + obj.grace + sick_add_time > 0s) { return (hit); } else { return (miss); } [12:46:24] and yes, s/hit/deliver/ :) [12:46:52] the added sick time comes from the keep part of the timeline now, instead of putting it in obj.grace and trying to cut that shorter-when-healthy [12:50:47] right, so in the new world we'll reduce/eliminate stale responses when backends are healthy [12:51:16] (without killing backends because 1799 will be a thing of the past) [12:56:24] yeah I think so [12:57:12] and it seems sanely-deployable from present state (first get the patch rolled out, then add back the basic vcl_hit->miss when outside obj.grace, then look at fixing our obj.grace back to a short healthy value and varying on sickness. [12:57:16] ) [13:03:16] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4153488 (10faidon) @Cmjohnson I think @BBlack's question above was for you -- task description seems to point at a few of the steps on your side being still pending at least. [13:11:28] 10Traffic, 10Operations: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368#4153519 (10ema) As of now, returning `deliver` instead of `fetch` is a valid mitigation for [[ https://github.com/varnishcache/varnish-cache/commit/c12a3e5e2bc978edafaccfe1c586e5d5dab01fcb| #1799 ]]. T... [13:22:23] funny.. top UA using AES128-SHA looks bogus: "Mozilla/5.0 (Windows NT 6.5; rv:35.0) Gecko/20100101 Firefox/35.0" [TLSv1 RSA RSA AES128-SHA]" [13:22:33] (note NT 6.5) [13:23:37] firefox 35; [13:23:44] could be spoofed [13:23:52] some bot spoofing their UA [13:24:03] yup [13:24:18] every request with that UA comes from the same cloud provider [13:25:45] yeah likely spoofing [13:26:38] NT 6.5 never existed, right? [13:26:53] not according to Wikipedia [13:27:14] from 6.3 (Windows Server 2012 R2) it jumps to 10.0 (Windows 10) [13:32:24] but maybe that article was edited by that bot [13:33:27] looks like the town of Livigno also got some requests from that UA :) [13:33:29] http://webcache.googleusercontent.com/search?q=cache:5Hrisc9EydQJ:www.comune.livigno.so.it/stat/agent_201706.html+&cd=9&hl=en&ct=clnk&gl=de [13:47:45] spoofing a never-actually-existed UA seems inscrutable [13:49:06] less likely to be blocked than UA:banana and still able to tell your magic bot from other traffic? [13:49:42] will conform to UA parsing handlers but also unique yeah [15:53:30] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4154242 (10Cmjohnson) @Vgutierrez Sorry about that, it was set but I had an extra . in the subnet. Anyway, that is fixed. Also, I am not sure which image you want to instal... [16:10:39] bblack: can we apply the same image to lvs1016 as for the older eqiad lvs instances? [16:12:41] vgutierrez: you mean jessie-vs-stretch? [16:14:26] nope, partman settings [16:27:57] meaning modules/install_server/files/autoinstall/netboot.cfg entry [16:30:51] cmjohnson1 already set DRAC password... so I could take lvs1016 from there [16:32:49] vgutierrez: sorry, yes, raid1-lvm like most of the rest, *not* "flat" like lvs2 and lvs1007-12 (we don't want to use hw raid, just sw raid) [16:33:40] ack :)