[09:52:40] 07HTTPS, 10Traffic, 06Operations, 15User-fgiunchedi: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3105367 (10Aklapper) [11:01:49] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3105544 (10ema) If I understand the main issue at hand correctly, the goal here is to make sure that developers can quic... [11:17:10] ema: gonna use cp1008 for some bbr testing [11:17:39] moritzm: cool! [11:23:49] nice [11:26:59] it [11:27:05] it's now enabled on cp1008 [11:30:20] moritzm: any specific tests you're going to perform now? [11:34:28] I haven't heard anything about whether it's useful (or even, harmful) for local near-zero RTT networks (e.g. inside of eqiad) [11:34:49] but all reports have been that it's awesome for both the public internet and for our own transport links inter-dc [11:35:14] google's last presentation on it: https://github.com/google/bbr/blob/master/Presentations/bbr-2017-02-08-google-net-research-summit.pdf [11:37:06] ema: first making sure it works in general [11:38:25] there's some others too: https://www.ietf.org/proceedings/97/slides/slides-97-iccrg-bbr-congestion-control-02.pdf [11:38:31] https://www.ietf.org/proceedings/97/slides/slides-97-maprg-traffic-policing-in-the-internet-yuchung-cheng-and-neal-cardwell-00.pdf [11:55:36] related, we should look at turning on fq_codel (either just on our edge stuff, or probably, everywhere) [11:55:56] apparently BBR gets slightly better (if anything) with it helping [12:06:00] I think the reason I avoided putting something like fq_codel in last time around, is that it runs a risk of de-tuning our scalability a bit [12:06:44] (the whole thing with the nginx edge traffic staying spread out across CPU cores along with RSS/RPS and several IRQs for the multi-queue NIC thing, etc) [12:07:20] it may matter less on the sending side anyways, though, and may be worth testing [12:12:11] paravoid: ping re DNS stuff, if you get some time before the office wakes up [12:12:56] uh [12:13:13] I'm about to have lunch [12:13:24] and I have to finish something afterwards [12:13:29] hopefully in an hour or so? [12:30:14] paravoid: yeah if we can squeeze it in during 13:xx somewhere that works. Probably I'll be away for an hour+ after that before I'm back again [12:33:18] https://www.bufferbloat.net/projects/codel/wiki/ says: "For servers with tcp-heavy workloads, particularly at 10GigE speeds, for queue management, we recomend sch_fq instead of fq_codel as of linux 3.12." [12:33:40] sch_fq is a bit safer under our scenario anyways, maybe worth looking at that first vs codel [12:41:09] 10Traffic, 06Operations, 05MW-1.28-release (WMF-deploy-2016-08-09_(1.28.0-wmf.14)), 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#3105821 (10BBlack) 05Open>03Resolved a:03BBlack This was resolved on the server side back in early Dec when the MW conf... [13:04:48] back on BBR: I think the big question is whether it's ok to just default to it and use it for DC-local traffic or not [13:05:10] if it is, then this is easy, we just change the global congestion control sysctl (at least on edge infra, if not globally) [13:06:41] if not, then we're looking at per-route congestion control using e.g. "ip route change 1.2.3.4/5 congctl bbr" [13:06:55] (or perhaps, setting bbr as default and using something else for dc-local routes) [13:07:15] right now we only have the local vlan and the default route, we'd have to add explicit routes for other dc-local vlans, etc. [13:08:17] and the congctl commands require a newer iproute package than jessie ships, which would make BBR config/deploy block on backporting that, which is already a task in: https://phabricator.wikimedia.org/T138591 [13:23:49] on the topic of the default; Stephen Hemminger (one of the core Linux network developers) mentions it doesn't do well in "some of the network corner cases": https://lwn.net/Articles/701463/ (but not sure how rare those cornercases are...) [13:26:12] hmmm [13:26:33] I think the only notable distro which made it default is the rasperry pi : https://github.com/raspberrypi/linux/issues/1784 [13:26:42] the google papers seem to cover the public-facing corner cases well (that it's amazing with e.g. 3rd-world internet access at high latency on lossy mobile networks) [13:27:05] so I can only imagine they're referring to the kinds of corner case you see at the other end of the spectrum (dc-local with near zero latency and loss) [13:27:28] also, I'm wondering how to contrast "corner cases where cubic performs poorly" to "corner cases where brr performs poorly"... [13:27:59] "Also it requires tc-fq" <- news to me, I had only seen that it "works better" with some kind of egress tc [13:29:26] someone should write a "how to BBR" FAQ :) [13:30:07] (with things that might read like: "You're going to need iproute congctl and set it only for public/WAN routes, and enable sch_fq on your interfaces too") [13:31:10] of course even if BBR does perform slightly worse for local traffic, and it's a PITA to set it up with the per-route stuff.... for our cache edge boxes it might still be worth turning it on as global default [13:31:23] (and accepting some efficiency loss on the cache-miss traffic in favor of optimizing our edge so much better) [13:36:05] yeah, that was my general thought as well (wrt default) [13:37:03] I hadn't see T138591 until now, I'll deal with that for 4.9, if we use 4.9 on stretch and jessie we should have the same tooling available [13:37:04] T138591: Backport iproute2 4.x from debian testing -> our jessie - https://phabricator.wikimedia.org/T138591 [13:37:21] 10Traffic, 06Operations: Backport iproute2 4.x from debian testing -> our jessie - https://phabricator.wikimedia.org/T138591#3105941 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:42:03] https://groups.google.com/forum/#!forum/bbr-dev [13:42:12] lots of threads there that might be informative [13:44:55] http://blog.cerowrt.org/post/birthday_problem/ is interesting too, as we're doing mq with about one queue per hardware cpu core [13:46:28] (but I think the banding he's talking about there, probably automatically becomes less an issue with the huge volume of flows (tcp conns) we have in our case) [13:54:25] ah, the Kconfig blob for TCP_CONG_BBR also mentions "It requires the fq ("Fair Queue") pacing packet scheduler." [13:57:09] bblack: I guess too late for you now? [13:58:04] and in tcp_bbr.c from Linux source: "NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing enabled, since pacing is integral to the BBR design and implementation." [13:58:04] paravoid: I've got a few mins [13:58:33] ok, same here :) [13:58:58] so where shall we begin? [13:59:21] _joe_ mentioned a few things the other day but I'm still not fully immersed into this conversation [14:00:01] paravoid: ok, so... [14:00:31] basically what it boils down to is that with discovery dns stuff being configured from hieradata->puppet->config-fragments for the configuration side (e.g. gdnsd plugin resources) [14:01:04] and then lines in zonefiles referencing that, it's hard to ever lint/validate as your change to add a new service or whatever is split over unmerged commits to 2x repos that need to match each other to validate [14:01:38] (you can deploy puppet first and then check DNS against it logically, but that might be when you learn the merged puppet change had a typo that was only apparently once you lint the ops/dns change against it) [14:02:14] and then we have CI complications in general, it's not really design for this (that e.g. an ops/dns lint check needs actualy templated-out production config files from latest puppet repo to do its job) [14:02:44] this all sorta loops back into what you were saying before anyways, about how do we automate this stuff better down the line with creating new discovery DNS records [14:03:45] the way I tried to do things (just updating the lint class to check using latest-deployed puppet stuff) still suffers from the latency of getting actual prod puppet merges agent-ized on CI hosts (which can be a long delay in some cases) [14:04:11] and still suffers from the "merge puppet", "lint dns, find +2'd earlier puppet change is the problem" scenario [14:04:34] yeah, ugh [14:04:39] and then on top of that, labs uses separate hieradata from prod, and it's the prod hieradata we want to base things on [14:05:05] my proposal, which you're the best person to shoot down with lots of arguments, is go ahead and move ops/dns content into ops/puppet/modules/authdns [14:05:11] (lol) [14:05:17] (on the first part) [14:05:27] but then still keep it as a two-stage process - puppet deplots the final-templated config+zone data to /srv/authdns/staging/ [14:05:38] authdns-update works from there to show final diff and deploy to /etc/gdnsd/ and such [14:06:08] hrm [14:06:29] it sounds crazy to me but I can't immediately explain why, perhaps I'm just too used to how we're doing things now [14:06:48] so maybe it's not so crazy :) [14:07:28] heh the irony [14:07:32] yeah it's one of those kinds of things. it might be a great idea, but it's hard to unglue your brain from existing practice and boil down which aspects of that were important [14:07:45] we're doing DNS discovery to un-tie stuff from puppet [14:07:56] and to do that, we're going to put all of our DNS into puppet :P [14:08:39] well, it's the real-world distinction that matters more than where things are stored. It's still nice to have one source of truth/templating. [14:08:45] nod [14:08:51] (the real-world distinction of apps abstracting through the DNS layer for this) [14:09:09] anyways, in the interest of "example code is worth 1000 words", I have a POC commit up for this [14:09:10] is it possible query hiera without compiling a catalog I wonder [14:09:17] possible to* [14:09:31] https://gerrit.wikimedia.org/r/#/c/342887/ [14:09:51] oh wow [14:09:58] it's not just to query hiera though - it's use hieradata to generate puppet/erb-templated config fragments based on full manifest eval [14:10:14] right [14:10:19] which part are we generating again? [14:10:49] gdnsd config file includes (fragments), which define e.g. plugin_geoip / plugin_metafo resources and the references to the statefiles that confd updates on etcd changes, etc [14:11:20] in the POC commit, I've already "fixed" the templating stuff too, jinja is replaced with equivalent erb stuff at puppet compile time [14:11:30] yeah I'm seeing that [14:11:33] scary stuff :P [14:11:57] <%= File.stat(@file).mtime.strftime('%Y%m%d%H') %> ; serial [14:11:57] brr [14:12:04] but I didn't fix authdns-local-update yet, to basically diff /srv/authdns/staging/ vs /etc/gdnsd/ for human review, and deploy by copying (well and run agent before) [14:12:23] serials are of dubious value anyways, but that's a close approximation to what we had there before [14:12:33] (it will be the mtime of the file on the puppetmaster for compilation, at 1h reso) [14:12:59] ("the file" being the template file under evaluation) [14:13:57] so basically to push a DNS config you'll need to ssh to one of the NSes, run puppet, run authdns-update? [14:14:24] I was thinking initially, just merge and run authdns-update, which will run agent for you on each NS before doing the rest [14:14:31] I think with cumin we can do better though [14:14:46] * volans at your service :) [14:14:59] that's going to be annoying, puppet can be slow [14:15:04] e.g. [agent run on all authdns hosts + output checksum of staging dir to ensure all same], [fetch diff from one for local human review], [copy into place on all] [14:15:16] also you'll race the agent by cron so sometimes it will just say that it's locked [14:15:24] puppet's been getting faster! [14:15:27] more importantly, it's kind of a runtime dependency [14:15:33] yes, it is [14:15:37] on the puppetmasters [14:15:46] and on puppet not being horribly broken in general [14:15:51] if the puppetmasters are de-synced, you'll sync different things (but something with a checksum could work) [14:16:13] also, if a puppetmaster dies, the first thing you'd do would be to remove the SRV record from DNS... [14:16:17] or we can adopt the model that you diff on the main host (wherever authdns-update was run) and sync that via ssh/rsync to all [14:16:17] and whoops :) [14:16:31] all things to mull :) [14:17:12] I have a lot more thoughts on that whole general topic of avoiding inter-dependencies (and I think, sometimes we try to do that but we probably still have them in reality anyways...) [14:17:17] but I've gotta run now for ~1-1.5h [14:17:29] mull and think and collect all the ways this is a horrible idea up :) [14:18:18] heh [14:18:24] yeah, there is a runtime dep on gerrit now too [14:18:57] but I remember coding a way to bypass that, although I don't remember it now, which kind of makes it a little moot :) [14:19:18] <_joe_> paravoid: sadly, we don't use those SRV records anymore [14:19:23] oh [14:19:29] <_joe_> because puppetlabs doesn't know how to cache DNS queries [14:19:33] then let's remove them? [14:19:51] anyway, besides the point [14:19:52] <_joe_> yes, I think alex left them there in hope they will fix their shit sooner or later [14:24:12] hrm [14:24:16] _joe_: what do you think? [14:24:45] also, even if we decide to go down that path, is it realistic to say that it will happen in the next 2 weeks? maybe we should find a more short-term solution first? [14:25:02] like replicating data to the other repository for example? [14:31:14] <_joe_> yeah for now I was about to propose to manually create the files in ops/dns [14:32:29] <_joe_> so as far as cross-deps go, depending on puppet is not that horrible if you think puppet is pretty redundant now [14:33:20] <_joe_> but yeah there are circular deps with dns of course [14:33:57] <_joe_> so honestly for now I would be ok with just adding the metafo/geoip files for the records we're going to use in discovery by hand [17:51:19] re: circular deps - I was thinking about that the other day and honestly it would be difficult to even boostrap now without running stuff [17:51:35] can't puppet the authdns servers without puppet, can't set up puppetmasters/clients without working authdns (and other such issues) [17:53:47] if we actually are comfortable moving down the "dns zonefiles in puppet repo" route, we could have it merged in a day or two if we really wanted [17:53:57] the question is whether that's a decent long-term idea or not at all [17:56:00] the downsides are the various scary concerns about inter-dependencies and whatever. the upside is it makes "single source of truth" sense. In the long run, hieradata/puppet drives everything, and the dual-repo barrier gets in the way. [17:56:14] and linting is just kind of the most-obvious fallout of that problem [18:01:02] we can hack anything to work of course, with enough duct tape [18:01:30] we can even make the patches I had before (just unreverts now) work, with the caveats of (1) Copy data from hieradata/discovery.yaml to hieradata/labs.yam [18:01:33] l [18:02:08] + (2) waiting either 10 minutes or up to 1 full day for CI slave hosts to catch up on the merged discovery puppet change before linting works for the correponding ops/dns change (static vs dynamic CI slaves) [18:02:37] and then if the lint fails and the fault actually lied in the puppet commit rather than the authdns one under lint, you get to go back and fix puppet and try the above again. [18:03:14] that seems better than trying to copy the puppet-templated output files from one repo to another. [18:03:51] we could also hack the linter to just ignore the pattern output from gdnsd checkconf of lines matching "missing resource disc-foo" :) [18:04:32] but the road to hell is paved with lots of duct tape. I'd like to step back and look at the fundamentals of the problem and have an idea what the "right" solution is. [18:04:45] and if it's "puet zonefiles in ops/puppet", I Think we can still do that in reasonable time [18:05:52] <_joe_> ok [18:06:01] <_joe_> I am not sure I get all the implications of that move [18:06:22] <_joe_> (sorry, was in another convo) [18:07:45] well the first-order implication is dependencies. Before, so long as puppet worked at some point in the reasonable past (to deploy ssh keys and scripts), even if puppetmasters are dead, we can deploy DNS changes (providing ssh still works, and gerrit git repo still works, and python jinja package didn't get a breaking upgrade, etc, etc) [18:08:02] (and of course ssh still working between nodes depends on DNS itself too heh) [18:08:41] <_joe_> so, it's pretty easy to survive one puppetmaster backend failing (that heals automagically) [18:08:52] After that move, if puppetmaster isn't compiling stuff, we can even do our normal deploy of a dns data change (e.g. move a host->ip mapping in authdns) [18:08:53] <_joe_> and even a puppetmaster frontend failing [18:09:00] <_joe_> without touching dns [18:09:35] but also, in both scenarios (old and new), we always have the option of logging into all the authdns servers and editing zonefiles manually and doing a gdnsd reload, in a real emergency. [18:09:48] <_joe_> yes [18:09:52] what would be broken is the methods of automating that from a central and reviewed commit [18:10:13] the other downside faidon mentioned earlier is agent slowdown [18:10:33] authdns-update doesn't currently need to run agent anywhere, now we'd have to all on authdns before it can update the zone data to live [18:11:18] but puppet's been getting faster, and the catalogs on authdns are relatively-light [18:11:22] <_joe_> we can write a cumin procedure to handle that correctly, or rewrite authdns-update, even [18:11:30] I just timed wallclock "puppet agent -t" on radon and it's 26 seconds [18:12:17] plus CI slowdown... [18:12:27] which you can bypass of course [18:12:28] for comparison, the current authdns-update (which ssh's around serially and gets worse the more authdns servers we have) takes 14s to no-op [18:12:46] but cumin can parallelize either solution and speed it up [18:13:13] <_joe_> so the cumin task would be cumin -m sync 'R:class = role::authdns' 'puppet agent -t' 'authdns-update' [18:13:24] well, something like that [18:13:30] <_joe_> yes [18:13:43] <_joe_> that would first exec puppet on all nodes [18:14:01] <_joe_> then if successful run authdns-update or whatever [18:14:03] the other thing authdns-update does is give a visual diff (post all templating) for human confirm, like puppet-merge [18:14:08] <_joe_> on all nodes [18:14:49] in the current scheme there's two local scripts on all the authdns nodes [18:14:49] <_joe_> it's still significantly slower than now [18:15:23] authdns-local-update will pull latest git changes, diff compare for humans vs /etc/gdnsd/, and then confirm+deploy to /etc/gdnsd (but review/confirm can be skipped) [18:15:49] authdns-update manages doing the pull+compare on the host you invoke it on, then ssh's to other authdns nodes with a special account and syncs them up as well, after you confirm on the first one. [18:16:08] (and then there's a couple sub-scripts beneath local-update, but whatever) [18:16:23] <_joe_> ok that part would be handled by cumin [18:16:26] right [18:16:36] my feeling about this is that it circles back again on my "keeping AuthDNS super simple" which I think have widely different views on :) [18:16:37] I'm imagining if we wanted to keep something similar, cumin could do like: [18:17:12] [puppet agent all authdns; checksum outputs match], [show diff+confirm from 1x node's outputs], [do the actual update copy on all] [18:18:16] paravoid: that's kind of what I was waiting for, for you to circle back and say "see, this is why discovery should be a separate zone with a separate set of authdns servers" [18:18:36] heh [18:18:52] it doesn't change the risk profile much though. all inter-service stuff will depend on discovery-authdns. it still melts the world if it melts down. [18:19:33] (and we get to authdns updating infrastructure in 2x different ways for 2x sets of servers, for double the complexity fun) [18:21:24] another way to think of it, re: keeping authdns separate and simple, vs single source of truth [18:22:26] imagine puppet does all the complex templating, and then all of the DNS config+zonefiles gets deployed by puppet to a "DNS Staging Server" named dnsstage01. And then authdns-update infra works much like today and is very indepedent and simple, but the initial rsync is from the DNS Staging Server. [18:22:50] or alternatively, as above but replace "DNS Staging Server" with "DNS Staging Git Repo" that puppet commits the templated outputs to. [18:23:24] what we're doing here really isn't that different from that, but it avoids a central resource (git repo or stage server) for better or worse [18:23:44] <_joe_> bblack: ok I have to say: after we moved the apache config to puppet, we had definitive advantages but also some pain points, but those were mostly tied to the fact that we have to run puppet on 400 hosts [18:23:44] and stages the output into /srv/ of all authdns, for "local" scripts and/or authdns-update inter-node sshing to copy data from. [18:24:44] (and we still always have the option of local edits, or local edits + rsync) [18:25:07] * volans reading backlog [18:25:24] (in that scenario that will happen probably once per 5-10 years on average, which is probably less frequently than we re-architect the whole thing anyways) [18:25:24] <_joe_> having puppet compile its output and commit it to a repo locally is a bit awkward, tbh [18:25:58] _joe_: the point of the comparison was to compare how comfortable you'd feel with authdns-update's separation/simplicity in that scenario [18:26:09] <_joe_> but has the advantage of keeping things loosely coupled while still retaining a source-of-truth link [18:26:11] because I think in practice the actual plan isn't much different on that level, maybe better [18:27:24] <_joe_> bblack: so in this scenario, you do the dns commit in puppet, do puppet-merge on a puppetmaster, ssh to the dns staging server, run puppet, then run authdns-update [18:27:27] <_joe_> right? [18:27:58] (and authdns-update would normally then sync data to all the real authdns servers after reviewing it, yeah, something like that) [18:28:12] I think that makes the separation/safety clear, but it's not a plan I'd actually want. [18:28:52] <_joe_> I don't dislike the workflow, I just dislike the idea of implementing this with puppet doing git commits, heh [18:28:53] the path in the POC commit is like a better version of that. no central staging server/repo. can still edit+rsync (or just edit) if puppet infra is dead. [18:29:37] <_joe_> ok [18:29:44] what I'm saying is the dnsstage01 or dns-staging.git variant is easier to swallow mentally, but the one we have in the POC commit that just puppets straight to /srv/authdns/staging on all boxes is functionally identical on most fronts and better on some [18:29:56] just harder to mentally swallow the change [18:29:59] <_joe_> yeah [18:30:21] so long as we have some scripts/infra still in place to make offline edits simpler [18:30:37] (e.g. "authdns-update" has a flag to rsync local changes to other nodes while puppet's dead) [18:30:46] <_joe_> oh [18:30:52] <_joe_> I wasn't aware [18:30:58] it doesn't, but it easily could [18:31:14] it already has the accounts, keys, etc to ssh between authdns nodes and rsync data around [18:31:45] the current model uses a git repo instead [18:32:26] <_joe_> I would prefer to go the "cumin -m sync" way, tbh [18:32:33] the default workflow of the current authdns-update it to pull from gerrit to a local repo clone, do the human diff/confirm, then ssh to all the other authdns nodes and have them pull git updates over ssh from the starting node's git repo and apply without review [18:32:38] <_joe_> but that takes more time [18:32:44] yes, cumin can do all things better in the long run, but that's later [18:32:45] <_joe_> we can do that later [18:33:10] so with the current model, if all other infrastructure was dead but you could still ssh into authdns nodes and needed to fix a record... [18:33:14] <_joe_> once volans and I have generalized what we're doing for the switchover, I guess [18:33:24] <_joe_> yes [18:33:32] you'd log into one, manually edit the local git clone and commit, and invoke authdns-update without the pull from gerrit to spread that around [18:33:49] bblack: cumin can work also without a working puppetdb backend, has the direct backed exactly for that purpose [18:33:53] with the puppet-derived POC commit model, you could make the scripts have a similar capability using rsync instead of git pulls, over the same accounts/keys [18:34:21] <_joe_> yup [18:34:46] (and to me, that seems like sufficient backup capability if all hell breaks loose and we need to update a record during) [18:35:21] (and the upsides of being able to use the full capabilities of puppet+erb+hieradata in all our dns config/zones is a big win long term even aside from the immediate problem) [18:35:28] <_joe_> (also, there is the slight issue of tying dns updates with cumin, of course; what if all cumin masters are down or partitioned away?) [18:35:29] what about doing this "merge" and visual confirmation on neodymium/sarin and then use cumin to ssh on the authdns hosts and pull the changes? [18:36:07] _joe_: cumin would just be parallelizing/automating the execution of scripts deployed locally on the authdns nodes. we can always ssh in and do it ourselves slowly/serially if cumin was dead too. [18:36:16] <_joe_> yes [18:36:33] <_joe_> bblack: well we could add some intelligence to the script that uses cumin as a library [18:36:44] <_joe_> sorry, dinner calls [18:36:47] eh I don't see why [18:36:49] <_joe_> bbl (maybe) [18:37:01] I like my tools separate in the oft-maligned unix philosophy way [18:37:28] there's no need for every script in the world to know about cumin. cumin just needs to be able to orchestrate all the well-behaved scripts in the world [18:37:39] this tool could also bes able to be run as a fallback from our laptops using our keys [18:37:48] right [18:37:57] it does that today, that's how we deploy DNS now :) [18:38:28] I can ssh (without even a bastion) from laptop+yubikey to each authdns server and command them to update themselves (if they can reach git), or update each other if they can reach each other. [18:38:57] actually I take that back, I bet since we put ferm on I can't ssh directly anymore [18:39:09] lol [18:39:13] (maybe we should fix that in this case!) [18:39:28] but anyways [18:40:09] anyways, I've convinced myself puppeted zonefiles are the way forward [18:40:19] but convincing myself is always easier than it should be! [18:40:48] this is the kind of change that probably needs some significant votes from others [18:41:34] [but since the POC commit, I think the initial commit should go ahead and add the offline-rsync mode stuff to authdns-update to be sure the backup plan is in place from the get-go] [19:07:35] 10Traffic, 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#3107532 (10fgiunchedi) [19:24:33] <_joe_> bblack: seems sane to me [23:17:54] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#2929493 (10Ejegg) Wouldn't Vary: Cookie explode caching all over the place due to things like the last visit date?