[13:02:43] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10monitoring, 10Patch-For-Review: pybal: add prometheus metrics - https://phabricator.wikimedia.org/T171710#3588053 (10ema) >>! In T171710#3584226, @faidon wrote: > I know a bunch of work happened during the Wikimania hackathon, but what's the status of this?  Most of th...
[13:10:46] <ema>	 bblack: yeah, longer keep in upload should reduce the expiries indeed
[13:16:14] <bblack>	 ema: sorry I ended up typing that stuff across two channels, I donno if you lost the half in -ops that went with it heh
[13:16:50] <bblack>	 in any case, I pushed https://gerrit.wikimedia.org/r/#/c/364605/ as a live experiment.  once we get 24h past its merge, we should see if it has a positive effect on mailbox lag
[13:17:03] <ema>	 bblack: oh yeah, I was /away so I got both notifications conveniently together :)
[13:17:15] <bblack>	 (for the first 24h we're still processing the normal rate of keep expiries)
[13:17:47] <bblack>	 so I had another semi-related thought last night / this morning
[13:17:57] <ema>	 we should probably keep an eye on _upload 304s and the amount of traffic coming in from swift!
[13:18:13] <bblack>	 for the most part, it's eqiad that suffers the mailbox lag issue
[13:18:45] <bblack>	 we know codfw/esams/eqiad all have roughly the same storage (close enough anyways)
[13:18:54] <bblack>	 they all have the same disks, codfw has 10 nodes, eqiad has 11, esams has 12
[13:19:17] <bblack>	 ulsfo is/was smaller, but the new deploy when done will bring it up to esams storage size (6 nodes, but double disk size from before)
[13:20:04] <bblack>	 anyways, I think the primary contributor to eqiad suffering mailbox lag worse than the others (and codfw doing the same during the last codfw transition, IIRC?), is the patterning
[13:20:34] <bblack>	 esams backend caches mostly just see the esams users' patterns.  ulsfo similar.  codfw sees ulsfo+codfw patterns, but codfw's population is small, making it a strange edge case.
[13:20:44] <bblack>	 eqiad sees the whole global pattern of misses, all together in one place
[13:21:24] <ema>	 right, and during the switchover we've noticed how codfw suffered a lot, which would confirm the patterning theory
[13:21:25] <bblack>	 you can think of each locality having its own distribution of object hotness, based on local trends and languages and news outlets and social circles, etc
[13:21:41] <bblack>	 if you had a cache per city everywhere, the patterns would be highly localized
[13:22:04] <bblack>	 (and you'd need less storage per citycache to cover X% hitrate without causing a ton of storage churn)
[13:22:54] <bblack>	 eqiad is at the opposite end of that spectrum - by being the ultimate backend for all the world's traffic, it has the whole global pattern to deal with, and thus needs more storage than any other site to cope with the pattern and achieve the necessary hitrate to avoid excess churn
[13:23:41] <bblack>	 so, with that as the background theory...
[13:24:00] <bblack>	 there are two reasons we do be->be fetches:
[13:24:34] <bblack>	 1) varnish-be has no outbound TLS to go direct securely (this is the primary hard reason, eventually solved by ATS transition)
[13:25:16] <bblack>	 2) be->be fetches lessen the impact of edge backends' warmup from empty, since some of their misses are contained in the be caches at the core DCs they'd pass through on the way to the origin anyways
[13:26:01] <bblack>	 well, there's also a minor third point:
[13:27:22] <bblack>	 3) be->be fetches can also be used to pre-warm core backends if they've been wiped.  e.g. if we depooled all user traffic from eqiad and wiped all its backend caches (and let's say, even routed everything through codfw backends for a while)... we can ease the transition back to the normal state by re-plumbing the live edge caches back through eqiad again, letting it fill in some storage from thei
[13:27:28] <bblack>	 r misses before we put direct users on eqiad frontends again
[13:28:44] <bblack>	 point 1 is just about transport protocol issues, nothing to do with actual caching issues.  points 2+3 are insurance that we get some "help" against temporary situations.
[13:29:01] <bblack>	 none of this matters to caching 99% of the time under normal conditions
[13:29:25] <bblack>	 (except in the negative sense of causing the huge global pattern mixing in eqiad's storage)
[13:30:51] <bblack>	 so, a straw proposal (maybe we can refine it better?): we can preserve all the benefits of 1+2, and only lose 3, if we have all be->be fetches set some header (X-Foo), and when the receiving be sees the "this is a be->be fetch" header, it operates a little differently for caching purposes
[13:31:17] <bblack>	 same request transforms, but the cache behavior becomes "if this is in my cache, serve a hit, if not, pass through but don't store it in my cache"
[13:31:37] <bblack>	 so that the remote traffic can take advantage of existing hits, but not disturb the remote cache's patterns
[13:31:48] <ema>	 so if it's cached serve a hit, hfp otherwise?
[13:31:54] <bblack>	 not even hfp, just p
[13:31:58] <ema>	 ok
[13:33:07] <ema>	 perhaps hfp would work too for mbox lag purposes as they end up in transient storage
[13:33:28] <bblack>	 yeah but really, there should be little reason to worry about hfp in this case, right?
[13:33:35] <ema>	 indeed
[13:33:59] <bblack>	 theoretically you're not likely to get that object request again any time soon, the remote site that ultimately requested it has now cached it
[13:34:23] <bblack>	 this mechanism should be transitive, too, which adds a little extra trickiness to the VCL
[13:34:53] <bblack>	 e.g. you might have a be fetch chain of ulsfo->codfw->eqiad, where ulsfo has a natural local-be miss, codfw doesn't have the object and passes, but eqiad does and hits.
[13:35:17] <bblack>	 (codfw still doesn't store the object in this case, only ulsfo does)
[13:36:12] <bblack>	 I think this idea is probably a net win on churn->mboxlag, and probably a net win for eqiad-be's hitrate in general for its local frontends' traffic
[13:36:25] <bblack>	 the loss of point (3) above is the negative though, and needs some thinking.
[13:41:30] <bblack>	 one semi-hacky way is to handle (3) semi-manually
[13:41:52] <ema>	 backend nuked objects rate in eqiad is much higher than, say, esams, and that would arguably decrease by changing cache patterns as per your proposal
[13:42:08] <ema>	 compare eg https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=20&fullscreen&orgId=1&var-server=cp1074&var-datasource=eqiad%20prometheus%2Fops
[13:42:19] <ema>	 and https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=20&fullscreen&orgId=1&var-server=cp3038&var-datasource=esams%20prometheus%2Fops
[13:42:46] <bblack>	 have a vcl config flag called something like "backend_warming" or something.  when we find ourselves in situation 3, we set the backend_warming flag on eqiad/upload via puppet (ugh, but rare and not time-critical?), which pushes a VCL diff that turns off the pass-mode handling for received be->be fetches in eqiad, letting it cache them until we turn the flag back off.
[13:48:02] <elukey>	 (thanks for the verbose explanation people, really instructive)
[14:45:06] <wikibugs>	 10Traffic, 10Analytics, 10Operations: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3568769 (10BBlack) The model's a bit different in the wikimedia.org case, I'm not even sure there's a rational answer here.  Can w...
[14:59:04] <wikibugs>	 10Traffic, 10Analytics, 10Operations: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3568769 (10Nuria) I think here we should not think of global unique devices for wikimedia.org domains and rather use just per-doma...
[14:59:29] <wikibugs>	 10Traffic, 10Analytics, 10Operations: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3568769 (10JAllemandou) Thanks @BBlack for the detailed explanations :) As for using the full `host` header value for wikimedia.or...
[15:21:04] <wikibugs>	 10Traffic, 10DBA, 10Operations, 10Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3588640 (10Dzahn)
[16:39:44] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Research, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#3588918 (10Nemo_bis) >>! In T87276#3560795, @DarTar wrote: >  has been in effect and confirmed by various partners Wikim...
[20:14:58] <wikibugs>	 10Traffic, 10Operations: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202#3589720 (10BBlack) I've re-done some of the sampled/informal `AES128-SHA` analysis from before, since it hasn't been done in about a year, and the past results were never recorded in detail.  This is...
[21:27:41] <MaxSem>	 bblack, moar BlueCoat fun: https://web.archive.org/web/20170311013249/https://bugs.chromium.org/p/chromium/issues/detail?id=694593
[21:28:27] <MaxSem>	 these guys literally delay TLS 1.3
[23:00:56] <Krenair>	 MaxSem, they marked it as private?