[06:22:29] <elukey>	 (I started another session of vk in tmux on cp4006)
[07:49:19] <ema>	 elukey: hi!
[07:49:34] <elukey>	 o/
[07:50:53] <ema>	 elukey: any news regarding the data consistency checks?
[07:51:21] <ema>	 I've noticed that there are a few "Log acquired" messages in the logs for all scripts depending on varnishlog.py
[07:51:55] <ema>	 and they seem to be triggered more frequently when the CPU usage of varnishd increases, somehow
[07:52:31] <elukey>	 working on finding a repro, from what I can see there are some dt:"-" that could be due to timeouts
[07:52:47] <elukey>	 buuut still not super sure, I'd need a bit more time :)
[07:54:29] <elukey>	 varnishkafka logs are a bit weird, but mostly kafka disconnects related
[08:03:04] <elukey>	 "VSL":"store overflow"}
[08:03:28] <elukey>	 so this is the other side of the timeout...
[08:04:06] <elukey>	 we keep maximum 1000 records pending Req End, and when we exceed the oldest one is flushed
[08:04:42] <elukey>	 raising the maximum timeout causes more pending VSL records to be kept 
[08:04:51] <elukey>	 so it should only be a matter of raising the value
[08:05:41] <ema>	 ok, let's try!
[08:07:57] <elukey>	 this is the description in varnishlog
[08:07:57] <elukey>	 Sets the upper limit of incomplete transactions kept before the oldest transaction is force completed. A warning record is synthesized when this happens. This setting keeps an upper bound on the memory usage of running queries. Defaults to 1000 transactions.
[08:09:38] <elukey>	 we have the option in vk so we could tune it
[08:09:50] <elukey>	 maybe trying with 5k first?
[08:41:16] <ema>	 elukey: sounds good
[08:41:52] <elukey>	 https://gerrit.wikimedia.org/r/#/c/307246 
[08:42:06] <elukey>	 ema: --^
[08:42:11] <elukey>	 I think that 5k should be good enough
[08:42:14] <elukey>	 and safe
[08:50:15] <ema>	 elukey: +1
[08:52:07] <elukey>	 merging!
[09:03:05] <elukey>	 ran manually puppet on 4005 and checked on stat1002 kafka events for the host, all good
[09:03:36] <ema>	 nice
[09:05:26] <ema>	 varnishkafka-webrequest disconnections seem to happen roughly every five minutes 
[09:06:17] <ema>	 5 or 10 minutes judging from the logs
[09:06:34] <ema>	 Aug 29 08:30:47
[09:06:35] <elukey>	 with kafka you mean?
[09:06:35] <ema>	 Aug 29 08:40:47
[09:06:38] <ema>	 yep
[09:06:40] <elukey>	 ah okok
[09:06:44] <elukey>	 yeah really weird
[09:06:49] <ema>	 Aug 29 08:50:47
[09:06:54] <ema>	 Aug 29 08:55:47
[09:07:20] <elukey>	 only on upload or even in misc/maps?
[09:07:30] <ema>	 I'm looking at cp4005 only
[09:08:23] <elukey>	 so I checked cp3009 and same thing
[09:08:32] <elukey>	 seems started Aug 8th
[09:09:50] <ema>	 18:30 ottomata: restarting kafka broker on kafka1013 to test eventlogging leader rebalances
[09:09:57] <ema>	 that's from SAL on Aug 8th
[09:10:01] <ema>	 could it be related?
[09:10:26] <elukey>	 yeah I was checking
[09:10:32] <elukey>	 so theoretically no
[09:10:59] <elukey>	 because he was working on upgrading EL's kafka client
[09:11:12] <ema>	 it happened again 10 minutes after restarting vk:
[09:11:12] <elukey>	 but it is worth to investigate
[09:11:16] <ema>	 Aug 29 09:00:22 cp4005 varnishkafka[611]: TERM: Received signal 15: terminating
[09:11:27] <ema>	 Aug 29 09:10:23 cp4005 varnishkafka[21640]: KAFKAERR: Kafka error (-195): kafka1018.eqiad.wmnet:9092/18: Receive failed: Disconnected
[09:11:56] <elukey>	 never a joy with vk
[09:12:51] <ema>	 anyways, this has nothing to do with the data consistency checks, right?
[09:13:13] <elukey>	 nono
[09:13:18] <elukey>	 Aug 16 16:02:31 cp4006 varnishkafka[1297]: %3|1471363351.921|FAIL|varnishkafka#producer-1| kafka1020.eqiad.wmnet:9092/bootstrap: Failed to connect to broker at [kafka1020.eqiad.wmnet]:9092: Network is unreachable
[09:13:23] <elukey>	 this is also weird
[09:13:37] <elukey>	 another thing to bare in mind is that we haven't upgraded librdkafka yet
[09:14:11] <elukey>	 IIRC paravoid mentioned that the new version should be much better (according to the author's comments)
[09:14:39] <elukey>	 maybe better handling of kafka 0.9 too?
[09:15:57] <wikibugs>	 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2590240 (10Jhernandez) > Worst case we could simply skip sort-normalization in v3 for any query-string that contains brackets.  That sou...
[09:17:52] <elukey>	 ema: anyhow, the TERM should be systemd right?
[09:18:07] <elukey>	 Aug 29 09:02:04 cp4006 systemd[1]: Stopping VarnishKafka webrequest...
[09:18:10] <elukey>	 Aug 29 09:02:04 cp4006 varnishkafka[29415]: TERM: Received signal 15: terminating
[09:18:13] <elukey>	 Aug 29 09:02:05 cp4006 systemd[1]: Starting VarnishKafka webrequest...
[09:18:42] <elukey>	 I am still not clear why
[09:25:11] <wikibugs>	 10Traffic, 06Operations: varnishkafka frequently disconnects from kafka servers - https://phabricator.wikimedia.org/T144158#2590248 (10ema)
[09:26:43] <ema>	 elukey: yes the term signal is systemd. Probably that specific restart was due to the config change
[09:27:19] <wikibugs>	 10Traffic, 06Operations: varnishkafka frequently disconnects from kafka servers - https://phabricator.wikimedia.org/T144158#2590260 (10ema) p:05Triage>03Normal
[09:30:53] <elukey>	 ema: yeah but from journalctl I can see it happening regularly
[09:31:01] <elukey>	 at least, I am checking on cp3009
[09:34:13] <elukey>	 so I re-checked timings on cp300[89]
[09:34:30] <elukey>	 the disconnects started at on Aug 08 at around 18:30 UTC
[09:34:32] <elukey>	 on both
[09:34:36] <elukey>	 18:30 ottomata: restarting kafka broker on kafka1013 to test eventlogging leader rebalances
[09:34:41] <elukey>	 and this is from the sal
[09:35:18] <ema>	 elukey: yeah, see https://phabricator.wikimedia.org/T144158 :)
[09:36:01] <elukey>	 good :)
[09:36:01] <ema>	 on all the hosts I've checked the disconnects start exactly at 18:30:27
[09:36:46] <elukey>	 another thing worth to notice is that it tries to connect with ipv6 and ipv4 addresses and receives (what appears to be) resets
[09:37:39] <elukey>	 mmm checking puppet changes for that time
[09:41:47] <elukey>	 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=32&fullscreen
[09:42:22] <elukey>	 it maybe due to brokers not being broker leaders
[09:43:01] <elukey>	 there is one broker per topic partition that serves the data, the other ones are only in sync replicas
[09:43:28] <elukey>	 when you restart a kafka broker it leaves the duty to one of the replicas via leader election
[09:43:52] <elukey>	 but then you need to explicitly get it back to the pool of leaders that can be elected
[09:44:08] <elukey>	 rebalancing the topic partition leaders
[09:44:21] <elukey>	 so this might have caused unbalanced traffic 
[09:44:26] <elukey>	 *might*
[09:46:29] <ema>	 cp4007 upgraded, looks good
[09:46:49] <wikibugs>	 10Traffic, 06Operations: varnishkafka frequently disconnects from kafka servers - https://phabricator.wikimedia.org/T144158#2590303 (10elukey) Two kafka brokers seems to handle less traffic from the same date:  {F4412361}  Maybe it is only a matter of rebalancing the kafka topic partition leaders?
[09:52:11] <elukey>	 sorry the weird pattern happened on the 18th, not the 8th
[09:52:18] <elukey>	 but it should be fixed anyway
[09:52:22] <elukey>	 keep investigating
[09:52:25] <elukey>	 I'll also ask to Andrew
[10:15:11] <wikibugs>	 10Traffic, 06Operations: varnishkafka frequently disconnects from kafka servers - https://phabricator.wikimedia.org/T144158#2590327 (10elukey) There is also another interesting thing to notice:  * from Aug 8th at ~18:30 UTC these messages were logged for various brokers: ``` ... Aug 08 18:30:28 cp3008 varnishk...
[11:22:17] <bblack>	 any chance the kafka disconnects for rebalancing are what's causing the minor random grafana data spikes we see in some graphs?
[11:25:27] <elukey>	 bblack: hi :) do you have some examples?
[11:26:07] <elukey>	 I don't think I have enough context for this one
[11:27:30] <elukey>	 anyhow, I think that the problem of the disconnects is resolved ema
[11:28:09] <elukey>	 I noticed that the disconnects were only for kafka1012/kafka1018 recently
[11:28:23] <elukey>	 namely the ones not acting as leaders
[11:28:39] <elukey>	 the disconnects stopped on all the hosts that I've looked into since this morning, when I rebalanced
[11:30:38] <wikibugs>	 10Traffic, 06Operations: varnishkafka frequently disconnects from kafka servers - https://phabricator.wikimedia.org/T144158#2590539 (10elukey) Most of the errors seemed to be related to kafka1012 and kafka1018, the ones that were not acting as leaders. Rebalancing the brokers helped and the disconnects stopped...
[11:32:14] <bblack>	 elukey: https://phabricator.wikimedia.org/T101141#2567647
[11:40:07] <elukey>	 thanks! 
[11:44:37] <ema>	 bblack: cp4007 upgraded, looks good. We can now set upload ulsfo as v4 in hiera and proceed with upgrading cp401[3-5] if you agree (with puppet disabled on them during the process of course)
[12:01:14] <bblack>	 ema: sounds good to me :)
[12:01:52] <bblack>	 now that we're half-through, it's better to finish up for vslp efficiency relatively-quickly unless there's a strong case we'll need to roll back
[12:02:42] <bblack>	 codfw gets relatively-light traffic anyways, so I imagine the chash-vs-vslp thing in codfw's backends won't be a big deal if it persists a few days before we move on from ulsfo
[12:02:53] <ema>	 bblack: mmh, I just noticed that we have a few 503s in upload ulsfo though
[12:03:06] <ema>	 see cp4013.ulsfo.wmnet:~ema/varnishlog.log
[12:03:41] <ema>	    18 FetchError   c straight insufficient bytes
[12:03:42] <bblack>	 I see the spikes in the graphs too
[12:03:42] <ema>	 ?!?
[12:04:12] <bblack>	 there were two spikes of it earlier today, then a smaller one back on the 24th
[12:04:18] <ema>	 notice that cp4013 is still on v3
[12:04:18] <bblack>	 perhaps they only happen during the transition process?
[12:04:46] <bblack>	 yeah but, could be 4013 fe -> 400[567] be?
[12:04:51] <ema>	 indeed
[12:04:56] <ema>	    18 ObjHeader    c X-Cache-Int: cp1048 hit/1, cp2011 hit/1, cp4005 miss
[12:06:25] <bblack>	 looking at the varnish code
[12:06:39] <bblack>	 it looks like this means the backend gives insufficient bytes in a chunked response, like the error code says
[12:06:59] <bblack>	 it would likely also happen naturally if a chunked response gets disconnected mid-transfer
[12:08:02] <ema>	 mmh we seem to have encountered a similar problem before in T62003
[12:08:03] <stashbot>	 T62003: HTTP 503 error when requesting linked data for large entities - https://phabricator.wikimedia.org/T62003
[12:09:32] <bblack>	 in that ticket, the ultimate problem was the backend giving incorrect content-length
[12:09:42] <bblack>	 (it was gzipping the output, but sending the ungzipped content-length header)
[12:10:54] <bblack>	 (and by "the backend" above I mean the applayer)
[12:11:14] <ema>	 interestingly the 503 spikes started today at 7:30ish, which is when I upgraded cp4007
[12:11:38] <bblack>	 right
[12:11:47] <bblack>	 do the 2x spikes correspond to fe + be pooling?
[12:12:52] <ema>	 oh wait, I've done the upgrade at 9:28 really
[12:13:01] <ema>	 (got confused by UTC vs. CEST)
[12:13:31] <bblack>	 the one on the 24th was around the same time of day as the first one today
[12:14:32] <bblack>	 SAL says 09:44 for cp4007 pooling
[12:15:14] <ema>	 correct, at 9:28 I've started the upgrade procedure and repooled at 9:44
[12:15:42] <ema>	 so the spike at 7:30 is likely unrelated
[12:16:09] <bblack>	 when did depool happen?
[12:16:32] <ema>	 shortly after the SAL entry at 9:28
[12:16:36] <bblack>	 hmmm
[12:16:53] <bblack>	 so, cp2011 was a hit
[12:17:03] <bblack>	 4005 missed and was presumably filling in cache
[12:18:50] <ema>	 trying to capture other 503s on cp400[5-7] backends
[12:19:14] <ema>	 the few I got looking at cp4013 frontend all came from cp4005 judging from X-Cache-Int
[12:20:32] <bblack>	 hmmm
[12:20:52] <ema>	 (and they were all misses)
[12:20:53] <bblack>	 the absolute rate of them isn't huge anyways, but it's surprising how long the rate-spike sustains in those spikes
[12:21:05] <bblack>	 (for any kind of transitional event)
[12:21:12] <ema>	 yep, it's also interesting how they didn't happen during the last few days
[12:21:53] <bblack>	 yeah there could be some human interaction
[12:22:02] <bblack>	 (with us logging into it and running certain commands)
[12:22:57] <bblack>	 do you have a raw varnishlog of a 503 from 4013 that went through 4005 with that code?
[12:23:44] <ema>	 bblack: not raw, just the whole varnishlog text output
[12:23:48] <bblack>	 I'm trying to grab one now, but this all assumes any low-rate 503 we see now is similar in nature to the spiky ones
[12:23:52] <bblack>	 well yeah that
[12:24:08] <ema>	 cp4013.ulsfo.wmnet:~ema/varnishlog.log
[12:25:12] <bblack>	 interesting
[12:25:23] <bblack>	 you can see in that log entry it tries twice as expected
[12:25:35] <bblack>	 (we already do a single retry of 503 on frontends, in case of transient error)
[12:25:42] <bblack>	 gets the same problem on both backend fetch attemps
[12:26:23] <bblack>	 also, no CL on the response from 4005 seems odd
[12:30:08] <bblack>	 I've confirmed with the same url (same jpeg file as your logged error) that the file is usually-ok
[12:30:30] <bblack>	 I can pull it through cp4013 successfully right now, and it missed via 4005 etc like the failure
[12:30:40] <bblack>	 was able to get a successful hit directly on 4005-be right after as well
[12:30:50] <ema>	 mmmh
[12:30:56] <bblack>	 < Content-Length: 29371
[12:30:59] <bblack>	 in both cases too
[12:31:35] <ema>	 I haven't managed to log any 503s on the v4 backends in the meanwhile
[12:31:38] <bblack>	 so in any case, I don't think it's data-specific or even specific to something wrong with cp2011's cached copy or anything
[12:32:18] <bblack>	 it may be that there's some subtle interop problem between v3 and v4 in some chunked encoding edge-case, which only turns up when v3 fetches from v4?
[12:32:28] <bblack>	 but not persistently for any given file
[12:33:00] <ema>	 it could be, in that case finishing the ulsfo upgrade would solve the issue
[12:33:11] <bblack>	 it's a small file, so it's not a case where v3 is bytes-triggered to turn on non-default do_stream
[12:34:51] <bblack>	 yeah
[12:35:08] <bblack>	 and ensuring we work from the outside edge inwards in terms of DC tiers, which was already the plan due to vslp-vs-chash
[12:35:33] <bblack>	 if the problem is only with v3 fetching from v4, finishing ulsfo should make them go away, it's a good test
[12:37:11] <ema>	 bblack: https://gerrit.wikimedia.org/r/#/c/307282/ :)
[12:43:30] <wikibugs>	 10Traffic, 10Varnish, 10Beta-Cluster-Infrastructure, 06Operations, and 2 others: On beta cluster varnish stats process points to production statsd - https://phabricator.wikimedia.org/T116898#2590674 (10hashar) I have rebased the Puppet patch https://gerrit.wikimedia.org/r/#/c/249490/  that let us vary the...
[12:43:35] <hashar>	 hello bblack and ema!
[12:43:51] <bblack>	 hi :)
[12:44:08] <hashar>	 got a lame patch for your radars!
[12:44:33] <hashar>	 the bunch of python scripts that emit varnish metrics to statsd are all hardcoded to statsd.eqiad.wmnet and i got a patch that change that to use a hiera() call
[12:45:03] <hashar>	 on beta cluster, we have to send the metrics to the labs statsd server.    There is no urgency, the puppet patch is cherry picked on labs already :D
[12:46:07] <wikibugs>	 10Traffic, 10Varnish, 10Beta-Cluster-Infrastructure, 06Operations, and 2 others: On beta cluster varnish stats process points to production statsd - https://phabricator.wikimedia.org/T116898#2590718 (10hashar) The patch has been cherry picked on beta cluster for quite a while already so it is essentially f...
[12:46:17] <bblack>	 hashar: LGTM
[12:46:23] <hashar>	 hopefully :/
[12:46:30] <hashar>	 but I havent thoroughly tested it beside on beta
[12:46:38] <hashar>	 and not sure of what kind of crazy impact it might have
[12:46:47] <bblack>	 well
[12:46:50] <hashar>	 but then I am probably just too paranoid
[12:46:55] <bblack>	 what defines the hieradata for the prod hosts?
[12:47:16] <hashar>	 hieradata/common.yaml:statsd: statsd.eqiad.wmnet:8125
[12:47:21] <bblack>	 oh not statsd_server, statsd, right
[12:47:36] <bblack>	 I was thinking of the similar: hieradata/codfw/mediawiki/jobrunner.yaml:statsd_server: statsd.eqiad.wmnet:8125
[12:47:57] <hashar>	 yeah that is slightly confusing :(
[12:48:52] <hashar>	 and labs has:  hieradata/labs.yaml:statsd: labmon1001.eqiad.wmnet:8125
[12:52:14] <bblack>	 there, one less cherry-pick
[12:52:16] <ema>	 cp4013 upgraded and repooled
[12:52:57] <ema>	 waiting ~30 min. for its cache to fill up and then I'll carry on with cp4014 and cp4015
[12:53:34] <bblack>	 I'm watching 4013 fe varnishlog for 503s manually now
[12:53:57] <hashar>	 bblack: awesome :]
[12:56:18] <bblack>	 I caught 1x so far, but it was a true 503 response from the backend, which was a v3 backend
[12:56:51] <bblack>	 -   RespHeader     Age: 70
[12:56:55] <bblack>	 ^ odd on a 503 ...
[12:57:15] <ema>	 uh
[12:57:21] <bblack>	 but we always have a low rate of real 503s, networks not being perfect and whatnot
[12:57:38] <bblack>	 -   RespHeader     X-Cache-Int: cp4014 miss
[12:57:48] <bblack>	 (was 4014 v3 host that generated the 503)
[12:59:25] <bblack>	 also interesting, but not entirely unsuspected, is those 503s are counted as misses
[12:59:36] <bblack>	 we don't really look at status code when encoding or decoding X-Cache
[13:00:02] <bblack>	 it would be nice to get back to trying to see a unique disposition like "err", which I tried before, but ran afoul of VCL limitations
[13:00:13] <bblack>	 maybe post-v4 it will be simpler, while not trying to do it across versions
[13:00:49] <bblack>	 (especially with the help of vmod_var)
[13:01:39] <bblack>	 I can't wait for that refactor, it will simplify so many things if we don't have to think about which headers are real network headers in some direction vs internal book-keeping because we don't have variables in VCL
[13:01:50] <ema>	 yes vmod_var looks very useful indeed
[13:02:37] <bblack>	 been logging for 10 minutes now and only that one 503
[13:02:52] <bblack>	 (which is not related to this specific problem)
[13:03:01] <ema>	 and all is quiet on varnish-aggregate-client-status-codes
[13:04:28] <ema>	 looks like the perfect time for a coffee
[13:28:08] <bblack>	 ema: on the v3->v4 fetch incompatibility front:
[13:28:14] <bblack>	 https://www.varnish-cache.org/trac/ticket/1506
[13:28:23] <bblack>	 and our old friend https://www.varnish-cache.org/trac/ticket/1627
[13:28:52] <bblack>	 either or both of those could be related, and are probably effectively non-issues for v4->v4, but could cause strangeness for v3 fetch from v4, I think.
[13:29:07] <bblack>	 that first link was fixed in 4.0 timeframe, but likely the bug existed in 3.x too
[13:29:15] <bblack>	 (or some variant of it anyways)
[13:30:12] <bblack>	 and I think #1627 is the one fixup in 3.0.7 we never backported to our 3.0.6 because it's incompatible/difficult to reconcile with our hacked "plus" code re gzip/streaming.
[13:30:26] <ema>	 ah
[13:31:05] <bblack>	 and also, our v3 frontends are effectively using http/1.0 on both ends, whereas v4 use http/1.1, which is also a factor in some of this
[13:31:27] <bblack>	 (because nginx on v4 frontends choses http/1.1 even though it's non-keepalive, whereas in v3 we still use http/1.0 reqs from nginx->varnish)
[13:33:16] <bblack>	 assuming the 503s go away as we finish ulsfo, I think we just mark this down for things to remember when we do text next quarter
[13:33:38] <bblack>	 to upgrade from the outside in (in DC terms), and expect some minor 503 issues for a half-converted DC until completion.
[13:33:51] <ema>	 bblack: so in practical terms this means we should try to minimize the amount of time we spend in a half v3/half v4 situation on a given cluster
[13:33:57] <bblack>	 yes
[13:34:39] <ema>	 very well
[13:34:48] <bblack>	 just killed my log on 4013 frontend, there was never a second 503 heh
[13:35:18] <bblack>	 I think we can live with 1x 503 every 30 minutes, that's probably normal background error rate imperfection
[13:35:36] <ema>	 sounds good. I'll proceed with cp4014 now
[13:35:48] <bblack>	 ok
[13:42:47] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Internet-Explorer: Internet Explorer 6 can not reach https://*.wikipedia.org - https://phabricator.wikimedia.org/T143539#2590963 (10BBlack) I don't think it's just the SHA-1 problem.  We also don't support SSLv3 (since back when the POODLE attack appeared...
[13:43:52] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Internet-Explorer: Internet Explorer 6 can not reach https://*.wikipedia.org - https://phabricator.wikimedia.org/T143539#2590964 (10BBlack) (also, it's possible there's non-default settings on IE6/XP under some conditions / service-packs / patches that al...
[13:44:19] <ema>	 cp4014 upgraded and repooled
[13:48:43] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Internet-Explorer: Internet Explorer 6 can not reach https://*.wikipedia.org - https://phabricator.wikimedia.org/T143539#2590972 (10BBlack) @Florian - On a related note, we've recently created https://wikitech.wikimedia.org/wiki/HTTPS:_Browser_Recommendat...
[13:54:09] <ema>	 the average frontend cache hit rate goes up quite fast :)
[13:56:09] <wikibugs>	 10Traffic, 10Varnish, 10Beta-Cluster-Infrastructure, 06Operations, and 2 others: On beta cluster varnish stats process points to production statsd - https://phabricator.wikimedia.org/T116898#2590998 (10hashar) 05Open>03Resolved Has been kindly reviewed and merged in by @BBlack . One less cherry pick!
[13:59:41] <bblack>	 ema: yeah, coalescing really helps with that.  I'm always surprised by how quickly caches recover for the hottest stuff
[14:00:41] <bblack>	 for "normal" frontend restarts, I've pushed the limits a few times to see what we can sustain.  My tightest experiment to date was all upload+text frontends restarted (without any backends restarting/wiping) in a 30 minute window
[14:01:13] <bblack>	 you can see spikes/increases in hit-local (vs hit-front) during an event like that, but it comes together pretty smoothly and still recovers fine with no real user impact.
[14:02:41] <ema>	 nice
[14:03:23] <ema>	 for a frontend restart, after ~15 minutes we get close to the cache hit rate of a warm machine
[14:03:33] <bblack>	 yeah
[14:04:03] <bblack>	 but the shape of that 15 minute curve is weighted heavily at the front. even after just a few minutes, it's "close enough", the rest is just trailing less-hot stuff
[14:04:33] <bblack>	 text recovers faster than upload of course, too
[14:04:50] <bblack>	 text's dataset is small enough that everything remotely-hot commonly fits in frontends
[14:05:03] <bblack>	 upload not so much, hence the persistently higher hit-local even under normal conditions
[14:05:53] <bblack>	 another implication of that, is that upload-frontend here is probably a pretty extreme edge case for varnishd
[14:06:20] <bblack>	 in that it's a sizeable memory cache, and it can't possibly keep up with demand on hot objects... things are constantly nuking and recycling space from not-yet-expired objects.
[14:07:13] <bblack>	 (luckily the local backends are essentially zero-latency away to back them with a fuller dataset)
[14:08:43] <bblack>	 we could probably do better at tweaking that to work more efficiently in light of the dataset, either by knowing our own object size/hotness stats better or experimentation
[14:09:30] <bblack>	 set a lower ceiling on the object size we're willing to cache at all in the frontends, until they don't turn over so fast and act more like the text model for smaller objects, and most of the hit-local handoff is the larger objects.
[14:10:16] <bblack>	 we have that cutoff at 50MB today.  I had experimented with it a bit, it was previously ~32MB for a long time.
[14:10:59] <bblack>	 at some smaller value than 32MB, we'd probably see a lot less churn in the frontend cache contents, resulting in longer avg age values on front-hits and lower nuke rates.
[14:11:54] <bblack>	 the experimental raise from 32->50 was to test the opposite theory: that the reason for relatively-low front-hit rate was that we were refusing to cache larger objects that could fit just fine
[14:12:12] <bblack>	 since it had virtually zero impact on front-hit, it's the opposite problem: we have too many hot objects under the size threshold.
[14:12:25] <ema>	 perhaps some combination of object popularity and object size could help decide what to pass at the frontend level
[14:12:55] <bblack>	 it's hard to predict popularity on a miss, though.  except for the one-hit-wonder filtering idea.
[14:13:32] <ema>	 oh yeah that makes sense :)
[14:13:54] <bblack>	 but also, this is an area that shows the weakness of a naive LRU in general
[14:14:16] <bblack>	 ideally, you replace LRU with something a little fancier that can take hit-count into it, too
[14:15:29] <bblack>	 (because if you ordered all cache objects at one instant in time by recent-use, and by hit-count, the lists would have some overall similarity statistically, but there are definitely cases where at any one point in time outliers are arranged very differently)
[14:16:09] <bblack>	 a low-hit-count object that just happened to have a recent rare hit goes way up to the top of the recent-use list even though it's a better candidate than what's now at the bottom (very briefly)
[14:16:29] <bblack>	 candidate for eviction Imean
[14:17:10] <ema>	 right
[14:17:43] <ema>	 predicting the popularity on a miss could perhaps be done querying the other frontends sidewise somehow, I was thinking
[14:17:52] <bblack>	 most of the time, it still mostly works out with LRU, but there are inefficiencies to eliminate there
[14:18:24] <bblack>	 another indicator of popularity-on-miss is whether it's a backend hit and what the hit-count is there, too.
[14:18:34] <bblack>	 for cache<->cache anyways
[14:19:22] <ema>	 cp4014 looks fine, I'll upgrade cp4015 and we're done with ulsfo upload
[14:19:37] <bblack>	 there's probably a way to do one-hit-wonder filtering in a very complex way based on that in pure VCL with no bloom dataset, that kinda works most of the time
[14:20:26] <bblack>	 on miss in any cache that's not backend-most (fetching from another cache), if you get a miss when fetching from the other cache, don't cache it here on this one fetch.  it will get cached beneath though, so you'll cache it if you see it again quickly.
[14:20:42] <bblack>	 based on parsing X-Cache-Int
[14:21:30] <bblack>	 it would also make X-Cache less-mind-boggling, since you wouldn't see patterns like "hit, hit, miss, hit".  hits would always be backed by deeper hits.
[14:22:53] <bblack>	 I'm not sure what the right way is to express "don't cache this just for this one fetch" in VCL across v3 and v4 anyways, though.  probably obj.ttl = 0s + beresp.uncacheable / hit_for_pass.
[14:23:36] <bblack>	 we don't want to create an hfp object that's visible outside this one request, we just want to avoid nuking any other object to store anything about this one fetched object in the present.
[14:25:02] <bblack>	 the the downside is it would slightly slow recovery curves on empty caches that we were talking about at the beginning of this
[14:25:37] <bblack>	 but if VCL had access to varnishd uptime (it might already for all I know, or we can grab it from inline C?), we could disable one-hit-wonder filter until uptime > 15m or something.
[14:26:02] <ema>	 varnishstat does show uptime info
[14:26:28] <ema>	 so yeah perhaps there is someway to export it in a vmod or something
[14:26:51] <bblack>	 hopefully child uptime rather than parent, too :)
[14:26:58] <ema>	 both :)
[14:27:47] <bblack>	 since the one-hit-wonder problem matters more in frontends than backends, we could experiment with all the above as a frontend-only thing.
[14:28:09] <bblack>	 something to do in our infinite free time we have with this total lack of backlogged tasks and experiments :)
[14:28:45] <elukey>	 just checked invalid timestamps registered for webrequest and the have dropped right after the vk patch this morning 
[14:28:57] <elukey>	 *they have dropped
[14:29:07] <elukey>	 so the -L 5000 value seems working correctly
[14:29:23] <elukey>	 will keep checking during the migration
[14:29:33] <ema>	 elukey: nice!
[14:29:39] <ema>	 cp4015 repooled, uslfo upload is now v4-only
[14:30:52] <bblack>	 \o/
[14:31:21] <bblack>	 so, comparing the miss-based one-hit-wonder idea to bloom filters, what comes to mind is this:
[14:32:22] <bblack>	 1. for the frontends, the miss-based one actually sounds more-robust than the bloom filter, because we get the same basic savings, but the miss-based one is effectively like a bloom filter that's shared across frontends (they won't consider it a one-hit-wonder if a separate frontend already saw that object)
[14:32:44] <bblack>	 2. for the intermediate backends, given chashing, the effect is about the same either way on the above
[14:33:04] <bblack>	 3. for backends that talk directly to applayer, of course we don't get the miss-filter at all, you'd need bloom filter there to do it at all.
[14:39:05] <bblack>	 so yeah, maybe we look at miss-based one-hit-wonder for frontends with an uptime hack, and see how that plays out.  and leave the backends alone until/if we look at proper bloom filters.  one-hit-wonder is just less a problem there anyways, given massive chashed storage.
[14:46:53] <bblack>	 I'm making a ticket for this since we keep discussing it with no tracking
[14:47:12] <bblack>	 now I can't remember the PDF link to that whitepaper that I first head of the idea in (blooom for one-hit-wonder in a shared cache)
[14:47:28] <bblack>	 it was some general paper on cache optimizations at some large site or CDN, where that was just one of the major bullet points
[14:51:28] <ema>	 Algorithmic Nuggets in Content Delivery?
[14:51:41] <bblack>	 ema: also you had another link about cuckoo-based bloom too
[14:52:38] <bblack>	 yeah that was the right one for the one-hit-wonder bloom idea, thanks :)
[14:53:27] <ema>	 and the other one is Cuckoo Filter: Practically Better Than Bloom
[14:54:08] <ema>	 bblack: ^
[15:03:30] <wikibugs>	 10Traffic, 06Operations: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2591156 (10BBlack)
[15:03:37] <bblack>	 thanks!
[15:04:32] <wikibugs>	 10Traffic, 06Operations: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2591183 (10BBlack)
[15:40:48] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#2591288 (10BBlack) https://gerrit.wikimedia.org/r/#/c/306935/ probably should've linked here.  This is a sort of temporary measure to start bugging users to upgrade...
[15:52:03] <wikibugs>	 10Traffic, 10MediaWiki-extensions-CentralNotice, 06Operations: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2591355 (10BBlack)