[04:54:58] <wikibugs>	 10Traffic, 06Operations, 07Availability, 07Wikimedia-Multiple-active-datacenters: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#2546816 (10aaron)
[07:14:06] <wikibugs>	 10Traffic, 10DBA, 06Operations, 06Performance-Team, and 2 others: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2546887 (10jcrespo)
[08:01:52] <wikibugs>	 10Traffic, 06Operations: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547002 (10ema)
[08:03:27] <wikibugs>	 10Traffic, 06Operations: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547014 (10ema) p:05Triage>03Normal
[08:11:58] <wikibugs>	 10Traffic, 06Operations: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547042 (10ema)
[08:15:23] <wikibugs>	 10Traffic, 06Operations: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547046 (10ema) Full panic log on cp1047:~ema/varnishd-backend-crash.log.
[09:54:55] <wikibugs>	 10Traffic, 06Operations: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547256 (10ema)
[10:13:04] <wikibugs>	 10Traffic, 06Operations: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547296 (10ema) This error is happening only on maps nodes. We never noticed because usually varnishd starts a new child process if one dies. However, yeste...
[11:06:52] <wikibugs>	 10Traffic, 06Operations: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547422 (10ema) Steps to reproduce while varnish.service is running fine with two processes, assuming varnish-modules is installed:   - apt-get remove varni...
[12:14:46] <bblack>	 ema: on the ticket about maps/cp1047.... 
[12:15:39] <bblack>	 the smp_oc_getobj() thing has been seen before with crashing backends.  it means they crashed in such a spectacular (or fast, after startup) way that the storage is corrupt.  So yeah, wiping persistent storage is the only recourse AFAIK
[12:16:12] <bblack>	 the modules thing, seems like an upstream bug
[12:16:31] <bblack>	 although I'm not sure what the desired behavior should be there
[12:16:56] <ema>	 bblack: well if the child cannot be started varnish is not gonna serve requests
[12:17:04] <bblack>	 :)
[12:17:12] <ema>	 so probably exiting would be a good approach?
[12:17:18] <bblack>	 maybe two bugs at this level then
[12:17:31] <bblack>	 1) If the child really can't be started, parent should die
[12:17:57] <bblack>	 2) But... the child totally should've been startable, "This is no longer the same file seen by the VCL-compiler." is a bad reason
[12:18:18] <bblack>	 couldn't it just have recompiled the VCL for itself in this case?
[12:18:56] <ema>	 right, the function being called there is mgt_push_vcls_and_start
[12:19:17] <ema>	 which sounds like it really just pushes whatever compiled vcl it has compiled before
[12:19:23] <bblack>	 yup
[12:19:46] <bblack>	 you could argue that perhaps we should've been aware of this and pushed around a VCL reload after updating vmods
[12:19:53] <ema>	 perhaps it's better to do that though. What if the VCL files have changed in the meantime and don't compile?
[12:20:00] <bblack>	 but I don't think it's documented well that it works like this, and it can reload for itself, too
[12:20:29] <bblack>	 it always compiles VCL on master startup doesn't it?
[12:21:08] <ema>	 yeah when you start varnishd it always compiles it
[12:21:21] <bblack>	 to leave the old compiled VCL in place if compilation fails might be a reasonable fallback plan, maybe
[12:21:36] <bblack>	 but not even trying to compile it when it has detected that vmods have changed, that doesn't seem right
[12:21:44] <mark>	 didn't we have a "storage clean" command in an init script or so
[12:21:47] <mark>	 for this reason
[12:22:01] <mark>	 (the storage corruption)
[12:22:12] <bblack>	 for the smp_oc_getobj() thing?
[12:22:47] <bblack>	 I don't think we've ever programmatically dealt with corrupted persistence
[12:23:06] <ema>	 so: 1) we should reload all v4 varnishes 2) find out what happens with the smp_oc_getobj thing because it does happen quite often and the child just gets restarted. But if storage is corrupted maybe it's not behaving properly? 3) report the mgt_push_vcls_and_start bug
[12:23:46] <mark>	 i'm sure we did it for squid coss
[12:23:49] <mark>	 for varnish, not sure
[12:23:57] <mark>	 we had /etc/init.d/squid clean
[12:24:10] <mark>	 which wrote nulls all over the coss block device
[12:24:36] <bblack>	 in the varnish case, I guess an initscript "clean" action would stop -> rm -f /srv/sd*/varnish*; start
[12:24:42] <mark>	 yeah
[12:24:45] <bblack>	 but it's rare we do that
[12:25:16] <mark>	 so varnish 4 does have persistence?
[12:25:22] <bblack>	 yes
[12:25:27] <mark>	 but it's buggy? :)
[12:25:29] <ema>	 uh actually varnish reload doesn't work. It fails with the "vmod has changed" error
[12:25:50] <bblack>	 mark: the bug here has nothing to do with persistence really, it's a different thing with vmods and VCL
[12:26:42] <bblack>	 there's two different bugs wrapped up in this ticket: the fallout of replacing a vmod while varnishd is running, and varnishd's crashing on startup (or right after) with smp_oc_getobj() errors
[12:26:54] <bblack>	 only the latter is persistence-related
[12:26:58] <ema>	 right
[12:27:02] <bblack>	 and we've seen it on V3 before
[12:27:21] <mark>	 definitely
[12:27:29] <bblack>	 it just means the persistence file is trash, which kinda makes sense if we had a repeat crash or a fast crash after startup, there was never time to seal a legit silo or whatever.
[12:35:01] <ema>	 bblack: ok so it looks like compiling the VCL is not enough to make varnish happy about the new vmod. An actual restart is necessary :(
[12:35:18] <bblack>	 well it's a step in the right direction I guess
[12:35:26] <bblack>	 before we had to ship a new varnish package to update a vmod
[12:35:33] <bblack>	 now we just have to restart varnishd to update a vmod :P
[12:35:39] <ema>	 :)
[12:36:31] <mark>	 doesn't matter guys, the containers crowd has you covered
[12:38:04] <ema>	 "-1 needs more docker" should be a standard code review
[12:39:05] <bblack>	 if we only had all our caches in docker instances, we could just replace the running image on every VCL update and never worry about reloads or restarts
[12:39:09] <bblack>	 it makes so much sense now
[12:39:14] <mark>	 exactly
[12:40:02] <bblack>	 ema: we should probably just move maps to -sfile for now anyways
[12:41:32] <bblack>	 the way the storage stuff is factored (in puppet) is kinda backwards to do it elegantly, but it can use the same 4-line hack misc does for now
[12:41:46] <bblack>	 it's kinda not worth fixing the abstraction if we expect them to all be sfile on v4 anyways
[12:42:05] <bblack>	 I should re-open that textfile I was working on and finish up the argument -> phab
[12:42:08] <mark>	 why not go with persistence in v4 btw?
[12:42:50] <bblack>	 mark: every time we have that conversation, I end up typing like 2 pages into IRC, and then there's one or two open questions to look into as well, etc
[12:43:02] <bblack>	 I'm trying to persist the anti-persist argument to phab now :)
[12:43:03] <mark>	 i can wait for phab, was just curious :)
[12:43:18] <mark>	 i thought it wasn't in v4, but apparently was wrong
[12:43:34] <bblack>	 it's there, but it was renamed from -spersistent to -sdeprecrated_persistent :)
[12:43:47] <mark>	 oh ic
[12:43:49] <ema>	 mark: https://www.varnish-cache.org/docs/trunk/phk/persistent.html
[12:44:05] <mark>	 thanks
[12:44:23] <bblack>	 there's a bunch of obvious anti-persistence argument, based on the phk link above and it being deprecated, more-complicated, probably slower, etc...
[12:44:41] <bblack>	 the hard part is arguing against the obvious persistence upsides about surviving crashes/restarts in various operational scenarios
[12:44:53] <ema>	 and it actually seems to crash quite often, see neodymium.eqiad.wmnet:~ema/varnish-backend-last-panic.log
[12:46:07] <ema>	 $ awk '/line/ {print $6}' varnish-backend-last-panic.log |sort | uniq -c
[12:46:10] <ema>	       1 cache/cache_ban.c
[12:46:12] <ema>	       2 cache/cache_busyobj.c
[12:46:15] <ema>	       6 cache/cache_obj.c
[12:46:17] <ema>	      56 storage/storage_persistent_silo.c
[12:46:20] <ema>	 heh
[12:46:38] <bblack>	 I'm pretty sure the cache_ ones are due to persistence too
[12:46:53] <ema>	 likely, only maps hosts are affected
[12:51:57] <ema>	 bblack: I'm gonna prepare a patch for the switch to -sfile
[12:55:33] <bblack>	 ok
[13:39:59] <ema>	 bblack: I'll do a rolling restart of maps varnish backends as soon as puppet will be done doing its thing
[13:50:54] <bblack>	 ok
[14:39:08] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack)
[14:39:29] <wikibugs>	 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2547959 (10BBlack)
[14:39:31] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547958 (10BBlack)
[14:39:59] <wikibugs>	 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10BBlack)
[14:40:01] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack)
[14:40:14] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack)
[14:40:16] <wikibugs>	 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10BBlack)
[14:47:00] <bblack>	 ^ I still get confused by parent/child task setup in phab sometimes heh
[14:51:35] <ostriches>	 +1
[15:40:52] <bblack>	 lol, I managed to crash the varnishd backend child on varnish3 (live on cp1065), by executing a related varnishadm command that should be readonly for information :P
[15:50:24] <ema>	 bblack: that's crazy
[15:53:01] <bblack>	 FTR, there's a hidden command "debug.persistent".  It works in its base form, but don't do e.g. "debug.persistent main1 dump"
[15:53:36] <bblack>	 I'm trying to get to the bottom of a documentation mystery relevant to the persistence thing
[15:54:17] <bblack>	 varnish docs for persistent say:
[15:54:20] <bblack>	 Varnish will split the file into logical *silos* and write to the
[15:54:20] <bblack>	 silos in the manner of a circular buffer. Only one silo will be kept
[15:54:20] <bblack>	 open at any given point in time. Full silos are *sealed*. When Varnish
[15:54:23] <bblack>	 starts after a shutdown it will discard the content of any silo that
[15:54:25] <bblack>	 isn't sealed.
[15:55:08] <bblack>	 which makes it sound like one persistent "file" contains multiple "silos", and that the sealing thing (discard most-recent open thing on startup) applies to silos.  Thus if a persistent file has, say, 10 silos inside it, you only lose 10% on restart.
[15:55:31] <bblack>	 but it appears in the code that one file == one silo, and a silo is split up into "segments"
[15:55:59] <bblack>	 so either the docs are wrong in mentioning silos, and mean to say that within each silo (file), segments are sealed with one open for writing, etc...
[15:56:00] <Platonides>	 maybe the docs should say "segments" instead of silos
[15:56:23] <bblack>	 or the segements split is just for free-space management and such, and it's dropping a whole file on restart (out of the two file/silo we have)
[15:57:02] <bblack>	 the code should make it obvious what's happening, but I've been staring at it for like 30 minutes now.  Varnish code is notoriously opaque :P
[15:57:40] <bblack>	 debug.persistent output definitely confirms that one file == one silo, and one file/silo contains many "segments"
[15:59:43] <bblack>	 I'm leaning towards the docs being at fault and that that paragraph should s/silo/segment/ then s/file/silo/
[16:00:04] <Platonides>	 that's an easy patch to make
[16:00:18] <bblack>	 yeah
[16:00:21] <Platonides>	 and hopefully one that will save many headaches to the next sysadmin
[16:01:13] <bblack>	 'seal' is also not a verb in the code, so mostly the hard part is mapping the doc's "seal" verb to some actual code that operates on either a silo or a segment
[16:01:34] <bblack>	 (there are 'close' operations at both levels)
[16:50:44] <wikibugs>	 10Traffic, 06Operations: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548194 (10BBlack)
[16:58:30] <wikibugs>	 10Traffic, 06Operations: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548382 (10BBlack)
[16:59:05] <wikibugs>	 10Traffic, 06Operations: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548194 (10BBlack) (Added two forgotten items (xkey + purge/ban) in the downsides list)
[18:47:55] <bblack>	 back on the strange Chrome/41-on-Windows UA
[18:48:15] <bblack>	 almost all of the responses are 301 (request for / -> /Main_page) or 304 Not Modified when they ask for Main_page
[18:48:29] <bblack>	 it really is like the browser is stuck in a self-redirect/reload loop
[22:16:42] <Platonides>	 is it a single user?
[22:16:49] <Platonides>	 maybe it is a spoofed UA
[22:17:28] <Platonides>	 I remember a similar story in the daily wtf
[22:18:23] <Platonides>	 http://thedailywtf.com/articles/The-Most-Favoritest-Icon
[22:38:08] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10faidon) That would add at least a round-trip latency on every true miss that hits eqiad/esams/ulsfo (new or just purged page), won't it?
[22:39:47] <bblack>	 Platonides: no, it's many users all over the globe.  The common factor is Chrome/41 on Windows 7-10
[22:40:07] <Platonides>	 weird
[22:40:29] <Platonides>	 they just request that?
[22:43:29] <bblack>	 Platonodes: https://phabricator.wikimedia.org/T141786 is where I started documenting it.
[22:44:11] <bblack>	 Earlier on I assumed the requests were "normal" and their TLS profile had regressed.  Now I think it's possible this UA probably always had a bad TLS profile and the new behavior is just the spam of repeated queries.
[22:44:17] <bblack>	 it's possible it's some combination of both
[22:45:53] <bblack>	 in a recent log (today), I sampled 5 minutes of traffic on a single eqiad text frontend, only from Chrome/41 UAs.  virtually all the requests were for / (301) and /Main_Page (304).  There were usually many requests per client IP in the sample, up to as many as ~600 (which would mean the /->Main_Page cycle once a second)
[22:46:16] <bblack>	 it's baffling, and clearly something buggy is happening
[22:46:43] <bblack>	 they're not fake UAs from a bot, they have too many legit organic characteristics, like session/login/centralnotice/geoip cookies, etc...
[22:47:20] <bblack>	 either somehow a Windows Update to a system library bugged out these old Chrome/41 installs, or something we changed on our end did it.
[22:47:35] <bblack>	 it's certainly not that they just all installed Chrome/41 in the past month
[22:48:21] <bblack>	 I dug around a lot today on our end, looking for some kind of regression in our HTML output or JS code that could have maybe tickled a Chrome/41 bug and caused it to spam-reload, but I didn't come up with anything solid.
[22:52:13] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549720 (10BBlack) On a true miss, yes, it would add a codfw<->eqiad round-trip.  That's ~35ms though, which may be hard for MW to beat on average.  True miss should be rare though, except whe...
[22:55:59] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549747 (10BBlack) Oh, re-reading your question, you mentioned specific DCs.  In the current layout where only eqiad is "primary", the side-checks from eqiad to codfw would only happen for req...
[23:05:16] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549781 (10BBlack) Of course, if MW can beat an eqiad<->codfw trip for the same page... we could look at other ways to structure this so it doesn't kick in all the time.  Perhaps trigger it on...