[04:54:58] 10Traffic, 06Operations, 07Availability, 07Wikimedia-Multiple-active-datacenters: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#2546816 (10aaron) [07:14:06] 10Traffic, 10DBA, 06Operations, 06Performance-Team, and 2 others: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2546887 (10jcrespo) [08:01:52] 10Traffic, 06Operations: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547002 (10ema) [08:03:27] 10Traffic, 06Operations: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547014 (10ema) p:05Triage>03Normal [08:11:58] 10Traffic, 06Operations: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547042 (10ema) [08:15:23] 10Traffic, 06Operations: varnish-backend crashed on cp1047 (maps) - https://phabricator.wikimedia.org/T142810#2547046 (10ema) Full panic log on cp1047:~ema/varnishd-backend-crash.log. [09:54:55] 10Traffic, 06Operations: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547256 (10ema) [10:13:04] 10Traffic, 06Operations: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547296 (10ema) This error is happening only on maps nodes. We never noticed because usually varnishd starts a new child process if one dies. However, yeste... [11:06:52] 10Traffic, 06Operations: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810#2547422 (10ema) Steps to reproduce while varnish.service is running fine with two processes, assuming varnish-modules is installed: - apt-get remove varni... [12:14:46] ema: on the ticket about maps/cp1047.... [12:15:39] the smp_oc_getobj() thing has been seen before with crashing backends. it means they crashed in such a spectacular (or fast, after startup) way that the storage is corrupt. So yeah, wiping persistent storage is the only recourse AFAIK [12:16:12] the modules thing, seems like an upstream bug [12:16:31] although I'm not sure what the desired behavior should be there [12:16:56] bblack: well if the child cannot be started varnish is not gonna serve requests [12:17:04] :) [12:17:12] so probably exiting would be a good approach? [12:17:18] maybe two bugs at this level then [12:17:31] 1) If the child really can't be started, parent should die [12:17:57] 2) But... the child totally should've been startable, "This is no longer the same file seen by the VCL-compiler." is a bad reason [12:18:18] couldn't it just have recompiled the VCL for itself in this case? [12:18:56] right, the function being called there is mgt_push_vcls_and_start [12:19:17] which sounds like it really just pushes whatever compiled vcl it has compiled before [12:19:23] yup [12:19:46] you could argue that perhaps we should've been aware of this and pushed around a VCL reload after updating vmods [12:19:53] perhaps it's better to do that though. What if the VCL files have changed in the meantime and don't compile? [12:20:00] but I don't think it's documented well that it works like this, and it can reload for itself, too [12:20:29] it always compiles VCL on master startup doesn't it? [12:21:08] yeah when you start varnishd it always compiles it [12:21:21] to leave the old compiled VCL in place if compilation fails might be a reasonable fallback plan, maybe [12:21:36] but not even trying to compile it when it has detected that vmods have changed, that doesn't seem right [12:21:44] didn't we have a "storage clean" command in an init script or so [12:21:47] for this reason [12:22:01] (the storage corruption) [12:22:12] for the smp_oc_getobj() thing? [12:22:47] I don't think we've ever programmatically dealt with corrupted persistence [12:23:06] so: 1) we should reload all v4 varnishes 2) find out what happens with the smp_oc_getobj thing because it does happen quite often and the child just gets restarted. But if storage is corrupted maybe it's not behaving properly? 3) report the mgt_push_vcls_and_start bug [12:23:46] i'm sure we did it for squid coss [12:23:49] for varnish, not sure [12:23:57] we had /etc/init.d/squid clean [12:24:10] which wrote nulls all over the coss block device [12:24:36] in the varnish case, I guess an initscript "clean" action would stop -> rm -f /srv/sd*/varnish*; start [12:24:42] yeah [12:24:45] but it's rare we do that [12:25:16] so varnish 4 does have persistence? [12:25:22] yes [12:25:27] but it's buggy? :) [12:25:29] uh actually varnish reload doesn't work. It fails with the "vmod has changed" error [12:25:50] mark: the bug here has nothing to do with persistence really, it's a different thing with vmods and VCL [12:26:42] there's two different bugs wrapped up in this ticket: the fallout of replacing a vmod while varnishd is running, and varnishd's crashing on startup (or right after) with smp_oc_getobj() errors [12:26:54] only the latter is persistence-related [12:26:58] right [12:27:02] and we've seen it on V3 before [12:27:21] definitely [12:27:29] it just means the persistence file is trash, which kinda makes sense if we had a repeat crash or a fast crash after startup, there was never time to seal a legit silo or whatever. [12:35:01] bblack: ok so it looks like compiling the VCL is not enough to make varnish happy about the new vmod. An actual restart is necessary :( [12:35:18] well it's a step in the right direction I guess [12:35:26] before we had to ship a new varnish package to update a vmod [12:35:33] now we just have to restart varnishd to update a vmod :P [12:35:39] :) [12:36:31] doesn't matter guys, the containers crowd has you covered [12:38:04] "-1 needs more docker" should be a standard code review [12:39:05] if we only had all our caches in docker instances, we could just replace the running image on every VCL update and never worry about reloads or restarts [12:39:09] it makes so much sense now [12:39:14] exactly [12:40:02] ema: we should probably just move maps to -sfile for now anyways [12:41:32] the way the storage stuff is factored (in puppet) is kinda backwards to do it elegantly, but it can use the same 4-line hack misc does for now [12:41:46] it's kinda not worth fixing the abstraction if we expect them to all be sfile on v4 anyways [12:42:05] I should re-open that textfile I was working on and finish up the argument -> phab [12:42:08] why not go with persistence in v4 btw? [12:42:50] mark: every time we have that conversation, I end up typing like 2 pages into IRC, and then there's one or two open questions to look into as well, etc [12:43:02] I'm trying to persist the anti-persist argument to phab now :) [12:43:03] i can wait for phab, was just curious :) [12:43:18] i thought it wasn't in v4, but apparently was wrong [12:43:34] it's there, but it was renamed from -spersistent to -sdeprecrated_persistent :) [12:43:47] oh ic [12:43:49] mark: https://www.varnish-cache.org/docs/trunk/phk/persistent.html [12:44:05] thanks [12:44:23] there's a bunch of obvious anti-persistence argument, based on the phk link above and it being deprecated, more-complicated, probably slower, etc... [12:44:41] the hard part is arguing against the obvious persistence upsides about surviving crashes/restarts in various operational scenarios [12:44:53] and it actually seems to crash quite often, see neodymium.eqiad.wmnet:~ema/varnish-backend-last-panic.log [12:46:07] $ awk '/line/ {print $6}' varnish-backend-last-panic.log |sort | uniq -c [12:46:10] 1 cache/cache_ban.c [12:46:12] 2 cache/cache_busyobj.c [12:46:15] 6 cache/cache_obj.c [12:46:17] 56 storage/storage_persistent_silo.c [12:46:20] heh [12:46:38] I'm pretty sure the cache_ ones are due to persistence too [12:46:53] likely, only maps hosts are affected [12:51:57] bblack: I'm gonna prepare a patch for the switch to -sfile [12:55:33] ok [13:39:59] bblack: I'll do a rolling restart of maps varnish backends as soon as puppet will be done doing its thing [13:50:54] ok [14:39:08] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack) [14:39:29] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2547959 (10BBlack) [14:39:31] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547958 (10BBlack) [14:39:59] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10BBlack) [14:40:01] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack) [14:40:14] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10BBlack) [14:40:16] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10BBlack) [14:47:00] ^ I still get confused by parent/child task setup in phab sometimes heh [14:51:35] +1 [15:40:52] lol, I managed to crash the varnishd backend child on varnish3 (live on cp1065), by executing a related varnishadm command that should be readonly for information :P [15:50:24] bblack: that's crazy [15:53:01] FTR, there's a hidden command "debug.persistent". It works in its base form, but don't do e.g. "debug.persistent main1 dump" [15:53:36] I'm trying to get to the bottom of a documentation mystery relevant to the persistence thing [15:54:17] varnish docs for persistent say: [15:54:20] Varnish will split the file into logical *silos* and write to the [15:54:20] silos in the manner of a circular buffer. Only one silo will be kept [15:54:20] open at any given point in time. Full silos are *sealed*. When Varnish [15:54:23] starts after a shutdown it will discard the content of any silo that [15:54:25] isn't sealed. [15:55:08] which makes it sound like one persistent "file" contains multiple "silos", and that the sealing thing (discard most-recent open thing on startup) applies to silos. Thus if a persistent file has, say, 10 silos inside it, you only lose 10% on restart. [15:55:31] but it appears in the code that one file == one silo, and a silo is split up into "segments" [15:55:59] so either the docs are wrong in mentioning silos, and mean to say that within each silo (file), segments are sealed with one open for writing, etc... [15:56:00] maybe the docs should say "segments" instead of silos [15:56:23] or the segements split is just for free-space management and such, and it's dropping a whole file on restart (out of the two file/silo we have) [15:57:02] the code should make it obvious what's happening, but I've been staring at it for like 30 minutes now. Varnish code is notoriously opaque :P [15:57:40] debug.persistent output definitely confirms that one file == one silo, and one file/silo contains many "segments" [15:59:43] I'm leaning towards the docs being at fault and that that paragraph should s/silo/segment/ then s/file/silo/ [16:00:04] that's an easy patch to make [16:00:18] yeah [16:00:21] and hopefully one that will save many headaches to the next sysadmin [16:01:13] 'seal' is also not a verb in the code, so mostly the hard part is mapping the doc's "seal" verb to some actual code that operates on either a silo or a segment [16:01:34] (there are 'close' operations at both levels) [16:50:44] 10Traffic, 06Operations: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548194 (10BBlack) [16:58:30] 10Traffic, 06Operations: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548382 (10BBlack) [16:59:05] 10Traffic, 06Operations: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548194 (10BBlack) (Added two forgotten items (xkey + purge/ban) in the downsides list) [18:47:55] back on the strange Chrome/41-on-Windows UA [18:48:15] almost all of the responses are 301 (request for / -> /Main_page) or 304 Not Modified when they ask for Main_page [18:48:29] it really is like the browser is stuck in a self-redirect/reload loop [22:16:42] is it a single user? [22:16:49] maybe it is a spoofed UA [22:17:28] I remember a similar story in the daily wtf [22:18:23] http://thedailywtf.com/articles/The-Most-Favoritest-Icon [22:38:08] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2547945 (10faidon) That would add at least a round-trip latency on every true miss that hits eqiad/esams/ulsfo (new or just purged page), won't it? [22:39:47] Platonides: no, it's many users all over the globe. The common factor is Chrome/41 on Windows 7-10 [22:40:07] weird [22:40:29] they just request that? [22:43:29] Platonodes: https://phabricator.wikimedia.org/T141786 is where I started documenting it. [22:44:11] Earlier on I assumed the requests were "normal" and their TLS profile had regressed. Now I think it's possible this UA probably always had a bad TLS profile and the new behavior is just the spam of repeated queries. [22:44:17] it's possible it's some combination of both [22:45:53] in a recent log (today), I sampled 5 minutes of traffic on a single eqiad text frontend, only from Chrome/41 UAs. virtually all the requests were for / (301) and /Main_Page (304). There were usually many requests per client IP in the sample, up to as many as ~600 (which would mean the /->Main_Page cycle once a second) [22:46:16] it's baffling, and clearly something buggy is happening [22:46:43] they're not fake UAs from a bot, they have too many legit organic characteristics, like session/login/centralnotice/geoip cookies, etc... [22:47:20] either somehow a Windows Update to a system library bugged out these old Chrome/41 installs, or something we changed on our end did it. [22:47:35] it's certainly not that they just all installed Chrome/41 in the past month [22:48:21] I dug around a lot today on our end, looking for some kind of regression in our HTML output or JS code that could have maybe tickled a Chrome/41 bug and caused it to spam-reload, but I didn't come up with anything solid. [22:52:13] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549720 (10BBlack) On a true miss, yes, it would add a codfw<->eqiad round-trip. That's ~35ms though, which may be hard for MW to beat on average. True miss should be rare though, except whe... [22:55:59] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549747 (10BBlack) Oh, re-reading your question, you mentioned specific DCs. In the current layout where only eqiad is "primary", the side-checks from eqiad to codfw would only happen for req... [23:05:16] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2549781 (10BBlack) Of course, if MW can beat an eqiad<->codfw trip for the same page... we could look at other ways to structure this so it doesn't kick in all the time. Perhaps trigger it on...