[03:16:32] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2185883 (10mmodell) 5Open>3Resolved a:5mmodell>3chasemp Thanks @chasemp! It work... [03:16:48] 10netops, 10Continuous-Integration-Infrastructure, 6Operations, 10Phabricator, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2185887 (10mmodell) [10:13:39] 10Traffic, 6Analytics-Kanban, 6Operations: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2186374 (10elukey) @faidon do you prefer to upgrade the vk package anyway or the above solution would suffice for the moment? [10:22:13] 10Traffic, 6Analytics-Kanban, 6Operations: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2186378 (10faidon) The Varnishkafka package shipping an rsyslog config is a bug in itself, so we should queue a fix for it. As you've already handled the issue at hand via puppet, it's okay... [10:31:49] ema: ---^ we should patch vk's package before rolling it out $everywhere for the varnish4 upgrade [10:36:39] sounds good [11:54:35] ema: on the misc VCL, it's the setting of req.backend for all of them that's messy right? the rest with TLS and pass-behavior, not so much? [11:55:13] right [11:55:50] the whole if/elsif in cluster_be_recv_applayer_backend basically [11:57:16] bblack: the rest is fine, even though defining TLS/pass-behavior with data rather than code would be awesome of course :) [11:59:07] misc is a great example for T110717 anyways [11:59:07] T110717: Define a standardized config mechanism for exposing services through varnish - https://phabricator.wikimedia.org/T110717 [11:59:16] oh cool stashbot [11:59:50] essentially we decide whether to cache or not according to the Host header [12:00:18] well, yeah, kinda [12:00:49] and enforce the presence/absence of certain headers [12:00:56] but looking a little further out: 1) The TLS thing is a temporary hack because they're not all TLS-ified, but should be, at which point they'd use the shared TLS code that's already in wikimedia VCL [12:01:36] 2) the pass-behavior probably is per-req.http.Host more than per-backend, whereas the host-matching would be defined per-backend, so it's a little more complex to structure for [12:02:01] and really, it's not a direction we want to encourage anyways. if an application doesn't want to be cached, it could always send no-cache headers on its own :P [12:02:17] that makes sense :) [12:03:36] which brings to the question: why don't we send no-cache headers in people, etherpad and friends? [12:04:04] because it's a PITA to patch them all I imagine [12:04:45] yeah maybe [12:05:04] most of them would be behind some nginx/apache/whatever instance though, where it's surely some singular add-header config line [12:05:44] it makes hit-for-pass instead of raw pass so it's slightly less efficient, but we're not talking about huge traffic where efficiency is that important anyways [12:06:36] (also, relatedly, I think misc is one of the cases where we're probably skipping important default VCL for things like cookies/auth/etc) [12:06:58] you would think misc would be the one where we'd most-want the default behaviors, since it's a very generic cluster [12:07:29] :q [12:07:39] hmmm this is not vi [12:07:56] :) [12:09:21] lunch & [13:02:20] 10Traffic, 10DNS, 10Mail, 6Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2186521 (10chasemp) @bblack -- I hope this is what you are asking for, but our server using our mx's is the only valid source of a phabricator.wikimedia.org email. W... [13:10:33] 10Traffic, 10DNS, 10Mail, 6Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2186525 (10BBlack) ok, so we should amend the patch to use `-all` and merge that [13:12:02] https://gerrit.wikimedia.org/r/#/c/282151/ looks great! [13:12:58] testing in puppet compiler now [13:13:16] the ruby code is blind, for all I know it contains horrible syntax errors, much less does the right thing [13:14:43] heh yeah rubyfail :) [13:15:04] very newbie question: what is the reason behind directors with only one backed? Future expansion, better configuration, etc..? [13:15:51] elukey: most applayer directors should only have one backend. there are only two exceptions: wdqs and logstash on the misc cluster [13:16:30] the "one backend" should be a virtual service hostname like foo.svc.eqiad.wmnet, which uses LVS+pybal to manage multiple backend hosts with healthchecks and monitoring [13:17:05] oh you got rid of yarn from the backend vcl [13:17:16] well it's unused, it's just hostname=>404 [13:17:33] bblack: right, forgot LVS in the backend, makes sense :) [13:18:00] wdqs and logstash should be that way too, someone needs to go through the pain of defining their LVS services though :) [13:18:51] bblack: sure it makes sense to move it to the frontend. It also makes the data-driven thingie easier [13:19:12] * ema adds a test for yarn as well [13:22:25] bblack: so the main reason for still keeping yarn in our VCL is to return the custom 404 response I guess [13:23:27] yeah [13:30:02] I think the compiler output looks right-ish now, but it's a shame the diff is so messy due to re-ordering [13:30:11] maybe if I inject a commit first that just sorts them, it will be simpler [13:48:05] ema: if the VTC tests work, let's merge them before my misc changes that are hard to human-parse diffs on? [13:48:17] may as well for the validation [13:48:48] bblack: they should all work except the last one I've added which depends on your frontend changes for yarn [13:48:54] those are merged [13:48:59] ah :) [13:49:09] I figured those are simple enough to human-verify :) [13:49:36] let me just double check on my instance and then I'll merge the tests [13:50:37] what's outstanding now is all changes to the set req.backend stuff (reorder existing, merge dupes in existing, then replace it with ruby code) [13:51:00] and I guess we're not testing setting req.backend really anyways, but at least it validates that the VCL still compiles [13:51:36] 10Traffic, 10DNS, 10Mail, 6Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2186555 (10faidon) `-all`, like other antispam features (DMARC etc.) works poorly with mailing lists/reforwarders. If we're sure that there aren't any Phabricator ema... [13:57:19] mmh for some reason /usr/share/varnish/tests/misc/ does not get created [13:58:46] yeah, using the proper branch helps [14:00:32] bblack: tests merged [14:02:07] bblack: question about the outage that just happened.. what is the effect on the caches when a storm of 500s gets returned from mw? [14:02:16] (whenever you have time of course) [14:02:58] it forwards them on to users :) [14:03:14] with its regards [14:03:22] either way it's the same basic request volumes flowing in and out of the caches at various places, just a bunch of 200s get replaced with 500s [14:04:12] the nice thing is it doesn't pollute cache hits for anons, but there's probably enough dynamic-ish page elements that browser pages experience some strangeness for users [14:04:36] like I donno if the whole anonymous browsing experience falls apart when MW falls apart just because of some banner-loader code or whatever [14:04:52] but in theory, the basic HTML fetch of the article still works [14:05:19] logged-in users get hosed, since most of their fetches are uncacheable intentionally [14:05:51] and of course all of our most-vocal volunteers everywhere use logged-in accounts, so it's a much bigger outage to them than it is the anons [14:06:33] ahh ok the pollution part was my concern [14:06:59] so anon traffic remain cached and the small outage doesn't pollute it, meanwhile whoever is bypassing the cache gets a 500 [14:07:14] yeah with a few exceptions to note: [14:07:40] 1) some pages may randomly expire from caches during the outage window, and those will go 500 even for anons possibly. but that's probably pretty rare for a short outage + any kind of hot page [14:08:02] we have some grace-mode logic to prevent that, but I'm really not sure how effective it is in practice [14:08:23] and 2) again, it's questionable how much of the browser UI is dynamic and how that reacts [14:09:04] it may be that the initial article fetch is a cache hit and all fine, but then some javascript tries to load a bunch of dynamic uncacheable things for stuff like banners at the top, and that failure causes the browser UI experience to suck or throw errors [14:10:10] super 1) would have been my next question, thanks! [14:10:51] cleaning up grace-mode is something to do as we get into the outstanding ticket about reducing our max TTLs, but I'm kind of deferring that to post-varnish4 at this point [14:10:59] so we don't have to figure out the magic twice in two different worlds [14:12:52] another related thing we haven't done in a while is the usual cycle of "peek at the top backend->app requests on the text cluster in varnishtop, gasp in horror, complain about what they are and get some of them maybe fixed" [14:13:45] like, one of the top things the caches have to fetch from MW, probably pointlessly, is: [14:13:49] /w/index.php?title=MediaWiki:Wikiminiatlas.js&action=raw&ctype=text/javascript&smaxage=21600&maxage=86400 [14:14:00] I mean surely that javascript isn't updated that frequently, but it probably has no-cache headers :P [14:20:50] [6~:D [14:20:53] argh [14:20:53] /wiki/Special:HideBanners?duration=604800&category=fundraising&reason=close is second only to api.php [14:20:56] that seems awful [14:21:18] yep [14:22:06] the next two are /wiki/Speciaal:Willekeurig (which is dutch for Random) and /wiki/Special:BlankPage (which we use to monitor health) [14:22:33] I wonder why so much Special:Random, probably on just nlwiki? [14:22:59] and I have no idea what this is, but it's #4 on the list: [14:23:01] /w/index.php?title=%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85%3Anull%2F%D9%88%D8%B5%D9%84%D8%A7%D8%AA%D9%8A&action=render [14:24:12] somewhere around #12 on the list is: /.well-known/apple-app-site-association :P [14:24:46] lots of these contributors to the relative traffic rates of internet-inbound to applayer-outbound are questionable [14:26:29] bblack: 282156 looks good to me and to varnishtest [14:26:34] yay [14:43:51] bblack: I'd say ship it! [14:43:52] https://puppet-compiler.wmflabs.org/2347/cp1061.eqiad.wmnet/ [14:48:23] I'm in a perfectionist mood, I want to see if I can make the declarative patch a no-op :) [14:53:11] https://puppet-compiler.wmflabs.org/2348/cp1045.eqiad.wmnet/ [14:53:16] ^ no-op on VCL output! :) [15:02:38] \o/ [15:03:28] bblack: is there any way to assign a topic to a gerrit CR once it's out? [15:03:41] without opening a new one if possible :) [15:04:18] yeah I think you can just re-push it with a new topic, what topic was it? [15:04:27] varnish4-upgrade [15:05:06] ah it doesn't like it with no other change [15:05:13] if someone merges something else I can do it during a rebase :) [15:05:23] or maybe git-review can do it [15:07:44] yeah it can't [15:07:53] I have to edit the patch or rebase it to submit it with a changed topic name :P [15:08:00] I guess I can reword commitmsg [15:08:17] while we're on a perfectionist mood... :) [15:08:38] awesome, it works [15:09:18] so, aside from the generic syntax changes [15:09:26] that whole misc_fetch_large_object things is a biggy [15:09:57] we don't need conditional do_stream=true anymore [15:10:12] we probably do still want a size limit on caching in the frontend only, but not in the backend [15:10:28] I mean, unless we anticipate objects so large they don't belong in a ~300GB storage chunk [15:11:01] I guess we could set a sanity limit at 1GB just in case, given the randomness of cache_misc's services [15:11:10] but on misc that's unlikely right? Especially given that we pass datasets.w.o [15:11:20] still, a sanity limit sounds good [15:11:47] in general (and here) there's still a little bit of thinking to do about varnish4's default do_stream + TE:chunked, too [15:12:08] if the backend does TE:chunked, and we leave default do_stream on everywhere... [15:12:22] we never get size info and never cache an application's TE:chunked response, even when it's small [15:12:36] (if we have size limits at all for caching at every layer) [15:12:56] what that argues for is probably a behavior like this: [15:13:35] 1. on 'direct' backends that talk to the applayer, the 1GB cutoff to hit-for-pass should not match TE:chunked (in other words, the backend always slurps a full TE:chunked into cache before passing upstream, and thus can send on size info) [15:14:13] 2. indirect backends can include >size + TE:chunked in the things they hit-for-pass (so at least the >1GB monster TE:chunked don't infect other backends' storage) [15:14:22] 3. frontends the same, at a smaller >size check [15:14:51] the hole in that VCL plan is if applayer services are sending "too big even for backend storage" objects (>1GB) which are TE:chunked, that really sucks for us. [15:15:07] but I think at that point we have grounds to say that's insane and you need your applayer to give us content-length on something so huge or else. [15:15:54] hmmm I should've added in (1): for TE:chunked, we set do_stream=false so that we get the buffered slurp, too [15:16:14] so recap: [15:16:53] direct-be: if !TE:chunked and size>1GB, hit-for-pass. if TE:chunked, do_stream=false (so we have size info for upper layers). [15:17:08] indirect-be: if TE:chunked||size>1GB, hit-for-pass [15:17:20] fe: if TE:chunked||size>10MB, hit-for-pass [15:18:15] in theory, we shouldn't even get chunked responses at indirect-be or fe, but including them in the hit-for-pass is just an escape-hatch to make sure the size limit is absolute. [15:18:43] it should only happen due to some strangeness where varnish has gzip or gunzip on the fly for cache<->cache, which shouldn't happen since they share do_gzip rules [15:20:51] the check for TE:chunked is just implicit btw. it's whether the Content-Length size-check defaults to zero or defaults to the cutoff value, since it's not present for the TE:chunked's [15:22:11] we could implement the same logic and behavior today in varnish3, inverting do_stream (assume false in direct-be, set it true in all the other cases), but then we're facing the stall of fast clients for slow concurrent ones if we turn on too much cacheable do_stream in varnish3 [15:24:28] an implicit assumption that needs testing in all of this, is that do_stream=false converts a TE:chunked backend response into one with Content-Length to the requesting client. [15:24:43] I think that's true, I think I observed it in varnish3 at least [15:25:06] (assuming no on the fly gzip/gunzip on the client side of things!) [15:25:25] (which would cause varnish to output TE:chunked regardless of what was happening with the backend) [15:27:23] if the gzip hole has real effects (e.g. a non-gzip client does the initial miss-fetch of an object and gets it cached ungzipped, then varnish has to on-the-fly gzip those cache responses from there out)... [15:27:57] we could perhaps also force AE:gzip on bereq just to close that up, and I think varnish would still gunzip-on-the-fly in the frontend for the rare non-AE:gzip clients [15:32:55] frontend's bereq, that is [15:33:03] hmmmm [15:33:36] chunked streaming of unpredictable output length is just a huge hole in HTTP anyways, it kind of sucks [15:33:52] 10Traffic, 10DNS, 10Mail, 6Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2186995 (10greg) Unless some team has a team list specified for their team project contact (I highly highly doubt it), the only thing I know of is/was wikibugs-l, whi... [15:33:53] they should've required a size estimate header for TE:chunked at least. rough order of magnitude? [15:34:27] surely most sources could do that. "This is chunked, but should be ~1MB" or some such thing [15:36:18] Stupid comment while I try to catch up with all of this: you mean that there should be an indication of the total size? Because with chunked transfers you do get the chunk size, right? [15:37:05] well at least an estimate, so that when a chunked stream starts, the receiver has some notion of the difference between a 1KB stream and a 1TB stream about to come at it :P [15:38:00] * ema nods [15:38:20] TE:chunked doesn't send any size info as a header [15:38:35] within the chunked formatting each block has a size, I think you're right there [15:38:42] but that doesn't really tell us much up front about the total stream [15:39:39] given all caches have finite storage, and thus would want to sanity-cutoff what they store at all vs what they just have to stream through because it's too big [15:40:30] I guess the cache could make that decision more-dynamically, but that would be a perhaps heavy varnish code change [15:41:11] you'd have to store the object somewhere temporarily at first in the TE:chunked case, and when temp grows >X and is still growing, you dump the temp storage and go stream-only without caching [15:41:35] otherwise you eventually copy it to cache storage (possibly evicting other objects) when done streaming it in [15:42:24] but since the only interesting cases are huge transfers, that's going to be hard, it would have to be disk storage and excess object copy i/o and blah blah [15:43:57] we couyld invent our own size-estimate header and support it for the rare case our apps usefully tell us such things [15:44:56] if we have access to the app itself we could even add a header with the actual size I guess [15:45:04] X-Content-Length :) [15:45:17] well often the app doesn't really know, due to design [15:45:40] but maybe even when they app doesn't know what it's going to emit exactly, it may have an estimate [15:46:16] the developer may be able to say "even though we chunk-stream all output, no output from this app is >10MB due to design constraints" and emit a fixed estimate-cap header [15:46:44] or even better "this is dynamically streamed as we fetch database rows, but based on the table definition and row-count caps, it can't be more than 134MB for this particular query" [15:48:04] oh cool so X-Content-Max-Length [15:48:12] yeah something like that [15:48:16] 10Traffic, 10Analytics, 10Analytics-Cluster, 6Operations: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 (or 0.10?) - https://phabricator.wikimedia.org/T121562#2187115 (10Ottomata) [15:49:04] we could support it in our VCL for all the mess above right off the bat, because even if everything else works as well as it can with all the above, we have two potential issues: [15:49:45] 1. Applications send TE:chunked responses with multi-gigabyte output, and it thrashes the direct backends' storage - in which case we scream and say "don't do that please" [15:50:33] 2. Applications send TE:chunked for reasonable responses, but don't like the fact that we're buffering rather than streaming in the first layer of cache, causing small perf loss vs ideal. In which case we tell them "If you send us this custom max-len header with a reasonable estimate, things would improve" [15:52:28] even if they can only do a fixed cap at "no more than 1GB", they would pass the check and get streamed at all layers [15:52:46] surely MW can do that for articles and such (noting this because MW does TE:chunked for article output) [15:53:33] if they can also tell us which are under 10MB (or whatever the frontend hit-for-pass limit is), they get FE-caching too, instead of the (very fast anyways) fetch from local be [15:55:10] the longest wiki article, https://en.wikipedia.org/wiki/1918_New_Year_Honours according to https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_records#Articles ... [15:55:19] but then would all MW articles be frontend misses? [15:55:29] still has a gzipped length of 314946 [15:55:47] that's well under any size cap we'd care about for cache limits, we just need to actually know that [15:56:49] if we did all the plans above, and MW doesn't change what it's doing at all, it would get cached at all layers, and streamed through all layers except the bottom-most (direct-be), which would buffer it to get the length to tell the rest [15:57:20] if the emitted an estimate <1GB, they could skip that one level of buffering, but then they'd need to also get their estimates accurate enough to be under the FE limit to avoid cache-miss there [15:58:01] or simply send us content-length instead of TE:chunked, if that's easy for MW [15:59:26] but also keep in mind: today they get buffered at every layer in varnish3. buffering at just 1/3-4 layers is still a win. [16:00:07] right [16:01:24] the alternative tradeoff (which might be better) would be to *assume* nobody will ever chunked-stream us object so big they're a problem for a ~300GB cache storage file (so >~1GB-ish) [16:02:09] then we have no size limit in the backend, but we still want to buffer chunked responses there to get length estimates to the frontend for its own size limit that's much smaller [16:02:59] and I guess we could still offer the max-length feature if they want to skip that buffering? but none of this would ever skip caching at backend layers [16:03:49] I donno, it's not much different in the net [16:03:56] there's no reason not to sanity-cap explicit content-length [16:04:45] all of this is about trying to be general purpose and handle pretty messed-up cases (giant streamed outputs, and giant outputs in general) [16:05:22] really those things only apply in practice to misc+upload. text has known-reasonable behavior today without worrying about it, so long as it holds that MW doesn't emit objects >~10MB [16:05:58] I'd like to think that, but then there's commonswiki... [16:06:13] which probably deserves all the large-object treatments that upload.wm.o has [16:07:09] well maybe not, it does seem like everywhere I look on commonswiki, it's using upload.wm.o image links [16:07:46] I would've assumed where it wouldn't [16:07:51] s/where/somewhere/ [16:09:29] another way to turn all this around in terms of how we're thinking: today in varnish3, we have the downsides of do_stream, thus it not being default and all the buffering... [16:09:47] in varnish4 we're happy that do_stream lacks the big downsides and is on by default and less buffering [16:10:22] but one downside is still there: do_stream=false converted TE:chunked with unknown size to an output with a fixed content-length. in a do_stream=true default world, we lose that size info through the cache stack. [16:11:22] for text cluster on varnish3 today, we rely on that. MW emits TE:chunked to direct-be, direct-be buffers it and emits a real content-length to upper layers and eventually to the user (if the user does AE:gzip. if they don't, varnish emits chunked at the front edge why it gunzips-on-the-fly) [16:11:59] since we don't have any size checks on text today, it kind of doesn't matter there, so long as we can keep assuming we don't need them. [16:12:24] but misc and upload both have reasons to have size-checks, so we don't want to lose the information we were getting from the initial buffering with do_stream=false. [16:13:56] if you need size checks, someone has to buffer and send a complete content-length'd output somewhere. the app does it, or varnish direct-be layer does it. [16:15:20] it would be more-efficient for the app to do it IMHO (most apps are behind apache or nginx or whatever, they can always configure that to slurp+buffer as it revproxies outwards) [16:15:50] because buffering over localhost is faster than across to our varnish direct-be (which is at least a different machine in the same DC, but could be x-dc with codfw switching and such) [16:17:10] but we don't need any of that mess in the text case probably, because we can set the reasonable constraint that no service of the text cluster sends singular chunked cache objects which are large enough to hurt the frontend memory caches [16:17:59] and in the upload case, we're actually kinda lucky today in that swift does send lengths on its file outputs, but I donno if that will hold for future thumbnailing changes [16:18:33] misc is a giant mixed bag though, and we have known service that play in this areas (WDQS, datasets, etc) [16:19:27] anyways, enough thinking out loud. we can stew on this a bit and come up with a decent plan. [16:20:47] nice [16:21:42] changing topic: I gave a shot to the vsm file permissions issue without much success https://gerrit.wikimedia.org/r/#/c/281918/ [16:21:49] suggestions welcome [16:22:53] yeah modules/admin is more about humans I think [16:23:32] I've never had to muck with /etc/groups here, so I'm not sure. we need whatever abstraction perhaps already exists to "put this existing user in this existing group" where those were created by package installs [16:23:45] and requires on the relevant package installs? [16:25:15] we could just write a new module that ends up execution sed to edit /etc/groups on the fly, and does it on every run of puppet, just to give the rest of the group something to scream expletives about [16:25:23] s/execution/executing/ [16:29:34] I'm out for a bit for lunch, etc. ttyl [16:30:56] enjoy [16:53:51] 10Traffic, 10DNS, 10Mail, 6Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2187405 (10Nemo_bis) http://markmail.org/search/?q=from%3Aphabricator.wikimedia.org is empty, unlike http://markmail.org/search/?q=from%3Agerrit.wikimedia.org , so th... [17:56:58] bblack: Hey, I appreciate your help the other day regarding our Varnish hit ratio problem after upgrading to MW 1.26. Unfortunately the change I tried, not caching only session|token cookies rather than all cookies, did not have any impact on our hit ratio, so I'm still stumped by this. [18:03:19] 10Wikimedia-Apache-configuration, 6Operations: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2187697 (10Varnent) 5Resolved>3Open We would like to change the redirect for an announcement on Monday (if possible) to this URL: https://policy.wikimedia.org/stopsurveillance So http://w... [18:03:37] 10Wikimedia-Apache-configuration, 6Operations: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2187700 (10Varnent) p:5Normal>3High [18:46:39] justinl: did you use the same regex we have? there are some near-misses from various MW extensions if you simplify it [18:48:19] otherwise as general advice, I'd say run "varnishlog -b ..." (whatever -m matches make sense as you dig) and look at the requests exiting the backside of varnish and their responses. Are Cache-Control response headers sane? in the varnishlog output it will show you if it decided to create a hit-for-pass object too [18:48:37] here's the current output with the slight change to the vcl: http://pastebin.com/p7ChZLVS [18:48:43] top dozen or so [18:49:05] the change I made to the vcl was this: if (req.url ~ "^/images/thumb/.*\.(jpe?g|png|gif)$" || req.url ~ "/favicon.ico") { [18:49:11] so it has to be something to do with ResourceLoader I'd think [18:49:11] wrong line... [18:49:18] if (req.http.Authorization || req.http.Cookie ~ "([sS]ession|Token)") [18:49:25] changed that from just req.http.Cookie [18:49:31] on the right of the || [18:49:34] ah, so we have an = there too [18:49:43] req.http.Cookie ~ "([sS]ession|Token)=" [18:49:51] so it only matches at the end of the cookie name, not elsewhere [18:50:06] I can give that a try [18:50:10] can't hurt! [18:50:40] changed on one of four servers, let me watch the metrics for a few minutes [18:50:41] either it's RL-related, or obviously it could be a general-case problem and RL is just naturally at the top of the list anyways [18:52:20] I wouldn't think, though, that a fairly vanilla mediawiki installation should require any special varnish configuration, regardless of the mediawiki version, since it was doing so well prior to the upgrade [18:53:34] well you'd think a lot of things too, e.g. that the apache config for a basic MW install would be pretty vanilla :) [18:53:53] it turns on MW in practice is more than just MW, it's all the supporting bits it spreads itself over too [18:54:00] (in practice) [19:00:02] justinl: looking back at your VCL paste from the other day, I'd look at the hit_for_pass stuff in vcl_fetch, too. Maybe a lot of MW URLs are now emitting Set-Cookie in your upgraded install for some reason? [19:03:37] Maybe but I don't really do anything with mediawiki other than install/upgrade the software so if it's making changes to how it generates cookies, I'm not familiar with it, and the varnish code I'd just pulled from that mediawiki docs page. So my code, except for the thumbnail images, wasn't anything I wrote for us. [19:04:30] I'm doing a quick varnishlog check and so far the only set-cookie headers I'm seeing are for database session cookies [19:04:32] still, it could be that with the upgrade, something subtle has changed that requires some VCL support, or some mediawiki config changes to work with normal VCL, or something [19:05:10] I'm just wondering if anyone else that uses MW + Varnish is seeing a similar impact [19:05:17] 7HTTPS, 10Traffic, 10MediaWiki-General-or-Unknown, 6Operations, 10Wikimedia-General-or-Unknown: securecookies - https://phabricator.wikimedia.org/T119570#2187877 (10BBlack) [19:05:20] 7HTTPS, 10Traffic, 7Varnish, 6Operations, 13Patch-For-Review: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#2187875 (10BBlack) 5Open>3Resolved a:3BBlack [19:05:35] or do most people not run a large enough installation for it to matter or be noticeable [19:06:27] Most of the time when I'm troubleshooting I can't even really find references to people running load balanced configurations, most I see are single server [19:06:30] it's hard to say [19:06:37] aside from wikipedia, of course [19:06:42] and that's a whole other scale [19:06:45] right, there probably aren't many *giant* installations with large traffic [19:07:14] and our situation here at WMF is very different: we're doing continuous MediaWiki release rollouts multiple times a week between the big published versions you see [19:07:30] and constantly adjusting our VCL config to match that and improve/optimize in general [19:07:50] so it's not really obvious from my perspective that e.g. MW 1.25 requires a specific line of VCL code that's new since 1.24 [19:07:57] yeah, so we're kinda in the middle, needing good caching due to traffic and load but not big enough for really custom stuff [19:08:54] actually our wikis are performing quite well even at 37% hit ratio, though if I turned off varnish, the web servers would fall over [19:08:59] I'm assuming the = on the cookie regex didn't do much right? [19:10:12] it's been 15 minutes so it's still ramping up but so far it's no better than before [19:10:39] i only changed one server though, not sure if purges from the other three servers could affect the changed server's ratio, though [19:11:38] probably not I'd think [19:12:55] what might be interesting, if you're really running out of ideas, is stop processing purges for an hour or so and see what happens, in case it is a massive rise in PURGE volume that's causing issues [19:13:12] we have had that sort of thing here lately, but it hasn't hit us too bad. there were some changes to how/when PURGE traffic is generated [19:13:23] hmm... [19:13:25] I really don't think that's it, but it's a hint to keep in mind [19:13:37] ok, i'll investigate that [19:14:20] basically in your vcl_hit code comment out the "purge;" part so that it doesn't actually do anything [19:15:12] another related thing to think about... I'm not sure what you're using to generate your hit/miss stats (and really, there's a lot of other states besides those two anyways) [19:15:30] but if it's naive, you may be counting the hit/miss of the PURGE request itself in your stats [19:15:36] i do see a lot of purging but there's plenty of room in the cache, it's not full, no pages being nuked, etc. and not that many edits [19:16:02] in which case perhaps the cached content and user hit-ratio is fine, but you've seen a large increase in PURGEs of non-hot articles that aren't doing much, but just generating PURGE misses [19:16:05] in graphite i'm using hits/(hits+misses) the same as varnishstat shows [19:16:11] right... [19:16:14] and the same as multiple web sites regarding this topic show [19:16:16] so that would include PURGE traffic [19:16:42] when the PURGE actually PURGEs an object, that's a hit. when it's a purge for something that wasn't in cache, that's a miss [19:16:53] it would be better to somehow exclude all PURGE traffic from your hit/miss stats [19:16:55] which has always been included, even prior to this upgrade, so either there wasn't much purging going on, even with only 4 GB allocated to varnish as opposed to 8 now [19:17:20] the collectd varnish plugin doesn't break it out, just provides the varnishstat metrics [19:17:21] yeah, I'm just saying, the MW-side change could be a massive ramp-up in pointless PURGE requests that miss, which skews your hit/miss stats but doesn't really affect users [19:17:33] hmm, interesting though [19:17:34] thought [19:17:45] I think in our varnish3, we actually had a patch for that, looking... [19:18:33] yeah we do [19:18:34] https://github.com/wikimedia/operations-debs-varnish/blob/3.0.6-plus-wm/debian/patches/0010-varnishd-cachehitmiss-ignore-purge.patch [19:18:46] ^ we run that source patch on our varnish3, so that PURGE do not increment hit/miss counters in varnishstat [19:19:48] that is interesting. wondering if i can roughly calculate it in our existing varnishes from varnishlog for example since we just use varnish out of the ubuntu repo [19:20:25] well the first question would be whether PURGE request volume is significant compared to GET, or else it wouldn't swing your whole stats [19:21:05] maybe "varnishtop -c -i RxRequest" [19:21:49] doing something like that now, running the varnishtop and passing it through awk to count up the RxRequest types [19:21:55] varnishlog* [19:22:11] another thing you can do, that I do here for a different perspective on the various meanings of "hit ratio", is actually compare input-side requests to output-side requests (that comparison wouldn't get killed by PURGE, since PURGE generates no output-side request) [19:22:57] in varnishstat terms, that would be compared client_req to backend_req [19:23:25] to get a real "our total inbound requests are reduced by X% before reaching the backend" perspective on "hit ratio" [19:24:57] hmm, well right now client_req/backend_req seems to be about a 6:1 ratio, so that seems in-line with the 85% we used to see in the hit ratio graph [19:25:22] yeah that's telling... [19:25:25] let me throw those parameters into our graphs for comparison, i just grabbed those numbers from varnishstat [19:25:56] how does your request method (PURGE, GET, etc) stats break down in varnishtop too? [19:26:27] I mean I see stuff like this lately at WMF: [19:26:27] 17533.74 RxRequest GET [19:26:27] 13917.95 RxRequest PURGE [19:26:27] 244.88 RxRequest POST [19:26:28] 74.60 RxRequest HEAD [19:26:41] we have some known issues with excessive PURGE volume, which you could be seeing too [19:26:56] it's not really hurting the article caching so bad, it's just a lot of excess junk on the PURGE requests in practice [19:27:23] (which would be the "this isn't a real proble, but PURGE is skewing your hit/miss and spamming your servers processing pointless PURGEs" idea of what you're facing) [19:28:07] in a sane world, PURGE should be much smaller than GET, especially with a smaller count of varnishes [19:28:09] running varnishtop -i RxRequest now [19:29:34] looks like our numbers are going to be comparable to yours... [19:29:44] 3829.63 RxRequest GET 2348.89 RxRequest PURGE 57.31 RxRequest POST 1.77 RxRequest HEAD [19:29:54] that was actually one of my better-looking samples heh [19:30:30] another check [19:30:33] 28548 GET 9 HEAD 423 POST 27712 PURGE [19:30:34] if I let it run a while, PURGE tends to outnumber GET for us on one server [19:31:03] i'm going to add new graphs of the client_req / backend_req and RxRequest counts [19:31:16] but if this was really trashing your article content cache in practice, you wouldn't still be seeing 6:1 req ratios [19:31:20] sounds like purges may indeed by skewing things now [19:31:26] I think it's just the purge spam itself that's the problem [19:31:58] yeah, the fact that performance seems great did seem at odds with the terrible hit ratio [19:32:35] thank you again for the help! nice to know i'm not going insane :) [19:33:02] :) [19:33:44] I'd still file a bug for MW devs to look at and ask "Why did my PURGE volume go way up with this MW version upgrade?" [19:33:52] because I'd be interested to see the answer too heh [19:34:53] I'm trying to find our bug here, which isn't directly pointed at MW core updates itself, but kinda covers the same ground [19:36:17] I'll do that :) [19:36:35] ah here it is: https://phabricator.wikimedia.org/T124418 [19:36:49] you may want to read/ref that as well, it's from digging into massive PURGE increase here since December-ish [19:36:58] which came in a few waves, some of which we found reasons for [19:37:40] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2187915 (10BBlack) [19:37:56] I fixed that first link, the relative date window is out of date heh [19:37:57] http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1453466175.547&target=MediaWiki.jobqueue.inserts_actual.htmlCacheUpdate.rate&from=-180days [19:48:27] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2187948 (10BBlack) {F3845100} [19:48:57] https://phabricator.wikimedia.org/T124418#2187948 above now has a clickable image of our actual PURGE rate increase vs non-PURGE since Nov [19:50:53] wow that's a huge change in purges [19:52:14] some of that was delayed effect. our purges go through a job queue, which had some limitations and implicit throttling [19:52:34] the problem started building earlier than the big spike, but they backlogged in the jobqueue and such. then someone "fixed" the jobqueue and it released the hounds [19:52:55] which is why the ticket starts with an htmlCacheUpdate graph, which shows the earlier rate increases before the PURGE spiked up [19:53:45] but keep in mind, our infrastructure's layout makes the problem look worse than it is. our GETs are spread over a ton of servers, and our PURGEs are mirrored to all of them [19:54:15] so there's an implicit multiplier. if you were trying to more-accurately look at the whole picture, you'd sum the GETs on all the servers, vs the PURGEs that arrive at just one (or avg them, same thing) [19:54:30] that graph just sums both across them all [19:55:39] that sounds like ours as well, except with only 4 servers. all 4 servers run both varnish on port 80 talking to apache on 8080, and $wgSquidServers has all 4 servers. [19:56:37] yeah for us, the purges are hitting all our text and upload frontends in that graph [19:57:17] so instead of an x4 multiplier we're looking at something like x70 [19:57:36] (across 4 distinct datacenters) [19:59:12] we also don't use wgSquidServers the normal way. we send multicast HTCP traffic to fan out the PURGE traffic, and then local daemons on each varnish host listen to the multicast UDP flow and make an HTTP connection to varnish over localhost, and translate it into PURGE requests there. [20:02:20] even in a medium scenario that's better. say if you had 25 MediaWiki servers and 25 caches, it would be 625 pairings of them to do HTTP connections from MW->cache and send purge, hopefully persistent ones heh [20:04:05] bblack: should we merge 282173? Confirmed noop on v3. [20:04:52] ema: yeah we could, but I was thinking as long as I'm there, I should fix the error statement too right? [20:05:21] bblack: oh it's fine I have another patch almost ready for that [20:05:52] ok [20:06:56] ema: also note from ops meeting stuff at the top, I'm not here tomorrow [20:07:06] will be back monday though :) [20:07:38] alright! [20:08:04] (and will have normal laptop/phone/access if something emergency comes up, just call me first) [20:09:50] bblack: logstash_director uses 'random' as the director type, with a comment saying "maybe-wrong? but current value before this commit! XXX" [20:09:58] sorry, it uses 'hash' [20:10:02] right [20:10:06] any reason not to use random instead? [20:10:37] I'm not sure, they might actually want hash, I just don't know [20:10:48] OK [20:10:49] even so, we'd probably rather than on chash/vslp [20:11:00] then we have to do some more magic in the ruby part [20:11:12] s/than/them/ [20:11:56] hash.backend() needs to be called with a string to pick the backend [20:11:58] but really, as I think I've mumbled about a few times but done nothing to address yet: logstash and wdqs should be LVS services living at foo.svc.eqiad.wmnet with LVS doing the random part, and just one backend host def on our end [20:12:51] if they really did need "hash" though and we can't escape that, LVS can't do URL hashing of course [20:12:59] we could at least fix up wdqs though [20:13:05] right. Who is they? :) [20:13:41] whoever runs logstash :) [20:13:45] I donno! :) [20:13:50] * bd808 raises hand [20:14:18] bd808: so the question is: do we need the requests to logstash100x to be hashed on the URL for some perf/caching reason at the logstash level? [20:15:06] for the kibana web front end or for the various transports that feed log data in? [20:15:27] kibana should be stateless [20:15:28] for logstash.wikimedia.org traffic, so kibana [20:16:01] right now logstash.wikimedia.org traffic is configured to hit logstash100[123], but it's hashing distinct subsets of the URL space to them [20:16:08] if we can switch it to random, our lives get easier [20:16:16] should work fine [20:16:21] awesome, thanks! [20:16:35] I probably copy/pasted badly at some point in the past :) [20:17:00] alright, let's switch to random then! Thanks bd808 [20:43:00] bblack: so obviously we still have to sort out the streaming part, but at least for syntax changes -> https://gerrit.wikimedia.org/r/#/c/282235/ [20:43:06] noop on v3 [20:46:09] and it passes the tests on v4 [21:28:26] 10Wikimedia-Apache-configuration, 6Operations: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2188311 (10Krenair) Why was this reopened instead of a new task being created? This task was resolved months ago... [21:37:27] 7HTTPS, 10Traffic, 6Operations, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1451751 (10bd808) The kibana dashboard at (NDA required) shows the long tail of b... [21:43:09] 10Wikimedia-Apache-configuration, 6Operations: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2188377 (10Varnent) I was following the same process as the last time the redirect was changed - which is what Dzahn suggested we do (see earlier comments). [22:22:01] 7HTTPS, 10Traffic, 6Operations, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2188513 (10BBlack) Thinking ahead a little (because I'm sure the long tail will still be long after we go through the announcement -> cutoff date phase): it would... [22:42:32] 7HTTPS, 10Traffic, 6Operations, 10Wikimedia-Fundraising, 7Blocked-on-Fundraising-Tech: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#2188594 (10CCogdill_WMF) Closing the loop on this, finally! We no longer need to support Silverpop's domains (mtk41 or mkt4...