[08:08:24] hi ema, bblack and elukey. Hopefully I will be more online again soon. The thesis is almost on it´s way to be printed and then I want to see if I can look into some things going on :) [08:09:21] Snorri: nice! [08:31:29] our vcl looks better now without the v3 compatibility mess [08:34:31] ok so the current vcl_hit situation looks a bit strange to me [08:35:15] 1) we don't include the builtin vcl claiming that it's just return(deliver), which is not true at least under v4 https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L291 https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia-backend.vcl.erb#L70 https://www.varnish-cache.org/docs/4.1/us [08:35:21] ers-guide/vcl-grace.html [08:35:49] 2) wm_common_recv_grace is empty https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb#L204 [08:42:50] we do set beresp.grace to 60m in wm_common_backend_response though, so the else part of wm_common_recv_grace seems to be taken care of [08:47:18] what I'm wondering now is: we're not doing the 5m grace part anywhere else AFAICT, is that ok? Is it fine to skip the builtin VCL in this case? [08:47:52] should we remove wm_common_recv_grace altogether? [10:40:30] 10Traffic, 06Operations: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#2817348 (10ema) [10:42:07] 10Traffic, 06Operations: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#2817361 (10ema) p:05Triage>03Normal [10:56:41] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2817405 (10Gilles) >>! In T66214#2816123, @GWicke wrote: > - Supply original format and size in the URL or metadata, and let the client choose between SVG... [13:52:38] ema: yeah our grace handling is pretty bad, it's something I've been avoiding until post-v3-cleanup. We probably need to have some design conversations around what we want to do there, and what kind of limits we use on which clusters, etc. [13:53:48] ema: on the graphiq.com issue, should set resp.reason as well to a similar string to the WWW-Authenticate stuff? [13:56:34] bblack: yep, not sure what the best message would be though :) [13:58:43] bblack: I've just updated the patch using the same string for now [14:01:31] ema: that string is fine I think. Maybe also loosen the referer regex to cover other subdomains and non-HTTPS? e.g. "^https?://([^/]*)graphiq.com" [14:01:54] I don't know why I put parens in there, but something like that [14:02:16] really just "graphiq.com" would be simpler and still work heh [14:19:58] bblack: [^/]*graphiq.com could potentially match other websites too [nitpick mode on] [14:48:41] yeah :) [14:48:59] (/|$) [14:49:07] and escaping dots, and blah blah [14:49:24] bblack, hey. think I've found a bug in the varnish puppet manifests [14:49:32] not surprising, what? [14:49:48] it relies on several packages that come from the experimental package repo [14:49:58] oh that's "normal" for now [14:50:00] which is fine in itself, but it doesn't appear to actually set up that repo [14:50:20] we kind of don't want to puppetize experimental broadly [14:50:25] the only thing I found that did was role::lvs::balancer [14:50:38] we're now at the point where we should move the packages to the main repo instead of experimental [14:50:51] it was to avoid accidentally installing varnish4 on varnish3 (which there aren't any of, any more) [14:50:58] aha [14:51:09] Okay, I'm going to manually fix deployment-cache-text04 for this [14:51:13] ema: ^ yet another post-v4 cleanup tasks: move packages [14:51:26] Krenair: we have some instructions about manually turning on experimental I think [14:51:46] in there: https://wikitech.wikimedia.org/wiki/Varnish#Upgrading_from_Varnish_3_to_Varnish_4 [14:51:47] yeah it looks pretty simple, deployment-cache-upload04 already has it somehow [14:52:08] thanks [14:53:14] yeah that contains what I was going to run [14:53:30] I think either I or someone else used those instructions on -upload at some stage [15:02:37] bblack, want a task for cleaning this up, or..? [15:04:00] Krenair: looks like it's already noted in T150660 , just not done yet. [15:04:01] T150660: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660 [15:04:06] ok [15:04:19] yeah that's the last step I guess [15:04:50] before closing the task that is :) [15:06:07] ema: re grace stuff, we've probably still got some bad edge cases happening live around the ttl=0s mark, too [15:06:15] well bad for caching, not bad for users [15:06:49] oh wait, I worked around that, it's ok for now [15:07:23] I was worried about 0s hits (hit right as its expiring) + our "ttl <= 0s" hit-for-pass, but it looks at X-Cache-Int to identify hits [15:11:46] bblack, ema: finished thesis is in you inbox. [15:12:19] \o/ [15:12:32] awesome! [15:13:05] on the grace questions, there's a few different angles to think about. roughly rephrasing the docs (and assuming we modified our VCL to match more like their 4.1 examples in these regards): [15:13:31] while obj.ttl>=0, a cache hit is a normal cache_hit [15:14:09] after that, obj.grace still allows a cache hit, but also triggers a background refresh of the content for future requests [15:14:54] if obj.keep > obj.grace, the object will persist in storage for that additional "keep" time, but never be used as a hit, only as an IMS candidate for a 304 from the backend [15:16:30] and then we have req.grace and req.keep on the request side [15:16:53] those effectively cap the acceptable grace/keep windows allowed from objects in storage, in the scope of a particular request [15:17:26] ignoring keep for a moment (as that's an entirely different topic and still requires a functioning backend) [15:17:55] the basic idea is we decide on a value G which is the maximal grace we're willing to use in "oh shit backend is dead" conditions [15:18:32] we set obj/beresp.grace to that on new fetches, so that we know they'll stick around that long for grace purposes in storage (unless pushed out of course) [15:19:13] we pick another value H which is the grace we allow when things are healthy (just to avoid latency hit to the user on background refresh of common objects shortly after expiry) [15:19:21] H < G [15:19:45] on the request side, we normally set req.grace=H, but if we detect backend issues, we flip to setting req.grace=G [15:21:35] (related "saint mode" doesn't really apply to how we operate here, for better or worse) [15:22:51] and in terms of our contract with the applayer/developers, we say that the maximum time an object lives in caches (that they need to plan around with their TTLs and deployments and deprecations, etc) is ttl+H under normal conditions, but ttl+G under exceptional (outage-y) conditions. [15:23:47] probably, reasonable values for H are in the 5-60m range, and reasonable values for G are in the 1-7d range. [15:24:08] interrelated with all of this is how TTLs work with layering [15:25:05] the expiry time set by cache-control + age *should* be transitive. If the applayer indicates via CC/Age that an object has 9 days to live, it should never live beyond 9 days ttl in caches (well, +grace on top of that). [15:25:42] but when we do our ttl capping on obj.ttl (e.g. applayer says 9 days, but our varnishd caps obj.ttl to 7d), that has no transitive effect, it only shortens the TTL within this one varnishd. [15:26:17] and probably Surrogate-Control plays a role in how we fix up those parts [15:27:08] (we start by using it for inter-layer TTL control between varnishds, and then publish a plan for how applayers should transition to using it to control varnish TTLs/behaviors as well, leaving CC unmolested and only consumed by UAs outside of our infra) [15:28:39] probably our capping via obj.ttl is "wrong" - if we want to enforce maximal cache lifetimes that cap what the applayer says, we should do it by modifying/setting Surrogate-Control (or Cache-Control if not that) [15:28:55] so that it doesn't matter how many layers it passes through, to determine the effective maximum age [15:30:06] another bit of the puzzle: we only do healthchecks for varnish<->varnish inter-layer stuff. We don't currently healthcheck LVS-based applayer backends because LVS is already healthchecking and handling pooling, etc. [15:30:47] but if we want the req.grace switching to work for the final leg of backendmost-varnish->app, we have to get healthcheck into there (maybe with different parameters than normal), to hopefully witness total failure of ability to reach a backend service. [15:34:16] that's pretty much all of my rambling thoughts on that topic [15:34:48] I think we have multiple open related tickets, and now that we're past varnish4 transition we can work on cleaning all of the above up, and reducing our effective maximum cache times from current values, too. [15:35:53] currently we do the obj.ttl (per-layer, non-transitive) capping at 7d, and MW is sending I believe 14d in Cache-Control [15:37:06] I'd like to get things down to where we're, say, capping off healthy normal TTL at 1d transitively (1d total for the whole of Traffic as a black box, regardless of layering), but also have, say, 7d of maximal grace built in as well for handling strange situations. [15:37:39] but there's some thinking to do on what those strange situations are and how it plays out. [15:38:13] for covering a quick outage->recovery (say of esams->eqiad link), we really don't need a huge grace time. We need just enough to react and depool esams, basically. [15:38:42] for caches->app at the last leg, similarly we only need enough grace to cover the timeframe of reported issue -> fixes, which is hopefully short too [15:39:23] it's the other cases that are harder to define. if we depool esams for 4 days and then bring it back, does having longer grace (or even keep) help us get through initial inrush? etc [15:41:24] maybe we don't need 7d grace. maybe we just need 7d keep so that IMS can reduce the bandwidth burst of the inrush of content refresh in that case. [15:43:38] I don't know that we've ever really tried to enumerate what all the possible (well, reasonably predictable ones anyways) operational scenarios are that could affect the desired values here. [15:50:14] So...I think I almost understand what you are talking about (the grace thing is still a bit unclear). [15:50:56] Maybe I could help to determine some bounds from the time with the v3 varnish. [15:52:11] In my data there are some cases where esams went offline and I was routed to eqiad instead. But I do not have latency data just the collected header data. [15:56:08] "grace" is basically: if we find an object in cache whose TTL is expired, but the TTL expired less than "grace" seconds ago, go ahead and serve it like a cache hit, while also fetching a new copy into cache for future requests in the background asynchronously [15:57:19] grace can be set on a backend response coming into the cache, which sets a maximal grace for hits against that object and affects (along with "keep") when it will be evicted on time (if not pushed out of storage to make room earlier) [15:58:01] and then grace can also be set on a request, which caps the grace amount that request will allow from an object in cache. the object could have 7d grace, but the request could specify it only wants to accept up to 5min grace. [15:59:33] another thought: we tend to think in terms of absolutes on ttl/grace/keep, but it might be better to think in terms of fractions of the applayer TTL [16:00:49] (then again, at request-time we don't know the applayer TTL for picking a healthy-mode grace. We can only calculate off of that at response time for obj.grace maximal value) [16:03:53] Mhhh I see [16:08:21] I think for normal mode a good grace time would be like 10 minutes. But at best this time would depend on the popularity of the object. [16:08:57] right [16:09:55] the case we care about most with grace-mode, under healthy conditions, is relatively hot objects not stalling out for the backend fetch once in a while when they expire. but if they're hot enough to matter, we don't need much grace to catch them in that grace window, either. [16:28:00] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2818140 (10GWicke) > you could pass it through an anchor that wouldn't make it back server side. I.e. restbase or the API would give this to the client:... [16:38:41] I will think about it a bit till tomorrow [16:54:28] bblack, I'm not missing something obvious here am I? https://gerrit.wikimedia.org/r/#/c/322423/ [17:00:20] (I mean, fundamentally, I'm not asking you to review the whole thing right now) [17:22:51] Krenair: I <3 u for that [17:22:58] Those apache configs -- bane of my existence [17:23:25] I've uploaded a bunch of patches to clean things up [17:23:35] Krenair: in theory yes we can get rid of those. However, they're also a safety net for now for those wikis (which were deemed private enough that they needed forced HTTPS before we had it for all) [17:23:58] both simplifying production ones (e.g. see above) and getting rid of the silly beta forks [17:24:02] Krenair: we thought about removing that safety net before, but it's still possible we could fuck up a varnish commit and accidentally drop the forced-HTTPS redirects [17:24:32] Krenair: so I think we're blocking on getting over the hurdle of conditional HTTPS-redirects in the Traffic stuff before we remove the internal protections [17:24:42] hmm. don't we have tests for varnish doing this sort of thing? [17:24:49] still [17:24:52] ok [17:25:04] it's a complex regex and there are exceptions (e.g. stream.wikimedia.org), and it's crazy logic [17:25:21] the goal is to get to where Traffic's HTTPS redirects are unconditional and plain-HTTP is impossible at that layer [17:25:33] and then it would be safer to remove internal protections (or even looking at XFP) [17:25:59] The main thing this commit allows for is https://gerrit.wikimedia.org/r/#/c/322425/1 [17:26:16] It may be possible to achieve that while maintaining the existing apache-level TLS enforcement [17:28:22] Krenair: You can have multiple ServerAlias directives. Probably best to split those one per line for easier diffing when entries added/removed [17:28:34] probably [17:30:30] bblack: On a semi-related note, we can land https://gerrit.wikimedia.org/r/#/c/305536/ whenever [17:30:40] (next to last nail in the bits.wm.o coffin) [17:31:10] Last nail is the docroot in mw-config, but needs the vhost gone first out of paranoia :) [17:31:26] oh, hang on [17:31:37] ostriches: yeah I kept not having time to test mine, and then Krenair proposed an identical one [17:31:39] I've got a dupe of that: https://gerrit.wikimedia.org/r/#/c/322420/ [17:31:45] sorry [17:31:47] so I +1'd his hoping he'll actually push his through :) [17:32:06] I can't merge puppet patches [17:32:09] Yay dupes, let's abandon one [17:32:36] they both have +1s and are simple, it's just a question of someone jumping through the verification hoops to make sure it doesn't blow something up [17:32:57] the domain no longer exists [17:32:58] we have rules around scary MW apache changes and testing on X-WM-Debug and/or deployment-prep, etc [17:33:12] deployment-prep's bits is gone [17:33:19] sure, it's just the black magic of apache config we're worried about, if it ends up affecting some other domain indirectly [17:33:24] ok [17:33:32] it needs ops to push it through [17:34:13] yeah [17:35:00] Luckily this isn't scary! :) [17:39:58] After my last adventure into the land of apache configs in a puppet swat window, I don't think we allow that anymore? [17:41:30] Looks like the next one I can attend without changing my timetable is Tuesday 13th [17:41:53] assuming there is actually one that day, the deployment calendar isn't that far ahead [17:43:38] I tried to do https://gerrit.wikimedia.org/r/#/c/321916/ in puppetswat but was told not today :( [17:43:41] (also trivial) [17:45:28] so I guess we need custom deployment windows for apache config changes? [17:46:28] * ostriches mutters something about being Agile :p [17:46:39] Move fast! Break shit! Don't care! [17:46:52] Agile™ [17:57:10] ostriches, bblack: okay, take 2: https://gerrit.wikimedia.org/r/#/c/322425/2 [17:58:11] that technique to get %{HTTP_HOST} is currently in use in a bunch of places for /upload redirects [18:41:09] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2818492 (10Gilles) a:03Gilles [18:41:40] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Performance-Team, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2703298 (10Gilles) [18:43:05] 10Traffic, 06Operations, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2699497 (10akosiaris) According to SoS, 5.3.0 iOS app has been shipped last week, so we should start seeing traffic for 0px requests dropping [18:50:34] bblack, what would you recommend doing to get apache config commits pushed through? [19:00:10] 10netops, 10DBA, 06Labs, 10Labs-Infrastructure, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2818586 (10jcrespo) 05Open>03Resolved a:05jcrespo>03Cmjohnson The servers are workin... [19:23:08] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2818653 (10AndyRussG) >>! In T149873#2767345, @aaron wrote: > Another idea is to add a cache-busting pa... [19:50:50] 10Traffic, 06Operations: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839#2818683 (10fgiunchedi) @bblack we'll need to translate the "dbnames" in `$private_wikis` to actual names used in urls, I don't think there can be a correspondence in the path alone, the hostname... [20:30:50] Krenair: I don't remember to be honest. I know there's a procedure (for doing some limited regression testing on X-Wikimedia-Debug, I think?), but I rarely have a reason to use it. [20:31:47] Krenair: https://wikitech.wikimedia.org/wiki/Application_servers has some stuff to test on deploy, but seems manual. I thought there was a script. [20:32:43] I imagine the testing procedure would involve disabling puppet across the cluster except for the debug machines, merging the patch on the puppetmaster, and applying puppet on the debug machines [20:32:49] then X-Wikimedia-Debug requests [20:33:09] then if all is well, slowly re-enable and apply puppet across the cluster [20:34:45] Seems reasonable ^ [20:34:50] yeah something like that [20:35:28] I just thought there was a script to do a standard set of X-Wikimedia-Debug fetches and check their outputs or something. I know one was discussed at one point, but that wikitech link doesn't have it [20:36:07] I think I've heard of there being some tests in the old apache-config repository [20:36:33] from back when us mere mortals could change apache configs [20:40:18] bblack, ostriches: https://github.com/wikimedia/operations-apache-config/tree/023f767801cb284cf3cfa88771243cb035c58722/test [20:41:00] jo.e probably knows best [20:41:30] 10Traffic, 06Operations, 06Performance-Team, 07Regression: Investigate major HTTP 500 spike since 2016-09-23 - https://phabricator.wikimedia.org/T151078#2818855 (10Krinkle) 05Open>03Resolved a:03Krinkle Looks like that was it. It's coming back down now: {F4828618} Might take a while to return fully... [21:55:48] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Performance-Team, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2703298 (10Tgr) See T88412 for similar issues in the past.