[09:12:53] ATS planning to add IETF QUIC support next year https://lists.apache.org/thread.html/43d6c349de22e77ef5a9bd0eeb5dd1b889e1a0f86132fcd21ec8580b@%3Cusers.trafficserver.apache.org%3E [10:57:23] this morning I saw https://github.com/ngtcp2/ngtcp2 that is really promising [11:08:33] elukey: nice, that's from the author of spdylay/shrpx/nghttp2 [12:19:10] 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3386169 (10faidon) Let's not forget to actually revoke those certificates too. We're getting a little off-topic here though, so perhaps @Jgreen / @CCogdill_WMF / @DKaufma... [12:27:12] 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3386198 (10Jgreen) 05Open>03Resolved a:03Jgreen >>! In T137161#3386169, @faidon wrote: > Let's not forget to actually revoke those certificates too. We're getting a... [12:27:14] 10HTTPS, 10Traffic, 10Operations, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3386203 (10Jgreen) [12:29:39] 10HTTPS, 10Traffic, 10Operations, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3386206 (10Jgreen) [12:29:42] 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3386204 (10Jgreen) 05Resolved>03Open I didn't intentionally close this task... [15:31:46] 10Traffic, 10Discovery, 10Maps, 10Operations, 10Interactive-Sprint: Rate-limit browsers without referers - https://phabricator.wikimedia.org/T154704#3387243 (10Gehel) a:03ema Significant work has already be done on T163233. @ema is aware of this task and will come back to us with some idea / plan / or... [15:50:31] gehel: I don't know about your existing discussions, but a couple minor points: (1) obviously, referrer can be added by proxy abuse or whatever that would normally lack it (2) In the mostly-HTTPS world, it's not common to get cross-origin referer even with legit traffic, unless the other site configured to allow cross-origin referer to be sent... [15:51:13] sending a referer when there is one is polite, but it increasing may be non-default, and it's not much real protection against anything persistent [15:51:22] *increasingly [15:52:05] this initial proposal shows our limited understanding of what is going on :) [15:52:30] The initial proposal is based on what OSM themselves have implemented, but this might not be the right solution. [15:52:34] we have a similar issue with UA headers [15:52:56] I believe we ask or require that our API users set a UA string so we can find them and identify the code, etc [15:53:30] and it sort of makes mental sense to more-severely limit (or maybe someday even block) UA-less requests on the grounds that UAs should identify themselves and it's not an unreasonable request [15:54:04] but at the end of the day, a lot of the persons and/or code that come to us UA-less today would just send "User-Agent: ." or whatever to evade the spirit of the rule [15:54:18] The idea is that we probably want *some* protection for when the next pokemon go app starts using us as a generic tile server and generates tons of traffic (or any other kind of large traffic not directly related to wikipedias), we have a way to answer that [15:54:53] and then we could arms race on the length of the UA and they'll send "XXXXXXXXXXXXXX", and then I guess we could arms-race on checking the entropy of the UA string and they'll just send us UAs filled with random hash values or UUIDs, perhaps auto-transformed into dictionary word lookups or whatever [15:55:53] basically, at the root of the "UA enforcement" problem is that the space of legitimate UA strings is virtually infinite, and there's no reasonable test of what's legitimate. [15:56:19] so in spite of our instincts to do something about UA-less requests, there's really not much point [15:56:27] Ok, understood, there is no way to block a determined attacker here. But there is a way to send a signal back to someone unkowingly abusing our tile servers [15:57:33] As an example, we had a case a while back (and I think it was a pokemon go app) which generated tons of traffic. Once contacted, the owner did respond fairly quickly. All in good faith... [15:58:21] I puzzle at the idea that someone can write a complex phone app and not have the basic understanding necessary to realize what they're doing when they abuse our tile server for something unrelated :) [15:58:52] in any case, test the HTTPS thing too, though [15:59:08] which HTTPS thing? [15:59:25] I'm pretty sure that unless the HTTPS site emits special headers to allow it, the browser won't send referer at all when legitimately including one of our tiles from a separate domain's page [16:00:01] we've had debates about this on our end (the other way around) that have gone public, it's a controversial topic how to set it [16:00:06] but the point is the default is to not send one [16:01:14] we put this in our page outputs to change the default: [16:01:16] [16:01:42] so when we link to another site and the user browser follows the link, it will send "Referer: en.wikipedia.org" (but not the full article title URL) [16:02:06] before we added that meta tag, once we had switched to HTTPS our outbound referer headers went dark (default, for privacy/security reasons under https) [16:02:51] so my point there is legit browsers legitimately refering a link to a tile from a blog post or whatever, may legitimately send no referer header [16:03:10] ok, so the initial request is wrong (and provides a solution, not a problem) [16:04:08] So the question should probably be: "should we proactively do something about potential abusers (T154717), and if yes, what can we do about it" [16:04:08] T154717: Maps Dashboard - add notation for increased tile usage - https://phabricator.wikimedia.org/T154717 [16:04:43] Or do we want to drop that completely and handle it on a case by case if the problem appears? [16:08:22] bblack: do you mind if I paste that conversation to the phab task? [16:08:24] it's really a soft-policy question first I think, and then from there you can dive into "what can we do about policy violation" (which may be nothing)? [16:08:46] we had a meeting about maps policy a long time ago, and now I don't recall where the output went (if any was significant) [16:09:18] T141815 [16:09:18] T141815: Define tile usage policy - https://phabricator.wikimedia.org/T141815 [16:09:22] the policy-level questions are like: is this considered an open tile resource for the Internet? (in the way the wikis are an open knowledge resource similarly?) [16:09:54] or is the intent that this only works for supporting embeds in our own sites, and all other usage is policy-violating? Or we make exceptions for other non-profits? that sort of idea space [16:10:14] so https://wikimediafoundation.org/wiki/Maps_Terms_of_Use [16:10:16] the current policy is open for third party usage, with limitation [16:10:18] yep [16:11:16] well there's some soft limitations there, like "respect our limited resources" [16:11:31] and there's some hard ones too, like a UA requirement, and apparently also a UA+Referer requirement [16:12:09] "Using the service without compliance with any license or copyright terms" is a hard policy limitation, but one that can't be auto-enforced. Humans will have to notice and take action. [16:13:09] so in the broad strokes, I don't think the spirit of that Maps ToU differs significantly from how the wikis are in practice, except for: [16:13:12] "If you are developing an application that uses the Wikimedia Maps service, you must provide a valid HTTP User-Agent that includes your application, version, and sufficient information to easily contact you (e.g., your email address)." [16:13:24] and "Accessing the service without a proper HTTP User-Agent or HTTP referer" [16:13:27] The initial ticket comes from Paul and his experience working on OSM. They do have a more aggressive approach than we do to limit abuse. [16:14:21] They probably need this more aggressive approach as their implementation is more resource intensive (tiles generated on the fly, high frequency of refresh, ...). [16:14:26] As noted above, I'm not sure that the referer requirement makes sense. They may not easily be able to set the meta-tag, or view the meta-tag as onerous on their own users' privacy (as some in our community have done in reverse) [16:14:43] (or we might effectively be pushing them to not use HTTPS, which is even worse) [16:15:04] the UA requirement totally makes sense, but falls into the UA arms race discussed above [16:15:10] (for technical measures, anyways) [16:16:09] so in the overall, I don't think our technical response should really be much different than whatever we continue to evolve for the wikis and other APIs and such, really [16:16:36] per-IP ratelimits, and we could perhaps limit harder for short/no-UA just to catch un-malicious cases and stop there [16:17:31] beyond that it's more like we're going to ignore minor problems and do manual incident response when someone abuses heavily enough to notice, like everything else. [16:17:43] So basically, that means that we don't want to do anything specific for maps, and we (the maps team) just trust you (traffic team) to do whatever make sense? [16:18:37] well, in terms of mechanisms, yes. I think it's reasonable that we talk about maps-specific limits, though (what is reasonable for per-IP limits for the maps service, given how tiles and embedded maps work, etc) [16:19:27] Make sense. I'll need to dig a bit into the stats... [16:19:30] in the long run I think the right model with any of our services is to require authentication of a legitimate account for serious levels of use, but even on the other APIs we're not there yet [16:20:19] (have a reasonable small per-IP ratelimit for single humans, and evade it when we can confirm authentication, at which case it's a matter of service admins setting non-default applayer ratelimits per-account on request) [16:20:55] you'd want to manage rate-limiting at app layer? [16:21:13] for a case like maps tiles, the authentication code doesn't necessarily have to be part of kartotherian itself. We can just require them to use the centralauth API to log into some wikimedia.org account that will send auth cookies in the direction of maps. [16:21:31] that looks very much like a cross cutting concern to me (with specific limits per service, but the same mechanism) [16:21:48] gehel: I'd want to manage the exception cases at the app layer, per real account. [16:22:08] e.g. in my ideal future world, an example with made-up stuff: [16:22:46] 1) For "normal" requests to /w/api.php, we ratelimit at some reasonable human-UA/single-user cutoff per-IP, let's say it's 10/s, at the Varnish layer [16:23:22] 2) If Varnish sees (and cryptographically verifies) a login-session cookie, it doesn't apply the ratelimit (or at least, applies a much higher maximum cap of some kind) [16:23:57] 3) In MediaWiki, there's a default ratelimit for authenticated accounts, which might be higher like 50/s (as these users have accounts - we can contact them, or we can disable their account for abuse, etc... they're now a known quantity) [16:24:31] 4) For exceptional cases where a 3rd party has a legit need for much higher limits we approve of, they can ask some level of administrator to set a higher limit in their account settings. [16:25:46] (all of this modulo 10 other ideas too, like maybe paralellism makes more sense than reqs/sec, and maybe APIs can return variable costs that affect the limiting in terms of endpoint or per-call expense, etc) [16:27:29] which is at least simpler than credits like Amazon's API, but similar at the end of the day [16:27:49] you can't stop the resource abuse when you have an unknown mix of client legitimacy, policies, and rates [16:28:25] and it's reasonable that if someone wants exceptional resources out of us, they should properly notify us in the more-formal sense of creating a legitimate login for their code to use and asking for an increased limit with justification. [16:29:15] I have not given it as much thought as you have, but I would have moved the rate-limiting itself completely out of the app layer. [16:30:09] well, there might be a way to do that as an optimization (e.g. when MW answers authenticated requests, it also sends back response metadata informing the caches of this account's ratelimit) [16:30:10] For something like maps, the service does not care about user, except for this technical constraint of limiting abuse, so I would much rather have this concern managed externally. [16:30:35] but ultimately it's up to the application to have any conception of a specific user's authorized increased limits [16:30:56] I disagree... [16:31:11] or it depends on the definition of application [16:31:39] if the application has no conception of what a user is or authorizing specific increases, then there's no way to differentially support both "ratelimit the common low-end case" and also "allow legitimate 3rd-party high traffic usage" [16:32:13] so you just get to pick one: tell high-volume users that you're fond of that you just don't support that volume, or let everyone go crazy with high ratelimits for all [16:32:41] you need to have a concept of user and of limit, it does not have to be in the application [16:33:31] ok, replace "in the application" in the code sense with "managed by the application's owners, who would validate 3rd party requests for increases according to policy, and understand the meaning in the context of their API calls, etc" [16:34:02] yep, then I agree :) [16:34:43] but unless you're advocating moving the whole concept of users up into the traffic layer, it doesn't make sense for the varnish caches to host a user database with user metadata either [16:35:13] it most certainly does not make sense for varnish, [16:35:33] it probably make sense for MW (it already has this database of users and some metadata) [16:36:01] it most probably does not make sense for kartotherian, which has no concept of users [16:36:42] yes but now you're basically saying "I agree kartotherian logically needs policy capabilities that differentiate users, but I don't feel like implementing users to do it" [16:37:05] ideally though, of course we don't want every app reimplementing user dbs and auth from scratch [16:37:22] we'd want kartotherian to just make simple API calls (or send the user to them) for CentralAuth or some such [16:38:13] Ideally, I don't want kartotherian to know anything about user and / or rate limiting. I want a proxy (in the generic sense) in front of it handling that transparently [16:38:17] in other words, if you want a high maps limit, you need to create a WMF CentralAuth account to authenticate with, and go log in there, which sets at *.wikimedia.org-level session cookie, which karto perhaps ignores on a technical level, but Varnish can use to find metadata on a user's maps ratelimit [16:38:54] (in some sense, lots of handwaving in some of those statements) [16:40:02] That reminds me of discussions about API gateways in another life... Not a happy memory :) [16:40:39] Time to take a break! Thanks a lot for the discussion, as always! [16:40:50] I'll try to summarize some of that in the phab ticket... [16:42:28] TL;DR - there's lots of fancy thoughts to have about the long term, but pragmatically there's not much we can do at all except what we're doing in all the other apps' cases: set fairly high per-IP ratelimits, *maybe* set them lower for empty/super-short UA strings as a warning, and anything more complicated is a year+ away when something or other is better-resourced and we can make loftier plans. [16:44:39] Thansk! [19:05:34] bblack: re: codfw switches upgrades tomorrow, cp-wise we only have two text and two upload nodes in row A (cp2001/2004 and cp2002/2005) -- and then one misc and one maps. I was wondering if it'd make more sense to just depool those instead of the whole DC (and routing around it)? [19:07:45] ema: from just the caches' perspectives, maybe, but there might be some other considerations here [19:11:23] ema: mostly I'm thinking about the fact that we do have some A/A services now, where if the traffic enters via codfw (or ulsfo->codfw), it will route to apps in codfw [19:12:09] and those apps could also be impacted and maybe they don't all have separate downtimes for it (or at least, we'd have to disable their codfw sides in the cache backending config too) [19:13:01] oh yeah, T168462 mentions 'Switch services served from codfw (es: restbase-async, citoid) to be served from eqiad' among the prep steps [19:13:01] T168462: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462 [19:13:23] so I thought moving all A/A services to eqiad only was part of the plan, but I might have misinterpreted [19:13:45] in some quick grepping, for A/A's or otherwise "active in codfw", cache_misc has: noc, eventstreams, pybal_config, wdqs, ores [19:14:00] and then cache_text has cxserver and mediawiki-debug [19:14:26] well you could go either way about it [19:14:42] but it seems simpler to depool codfw traffic from geodns and inter-cache [19:15:01] yeah actually it does :) [19:15:07] and then you know nothing is going to try to backend through there unless it has to (e.g. mediawiki-debug, which for whatever reason is codfw-only right now, but also totally not that important in a short outage) [19:15:14] and I don't think we had any significant impact last time around [19:16:16] if we don't do the "depool codfw traffic stuff" route and just depool the few nodes, and then ores@codfw or whatever fails due the network work, ores will fail publicly [19:16:43] (unless we also go depool all of those on their codfw side too, or are sure it can't impact them based on servers in rows) [19:17:41] ok, let's depool the whole DC and route around it like we did last time, agreed :) [19:18:45] bblack: also, today I had my share of fun with puppet/hiera adding a new class for pu https://gerrit.wikimedia.org/r/#/c/361844/ [19:19:06] does it seem ok to you? [19:21:53] aaand, varnish 4.1.7 is out, they've finally backported the nuke_limit fix https://www.varnish-cache.org/lists/pipermail/varnish-announce/2017-June/000721.html [19:23:59] changelog here https://github.com/varnishcache/varnish-cache/blob/4.1/doc/changes.rst [19:24:32] 10HTTPS, 10Traffic, 10MW-1.30-release-notes, 10Operations, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3388198 (10aaron) Looks like SwiftRepl is the last element of this task. @fgiunchedi , how difficult does look to add to the replication script? I know you sa... [19:27:46] +1 conceptually awesome, I have no idea if it works in practice :) [19:29:06] we'll have to think about nuke_limit. we had a working value there before, and lots of related behavior changed on 3.x->4.x, but since it's been broken in all of 4.x we got no real experience with how the tuning could've changed... [19:32:17] right, we'll have to see if with the limit being honored it gets harder to nuke and how that affects mailbox lag [19:34:21] I just dug through some git history to make sure I remembered right [19:34:34] historically under 3.x, we had an explicit setting (on all clusters) of nuke_limit=300 [19:35:14] and then during the original mailbox lag investigations, we bumped to nuke_limit=1000 (which did nothing) and lru_interval=31 (which may have helped) [19:36:13] I think the lru_interval thing makes a lot of sense. only if we had plenty of headroom on related issues (which we don't!) would we consider reducting to maybe very very slightly increase long term hitrate or something [19:36:20] *reducing [19:36:29] but nuke_limit is a tricky thing [19:37:12] if it's too low, and we have lots of fragmentation and size-spread within a storage, we're going to at least sometimes see outright "failed to allocate storage", which I'm guessing causes 5xx. [19:37:49] if it's too high, there might be pathological cases where we could've just returned a rare 5xx, and instead we waste lots of locking and time churning on tons of pointless nukes and start screwing everything up (inducing mailbox lag, etc) [19:38:54] I tend to think we've limited the pathological case considerable with the storage-splitting on the BEs and the absolute size limit (256K I think) on the FEs, in the upload case [19:38:58] but who knows [19:39:03] yeah, and presently it's for sure too high (being unbounded!) [19:39:26] so true heh [19:39:46] hopefully having any kind of limit improves the situation [19:39:55] it's gonna be interesting to see, yep [19:40:19] is there any kind of stats-inference we can make about current nuke rates? [19:40:35] even like, a counter of total nukes vs total allocations, to derive a long-term average nukes-per-alloc? [19:41:24] so we do plot the nuked objects rate [19:41:43] allocator requests and allocator failures too [19:41:58] although honestly I wonder if they're even in the same units [19:42:12] allocator requests could be per-chunk and nukes per-object (or both per-object) [19:42:51] yeah [19:43:19] nuke_limit seems like it would be nukes-per-chunk I think? [19:43:31] (as in, when trying to allocate 1 chunk, it can nuke up to X objects trying to find room) [19:43:46] Units: allocations [19:43:52] Maximum number of objects we attempt to nuke in order to make space for a object body. [19:44:06] oh good, allocator requests are in units of allocations, that makes things clear :) [19:44:23] I think that description of nuke_limit can't quite be right [19:44:30] maybe if you squint [19:44:50] because underneath it all, when allocating space for an object, varnish actually allocates chunks, sometimes multiple for one object [19:45:12] e.g. in the case of a chunked-streaming transfer of cacheable content, it can't pre-allocate a whole object for sure [19:45:18] but it allocates chunks as the object builds [19:46:13] I don't know if in the upload-like case, where we always have CL data up front, if it still has the same basic streaming-allocation behavior, or tries to make one giant chunk for the known CL. [19:46:34] (or maybe allocates it all upfront, but in multiple chunks of mmh [19:47:47] it's possible that the new definition of nuke_limit could fit the claimed description I guess [19:48:00] if they're tracking the nuke_limit number in a broad context across all chunk allocations for an object as you go [19:48:12] at any rate, once a chunk has been deallocated one has to deallocate all other chunks for the same object right? So it might make sense to limit on the number of objects rather than chunks I guess [19:49:16] yeah I'm sure nuke_limit has to be units of "whole objects nuked, inclusive of all of their underlying chunks" [19:49:50] the side that seems questionable to me is whether you get nuke_limit nuked objects per allocated new object, or nuke_limit nuked objects per allocated chunk for a new object [19:50:40] I'd like to think it's the former! [19:51:18] if (wrk->strangelove-- <= 0) { [19:51:27] appropriate for nuking ^ [19:53:03] BTW we're gonna get a different error in varnishlog when nuke limit kicks in https://github.com/thomsonreuters/varnish-cache/commit/a267b2acb8de6a2a9229850e3d5b68ee87192fdc#diff-d702f43a039b6bfda12f579bc0ede008R333 [19:53:24] not the good old 'Failed to allocate storage' [19:54:49] ok so I'm gonna package 4.1.7 together with the new counters for transient storage added here https://gerrit.wikimedia.org/r/#/c/361845/ [19:55:20] lol at the dr strangelove :) [19:55:28] yeah :) [19:58:42] uh, I'm not sure how I ended up on thomsonreuters' fork of varnish (link above) instead of github.com/varnishcache but anyways [20:00:19] it's interesting to look at the (usually few) commits in someone's fork, they have a file_exists in the std vmod: https://github.com/thomsonreuters/varnish-cache/commit/50c4ba797f445acc8a6af3622ba72cf1c55ee9f9 [20:00:51] so they can do things like: if(file_exists("/tmp/foo")) { synth(503, "Maintenance Mode!"); } [20:01:11] and it just maps to a raw stat() call, but I guess the OS level should be efficient-enough at caching that [20:11:57] oh cool [20:12:07] there doesn't seem to be much in varnish "network" though [20:12:37] https://github.com/varnishcache/varnish-cache/network [20:16:08] I was going to say that file-check booleans seem inelegant, but then again cumin-touching a file seems far more elegant for a runtime state than pushing a puppet hieradata change (as we did for the "traffic_shutdown" setting) :) [20:17:23] puppet hieradata change which changes the vcl! [20:17:43] so yeah 'touch' not too bad after all :) [20:20:48] 10Traffic, 10ArchCom-RfC, 10Operations, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3388413 (10GWicke) @anomie, this RFC is primarily about better addressing the... [22:27:05] 10Traffic, 10Operations: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#3389037 (10DFoy) My only reference was the conversion of all the zero partners (around 75) from URL-based whitelisting to IP whitelisting when we adopted HTTPS-only, and that took about 5-6 mo...