[01:22:32] 10netops, 10Operations: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3346480 (10BBlack) What are the real pros and cons on this? We could even go in the other direction and have a unique ASN per region/continent. How does the impact future anycasting? Note https://tools.ietf.org/ht... [07:14:00] 429s seem to be stable at less than 2K/s https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=All&var-status_type=4&from=1497366173660&to=1497404757392 [07:14:40] bblack: merged https://gerrit.wikimedia.org/r/#/c/358028/ as legoktm confirmed it's a good idea, mentioning that it affects quite a lot of projects: https://meta.wikimedia.org/wiki/Special:SiteMatrix [07:42:02] 10Traffic, 10Wikimedia-Apache-configuration, 10Operations, 10Mobile, 10Patch-For-Review: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3347046 (10ema) 05Open>03Resolved a:03ema As @Legoktm mentio... [09:15:20] 10Traffic, 10Operations, 10RESTBase-API, 10Patch-For-Review: [feature request] Redirect root API path to docs page - https://phabricator.wikimedia.org/T125226#3347209 (10ema) p:05Triage>03Normal [09:28:23] bblack: re: T125226, I've worked https://gerrit.wikimedia.org/r/#/c/306979/. Now that I'm done though I wonder if it wouldn't be much better to just remove the trailing slash in https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/cache/text.yaml#L69 [09:28:23] T125226: [feature request] Redirect root API path to docs page - https://phabricator.wikimedia.org/T125226 [09:28:57] s/worked/worked on/ [11:41:23] 10netops, 10Operations: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3347521 (10akosiaris) [12:29:01] ema: hypothetically we could have other services that aren't part of rest_v1 and use the /api/foo/ schema, and one of those might end up being e.g. /api/rest_v11/ . So basically I don't think it's correct to remove the trailing slash complete, but we could make it optional on the regex. [12:29:11] ema: as in ^/api/rest_v1/? [12:29:30] err I guess that doesn't solve it either [12:29:51] ^/api/rest_v1(/|$) [12:29:56] maybe that does? [12:33:13] ema: but also, there are unfortunately several other mentions of /api/rest_v1/ in the cluster VCL too, they probably all need similar updates [12:38:30] it is the cleaner approach, though [13:02:26] ema: other thoughts on the 429 business... the RFC says: [13:02:28] The response representations SHOULD include details explaining the [13:02:28] condition, and MAY include a Retry-After header indicating how long [13:02:31] to wait before making a new request [13:04:03] so on the SHOULD for putting a text response on the page: while it seems standards-better to maybe include at least a snippet of synthetic text/plain (or even better a proper error page), but I think at the end of the day efficiency has to win here. The fewer response bytes the better. [13:05:29] but Retry-After might be worth experimenting on. Unforunately the RFC that defines that header says it can only be integer seconds. So there's no way to say e.g. "Retry-After: 0.1" [13:06:17] but perhaps we should include a "Retry-After: 1" for the case of whatever UAs might actually pay attention? Maybe some of the ones that do not decrease their rate in response to 429 are doing so because they're not given any retry-after advice? [13:25:57] bblack: re: 429 responses, we're currently returning the text/html error page [13:27:35] (as of this morning, I've merged https://gerrit.wikimedia.org/r/#/c/358057 which removes the return(deliver) in vcl_synth) [13:28:50] `Retry-After: 1` seems like a good idea! [13:32:01] TIL: Retry-After can be returned on 503s too [13:36:05] yeah we've had large debates around 5xx in the past that touched on that :) [13:36:44] the interesting meta-point is that usually if you're a server generating an explicit 503, you don't have any good idea when it is that whatever failure you encounter would be resolved anyways. [13:37:13] unless it's "planned downtime" I imagine [13:37:50] good point, I guess :) [13:38:25] but also not really the case we care about at the edge. we'd never just "503 Down for Maint" for the public-facing services I don't think. [13:38:54] if our stack sends 503 to a user, most likely none of the code involved really understands the scope of the problem [13:39:57] (or importantly, whether it's really a server failure that will impact virtually all requests to this domain/resource, or a server bug triggered by anomalies in the client's specific and technically-legitimate (not 4xx-worthy) request) [13:44:55] anyways, all of the past debate on this was centered on how retries should work throughout our stack [13:45:05] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3347979 (10Anomie) [13:45:35] the idea being that if lots of different layered software components in our stack are retrying requests in various ways (hopefully a limited retry count and maybe with some retry delay?)... [13:46:03] when something deep is at least temporarily completely borked, user requests get amplified [13:47:26] e.g. hypothetically (the applayer isn't quite this nested, but to make an extreme case) user sends 1 req to nginx, which forwards to varnish-fe, to varnish-be, to RB, to parsoid, to MW-API, to memcache. And memcache fails in a way that generates 503s at the bottom layer here [13:47:55] so there's 6 layers above memcache. even if they all conservatively have a policy of "on 503, retry my request to the next layer once only" [13:48:26] you end up with 2^6=64 requests fired at memcache for 1x user request during the time that these 503s are happening [13:49:12] so it causes terribly request storms during times of deep failure [13:50:49] so our answer to that is basically that nothing should retry 503s (or similarly, timeouts/hard-errors even reaching the next service to get a 503), they should always pass right back up the stack [13:51:40] and then we make a lone exception at the varnish-fe level, where it's allowed to retry a 503 exactly once on behalf of the user, as a pragmatic tradeoff to hide rare/intermittent errors from users without huge multiplicative request storm costs further down. [13:52:14] right [13:52:38] which seems reasonable :) [13:54:14] I've prepped https://gerrit.wikimedia.org/r/#/c/358965/ to add Retry-After meanwhile [13:55:16] so in the retry-storm scenario on "deep" 503s, do you think retry-after could help? Perhaps different values at different layers [13:56:04] assuming we want to change the retry-once semantic currently in place, and mostly for the sake of discussing alternatives [13:57:43] no, not really [13:58:11] so we kind have to assume that any policy we make is applied at many places throughout an arbitrarily-deep stack of services handling a request [13:59:10] for any given service, it's got two kinds of 503-cases: (1) It recevied a 503 from a lower layer (options: retry or forward) or (2) It failed to contact the lower layer successfully, and thus is in the position to retry or generate a new 503 for the next consumer upstream. [14:00:00] all of them start as case 2 somewhere, and then case 1 applies as it traverses back up the stack [14:03:26] in the case-2 case, the 503-generating service really has no ability to predict the nature of the failure, and thus none of the case-1's that will process it on the way know either [14:03:57] it could be that in the original case it was a transient one-off thing and an immediate try would be the best strategy if you had an omniscient view of the issue [14:04:03] or not [14:04:08] but there's no way to know [15:39:07] bblack: oh, http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1 is a 404 anyways so I guess https://gerrit.wikimedia.org/r/#/c/306979/ would be the way to go to fix T125226 anyways [15:39:07] T125226: [feature request] Redirect root API path to docs page - https://phabricator.wikimedia.org/T125226 [15:39:20] without trailing slash that is [15:39:49] looks like cache_upload in codfw isn't particularly happy with the load [15:40:22] since I didn't spot any horrors with swift 2.10 in codfw and I'm out tomorrow/friday I'm ok with reverting back to eqiad, https://gerrit.wikimedia.org/r/#/c/358620/ https://gerrit.wikimedia.org/r/#/c/358621/ https://gerrit.wikimedia.org/r/#/c/358622/ [15:40:43] bblack ema ^ [15:41:05] yeah, I've restarted varnish-be on cp2014 and cp2017 because of 503s due to lag [15:49:34] 10netops, 10Operations: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3348613 (10faidon) >>! In T167840#3346810, @BBlack wrote: > What are the real pros and cons on this? We could even go in the other direction and have a unique ASN per region/continent. How does the impact future an... [16:11:06] 10Traffic, 10DNS, 10Operations: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3348680 (10ema) [16:12:28] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3348682 (10Nuria) [16:44:14] 10Traffic, 10DNS, 10Operations: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3316252 (10Dzahn) Check this for ALL of the other language prefixes too. We once had them all, was a long discussion with the owner of .wiki years ago. Then WMF decided to not use... [16:45:04] 10Traffic, 10DNS, 10Operations: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3348768 (10Dzahn) de.wiki, fr.wiki, it.wiki, etc etc... [16:45:53] 10Traffic, 10DNS, 10Operations: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3348771 (10Dzahn) T145907 and T88873 [17:04:04] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3348807 (10Nuria) [17:06:08] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1798302 (10Nuria) >as well as the pageview API, which is currently low on backend capacity. Correction: pageview API has been rebuild since last comment and it can handle a LOT... [17:07:42] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3190763 (10Nuria) The fact that no requests have been throttled of late in PageviewAPI (see 429 graph below) kind of tells me that PageviewAPI has received too f... [18:29:30] 10Traffic, 10Operations, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349120 (10GWicke) [18:31:38] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3349135 (10Jdlrobson) [18:32:11] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3349142 (10GWicke) [18:32:15] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3349137 (10GWicke) 05Open>03Resolved a:03GWicke @bblack and myself looked into this yesterday after the deployment of the more aggressive global limits, and found that leg... [18:41:26] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, and 3 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3349181 (10Krinkle) [18:41:56] 10Traffic, 10ArchCom-RfC, 10Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3349184 (10Krinkle) [18:42:17] 10Traffic, 10Operations, 10Performance-Team: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#3349189 (10Krinkle) [19:48:55] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3349563 (10Nuria) >which matches metrics end points explicitly limited at 100/s per client IP. mmm... looking at pageview API dashboard i can see some of lawful traffic (spike... [20:08:47] 10Traffic, 10ArchCom-RfC, 10Operations, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349597 (10GWicke) [20:30:11] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349706 (10chasemp) [20:32:04] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349637 (10Reedy) Can you advise what API queries you're actually making? And any suggestion of magnitude? [20:50:19] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349772 (10Haiku-narrative) This will be the general format of the vast majority of calls: https://en.wikipedia.org/w/api.php?action=query&format=json&redirects=&continue=&prop=extracts%7... [20:53:04] TLS 1.3 in Apple's products: https://mailarchive.ietf.org/arch/msg/tls/38hn9mRARfDpNdwVKXSpCFhRXs4 [21:29:07] 10Traffic, 10Discovery, 10Maps, 10Operations: Make maps active / active - https://phabricator.wikimedia.org/T162362#3349967 (10debt) Moving off the sprint board - the Discovery team won't be able to do this work at this time. [21:35:22] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350029 (10faidon) What kind of User-Agent will you be using? Please have a look at the [[ https://www.mediawiki.org/wiki/API:Etiquette | API Etiquette ]] and especially the User-Agent sec... [22:45:53] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349637 (10GWicke) You could consider using https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_summary_title instead, which is a fully cached version of the page summary informa...