[00:10:55] 10Traffic, 10DNS, 10Operations: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3350353 (10Bawolff) Hmm. W.wiki seems to be similar in that we own the domain but the A record points to AWS. It has the additional interesting thing in that its included in the su... [00:17:32] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350376 (10BBlack) >>! In T167920#3349772, @Haiku-narrative wrote: > we actually haven't been given the numbers yet, but I expect us to handle a few million requests over the course of a c... [00:18:11] 10Traffic, 10DNS, 10Operations: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3350377 (10Bawolff) Oh, i see, the A record points to a page which is hosted by the registrar for .wiki (who happens to use aws- i didnt originally think of navigating to 54.148.61... [04:44:04] 10Traffic, 10Commons, 10Multimedia, 10Operations, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3350562 (10Dispenser) The ongoing abuse has costs: in disk space 1 TB wasted so far, for administrators blocking dozens of accounts and d... [05:25:17] 10Traffic, 10Commons, 10Multimedia, 10Operations, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3350573 (10Marostegui) p:05Triage>03Normal [05:25:30] 10Traffic, 10Commons, 10Multimedia, 10Operations, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3350574 (10zhuyifei1999) >>! In T167400#3350562, @Dispenser wrote: > Then (somehow) configure Varnish to understand WP0 IP ranges It [[h... [06:25:51] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350596 (10Marostegui) p:05Triage>03Normal It would be helpful also if you could provide the start/end hour of your tests so we can identify those in our graphs. [06:55:03] 10Traffic, 10Commons, 10Multimedia, 10Operations, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3331800 (10Bawolff) > Wikipedia Zero traffic is tied to IP addresses, not users. So it definitely could be performant. Have MediaWiki set... [09:00:12] 10Wikimedia-Apache-configuration, 10Wikidata, 10Wikimedia-Site-requests, 10Patch-For-Review, and 3 others: [RFC] should wikidata.org/entity/Q12345 do content negotiation, instead of redirecting to wikidata.org/wiki/Special:EntityData/Q36661 first? - https://phabricator.wikimedia.org/T119536#3350745 (10Ladsg... [09:00:30] 10Wikimedia-Apache-configuration, 10Wikidata, 10Wikimedia-Site-requests, 10Patch-For-Review, and 3 others: [RFC] should wikidata.org/entity/Q12345 do content negotiation, instead of redirecting to wikidata.org/wiki/Special:EntityData/Q36661 first? - https://phabricator.wikimedia.org/T119536#1828684 (10Ladsg... [09:20:24] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350830 (10Haiku-narrative) Run time will be Thursday 1800EST to Friday 0600EST, we're looking into updating our user agent now [09:21:07] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350832 (10Haiku-narrative) >>! In T167920#3350164, @GWicke wrote: > You could consider using https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_summary_title instead, which is... [09:33:51] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350859 (10Marostegui) >>! In T167920#3350830, @Haiku-narrative wrote: > Run time will be Thursday 1800EST to Friday 0600EST, > > we're looking into updating our user agent now For the r... [10:36:20] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3350950 (10ema) >>! In T118365#3349563, @Nuria wrote: > mmm... looking at pageview API dashboard I can see some of lawful traffic (spikes we could have handled) seems to have b... [12:30:32] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351181 (10BBlack) That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hosted server at Hetzner in DE. [12:35:29] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351184 (10ema) >>! In T118365#3351181, @BBlack wrote: > That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hos... [12:40:52] bblack: hey :) should we go ahead with Retry-After? https://gerrit.wikimedia.org/r/#/c/358965/ [12:45:24] ema: yeah, but I would simplify and just set it to "1" universally on the deliver end. Any rate we specify is going to be bigger than 1/s, so even 1 is a sufficient delay penalty that they'll have multiple credits after that. [12:46:06] and honestly, I bet most clients ignore it anyways, but only one way to find out [12:46:26] yeah [12:50:13] the problem we'll continue to run into with complaints about/from "legit" bulk clients is the lack of any good way to set higher limits on a case-by-case basis [12:50:28] although I think we have a range of general answers: [12:51:01] 1) At certain rates not far from where we're limiting now, surely for many purposes bulk data downloads would be better for all involved [12:51:26] 2) If they'd use an authenticated bot account, they'd bypass today's restriction (and its eventual replacement with signed tokens) [12:54:41] 3) It's not that hard to limit themselves to serial (single-concurrency), and/or self-pace if they're already serial. Decide on a max rate that works, like 10/s (1 req per 100ms). On each cycle of their requesting software, record request start and end timestamps. duration=end-start. if duration>100ms, sleep(100ms - duration). [12:55:43] well really it's probably more efficient to time the entire cycle from start of req#1 to start of req#2, so they can count other processing they're doing into the time window and avoid some unecc delay [12:55:47] but whatever [12:58:50] but honestly, I think most of the violators must be high concurrency [12:59:41] if I benchmark with "ab" from inside of eqiad itself asking for https://en.wikipedia.org/wiki/Special:BlankPage (which seems like a pretty simple case, and is always uncacheable for some stupid reason, so it represents what should be a cheap cache miss/pass case like the limiter hits on) [13:00:12] the mean and median timing for req->resp is ~55ms [13:00:17] and that's from inside our own DC [13:01:05] if you assume most cache misses are hopefuly more expensive than Special:BlankPage and most users have >0ms latency to our edge, it should be rare that someone can sustain a 10/s rate at single concurrency... [13:01:11] yeah and some of those bots running on hetzner are doing > 100 misses/s [13:02:06] so clearly not single concurrency [13:02:48] one has to wonder of course, how the hell it takes us 55ms to answer a single request for an uncacheable Special:BlankPage when the client and all servers are at ~0ms latency from each other :P [13:03:01] but the fact remains, I doubt other misses are faster [13:05:56] in somehow related news, the top User-Agent in the last 6h is "-" [13:06:02] :) [13:07:59] complete lack of a UA string is kind of awful, and it is something we could consider bad etiquette and ratelimit lower, but... [13:08:10] I fear then people will just stick "asdf" in there or whatever [13:08:18] or "-" :) [13:08:32] I think the '-' is from us [13:08:39] (means empty / no header) [13:09:03] ah [13:09:37] right that's probably the varnishkafka fallback [13:12:35] there's just no reasonable technical measure by which we can validate the existence of a sane/useful UA string [13:12:45] FTR the vast majority of requests without UA are to /w/api.php, then a fair amount to / [13:12:57] the hardest case being UAs that copy strings from other UAs (e.g. some python bot claiming to be Firefox) [13:15:12] uh CrossrefEventDataBot does seem to have slowed down significantly after we merged Retry-After [13:15:38] in 10s it hit us 1k times earlier today, now ~150 [13:15:45] excellent! maybe there's hope [13:16:21] and maybe they have concurrency=15 :) [13:16:31] so for that case, we have the RB limit at 100/s with a burst of 1000 [13:17:22] so if they're actually obeying Retry-After with an explicit 1s pause, they'll gain 100 credits during that time. [13:17:59] so they'd fall into a pattern of (1s-pause, N reqs, 1s-pause, N reqs, ...) [13:18:14] where N will be at least 100, plus however much more credit they gain during the time it takes to do the first 100, etc [13:18:24] probably at most 200 per chunk [13:43:25] bblack: bump https://gerrit.wikimedia.org/r/#/c/306979/, it looks like http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1 (without trailing slash) returns a 404 so I think the redirect is needed after all [13:44:33] hmmm [13:46:37] we could also just rewrite the req url in this case, if that's simpler [13:47:02] if (req.url == "/api/rest_v1") { req.url = "/api/rest_v1/" } [13:47:29] but I hate that we're adding any more custom crap [13:48:29] gwicke: should http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1 go to the docs just like http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/ ? [13:50:00] gwicke: (while I'm thinking about related hackery - is RB ever planning to start paying attention to the Host header so we can get rid of the rewriting and just send it en.wikipedia.org/api/rest_v1/ ? ) [13:51:37] ditto mediawiki on the X-Subdomain problem. it would be so nice if it would just look at the .m. in the Host header itself :P [13:56:56] ema: in the meantime, since we're still emitting quite a few 429s, I'm still kind of on the fence about raising our default limit further. I mean, those clients were working last week and not abusive enough to cause concern, so it's hard to justify [13:57:17] bblack: yeah fair enough [13:58:08] ema: as gwicke was pointing out the other day, the real problem facing us here is the disparate costs of different API calls (or even the same call with different arguments or at a different time). Any sane rate we set for lightweight API calls will allow problematic levels of traffic to a heavy-cost API. [13:58:17] the bulk of it is /w/api.php, should we perhaps apply a rate similar to RB's to those? [13:59:28] so maybe long-term we need to get costing into vsthrottle, and then start pusing apps to send us cost data (or alternatively or as a default if they don't - we could perhaps calculate a fake cost based on ms response latency from backendmost->applayer...) [14:00:44] ema: +1, let's try bumping /w/api.php up to RB levels and then see what remains? [14:01:12] * ema nods [14:06:39] bblack: https://gerrit.wikimedia.org/r/#/c/359137/ [14:09:58] or is it more efficient to match ~ '^/(api/rest_v1/|w/api.php)' ? [14:12:17] my bet is on the latter :-) [14:12:56] bblack: re docs, you mean https://en.wikipedia.org/api/rest_v1 ? [14:12:58] yeah 1x regex [14:13:10] gwicke: yes [14:13:32] gwicke: that public URL fails right now because we don't send it to RB at all unless a trailing slash is added [14:13:33] if that is easy to do, then yeah that would be great [14:13:48] but if we add it, then http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1 is a 404 anyways [14:14:04] we have had several users tell us that it doesn't work, and all they were missing was the trailing slash [14:14:11] the expectation is certainly pretty strong [14:15:05] yeah, we don't have a redirect at the RB level -- and that wouldn't really work anyway [14:15:21] it doesn't have to be a redirect, the one with a slash isn't either [14:15:39] unless Varnish followed the redirect on behalf of the user, which I recall it doesn't [14:15:44] no it doesn't [14:16:06] really we'd want the user to learn about the slash [14:16:23] so ideally, we'd send a redirect response that sent the client to rest_v1/ [14:16:26] the working https://en.wikipedia.org/api/rest_v1/ resolves to http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/ and works [14:16:40] that works too, but you can't do it in RB anywhere? [14:17:25] I can check -- but I think we are using relative redirects so far [14:17:38] and this would introduce a need to know about the external path [14:17:45] the direction we've been trying to move forever is to avoid putting hacks in varnish for something that's theoretically what the applayer should do. in other words, move towards varnish being functionally-transparent other than caching. [14:18:22] well, block it on fixing the external path part first? [14:18:34] which goes back to: [14:18:36] gwicke: (while I'm thinking about related hackery - is RB ever planning to start paying attention to the Host header so we can get rid of the rewriting and just send it en.wikipedia.org/api/rest_v1/ ? ) [14:18:53] yeah, I think that would be a lot better [14:19:05] we definitely thought about it before, but haven't tackled it [14:19:16] in general, as a model for disparate APIs that might live alongside RB, I think it's a better model [14:19:18] I might have written a task already [14:19:49] otherwise we have this cache-rewriting table from /api/rest_v1/ -> rb.wmnet:/$host/v1/ and /api/foo_v3 -> foo.wmnet:/bar/baz/x/, etc [14:19:52] it's confusing [14:20:07] it's better that application code be configurable in its base url and understand Host headers [14:20:36] and when we decide to integrate "foo" as /api/foo_v1/, we configure that root url in the application's config and pass the routed traffic through with URLs unmangled [14:21:47] yeah, or at least /foo_v1/ [14:23:12] don't see a task right now, let me create one [14:23:39] speaking of tasks, I wrote up https://phabricator.wikimedia.org/T167906 as a follow-up on concurrency limiting [14:24:09] yeah I saw. I started to reply, but it's such a complex topic it needs a lot of thought to same something coherent :) [14:24:49] yeah, the description is also really more of a sketch at this point [14:26:17] in the end, we probably need both cost-aware ratelimiting and concurrency limits. and both have some sane and fairly low values for anonymous traffic, and let a lot more through for authenticated, where it's excepted the application layer has configurable per-account limits for authenticated traffic. [14:27:24] cost-aware ratelimiting solves a whole lot of problems in one fell swoop, but the costs won't be known in all cases until the response is already generated (at which point we'd always send the response we already did the work for, but add the cost into the ratelimiter to potentially impact the next request) [14:27:41] yeah, that's what I like about concurrency limiting [14:27:53] but without concurrency controls, that doesn't leave any way to stop an initial burst that opens 100x parallel requests for very heavy-cost responses [14:27:59] it covers the cost relatively well, without being complex [14:28:32] we don't have a lot of end points that take a long time, but are not using significant resources on our end [14:28:41] my issue with doing just concurrency (without also cost-based ratelimiting) is that even at concurrency=1, a single client can send an abusive load of requests to some API endpoints. [14:28:59] the vast majority really ties up a thread or disk io somewhere while the request is being processed [14:30:10] it could make sense to start with concurrency limit as a baseline, and then further cost / rate limit specific end points [14:30:33] the alternative is concurrency-limiting-only and requiring that no single API request can ever cost much (all APIs are reasonably-cheap enough that they're not overwhelming when faced with a low-latency continuous serial caller) [14:30:43] or fix those endpoints to not be so expensive any more [14:30:49] but eliminating all heavy API endpoints seems much further off than just getting them to return cost info [14:31:31] or as I was saying earlier, even if we can't get them to do code patches that return cost info, we could have varnish estimate cost based on response latency from the applayer to the cache or something [14:31:45] (or bytes transfered, or some other magic factors) [14:31:47] btw, re: authenticated bots, I was talking to Dan in Vienna [14:32:05] they apparently have a pretty big project in Analytics about recognizing bot traffic that doesn't have "bot" in the UA string [14:32:13] and they're thinking of building machine learning models for it [14:32:24] to recognize bot-like behaviors, rather than human behavior [14:32:45] they want this to be able to exclude big spikes in traffic coming from bots from traffic stats etc. [14:32:58] and I was pointing out that it's probably the wrong layer and solution to solve this [14:33:04] it would seem much simpler to say "if you have a bot that's putting heavy enough load on our systems to trip our standard anonymous limiters, you need to register and use an account for it to get a higher limit" [14:33:12] yeah, exactly that [14:33:16] was what I was saying [14:34:09] yeah, IP-based is always limited, with NATs etc [14:34:39] and machine learning models sounds like a too complicated solution for what's otherwise a simple enough problem to me [14:34:54] NATs with large client counts is really an intractable problem for the anonymous limiting by-client-IP. But I don't think there are that many cases that matter, and we can try to identify and exclude them [14:34:54] hey, it's hot right now [14:35:33] most of the large-client-counts-behind-one-IP aren't just NATs, they're proxying too (e.g. Opera), so we can look to our trusted proxy lists and XFF processing to fix those [14:35:41] (higher limits for known trusted proxies) [14:35:41] yeah, fortunately the AOL proxies are long gone [14:36:30] if we had some known IPs for true NAT gateways with large client counts, we could also put them in the trusted-proxy list similarly actually [14:37:13] and just be sure the XFF-processing code understands how to handle them a little differently (don't trust their XFF data, do set X-Trusted-Proxy=FooCarrier + X-Client-IP= [14:37:17] ) [14:45:52] 10Traffic, 10HyperSwitch, 10Operations, 10RESTBase-API, 10Services (next): Respect host header in RESTBase, and redirect v1 to v1/ - https://phabricator.wikimedia.org/T167972#3351561 (10GWicke) [14:46:12] 10Traffic, 10HyperSwitch, 10Operations, 10RESTBase-API, 10Services (next): Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/ - https://phabricator.wikimedia.org/T167972#3351574 (10GWicke) [14:50:15] ^ don't forget the /api/ part :) Unless public+internal URLs are identical, we'll always have the cache layer doing unecessary rewrites, and we'll always have confusion when looking at logs and/or errors and/or traces as the URL changes from layer to layer [14:50:57] (all this also applies to MediaWiki's mobile subdomain problem, where varnish rewrites en.m.wikipedia.org to en.wikipedia.org + "X-Subdomain: m") [14:57:14] bblack: should we? https://gerrit.wikimedia.org/r/#/c/359137/ [14:58:42] bblack: re tend to use relative redirects wherever possible, so if we just redirect to rest_v1/, then that should work regardless of the external prefix path [14:58:47] *we tend [15:03:58] ema: yes, please :) [15:12:03] bblack: merged and changes applied to esams, you can tell the difference just by staring at varnishncsa :) [15:12:17] there's still one crazy client, python-requests/2.10.0 [15:12:36] on cp3033 [15:17:58] looking much better https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&from=1497536737198&to=1497539814710&var-site=esams&var-cache_type=text&var-status_type=4 [15:36:23] looking at the non-api ones now [15:43:13] a few yahoo IPs, several MSN bot IPs, a few googleusercontent IPs, some random hosted servers, some strange IP from Belarus, [15:43:30] that about covers the top 10-20-ish IPs that are hitting 429s on non-API URLs [15:44:43] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351748 (10Nuria) Thanks for the prompt response, when the number of changes I did not see when these took effect, it is true that we do not see on our end 429s at all times, bu... [15:44:45] it's hard to find any that look like end-user IPs [15:46:03] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3351751 (10Nuria) If you look at 404s however, looks like the throttling had a positive effect on removing "garbaage-y" traffic. [15:46:13] if we had automatic latency costing, we could do a sort of loose concurrency control with it [15:47:23] what's automatic latency costing? The applayer letting us know the "price" of a certain resource? [15:48:00] well, us guessing the price by looking at the latency of the final varnish-be->applayer req->resp loop [15:48:23] and establishing some kind of "normal" request latency for cost=1, then charging higher costs the longer it takes the applayer to respond [15:48:45] right, the other one wouldn't be automatic :) [15:49:28] I think actually o.ri hacked in a response header for mediawiki that covers something similar already [15:50:15] < backend-timing: D=76990 t=1497535402884266 [15:50:26] I think that means MW spent 76ms generating the response [15:52:35] anyways, vsthrottle doesn't have a cost API yet, but it's easy to add [15:54:30] so, looking at non-API 429s, the worst offender over the past 24h was hitting per-minute rate peaks of ~15K, which is ~250/sec [15:54:50] (of 429s, so we'd have to add in the 10/s of request that made it through as well) [15:55:33] although that particular offender dropped back off sharply around 07:00 UTC today? [15:55:56] maybe in response to some of our changes [15:55:59] bblack: can you send me the IP in a /msg? [15:57:31] oh I'm failing to exclude RB URLs too [15:57:57] but I don't think it matters in this case [15:58:10] and that IP is only hitting QNNNNN (wikidata) URLs heh [15:58:31] excluding that one, the next common ones are closer to 12k/min aka 200/sec [15:58:48] and /wiki/Wikidata:Main_Page [15:59:49] there there's a decent batch of several high-rate clients that are closer to ~42/sec in 1-min avgs [15:59:51] Weird, we had a large spike of inbound traffic: https://librenms.wikimedia.org/graphs/to=1497542100/id=11512%2C11511%2C11600%2C11144%2C11105%2C7140%2C6820%2C4109%2C5944%2C4151%2C6862%2C6821%2C7139%2C11100%2C11146%2C4110%2C7224%2C6841%2C8288%2C11510%2C11623%2C11142%2C8209%2C7199/type=multiport_bits_separate/from=1497455700/ [15:59:51] Which matches, with this spike of inbound icmp to esams lvs: https://grafana.wikimedia.org/dashboard/db/network-performances-global?orgId=1&from=now-24h&to=now [16:00:25] XioNoX: spoof reflection stuff? [16:00:57] no idea [16:03:39] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3324750 (10ayounsi) The switch was showing "Carrier transitions" errors on that interface and no inbound traffic. We tried changing the... [16:05:13] so I've tried a few random articles (looking for short/simple articles) on enwiki [16:05:40] and I can't find any that have apachebench timings faster than the ~55ms observed for Special:BlankPage when queried from the local DC straight to the appserver over HTTP [16:05:56] most real articles are more like 70-100ms for reasonable-length/complexity ones [16:06:50] if we assume there's commonly a ~50ms timing floor for non-API cache misses [16:07:43] then even a near-zero-latency remote client (one hosted elsewhere in eqiad, basically), a serial client shouldn't be able to break about 20/s or 1200/60s [16:08:04] to get significantly higher than that, they're probably using concurrency [16:09:33] I checked a few wikidata QNNNNN too, same basic floor seems to apply [16:10:34] so maybe we should double up our 600/60s limit for non-API, and then we can say with fair confidence that we're probably only limiting client IPs that are attempting concurrency>1 [16:14:08] (well concurrency>1 and commonly cache misses, that is. obviously we're still doing nothing for cache hits, which is fine) [16:51:18] bblack: What're your thoughts on https://gerrit.wikimedia.org/r/#/c/341729/? [16:51:27] (it's the last remaining use of non-module files/*) [16:52:02] We talked before and I fixed up the utils/create_ecdsa_cert script [16:53:19] 10netops, 10Operations: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3352086 (10ayounsi) How I understand it, the increased complexity of running two "networks" outweighs its advantages. And our customers are networks we manage and have control over. A single ASNs (using confederation... [17:11:12] RainbowSprinkles: it seems like the right thing to do. We should compiler-check that it's still a no-op in critical machines right quick (I can do that in a few) [17:11:39] Ah good point, didn't think of that [17:12:09] or if you want to run it, the important thing is check a cache just in case [17:12:13] e.g. cp1065.eqiad.wmnet [17:13:47] Doing one in dallas and dc each [17:14:14] Whoops, wrong change! [17:14:15] Hah. [17:14:18] That won't show anything [17:16:22] Posted on the change too: https://puppet-compiler.wmflabs.org/6789/ - just manifest changes, no actual on-disk changes to files [17:21:03] ok [17:21:56] the only paranoid thing I can think, is we should probably disable puppet on the caches before merging and then wait a few minutes to run them, to avoid puppetfails due to race conditions [17:22:23] (puppet has this notorious race condition where when you move files and move their references in the same commit, some clients who run right after the merge get an inconsistent view and fail) [17:22:24] Should we grab an official deploy window then? [17:22:46] I can take a few minutes and try it now [17:23:03] Ok. I can't do much other than just watch on this one :) [17:38:51] 10netops, 10Operations: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841#3352281 (10ayounsi) Best practices for confederation is to limit IGP within each confederation. The main advantages are reducing the blast radius if OSPF miss-behaves, and increasing convergence speed.... [18:34:41] 10Traffic, 10Operations: uploads.wm.o commons archive 20170615014039!Adsalm.webm visible despite file deleted on Commons - https://phabricator.wikimedia.org/T168002#3352611 (10zhuyifei1999) [18:59:42] 10Traffic, 10Operations: uploads.wm.o commons archive 20170615014039!Adsalm.webm visible despite file deleted on Commons - https://phabricator.wikimedia.org/T168002#3352693 (10zhuyifei1999) Also T129845#3351290: https://upload.wikimedia.org/wikipedia/commons/archive/3/3e/20170615122111%2120170615014039%21Youne... [19:01:49] paravoid, CyrusOne said they cleared the circuit between cr2-codfw and eqdfw, I'm going to remove the preference over the other circuit and see if alarms come back [19:15:56] 10Traffic, 10Discovery, 10Maps, 10Operations, 10Interactive-Sprint: Rate-limit browsers without referers - https://phabricator.wikimedia.org/T154704#2921080 (10debt) @Gehel will chat with the #traffic team about this. [19:23:59] 10Traffic, 10Operations, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3352835 (10debt) 05Open>03Invalid We haven't been able to reproduce this, closing. [19:59:18] 10netops, 10Operations: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3352909 (10ayounsi) I don't have a great visibility on how our ip space is divided, but looking at DNS it looks like for example 208.80.154.200 could be a good choice. Close to the eqiad loopback IPs. Once we have an IP, I... [20:05:47] 10netops, 10Operations: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3352918 (10ayounsi) 05Open>03Resolved Circuit has been back to the same levels of traffic as before the issue for 1h, no more issues reported in Icinga or Smokeping. [20:16:29] 10netops, 10Operations: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3346542 (10BBlack) Multicast has its uses in general. Even if we kill HTCP another use may pop up. I did quick survey to try to find active uses. Filtering for just v4, removing the standard references you'd see to 224.... [20:43:23] 10Traffic, 10MobileFrontend, 10Operations: Remove disableImages handling from VCL - https://phabricator.wikimedia.org/T168013#3352978 (10MaxSem) [20:44:39] bblack: until we have a proper IPAM, multicast IPs could go on https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations [22:25:17] 10Traffic, 10MobileFrontend, 10Operations, 10Reading-Web-Backlog: Remove disableImages handling from VCL - https://phabricator.wikimedia.org/T168013#3353345 (10Jdlrobson)