[01:51:43] 10netops, 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-codfw: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3343355 (10Papaul) 05Resolved>03Open [04:22:47] 10Traffic, 10DBA, 10Operations, 10Performance-Team, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3343495 (10aaron) @daniel , can you look into the amount of purges happening in ChangeNotification jobs? I don't see an... [08:28:12] 10Traffic, 10DBA, 10Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3343718 (10Marostegui) 05stalled>03Open p:05Triage>03Normal Let's close this for now [08:29:26] 10Traffic, 10DBA, 10Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3343722 (10Marostegui) 05Open>03stalled [09:14:11] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10Ladsgroup) The patch is merged but not deployed so I think we should wait... [09:27:55] FYI I'm almost done with swift 2.10 upgrade in codfw, only two machines left, I published https://gerrit.wikimedia.org/r/#/c/358376/ and https://gerrit.wikimedia.org/r/#/c/358377/ to point varnish upload to codfw, I was thinking of merging that later today and leave it running for say 24h, or longer but thurs/friday I'm out [09:28:16] bblack ema ^ [09:30:32] godog: ack [10:00:53] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3344106 (10Marostegui) p:05Triage>03Normal [12:59:33] ema: have you looked at API req rates in general? We might have a fair number of legit clients that go faster than that who will get annoyed. [13:00:05] ema: on that subject - last time around we started at 50/sec, and RB wasn't well-cached at the time, and Kiwix was complaining the limitation on RB reqs was going to hobble them or something [13:00:14] (I think that's roughly how I recall it anyways) [13:00:59] of course they can just source from more IPs and hit more caches in parallel (hoping for fair hash distribution). it's complicated :) [13:02:25] it might be a good idea to come up with some general schema for reasonable misspass ratelimits for text (and then later for upload/maps too) [13:02:41] (well, fe misspass, which I think is pretty reasonable to target) [13:03:44] keeping in mind that later in the evolution of this, we might be able to really identify anon vs authenticated, but for now that's easily spoofed even if we make a surface distinction by looking for a valid session cookie regex or whatever [13:04:08] just some hard outer limits to prevent stupid cases from few IPs [13:05:02] (any limit is still an improvement on no limit at all, and we can see how many complain at a given level and bring it down until there's some unacceptable pain, then back off a little?) [13:05:06] bblack: so I was looking at API requests in pivot and it looks like we get peaks of 50 requests per minute from the most active IPs [13:05:31] ok [13:05:53] we can ratelimit misses in general too, I'd just be careful to carve out supposedly-auth'd clients first [13:06:02] (to non-api) [13:06:41] I wonder what the overall top reqs/min cases are, when the path is ignored [13:08:37] http://bit.ly/2sjEtIl [13:08:42] I do like the idea of thinking in "per minute" or longer terms though. it gets us away from caring about short spikes that are inevitable with real UAs [13:10:13] that's 150/min? [13:10:31] up to ~200/min I think [13:11:49] is the filter search in the UI already sorting the IPs to chose by overall rate or something? (how do you find the top ones?) [13:12:01] I believe so! [13:12:39] elukey: ^ is the IP dropdown filter sorted by request rate? [13:13:51] I don't think so [13:13:59] oh [13:13:59] it should be only a collection of IPs [13:14:03] I *think* [13:14:09] they don't have any natural sorting that's obvious [13:14:17] but, it does seem like the top ones in the list tend to have the highest rates [13:14:34] * elukey asks to the Druid masters [13:14:38] it may just be that it's listing them in "most recently seen in the data" order, and thus higher-rate clients are statistically likely to be higher on the list [13:14:38] thanks! [13:15:42] ema: going back to a week-long view and then doing the math, their worst hour this week averaged 15k/hr (4/sec) [13:15:50] (that one client IP that seems the worst) [13:16:35] also, that IP belongs to MSFT, maybe it's a bing crawler [13:17:32] the druid masters are not sure :) [13:17:36] ema: the numa stuff can't be compiler-checked (yet) because the first commit introduces a new fact that the compilers don't have data for, and the rest depend on it [13:17:57] elukey: maybe a better question is: "How do we find the higher-rate client IPs in that UI?" [13:18:15] s/er/est/ :) [13:18:20] bblack: oh, so maybe we could go ahead and merge the first commit then, the one introducing the numa fact [13:18:24] I tried adding a split by IP, that seems to do it, it sorts by hits first [13:18:53] so I split by IP with https://goo.gl/ZypxwT [13:19:03] ah exactly what godog suggested :) [13:19:15] ah there we go :) [13:19:28] nice [13:19:52] so 1h, 1d, 1w views all agree that one MSFT IP is the worst one [13:19:53] yeah pivot is impressive, I hope 'subset' is as good or better [13:20:03] superset! [13:20:12] although it might help to check the next few down the list to see if they're burstier [13:21:11] haha elukey oops [13:21:38] anyways so turning our nominal 25/5s rate into a per-minute [13:21:52] 300/60s? [13:23:31] "300/60s is all the ratelimit anybody would ever need on a computer" [13:24:03] (since it's an MSFT IP after all) [13:24:07] :) [13:24:21] you mean for api.php or as a general limit? [13:24:50] well that webrequest table is I assume at least all of text, if not all combined webrequests? [13:25:12] all combined I think [13:25:41] yeah so, that seems like a safe "upper limit" that should get legitimately violated and complained about rarely then [13:26:09] especially since we're looking at all edge traffic in that graph (cache hits too), and we're only applying the limit to miss/pass cases [13:28:00] true! Would be good to have X-Cache-Status as a dimension too, wouldn't it [13:28:29] well in this case it wouldn't necessarily help much, as hit/miss there is for all layers combined, and it's only FE-hit that matters for this case [13:29:06] (I thought about trying to move the limiter to the backends actually, but then we've swapped hashing IPs->cpmachines for hashing URLs->cpmachines, so it doesn't really work fairly from there) [13:36:06] root@cp4021:~# facter -p numa [13:36:08] {"device_to_htset"=>{"eth0"=>[[0, 24], [2, 26], [4, 28], [6, 30], [8, 32], [10, 34], [12, 36], [14, 38], [16, 40], [18, 42], [20, 44], [22, 46]], [13:36:11] ... [13:36:41] I don't recall now how/where/when the compiler updates facts from the nodes though [13:38:47] bblack: manually :) [13:39:18] https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs (reachable by searching puppet compiler on wikitech, I've added a redirect) [13:39:22] well I just tested the possibility of "automatically and immediately", and that clearly wasn't the answer [13:39:36] basically run modules/puppet_compiler/files/compiler-update-facts from your local puppet repo [13:39:41] but you need to wait at least 30m [13:39:51] to have a puppet run on all hosts [13:40:51] ok working on that [13:46:06] bblack@alaxel:~/repos/puppet$ modules/puppet_compiler/files/compiler-update-facts [13:46:10] Authentication failed. [13:46:23] (it did prompt to accept the hostkey the first time, but both runs gave that) [13:46:54] probably issues with my labs keying setup [13:47:24] * volans wonders if you have to be added to the project on labs to do that [13:47:41] let me run it now for you and then we can check it later [13:47:41] yeah you do [13:47:48] I can do ssh root@compiler02.puppet3-diffs.eqiad.wmflabs [13:47:50] but not as myself [13:48:03] ok [13:48:05] ok, and it doesn't use root becaue not everyone has root in labs [13:48:10] thanks! [13:48:29] I probably shouldn't have root in labs anyways, at least not directly like that without sudo :P [13:48:48] I think I asked for it once because I got tired of having to ask to be added to $random_project to investigate $random_labs_related_thing [13:49:05] right [13:51:01] bblack: whould be done [13:51:28] it's the first time I run it since the last few modifications done recently but didn't throw errors, so I guess it worked ;) [13:52:50] s/whould/should/ ofc :) [13:57:21] I've also added you to the project, not sure if need to add you also a projectadmin... [14:01:44] I believe default sudo policy is ALL even for reg users but projectadmin is like create/destroy instances etc volans [14:02:22] chasemp: ok, thanks. Also I cannot add people there, it says: You have requested an invalid special page. [14:02:46] where is 'there'? [14:03:13] sudo page? [14:03:24] on wikitech, clicking add members to the projectadmin box in https://wikitech.wikimedia.org/wiki/Special:NovaProject [14:03:32] ah ok [14:03:34] for the puppet3-diffs project [14:04:03] volans: that's...new, can you ping andrew on that he has been in there doing things recently iirc [14:04:23] or file a task and I'll bring it up in our meeting today :) [14:04:28] sure, I'll ask him, maybe something has changed and now are managed in a different place :) [14:04:39] sudo yes I think is now in horizon [14:04:54] afaik projectadmin is still on wikitech but maybe one effected teh other unexpectedly [14:05:32] [moving convo elsewhere so I'm not polluting your channel :)] [14:06:43] bblack ema FYI I'm going ahead with testing swift 2.10 in codfw by point varnish traffic there, https://gerrit.wikimedia.org/r/#/c/358376 and https://gerrit.wikimedia.org/r/#/c/358377 [14:06:44] yeah, I'm asking in -labs ;) [14:06:51] thanks for the info chasemp! [14:06:53] essentially the same thing we do for the switchover [14:09:19] (waiting for +1 or an ack tho) [14:09:46] godog: +1 :) [14:10:51] hehe thanks [14:14:35] 10Traffic, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3344847 (10Ottomata) Ooook, I just checked some things. - x_forwarded_for was only being used by legacy pageview code in refinery, which itself... [14:36:21] bblack: updated https://gerrit.wikimedia.org/r/#/c/358583/ using 300/60s. I've updated the wikiScrape regex as it's now called "wikiScrapeFOCUSCloud" apparently [14:38:34] arms race lol [14:40:46] :( [14:41:22] there's also a significant increase in api requests using "-" as the UA [14:42:14] I think "you must specify a UA string" was at some point in the past one of our API requirements on paper, and we talked about blocking or applying much lower limits to those that don' [14:42:18] t [14:42:43] but I think it's moot. for cases that matter, UA strings are always malleable anyways. it's a weak upping of the ante. [14:43:28] https://goo.gl/x2WRPz [14:43:40] :( [14:44:58] oh I forgot to include our reasoning in the commit message, amending [14:45:50] yeah I guess better there than in gerrit comment [14:46:43] the only thing we could be missing here mathematically is fast legitimate tiny bursts [14:47:42] e.g. maybe there's an edge case that certain real browsers, if starting fresh on completely empty cache and you hit the main page and then click through a few articles rapidly, they hit 300 reqs over 4-5 article pageviews in under a minute, even though they'd never *average* that much over a longer term. [14:48:21] you mean our frontend cache being completely empty or the browser cache? [14:48:48] browser [14:48:54] right [14:49:05] I know a full main page load in an empty browser does generate a ton of reqs [14:49:21] but you'd think followup article links would be minimal (just text + images, not all the RL stuff and such) [14:49:29] but few of those should be fe-misses right? [14:49:54] probably [14:49:55] you'd hope! [14:50:17] there's logged-in users to think of though, as those would all be passes [14:50:22] right [14:50:49] so one thing we maybe could/should do for the general case out of an abundance of caution, is skip the seemingly-authenticated requests [14:50:58] agreed [14:51:05] does the tool keep some history and allow to manage bursts over time? [14:51:17] if someone actually takes the time to look at our VCL and inject fake session cookies to help an attack (unlikely), we can cross that bridge later [14:52:37] volans: it's a https://en.wikipedia.org/wiki/Token_bucket algorithm . So in this case the bucket has 300 token capacity, and refills perfectly-smoothly at an average rate of 300 new tokens per 60s. [14:53:19] a fresh client (or freshly restarted cache all of whose clients are thus fresh) start with the full 300 credits, so they can ram in 300/s for the first second or whatever if they want. [14:54:10] but once their bucket is empty they'll be limited to 300/60s so 1 new credit allowing 1 new request enters the empty bucket every 200ms [14:55:51] got it [14:56:48] and you think that a clean client would use all the credits in the first few page visits? [14:58:30] not necessarily, it's just a mathematically possibility that our stats don't account for, since we only checked for maximal clients in reqs/min [14:58:43] or really that was just for a day. over the past week, we looked at hourly averages [14:59:25] so the data says nobody went over 300/min for the past 24h, and nobody went over 18k/hr for the past week [15:00:06] but that doesn't gaurantee anything about shorter bursts that stayed under those averages [15:00:24] ideally we should have some higher limits for short bursts and lower for longer term, can we have multiple timeframes? [15:00:39] (although I guess if there were any legit ones over the past day, the burst would have to have been split over two separate minutes to stay under the radar) [15:01:23] well, any token bucket implicitly specifies a "burst" that's allowed over an infinitely-short term, and a single long-term rate [15:01:55] yeah [15:02:04] but we don't really want our long-term rates to be calculated over super-long times, either. because we have lots of client IPs, and it has to maintain data on each one it sees for at least that duration [15:02:45] so e.g. if we did something like the current 300/min, and also another one at say 10k/hr, the memory usage and garbage collecting and general cpu waste to track it goes way up [15:03:14] indeed [15:03:22] if we stick to just a per-minute long term rate, we know we're only wasting token bucket slots on those client IPs that have talked to this cache over the past minute at all. [15:03:32] (roughly) [15:04:50] don't know how hard could be to do it, but one possibility could be to track longer term only the IPs that consume all the credits [15:05:04] during the shorter 1m period [15:05:23] the other risk here is we haven't tried vsthrottle under varnish4 ratelimiting lots clients before (just wikiscrape). We know the equivalent past solution (vmod_tbf + varnish3) ended up leaking the tracking data memory on every vcl reload too, which would eventually oom the machines. We're hoping v4+vsthrottle doesn't, but time will tell. [15:05:53] oh, great :) [15:05:54] volans: that would be fairly trivial I think, it's a good idea, if we want to do that at all. [15:06:47] oh hmm, no it's not trivial to nest them and have them operate independently [15:06:56] but we could nest them in the way that they're not independent [15:07:30] maybe the longer term one could change the initial credit for those IPs in the 1m one? dunno the internals of it so likely throwing random ideas [15:08:06] unfortunately the API is pretty deficient, I talked about that a bit in the ticket [15:08:47] it doesn't allow custom costs (just 1 token per request), and doesn't allow specifying rates+bursts in different units (which is why we can't spec it as "300 burst + 5/sec rate", we have to spec it as "300/60s" to get that effect) [15:09:15] too bad [15:09:23] but I guess we could submit API patches for it [15:09:34] they'd have to be backwards compat for users, so maybe new calls or whatever [15:09:55] e.g. is_denied_cost() that takes an extra cost argument vs current is_denied() [15:10:44] also I'm not a fan of how it garbage collects and some of the built-in constants [15:11:20] are you volounteering to rewrite it? :-P [15:11:20] basically GC happens automatically every 1/1000 reqs, and it GCs the table by killing entries whose last timestamp (last req) was more than the rate's time period ago (so 1 minute ago in this case). [15:12:42] it would seem to be more efficient and kill entries faster to instead GC any entry that has a full bucket (has gone back to the initial state, essentially, which it might do sooner than a minute), but also there's already locking so it could GC on time intervals too. 1/1k is kinda heavy for us [15:13:09] but then I also hate that the locking has a fixed scaling constant in it too [15:13:21] it's easy to nitpick someone else's code though. it clearly "works" :) [15:13:40] rotfl [15:17:39] 10Traffic, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3345019 (10Nuria) a:03Ottomata [15:19:53] bblack: https://gerrit.wikimedia.org/r/#/c/358583/ amended [15:20:15] oh and https://gerrit.wikimedia.org/r/#/c/358057/ would need a review :) [15:24:52] ema: LGTM - I still haven't really vetted the idea that it might be possible for an app response to specify e.g. "200 TLS Redirect" and then somehow land in vcl_synth() and trigger the conditionals... but: [15:25:14] (a) If they do that it's their fault they picked the same crazy non-standard reasons we did or something... [15:26:10] (b) I don't *think* it's possible to get from "fetch a legit backend response that would flesh out resp.reason" to vcl_synth(). it doesn't seem likely to be possible, anyways, but I guess answering such questions definitively is tricky [15:26:37] 10Traffic, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3345058 (10Ottomata) @bblack, just one last double check: are you sure XFF is not useful for ops purposes? We can easily exclude this data fro... [15:30:19] bblack: vcl_synth is in the client side, can you get there from _backend_fetch? I thought not [15:31:44] 10Traffic, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3345069 (10BBlack) No I don't think we need it for non-immediate analysis like this. We still `zero`, `zeronet` and `proxy` in the X-Analytics... [15:32:13] 10Traffic, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3345072 (10Ottomata) Ok! Will merge this today then, thanks. [15:32:42] ema: I don't think so, I just haven't really tried to be sure yet. [16:17:40] XioNoX: hi! I am wondering if your [16:18:05] *you'd have some time next quarter to work on the ACLs for the analytics vlan with me [16:18:20] especially the ipv6 part [16:22:15] elukey: yep! [16:23:29] thanks! I am also wondering if we have other vlans like analytics (discovery?) and if we might come up with a common baseline [16:58:05] bblack: not sure if you wanna merge https://gerrit.wikimedia.org/r/#/c/358028/ but it seems good to go :) [17:00:39] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3345360 (10daniel) [17:01:58] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10daniel) @Ladsgroup you mean, the patch that creates Special:PageData is m... [17:02:17] ema: it does, but the ticket is really just the wiki-users complaining and us responding with a patch. I'd like to get some feedback from someone on the MW side of things here, in case they have something to say like "those dashed language variant subdomains aren't appropriate for mobile redirects because of blah blah blah that you don't understand about how this is structured" or something [17:02:47] ema: can you track down someone who might have that kind of opinion? [17:03:24] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3345365 (10daniel) p:05Triage>03Normal [17:19:07] bblack: sure [17:54:09] here's the 429 rate since introducing rate-limiting today: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&from=1497366677181&to=1497375064948&var-site=All&var-cache_type=All&var-status_type=4 [17:55:14] ema: hmm that seems fishy doesn't it? [17:55:51] we supposedly set it at a higher rate than the worst client we can see, yet we're emitting ~2.4K/sec 429s? [17:55:52] bblack: not sure, looking at those in pivot, it's pretty much all api.php related (MSFT crawler, one ec2 IP and a local IP) [17:58:26] http://bit.ly/2sYS8CE (interesting to split on user-agent or ip instead of uri path) [18:05:03] numa stuff seems to function as expected [18:05:14] (on a test host with no real traffic) [18:05:16] nice [18:05:56] I doubt we'll see any massive change on existing cache hosts, but might be able to tell more-subtle effects on things like context switches and memory bandwidth, etc [18:09:17] ema: back on ratelimit, anyways, it still seems unexpected to me that we're emitting so many 429s given the high cap. I guess perhaps that Bing IP and other similar ones are just more-bursty than the 1m/1h averages in pivot make us think they are? [18:10:31] we could up the burst capacity by multiplying out a little further, e.g. 1500/300s instead of 300/60s? [18:10:40] but still, it seems odd that we should have to [18:10:46] pivot has all the requests or sampled ones? [18:11:14] ah good point [18:11:22] I assumed all, since all are sent to webrequest. but honestly I don't know for sure that it's not using sampling by the time it reaches that UI... [18:12:07] I think e.lukey mentioned that once, let me go through the irc logs [18:12:51] it's 1/128 [18:13:24] oh heh [18:13:29] that explains it [18:13:32] I was just figuring that out from comparing our grafana rates [18:14:31] yeah my rough math from reading numbers off of graphs imperfectly showed a ~100x sample rate [18:14:34] so makes sense [18:14:46] sooooooooo [18:16:02] that bind IP is really hitting hard then [18:16:04] *bing [18:16:14] yup [18:17:48] still, if we only seem to be throttling automated stuff and only a few worst offenders (sounded like that earlier?) I don't think it's unreasonable that they should honor 429 or live with it [18:18:21] agreed, looking at the user agents being rate limited, "Peachy Mediawiki Bot" and "Wikidata Query Service Updater" seem the only "legit" ones [18:19:14] then there's "-" which is hitting the api pretty hard, bingbot and yahoo [18:19:16] I'm surprised those aren't using sessions of some kind [18:20:10] maybe let's double it up as an experiment and see how much that kills off the 429 rates? [18:20:18] agreed [18:20:18] to 10/s (1200/60s)? [18:20:51] we could, perhaps, even have a higher limit for X-Client-IP ~ "^10." ? [18:22:17] bblack: also, 10/s should be 600/60s right? [18:22:23] oh right, in spirit. maybe not quite exactly like that [18:22:26] yeah 600/60 [18:22:40] are the peachy/wikidata ones internal IPs? [18:22:56] labs is in 10/8 too of course [18:23:04] wikidata is, let me check peachy [18:23:51] yup, peachy is also 10/8 [18:24:32] maybe let's start with 10/s and then optimize for local IPs after we see the impact [18:25:38] well [18:25:42] yeah ok [18:26:43] really we shouldn't give labs special rates here, but we should for true internal clients (e.g. if for some reason thumbor or parsoid or whatever makes requests directly back to the cache frontends) [18:27:42] https://gerrit.wikimedia.org/r/#/c/358774/ [18:27:45] we have an existing VCL ACL in wikimedia-common called "wikimedia_trust" [18:27:55] which puts together all our proper ipv4+ipv6 and excludes labs [18:28:02] ah nice [18:28:28] so maybe we can wrap the check in a non-match on that [18:29:06] I don't remember how that changed with v4, we used to have to use ipcast to do that [18:29:10] (match string from X-Client-IP against ACL) [18:30:29] crazy how those bing/yahoo bots don't seem to slow down at all regardless of the high 429 rate we're returning to them [18:34:39] bingbot is braindead [18:35:11] Cf T167465 [18:35:12] T167465: "Key contains invalid characters" when using MultiWriteBagOStuff - https://phabricator.wikimedia.org/T167465 [18:35:23] (which we only discovered because of absolutely stupid queries bing was sending us) [18:54:57] ema: seems to have lopped off about 1/3 of the 429 rate [18:56:25] bblack: yep. Notice that puppet might not have run everywhere yet, I've merged at 18:31 [18:58:43] oh ok [18:59:02] also, I looked in pivot at the 429'd IPs, the only 10/8 ones (or any other of our IPs that I see) are labs ones [18:59:22] so I guess we don't have any truly-internal clients using the external interfaces at high rate [19:00:47] still, probably better to exclude wikimedia_trust anyways [19:00:53] 10Traffic, 10Operations, 10Community-Liaisons (Jul-Sep 2017): Communicate this security change to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3345721 (10Johan) Sorry, it seems like this wasn't picked up by anyone. @BBlack, do you want this to be communicated as so... [19:01:27] https://gerrit.wikimedia.org/r/#/c/358779/ [19:02:11] oh wait the fallback to 127.0.0.1 is a bit silly there [19:02:29] hmmm [19:02:37] it would be probably better to fallback to an ip not in wikimedia_trust [19:02:49] yeah probably [19:03:01] 192.168.1.1 might be a decent pick [19:03:07] or something like that, in a private range we don't use [19:04:09] we use 192.0.2.1 elsewhere in vcl, I'd go for that for consistency [19:04:55] in any case, 600/min (or in other words, 10/s with a burst capacity of 600) seems like a very generous rate to offer anonymous clients on non-cache-hits, I'm fine with waiting for complaints to come in at this point. [19:05:05] yeah that makes more sense even (192.0.2.1) [19:06:24] at some point we (the broader organization-level we) will need to do something more coherent about ratelimiting policy, but we can probably hack through things for now [19:08:00] "something more coherent" being that we ask those complaining about 429s to authenticate with a bot account instead of anon, and that we do apply some sort of much-higher sanity-check limit even for auth'd access, and the applayer manages real ratelimits for auth'd access (at a slower default rate + 429 responses, with the option for administrators to up the rates for legitimate needs for specifi [19:08:06] c accounts) [19:08:34] and the traffic layer hashing and signing auth tokens so it can properly validate them and not get spoofed by invalid sessionids [19:09:59] yeah 10/s with burst of 600 really does seem generous enough :) [19:13:05] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3345748 (10GWicke) To summarize the options using a single domain only: ## Use www.wikimedia.org only ### Pros - Follows the common www. co... [19:15:59] ema, bblack: would this apply to the REST API as well? [19:17:09] gwicke: yeah everything, so yeah it raises the old ticket about kiwix rates and all that again. but... [19:17:20] 10netops, 10Operations: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056#3345763 (10ayounsi) After looking at the various looking glass, bird-lg seems indeed the best option (doesn't need ssh access to the routers, open-source, user-friendly, supports multiple regions). That's wh... [19:17:37] gwicke: on the other hand, we're only applying it to non-cache-hits, and RB tends to be better-cached these days. [19:18:08] yeah, but RB has explicitly documented limits per entrypoint [19:18:41] oh? [19:18:52] and we should try to be predictable for users [19:19:06] it's hard for users to guess percentages of Varnish cache hits [19:20:06] also, request rate limits have the issue of promoting more expensive compound requests [19:20:06] yeah I know "not hits" doesn't map well to policy. but it maps well to efficiency, and it maps well to what really matters (not overwhelming applayer things by scanning through all of the low-incidence corners of the namespace/api rapidly) [19:20:33] the answer to expenses is to attach costs in the response headers to account for them [19:21:01] so far, we have dealt with that by setting by-entrypoint limits [19:21:10] that you're enforcing in RB? [19:21:25] for example, https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_html_to_wikitext_title_revision is limited at 5/s [19:21:37] it's an uncacheable POST entry point [19:21:55] and yes, this is enforced in RB [19:22:06] there is also a global REST API limit that is enforced in RB [19:22:31] my general thoughts on that in the abstract was to ask applications to set a cost header. e.g. we normalize on a "request" usually being reasonably equivalent to other random requests and lightweight, but for heavier requests you send a response header like "X-Cost: 10", which means 10x the normal expense, and that adds 10 requests worth to the ratelimiter as the response flows out. [19:22:34] the other main approach is to use some pacing system [19:23:05] where clients are expected to modulate their request rates / delay the next request based on response headers [19:23:07] the nice thing about the X-Cost approach is while it could be set statically for expensive query types, it could also be dynamically calculated based on the resources consumed on a very particular query or something [19:23:51] there's Retry-After that we could be sending with the 429s, but we're not currently. it's be interesting to see if any clients even honor it. [19:24:27] in my experience, even a fixed request limit is hard to implement for most users [19:24:35] max request concurrency is easier [19:24:44] and arguably what matters more anyway [19:25:09] we've seen serial rates that get way higher than we'd like, it's not hard if they have low latency to us [19:25:40] with low latency but low concurrency, I wouldn't mind someone being quick [19:25:45] gwicke: which status code does RB return if the limit is exceeded? The doc you pasted doesn't mention 429 [19:26:00] concurrency is basically what captures the resource consumption on our end [19:27:00] not really, when the norm is a ton of total request, from a great number of fairly low-rate clients [19:27:02] ema: 429 [19:27:16] it's missing in the doc [19:27:34] ok [19:27:43] a given cache might service whatever it is, 100K distinct users over the period of an hour at fairly low rate, and one fast serial requestor might double the total request-rates through that cache [19:28:56] concurrency limits implicitly capture differences in costs [19:29:14] in any case, there's a lot to be said about how to design things properly in an application-generic way to support this ratelimiting (I touched on all of it but the X-Cost thing slightly above) [19:29:14] and tend to be conceptually easy to implement for users [19:29:42] the driver here is that we never actually get all that done "right" because it's a big design challenge and cuts across a bunch of things [19:29:48] 10netops, 10Operations: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056#3345819 (10ayounsi) a:03ayounsi [19:29:52] we have been using concurrency limits internally for that reason, and they worked a lot better for us than rates [19:30:10] and back here in the real world, we have abusers that hurt uptime by spamming crazy request rates at us from low counts of IPs [19:30:19] and we have the ability to limit them with a rate-based tbf [19:30:23] so, we are :P [19:30:48] gwicke: what's the max of RB's advertised rates? [19:31:07] (also, how does the concurrency limit map to the rates you were talking about earlier, e.g. 5/s for html->wikitext?) [19:31:07] yeah, I'm just concerned that our docs will be wildly inaccurate, and it will be very hard for users to figure out how to rate limit their clients [19:31:46] and I'm also concerned that doing a global rate limit penalizes entry points that return one thing at a time over those that return a bundle of X things [19:31:58] which isn't really what we want to encourage [19:32:17] overall advertised rate for the REST API is 200/s [19:32:24] lower for individual entry points [19:32:47] median miss response time is around 20ms [19:32:50] 200/s for a single client is kind of nuts [19:33:27] 10netops, 10Operations: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3345837 (10ayounsi) Circuit identified and troubleshooting started by CyrusOne. [19:33:29] so even at very low concurrency, you can reach 200/s at fairly low concurrency & with limited server side resource consumption [19:33:48] median latency for html->wikitext is a lot higher [19:33:58] 200/s is 1 req every 5ms [19:34:25] so 5/s would be a concurrency of perhaps 3 [19:34:42] do your documented rates talk about how averages or bursting work? is it enforcing it on a second-by-second basis, or does 200/s really mean 12k/minute averaged rate and some kind of burst bucket? [19:34:49] so for a typical content request, 200/s is a concurrency of maybe 6 [19:35:14] assuming low latency network to the client [19:35:25] in any case, your ratelimit isn't really capturing what's happening with cache hits of course [19:35:50] we have some bursting, yes [19:36:08] you can go slightly higher in the first 10s [19:36:11] but then it cuts off [19:36:28] is that how you're calculating rate in general, in a 10s sliding window or something? [19:36:45] anyways, maybe I'm diving off into too many details here [19:36:55] no, it's a decaying counter that is synced every 10s [19:37:18] sync cost is constant, and there is no per-request network requests for rate limiting [19:37:29] that's good, I always thought that sounded silly [19:37:39] (per-request network requests to ratelimiting "service") [19:38:05] my issues with concurrency as a way of accounting things (in general, not about RB): [19:39:12] it's hard to capture concurrency further down in the stack [19:40:25] 1) There's really no clean separation of which limitations are causing the concurrency. e.g. a given client might simply be configured to fire off one request every 5ms in a new thread (or event context or whatever) regardless of the response timing. [19:41:02] in which case their rate will always be 200/s, and their concurrency is determined by network+server latency in handling it [19:41:43] or they might have a client configured to run 4 serial streams of requests that don't send a req till the previous response is processed, in which case as network+server latency drops they approach a concurrency of 4. [19:42:06] the latter is a typical implementation strategy [19:42:36] yeah, even more typical is just a single thread running flat out but serially [19:42:36] the former is more typical for browsers, in which case I'm pretty sure some built-in concurrency limits apply as well [19:43:12] (but browser HTTP2 limits are likely to be much too high for our purposes) [19:43:52] but when the "average UA" is sending us something on the order of tens of requests per minute, and one low-latency serial streamer with a concurrency of 1 starts firing off 300 reqs/sec into the same infrastructure, it still does have an outsized cost, even though all clients are still concurrency=1-ish [19:44:25] 300 req/s with concurrency 1 seems hard to achieve [19:44:34] ~3ms [19:44:37] unless it's all cache hits, and they are basically in the same DC [19:44:47] they just have to have a server in virginia at the most popular datacenter in the world :) [19:45:28] and never hit our second varnish layer [19:45:31] always frontend hits [19:45:37] why? [19:45:59] IIRC going through two Varnish layers in eqiad was >10ms [19:46:10] I doubt that [19:46:11] so around 5ms per layer [19:46:37] easy to test [19:47:27] anyway, if they are in the same DC & only hit super cheap caches, then I'm inclined to say "more power to them" [19:47:47] but I'll also be surprised if we ever see that in practice [19:47:47] me too, which is why we're not ratelimiting cache hits in the frontends [19:48:05] we're ratelimiting the requests that make it through the frontend cache, the ones that start costing more [19:49:21] in the short term, would it be possible to configure different limits for different paths? [19:49:48] we can, but I'd rather wait for a user to complain and look at the use case so we know what we're doing a little better [19:50:08] we know that google is using higher rates [19:50:20] the highest rates we saw earlier were from MSFT [19:50:44] (and then some random ec2 instances) [19:50:59] for the action api, REST, or both? [19:51:19] for all requests [19:51:37] google is fetching HTML for each revision, but since they aren't the only ones doing that, they are likely to have some cache hits [19:51:51] edit rates are > 10/s though [19:52:32] that biggest-hitter IP from MSFT has seen peaks up to ~500 reqs/sec [19:52:58] ouch, that's excessive indeed, especially if it's for cache misses [19:53:14] (that's not a peak second either, it's a 500 reqs/sec rate averaged over an hour) [19:53:31] ;/ [19:53:58] the problem I see though with going very low, is that it still won't catch DOS attacks hitting very expensive end points [19:54:14] but it will start to affect users who are doing legitimate things, and are following the documented rate limits [19:54:22] and to put that in context, at esams (our highest-traffic edge), a singular cache_text node processes approximately 3K reqs/sec when averaged out over the whole day. [19:54:50] so someone hitting 500/s through one of those caches is consuming a large fraction of resources for a single client/IP, any way you slice it [19:55:06] those ~3K/sec of "normal" requests are serving millions of users, not one abuser [19:55:16] yeah [19:55:40] the 10/s proposal is per L2 varnish node, or overall? [19:57:34] interesting.. it seems that HTTP2 servers can limit the number of concurrent streams [19:58:04] via SETTINGS_MAX_CONCURRENT_STREAMS: https://tools.ietf.org/html/rfc7540#section-6.5.2 [19:58:23] the 10/s rate is per-IP (we happen to have our LVSes hashing on IP, so a given IP always goes to a single cache server) [19:59:23] the rate's a TBF set at 300/1m actually. So they can initially burst up to 300 requests at whatever rate they feel like before it really kicks in and starts limiting their average reqs/min. [19:59:38] sorry 600 reqs per 1 minute :) [19:59:54] we started at 300 but then doubled it [20:00:39] and starting a higher setting wouldn't help with the abusers? [20:01:18] *at a higher setting [20:02:43] well it would help with MSFT, but honestly MSFT's scanning is relatively-benign because they're not hitting expensive endpoints usually, so that gets back to the whole thing about expensive vs lightweight requests [20:03:17] the WikiScrape bot we've been specifically limiting to 5/s to curb it, I'm not sure how high they were before limiting [20:03:22] but they were doing expensive things [20:04:19] yeah, the problem is that the costs differ by several orders of magnitude [20:04:21] the light-vs-heavy request thing is an interesting point, I think the answer is probably a reasonable middle ground, because the bare concept of a request has it's own overhead costs [20:04:46] in general, I think it is a good idea to favor light requests that get answered quickly and tie up less concurrency [20:05:37] but at some lower bound, if a client is having to fire off a boatload of requests to get all the data they need, a batched API that can bundle up more data in a single req->resp cycle is going to be more efficient for all involved, so long as the cycle times aren't getting crazy long where timeouts and BDP limits become a factor in a single request [20:06:03] (meaning in practice, you should still be aiming for single-digit seconds at worst case for a batch response?) [20:06:17] it's a balancing act between granularity & fragmentation [20:06:24] definitely [20:06:37] we are putting a lot of thought into that with the Reading team actually [20:06:46] you could take the lightweight idea to the absurb extreme with requests for text contents with byte-addressing, one resonse byte per request [20:06:56] /wiki/Foo?byte=3517 [20:07:22] ;) [20:08:13] I would love to find some solution that is both easy to understand & implement for users, and avoids the issues around widely differing request costs [20:08:56] it seems hard to achieve that with rates [20:09:48] but we have implementations of those, and I'm not sure how concurrency-based limiting would be implemented in practice [20:09:54] for example [20:10:04] yeah [20:10:37] there's definitely an argument to be made for fragmenting article text too, if most readers never really scroll down on most articles' length [20:11:01] but you certainly don't want a bulk downloader to then have to make several requests per article to get all the text [20:11:07] article text can be streamed as well [20:11:19] browsers are really good at rendering straight from a stream [20:11:27] and these days you can even control how much to stream [20:11:32] I think we do stream for the common case [20:12:01] but trying to hang back and control the streaming by limiting the output rate and hoping the user will disconnect and move on before you send it all seems like it ties up a lot of concurrency at the edge, too [20:12:12] but yeah, there is all kinds of talk about loading reference sections separately for example [20:12:48] we already have the lead / remainder split for mobile, but it's not clear if that is going to stay [20:12:53] or if streaming makes more sense there [20:12:54] what it all boils down to is first modeling all the types of users (not just types of human user patterns + UA behaviors, but also the users that are search engine scrapers and metadata bots, etc) [20:13:11] and mapping out reasonable range of APIs to support them all [20:13:29] yep [20:13:34] you may need to service fragments, whole articles, and reasonable batched chunks of articles all via different endpoints [20:13:41] we found that basically all clients that get the lead, also get the remainder [20:13:55] it was just a way to incrementally load JSON [20:14:38] anyway, this is fairly far OT ;) [20:15:19] re the rate limits, is there a dashboard we could monitor to make sure that no legitimate REST API users are being blocked? [20:15:28] (re "we do stream for the common case" earlier - I mean that the traffic stack in the text cluster is configured to not spool up whole applayer responses. it's supposed to forward through on reception of the first bytes from the appserver and stream as fast as it can) [20:16:05] ah, right - tee off straight to client [20:17:42] gwicke: probably pivot is your best bet, and yes some RB reqs are being limited [20:17:46] e.g.: [20:17:49] https://pivot.wikimedia.org/#webrequest/line-chart/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Azs6RpbutnsEwGZVyxALbVeAfQlhSY4AF9kwYhBkdmeANroVazsApU6jFmw55uOfETKUlxpq3adLvAUNHj8IhUt3OLZVUA/BkxACVicgBzcSUAEwBXBmI9XgAFOAA2AAkqZjBqKwBaAEZ5CrQgtPwo+KN6RzMXTCsCEkN7RtMQ13x3JWE5fEIwMCwJAtTEjUVq3vnzfAwAN2oGABtiHC6TJyW+4A3SQpSNqhXiDcSvdGZqMDgATywvYABlOHCASQA5AHEjBtqGJyGANLhNMAACwAJgAnMAALpzMAvN7 [20:17:55] vEBwBQVRQ6YJLZTkeq7JoLHjWDp2GjdfYtCl8aiCQaeXjJUgSLCpQi+RYtfyEiAcMC2DSknoHVq8Y6nK4XK43XhQbakKCOMASFalKC8oUiLlMXjkDAMERy3Eo5DaGl7ZqBBkDYBDN5SMLiOYAI0SEAA1g8AIIE/lB3hugBC3r9YCoSRSNWAmQArABZXnME1gQM1fHx9MMaPiun2tqO52SaSyd0JUhMEO1ZgQajEijRHFI5DkRIbDZKQgnMVQvvg5FW9ud7tAA= [20:17:59] good god what an awful URL [20:18:01] hang on [20:18:21] http://bit.ly/2spQeOz [20:18:35] thx! [20:19:16] seems to have spiked up a bit since we started talking, maybe you testing? :) [20:19:29] no, nothing on my end [20:19:36] but still, it's on the rough order of 1/s [20:19:43] (total 429 responses to RB clients) [20:20:00] did you deploy the new limit a couple hours ago? [20:20:54] the higher limit was around then, the lower one was earlier [20:21:24] basically there's no real 429 rate before the 300/min rate went in earlier [20:21:29] all of those are from varnish [20:21:33] 429 rates were close to zero before [20:22:31] checked up to a month back [20:22:37] this is the first 429 spike [20:23:06] yeah [20:25:29] seems like a lot of the RB limiting is on some metrics API [20:25:34] I guess we'll have to do some more querying to figure out if there any actual abuses in there [20:25:43] *are [20:26:06] yeah, metrics is basically uncacheable [20:26:46] glamtools [20:27:22] looks like an internal IP [20:27:45] or at least, the rate of requests to that endpoint from internal IPs dropped [20:27:58] ,http://tools.wmflabs.org/glamtools/glamorous/?mode=category&text=Photographs%20by%20Gianni%20Careddu,wikimedia.org,/api/rest_v1/metrics/pageviews/per-article/wikidata/all-access/user/Q3925699/daily/20170301/20170331,null,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36" [20:28:33] yeah, labs is considered internal for the REST API stats [20:28:35] I think those two URIs are referer,request-uri [20:28:53] the one I'm pasting above is from a non-labs public IP [20:29:16] http://tools.wmflabs.org/glamtools/glamorous/ [20:29:25] hm, okay [20:29:40] I'm guessing that glamorous tool ends up loading JS in the browser that ends up spamming that metrics endpoint [20:30:24] "glamorous tool" ;) [20:30:49] people shouldn't make software preload thousands of things for a user that they may never scroll through :P [20:31:03] I was looking at https://grafana.wikimedia.org/dashboard/db/restbase?panelId=15&fullscreen&orgId=1&from=now-12h&to=now, filtered on the metrics entrypoints [20:31:47] in that graph, only the internal requests to that endpoint had a significant drop in request rates around 4pm [20:32:13] yeah that dropoff in "internal" metrics is exactly when the first ratelimit went in [20:32:35] and the slight bump back up is when we raise it from 300->600/min [20:33:10] the ratelimiter is also excluding "internal" requests, but we specifically didn't include labs in the definition of internal [20:33:57] (but we could, it was a decision you could really argue either way for in the ratelimiter case. or at least, for different ones for labs) [20:35:09] so it looks like we are already limiting at least semi-legitimate requests that didn't cause issues so far [20:36:59] that's inevitable, it's a blunt instrument :) [20:37:27] (or I guess you could say that imposing a ratelimit policy redefines "legitimate", depending on your view) [20:37:29] yeah ;/ [20:37:41] but we can still try to minimize the damage [20:38:35] given that we have other rate limiting in place for the REST API, would it be okay to ask for a higher global limit for /api/rest_v1/ ? [20:39:10] the metrics end points have lower per-entrypoint limits as well [20:39:18] at the deepest root though, you're right that the real problem is that different URIs (well maybe even different request to the same URI over time in some cases) have wildly different costs and there's no good way to account for it [20:39:44] but solving that is not a quick fix [20:40:05] nod [20:40:18] so, 200/s for /api/rest_v1/ to match your ratelimit, at least for now in this world? [20:40:43] well the limiters won't exactly behave in the same manner, but we can make it close-ish [20:40:52] https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end documents 100/s for the metrics API end point [20:41:03] if you're doing 10s burst windows, we can do the TBF params as 2000/10s [20:41:13] even 100/s would probably be okay, considering that most legitimate uses will have some cache hits [20:41:28] 10Traffic, 10Operations, 10Community-Liaisons (Jul-Sep 2017): Communicate this security change to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3346178 (10Johan) And just to make sure I understand what's to be done: //The problem// IE8-on-XP will no longer be supp... [20:41:36] ok [20:41:56] I think I'll go ahead and exclude labs completely for now too, while we're still figuring things out [20:42:12] we're more-worried about the truly public case anyways. with labs worst case one of us can log into the node and kill it. [20:42:29] yeah [20:45:13] I'll keep my eyes open for ways to perhaps limit concurrency at some point [20:46:58] it's too bad the lvs iphashing hack doesn't descened to the nginx+varnish thread levels too [20:47:17] that would be an interesting way to make concurrency limits implicit or more-efficient to implement, either way [20:47:42] yeah [20:47:46] (if the balancing of the socket traffic to daemon threads/procs hashed on IP as well, so that per-IP tracking didn't have to be something mutex-protected cross-thread) [20:47:57] another could be at the nginx HTTP2 handling level [20:48:40] but, it'd be multiple nginx instances, so there is the communication problem [20:48:50] on connection setup at the very least [20:49:35] unless all connections from a single IP are mapped to the same nginx instance as well [20:53:45] gwicke: done [20:53:55] the metrics reqrate went back up too [20:54:06] thank you! [20:54:51] http://nginx.org/en/docs/http/ngx_http_limit_conn_module.html looks interesting, in that it counts each concurrent HTTP2 request as a "connection" [20:56:15] yeah but we try not to do anything conditional in our nginx. because it's acting as a mostly-transparent proxy into varnish, and our analytics pipeline lives behind it [20:57:04] yeah, it would also require IP hashing in LVS to work [20:57:09] (so if nginx were to turn on a small cache or conditionally reject requests with a 4xx because it thinks the URL looks ugly or whatever, analytics doesn't see those at all) [20:57:22] we do hash IPs in LVS [20:58:13] okay [20:58:54] there are also nginx? [20:59:15] it was so easy in the old days, with just about 3 layers… [20:59:20] now there are dozens :P [20:59:33] varnish would be just as good really; the only reason I looked at nginx first is that it could signal that limit to clients via the connection setting, but in practice it doesn't seem likely that it would do so [21:01:17] Platonides: varnish doesn't support TLS termination, so we had to put something in front of it [21:01:30] (for now) [21:01:41] it's pretty efficient though :) [21:03:00] heh I just noticed that Special:BlankPage is uncacheable :P [21:03:12] socat openssl-listen:433,fork tcp-connect:varnish:443 😈 [21:03:50] that would fall over and die :) [21:04:22] the whole ethernet->nginx->varnish pipeline is fairly-well optimized though [21:04:46] and nginx does do some nice things for us along the way to inject more analytics data and such [21:05:15] (and to some degree it funnels down the connection parallelism a bit before things reach varnish, too) [22:32:37] 10netops, 10Operations: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3346480 (10faidon) [22:33:51] 10netops, 10Operations: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3346497 (10faidon) [22:41:05] sigh, names and consistency: [22:41:05] AS14907 Wikimedia Foundation Inc. [22:41:28] AS43821 Wikimedia Foundation, Inc. [22:45:06] 10netops, 10Operations: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841#3346525 (10faidon) [22:47:25] 10netops, 10Operations: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842#3346542 (10faidon)