[09:10:45] 10ORES, 10Machine Learning Platform, 10Okapi, 10Operations, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) Checkin in to report that calls from OKAPI have stopped tonight. Thanks @RBrounley_WMF (and the team)! So if we still see the starvatio... [15:13:02] 10ORES, 10Machine Learning Platform, 10Okapi, 10Operations, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) I just had to do another service restart. [15:15:42] chrisalbon: do you feel the situation is bad enough that we should ratelimit okapi's request rate to ORES at our edge cdn? [15:17:02] I want to give them some time to fix it, but me manually restarting ORES three times a day isn't sustainable [15:17:25] there's also a real impact on the communities when ORES is unavailable [15:17:37] it kind of baffles me how we got into this situation in the first place... [15:18:45] Not sure frankly, I'm still learning ORES. [15:19:48] I'm okay with ratelimiting okapi to protect ORES. [15:20:04] Ideally ORES would rate limit itself but we've passed that point. [15:20:58] <_joe_> tbh I don't think it's the okapi causing the issues [15:21:03] <_joe_> let me verify [15:21:27] <_joe_> also it shouldn't be you chrisalbon restarting it [15:22:05] lol yes, I agree I shouldn't be restarting it but otherwise it jams up and people can't use the service [15:23:09] <_joe_> https://logstash.wikimedia.org/goto/9a38d0072add8f21cf0e3198bd092e28 [15:23:22] if it isn't OKAPI then I am really worried because the fix is going to be more complex [15:23:26] <_joe_> chrisalbon: sure I am saying we should have better alerting :) [15:23:36] <_joe_> okapi is not calling a lot [15:23:53] <_joe_> I would guess the issue is we're serving all the traffic from one DC [15:24:03] <_joe_> but again, just a guess [15:24:16] <_joe_> we can try to bump maxclients on the redis side [15:24:23] <_joe_> to see if that helps avoid this runaway [15:24:26] halfak is ORES active/active or active/passive [15:24:38] <_joe_> it's active/active :) [15:24:45] _joe_: okay, we should repool eqiad then [15:24:50] it was depooled there for the switchover [15:24:53] (I think) [15:24:55] yeah [15:25:16] <_joe_> cdanis: I'm not sure how much of a difference it will make, given everything else is still only running from codfw [15:25:29] <_joe_> but at least will funnel some traffic from the public to eqiad [15:25:32] yeah [15:25:48] ok, done [15:25:55] <_joe_> cdanis: are you doing it? [15:26:02] I did it [15:26:15] thanks all. I have total love for all you SREs [15:26:20] <_joe_> so the other thing we can try is to raise maxclients for rdb2003:6380 at runtime [15:26:41] <_joe_> to rule out root causes [15:27:09] <_joe_> it's proably a combination of all this stuff causing the issue [15:27:16] <_joe_> increased traffic from edits [15:27:23] <_joe_> all traffic to a single dc [15:27:26] <_joe_> etc etc [15:27:41] <_joe_> but more to the point, if ORES is unavailable, we should get paged [15:29:01] <_joe_> now it's 5:30 pm on a friday, but we should pick up this discussion on monday :) [15:30:53] ha, have a great weekend _joe_ [15:31:15] <_joe_> chrisalbon: I might try to raise maxclients on redis now [15:31:22] <_joe_> to let you have a nicer weekend :P [15:31:40] sounds great to me [15:32:47] <_joe_> heh we need to change the systemd unit for that :/ [15:33:04] <_joe_> so it's not as easy as I'd hoped [15:34:17] <_joe_> cdanis: what do you think? we just hand-add an override on the server for now? I don't want to write a puppet patch into /that/ codepath [15:34:33] <_joe_> (which I'm the author of, so I know how bad it is) [15:34:50] _joe_: and you are thinking leave puppet disabled for the weekend? [15:34:59] <_joe_> I don't think it's needed [15:35:07] <_joe_> if I just drop a systemd override file [15:35:11] oh! [15:35:14] yeah that seems fine [15:37:14] <_joe_> cdanis: can you double check? rdb2003:/etc/systemd/system/redis-instance-tcp_6380.service.d [15:38:06] _joe_: lgtm [15:38:57] <_joe_> I will restart the redis cache for ores, that will generate a brief service hiccup [15:40:45] <_joe_> 127.0.0.1:6380> config get maxclients [15:40:47] <_joe_> 1) "maxclients" [15:40:49] <_joe_> 2) "10000" [15:40:51] <_joe_> success [15:41:06] <_joe_> this should at least guarantee we don't easily starve out of redis connection spots [15:44:32] <_joe_> and at the very least, if we have further issues, we can rule out the redis config as a bottleneck [15:57:33] 10ORES, 10Machine Learning Platform, 10Okapi, 10Operations, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) Small status update: in order to grant everyone a quieter weekend (hopefully!), we've repooled eqiad and raised manually the max client...