[13:10:35] 10Scoring-platform-team-Backlog, 10Wikilabels: Wikilabels translations not updated and cannot load old worksets - https://phabricator.wikimedia.org/T174343#3558487 (104shadoww) [14:01:20] O/ any thing i can do for yall? Merging and such? [14:06:34] o/ [14:06:36] * halfak is back @ home [14:06:44] O/ [14:07:17] Im here if im needed to merge or do some cleanup [14:07:30] o/ Zppix [14:07:36] I'm catching up on things. will let you know. [14:07:50] Kk [14:08:01] Have fun :) [14:09:31] o/ akosiaris got time now for a quick stress test? [14:11:54] halfak: o/. yes I am around and will be for the next 3 hours [14:12:14] Cool. Starting up a simple 2000 requests / minute right now. [14:16:06] 100% timeout error [14:16:13] So something is wrong. [14:16:51] Well. 2% timeout and 98% overload. [14:16:56] Yeah. Something is terribly wrong. [14:17:15] celery workers seem to be down [14:18:07] akosiaris, ^ [14:18:13] Is this on the new cluster? [14:18:14] Maybe we got an OOM [14:18:15] Yeah [14:18:37] * akosiaris looking [14:20:21] OSError: [Errno 24] Too many open files [14:21:02] celery seems to have croaked, systemd restarted it after that [14:21:18] but this isn't consistent across the 9 boxes [14:21:49] or maybe I am wrong [14:22:02] Looks to me like no workers are processing any requests right now [14:22:32] Looks like some of the discussion is towards increasing ulimit [14:22:39] yeah it's easy to bump [14:22:47] but I wanna be sure first [14:22:53] why are we hitting that limit ? [14:23:01] Not clear to me. [14:23:09] I would not expect celery ores worker opening so many files [14:23:23] Could be that we're not closing model files after we open them [14:23:32] Or something like that. [14:23:37] How many is too many really? [14:23:47] Are we erroring out in the hundreds, thousands, millions? [14:25:47] 65k [14:26:12] I say we need to close some files halfak [14:26:22] ah sorry 4k [14:26:36] Hmm... let's see. We have ~75 models. 75 * 400 = 30,000 [14:26:37] but I think it's not related to actuall files [14:26:44] I'm not sure that we're *not* closing them [14:26:44] akosiaris@ores1002:~$ sudo systemctl show -p LimitNOFILE celery-ores-worker [14:26:45] LimitNOFILE=4096 [14:26:45] yet [14:26:54] so... remember this is unix we are talking about [14:27:07] everything is a file, including sockets aka network connections [14:27:18] Right. [14:27:19] lemme paste the full backtrace [14:28:01] https://phabricator.wikimedia.org/P5930 [14:29:21] Not too helpful for me. Hmm. whatever it is, it's probably happening when celery starts forking like mad to get to N workers [14:29:32] so billiard opens to many pipes to take to redis is my guess [14:29:45] to talk to* [14:30:10] Hmm... Ok so the workers don't need to talk directly to redis, but celery's code does manage its own redis connection for workers. [14:30:28] We could make sure to drop the redis connection used for cache lookups in the celery workers. [14:30:39] I've got to run to a meeting, but I can look into that today. [14:30:50] who does the cache lookup ? [14:30:54] uwsgi ? or celery ? [14:30:57] I guess uwsgi [14:31:04] if yes, it will not help [14:31:29] It's uwsgi. [14:31:36] yeah it's unrelated [14:31:41] But I think that the celery workers maintain a redis connection for it anyway. [14:31:56] yeah but it's not that redis is complaining it's receiving too many connections [14:31:57] Celery doesn't really maintain independence between the workers and uwsgi. [14:32:02] It's essentially running the same code. [14:32:11] uwsgi forks before celery and far less often. [14:32:24] it's different processes [14:32:37] Different processes that all maintain the connections to redis [14:32:54] yes but we are not talking about some shared resource here having problems [14:33:15] this is the kernel saying to celery worker that as a process it's over its 4k limit [14:33:22] uwsgi has it's own limit [14:33:29] its* [14:33:42] I could just bump the limit [14:33:53] akosiaris, right. That would be a fine short term solution. [14:34:00] but to what value is a good question [14:34:16] In the long term, I'd like to make celery workers not hold onto redis connections that they aren't going to use [14:34:18] I guess we can just answer it later down the line [14:34:30] Or rather, solve the issue, maybe [14:34:54] fine by me [14:35:04] anyway I 'll double the limit and let's see what happens [14:35:12] Great. Thanks. [14:35:21] Might have to quadruple it since we just quadrupled workers. [14:36:15] heh, as long as we don't overrun the system max of 6,5M I guess that's fine [14:36:32] especially as long as we are doing stress tests [14:43:17] halfak: ok file limit bumped to 4k temporarily [14:43:39] wasn't that the old limit? [14:51:33] er sorry 16k [14:51:39] cool [14:51:55] Can you do a rolling restart of the workers too? [14:51:57] Or should I do that? [14:51:59] already done [14:52:03] cool. [14:52:08] Start the stress test again? [14:52:19] yeah go ahead [14:52:23] Going! [14:52:35] Working! [14:53:02] ah.. now we finally have redis issues [14:53:08] Aug 28 14:53:01 ores1002 celery-ores-worker[25618]: [2017-08-28 14:53:01,423: ERROR/Worker-319] Connection to Redis lost: Retry (14/20) in 1.00 second. [14:53:44] Yup. Errors everywhere! [14:54:15] 500s that are not 503 [14:54:29] Maybe a connection limit with celery? [14:56:44] redis logs are not showing anything [14:59:11] then again the redis servers are bound to the same 4k open file limit (but there should have been an error of somekind logged somewhere) [14:59:46] I 'll bump it just for the sake of it but honestly > 4k simultaneous connections to a redis server is a bit much [14:59:56] agreed. [15:08:03] bumped to 16k for the 2 redis servers (6379 & 6380) as well [15:08:16] Shall I retry? [15:08:20] sure [15:08:32] going! [15:08:53] yeah nothing happened [15:08:55] same error [15:09:11] Yup. Lasted about the same amount of time before erroring out. [15:09:14] ah [15:09:20] max number of clients reached [15:09:22] finally [15:09:29] Aug 28 15:09:05 ores1002 celery-ores-worker[25618]: [2017-08-28 15:09:05,661: CRITICAL/MainProcess] Task ores.scoring_systems.celery_queue._lookup_score_in_map[enwiki:draftquality:0.0.1:697381357] INTERNAL ERROR: ConnectionError('max number of clients reached',) [15:09:37] ok that happened after the 20 retries [15:09:46] nice so we maxed out redis [15:09:47] lol [15:11:07] In Redis 2.6 this limit is dynamic: by default it is set to 10000 clients, unless otherwise stated by the maxclients directive in Redis.conf. [15:11:08] However, Redis checks with the kernel what is the maximum number of file descriptors that we are able to open (the soft limit is checked). If the limit is smaller than the maximum number of clients we want to handle, plus 32 (that is the number of file descriptors Redis reserves for internal uses), then the number of maximum clients is modified by Redis to match the amount of clients we are really able to handle under the current o [15:11:08] perating system limit. [15:14:08] halfak: could you try once more ? I 've amended a bit the limits [15:14:14] kk [15:16:31] Working just fine :) [15:16:41] Made it well past the limit where we failed in the past. [15:17:00] finally [15:17:13] so .. where do we go from here ? [15:18:21] I'll continue the tests until we fail and then we decide to move forward or make more modifications. [15:19:56] Trying 3k requests per minute. [15:39:45] 10Scoring-platform-team, 10WMF-NDA-Requests: Data and Shell access related to Scoring Platform project on drafts and page reviews - https://phabricator.wikimedia.org/T172720#3558911 (10Sumit) [15:39:47] 10Scoring-platform-team, 10articlequality-modeling, 10draftquality-modeling, 10artificial-intelligence: Get Sumit access to deleted page data for quality modeling - https://phabricator.wikimedia.org/T172719#3558914 (10Sumit) [15:42:16] 10Scoring-platform-team, 10Scoring-platform-team-Backlog, 10Research Ideas: Create machine-readable version of the WikiProject Directory - https://phabricator.wikimedia.org/T172326#3558924 (10Sumit) [15:55:38] 10Scoring-platform-team, 10Continuous-Integration-Config, 10Easy, 10Patch-For-Review, 10User-Ladsgroup: Have CI merge research/ores/wheels changes - https://phabricator.wikimedia.org/T173251#3558986 (10hashar) 05Open>03Resolved Thanks :) [16:03:20] akosiaris, we eventually started to overload with 3k requests. Did something do wrong on the machines? [16:03:58] 10Scoring-platform-team, 10revscoring, 10artificial-intelligence: '!' doesn't work for threshold optimizations - https://phabricator.wikimedia.org/T173261#3522382 (10Halfak) a:03Halfak [16:04:26] Maybe the workers died again? [16:04:32] Aug 28 15:36:03 ores1002 celery-ores-worker[25618]: ValueError: filedescriptor out of range in select() [16:04:53] I am in a meeting currently will post the phab paste later [16:05:24] cool thanks [16:06:56] not sure who hung [16:16:35] halfak: https://phabricator.wikimedia.org/P5930#32584 [16:54:03] 10Scoring-platform-team, 10draftquality-modeling, 10artificial-intelligence: Project around page reviewing and drafts - https://phabricator.wikimedia.org/T172726#3559250 (10Halfak) Made an editing pass through the wiki page. Looks good to me. [16:55:09] 10Scoring-platform-team, 10editquality-modeling, 10revscoring, 10artificial-intelligence: [Investigate] Get signal from adding/removing images - https://phabricator.wikimedia.org/T172049#3559263 (10Halfak) [16:59:30] 10Scoring-platform-team, 10articlequality-modeling, 10draftquality-modeling, 10artificial-intelligence: Get Sumit access to deleted page data for quality modeling - https://phabricator.wikimedia.org/T172719#3559357 (10Halfak) 05Open>03declined Discussed with @Sumit. We were able to make due with his #... [16:59:34] 10Scoring-platform-team-Backlog, 10Community-Liaisons: AWight staff account access to deleted text - https://phabricator.wikimedia.org/T174363#3559359 (10awight) [17:01:01] 10Scoring-platform-team, 10draftquality-modeling, 10artificial-intelligence: [Discuss] draftquality on a sample, humongous everything, or something else? - https://phabricator.wikimedia.org/T168909#3559377 (10Halfak) a:03Halfak [17:02:24] halfak: any small gap b/w meetings sometime today? about 5 min? [17:03:00] codezee: what you needing help with perhaps i could help? [17:06:15] codezee: i added irc notifications to the drafttopic repo [17:06:42] codezee, in 55 minutes I'm available. [17:07:10] Zppix: thanks! btw where did you add them? [17:07:32] repo settings -> integrations codezee [17:08:40] oh, i see :) [17:09:47] I setup/re/configured the other repos to the same in the past so i did this one to keep it standardized with each other [17:10:39] halfak: I don’t think this is a big deal, but I wanted to push back on the thresholds API. If you’re interested. [17:11:24] awight, what's up? [17:11:26] All I would say is that it might be better to simplify the request structure. [17:12:00] You were saying RoanKattouw was thrilled about the functionality, but was that specifically about the flexibility, or just that we were exposing this info? [17:12:21] The thing about complex APIs is that they are annoying to maintain. [17:12:31] Must be flexibility because we were exposing this in an inflexible way in the past [17:12:38] define complex in this context [17:12:55] I Agree apis are a pain in the arse to maintain [17:13:07] complex that we can pass paths into the data and even filter queries [17:13:34] awight, so basically all of the new functionality [17:13:41] IDK even [17:13:51] IDK what was there previously [17:14:00] I think it’s great to return all the threshold info [17:14:48] but it’s a little scary to expose something like “get X at Y” in full DSL syntax, rather than targeting a specific use case like /get/X/at/Y [17:14:51] Sorry to interrupt but did we find a solution to celery? [17:16:44] awight, seems like a difference in delimiters. [17:16:51] Zppix: https://media.giphy.com/media/y7eQLoDeRpOzS/giphy.gif [17:17:21] halfak: The only difference is between supporting “anything in the world” and supporting a specific use [17:17:37] Lmao i needed that adam [17:18:52] halfak: I’m not advocating we burn down any of your work. Just that we might provide strong stability guarantees for a specific API, but call the full query thing experimental. [17:19:05] Zppix: Any time, devin :D [17:25:55] 10Scoring-platform-team-Backlog, 10Community-Liaisons: AWight staff account access to deleted text - https://phabricator.wikimedia.org/T174363#3559428 (10Quiddity) 05Open>03Invalid See instructions at https://office.wikimedia.org/wiki/WMF_Staff_userrights_policy :-) [17:34:35] 10Scoring-platform-team-Backlog, 10Community-Liaisons: AWight staff account access to deleted text - https://phabricator.wikimedia.org/T174363#3559450 (10awight) @Quiddity Thanks for the link! [17:34:44] awight, is your concern that we'll change the structure of information or maybe that we'll change the querying strategy for a consistent structure? [17:38:12] halfak: AIUI, the two main approaches to API design are * make it minimal, or * make it really easy to do all the common operations [17:38:31] the full-query parameter doesn’t seem to have either of the desired properties... [17:39:13] The only argument I see for allowing these queries is that our users might have sophisticated use cases we haven’t yet analyzed [17:39:42] in that case, we should probably understand their use cases and make that access really straightforward [17:40:10] awight, do you feel like my analyses of the use cases have not been complete enough? [17:40:24] sorry—where would I read those? [17:40:29] In the tasks :) [17:40:43] I’m terribly underinformed about the whole thing, due to my own laziness. [17:40:48] kk lemme see... [17:41:23] Here's a big part of the discussion: https://phabricator.wikimedia.org/T162217 [17:41:47] lol [17:40:43] I would basically not have to bother you ever again, and would have enough granularity to do all sorts of things [17:41:58] I see the appeal ;-) [17:42:17] :D [17:45:07] Where do I learn how to write an expression like “maximum recall @ precision >= 0.9” ? [17:55:43] awight, no docs yet [18:00:44] o/ codezee [18:00:54] Just got done. What's up? [18:01:07] awight: there's an interesting example from last time on L82 of etherpad [18:17:43] codezee: hehe that is certainly an interesting query! [18:22:17] Im back [18:38:44] Out to lunch. Back soon [18:40:42] back in 10 [19:01:43] 10Scoring-platform-team-Backlog: AWight staff account access to deleted text - https://phabricator.wikimedia.org/T174363#3559689 (10Qgil) [19:29:42] back [19:33:29] 10Scoring-platform-team-Backlog, 10articlequality-modeling, 10artificial-intelligence: Implement feature for detecting clumps of text that lack references - https://phabricator.wikimedia.org/T174384#3559974 (10Halfak) [20:01:16] 10Scoring-platform-team, 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3560090 (10Halfak) New test today. Moral of the story is **TOO MANY FILE HANDLES**. https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&... [20:01:28] ^^ Please review, awight [20:01:44] halfak: will do! [20:20:05] awight: i saw your msg in #mediawiki lol [20:20:34] halfak: forgive me if you answered me did you and akosiaris figure out how to prevent the overloaded celery issue [20:21:00] see that last post. still haven't figured out why celery is going crazy for file handles. [20:22:44] Ah ok [20:23:18] Ill watch and see if at any point i could help find a fix [20:23:23] But atm im clueless [20:33:29] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Reduce very long search times on RC Page when using ORES for rare combos - https://phabricator.wikimedia.org/T164796#3560250 (10jmatazzoni) ! In T164796#35536... [20:34:44] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Reduce very long search times on RC Page when using ORES for rare combos - https://phabricator.wikimedia.org/T164796#3245687 (10Zppix) Rare combos are likely... [20:36:20] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Very long search times on RC Page for "Very likely good faith" + "Likely have problems" - https://phabricator.wikimedia.org/T164796#3560258 (10jmatazzoni) [20:36:57] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Very long search times on RC Page for "Very likely good faith" + "Likely have problems" - https://phabricator.wikimedia.org/T164796#3245687 (10jmatazzoni) I j... [20:40:32] halfak: [20:40:44] halfak: not certain what do do with reviewing the stress test [20:40:48] It look bad ;-) [20:41:26] Shall I dig into the bug itself? [20:41:29] awight: they want our .02¢ on T164796 [20:41:30] T164796: Very long search times on RC Page for "Very likely good faith" + "Likely have problems" - https://phabricator.wikimedia.org/T164796 [20:42:27] Zppix: Thanks for surfacing [20:42:36] Np [21:26:28] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Very long search times on RC Page for "Very likely good faith" + "Likely have problems" - https://phabricator.wikimedia.org/T164796#3560419 (10awight) I think... [21:34:26] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Very long search times on RC Page for "Very likely good faith" + "Likely have problems" (on en.wiki only?) - https://phabricator.wikimedia.org/T164796#3560470 (... [21:35:11] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Very long search times on RC Page for "Very likely good faith" + "Likely have problems" (on en.wiki only?) - https://phabricator.wikimedia.org/T164796#3245687 (... [21:38:17] Ugh I am feeling so interrupt-driven. [21:50:36] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Very long search times on RC Page for "Very likely good faith" + "Likely have problems" (on en.wiki only?) - https://phabricator.wikimedia.org/T164796#3560510 (... [21:55:22] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Very long search times on RC Page for "Very likely good faith" + "Likely have problems" (on en.wiki only?) - https://phabricator.wikimedia.org/T164796#3560528 (... [21:58:42] 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017): Very long search times on RC Page for "Very likely good faith" + "Likely have problems" (on en.wiki only?) - https://phabricator.wikimedia.org/T164796#3560530 (... [21:59:09] That was dirty… I managed to kick that one back over the fence. [21:59:32] \o/ [21:59:51] awight, re. stress test if you have nothing to say really, that's OK. [21:59:53] :) [21:59:56] um wait [22:00:01] hehe I’m sure I can say something [22:00:09] but I think your guesses are right so far [22:00:14] I think the next step is to address the file-handle issue. [22:00:40] yeah me too. checking the etherpad, cos I thought I remember you making tasks already [22:00:41] I think we're hitting a limit because of the massive number of duplicated processes. I think few people use celery to the extent that we are. [22:00:55] nope. I assigned you the task-making [22:00:58] In full manager mode :D [22:02:15] lol right on [22:07:22] 10Scoring-platform-team-Backlog, 10ORES, 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3560572 (10awight) [22:14:07] 10Scoring-platform-team-Backlog: [Investigate] ORES worker threads shouldn't use Redis connection pool - https://phabricator.wikimedia.org/T174403#3560595 (10awight) [22:14:24] halfak: ^ does that correctly state your current guess? [23:09:41] damn. missed awight [23:24:18] hi halfak [23:29:06] 10Scoring-platform-team-Backlog: [Investigate] ORES worker threads shouldn't use Redis connection pool - https://phabricator.wikimedia.org/T174403#3560595 (10Halfak) I think the issue is that we use the connection pool looking up cached scores. The celery connection pool is important and useful. Unrelated, b...