[14:09:43] I’m pushing the celery 4 code to ores* for fun... [14:10:26] My hope is that I can do a stress test, and if needed dial down the workers until we can at least match current performance and switch production over. [14:40:54] 10Scoring-platform-team, 10ORES, 10Wikimedia-log-errors: Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ORES/includes/Hooks.php on line 602 - https://phabricator.wikimedia.org/T179830#3737445 (10Addshore) [14:42:46] o/ [14:43:16] awight, how's it going? [14:43:50] halfak: hey there [14:43:58] Hi! [14:44:01] awight: is deploying to ores nodes last time I checked [14:44:12] Yeah. Saw the pings in operations :) [14:44:13] /o\ [14:44:23] awight@ores1001:/srv/deployment/ores$ pstree -pal | grep celery | wc [14:44:24] 482 970 10704 [14:44:25] awight@ores1001:/srv/deployment/ores$ pstree -pal | grep celery | wc [14:44:25] I'm writing the code to turn user id into user name in wikilabels [14:44:26] 1 3 37 [14:44:30] nice! [14:45:23] awight, all the workers crashing? [14:45:29] I see 482 turn into 1 [14:45:30] hahaha [14:45:37] * awight runs off the gangplank [14:45:40] File "/srv/deployment/ores/venv/lib/python3.4/site-packages/redis/_compat.py", line 23, in select [14:45:40] return _select(rlist, wlist, xlist, timeout) [14:45:41] ValueError: filedescriptor out of range in select() [14:45:50] lol damn [14:45:52] so… I’ll dial down the workers. [14:46:11] What business does redis have being fragile, though? [14:46:41] oh. It’s gonna be anything that uses basic filesystem operations [14:47:01] https://github.com/celery/celery/issues/3397 [14:47:08] we should file a bug [14:47:09] we should consider pruning filehandles in the master process [14:47:42] even better—have celery devs consider doing it :) [14:47:50] Right. Looks like they already have. [14:47:59] This bug is "closed" [14:48:03] but it's still a problem. [14:51:34] halfak: [14:51:38] https://usercontent.irccloud-cdn.com/file/e00O7tZt/image.png [14:51:47] Should I link to user page or user contribs? [14:51:51] \o/ Looks good. [14:51:59] I think use page on local wiki if that's OK. [14:52:04] Or maybe the *home* wiki :) [14:52:18] Since we're accessing the global account. Either way. [14:52:31] If local is way easier, then that makes sense. [14:52:34] I'm using lru cache also [14:52:49] meh, it wont' matter much [14:53:12] awight, can you confirm your pypi account name? [14:53:23] halfak: “adamw" [14:53:29] thanks' [14:57:19] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3737484 (10awight) @akosiaris FYI, I'm backing off from our attempt to use 480 workers. The change is ready for review in https://ge... [15:00:22] URrgh. That redis code we’re hitting is only used when Python < 3.5 [15:00:46] redis code? [15:01:00] See my paste above. We’re hitting select() limits in other libraries now. [15:02:04] Oh I’m wrong about the 3.5 compat, I think we call select in either case. [15:06:26] awight, oh! I see [15:07:09] I just want to go back to a simple life, with only 240 workers. [15:07:31] awight, i wonder if that is in celery's use of redis [15:07:44] If not, we can cut it. [15:07:54] k revscoring 2.0.9 is uploaded. [15:07:58] hum [15:07:59] Nice [15:08:34] I’m shooting for an incremental approach, 240 is already 5x the number we have currently in production. [15:08:41] good point [15:08:43] I think we’ll want a few thousand eventually. [15:08:43] Also more machines [15:08:44] :) [15:08:53] ah good call. [15:09:25] This pipes thing is in fact killing me, though. [15:09:36] pipes? [15:10:00] That celery finds it appropriate to open all the file descriptors. [15:10:05] Oh yeah. [15:10:08] how about… UDP or something. [15:10:18] actually! do we have that option? [15:10:22] celery is supposed to not do that [15:10:27] mm? [15:10:39] That's the thing. We should get 2x the number of workers as file descriptors [15:11:24] fd = 2x workers makes sense, why would it be the other way around? [15:11:37] since each pipe is one-way [15:12:41] "other way around"? [15:14:44] sorry. Why would there be 2x workers as pipes? [15:15:04] also: I’m digging through the manual for alternatives to this pipe nightmare [15:15:11] "ask" said so in an issue. I'm looking for it. [15:15:15] "select has a quirky limitation of 1024 file descriptors caused by FD_SETSIZE, and this limit is based on the number of file descriptors used by the program, and not the number of file descriptors in the set." [15:15:27] “set” is unclear [15:15:47] https://github.com/celery/celery/issues/3397#issuecomment-242603728 [15:15:54] "You have 500 processes, that will create about 1000 pipes." [15:17:04] Maybe we should demo the issue with a super simple celery app [15:17:05] Yes. That’s fixed in celery 4 AFAICT [15:17:17] awight, but why are we still experiencing it? [15:17:39] however, it seems that the total number of file descriptors used by a process causes an issue with any *other* lib that uses select() [15:17:43] It’s only a guess... [15:18:08] awight, I think it's a total per-process and redis hitting the wall just means it was the last straw. [15:18:17] mmm [15:18:19] yes true [15:18:27] but the limit on FDs is much, much higher [15:18:41] it’s only python+linux+select that has this issue [15:19:04] so we could happily be at 50,000 FDs in a process, then some jerk comes along and tries to open a new thing with select(), and the FD >> 1024 [15:19:17] Do you see the error happen with redis select on all of the nodes? [15:19:58] good question, lemme see. [15:20:16] This is ores1002 so yes I think so,. [15:20:54] for the lulz: https://phabricator.wikimedia.org/P6264 [15:24:33] Thinking about it, there’s no damn reason to have these pipes at all? Cos Celery workers get their jobs through a broker, not from the spawning process. [15:36:50] awight, looks like we should file a bug against kombu re. use of redis select that limits the # of worker processes. [15:37:53] kombu is calling a public API in redis, connection.can_read, that should be fine. [15:38:01] It’s these dang pipes that are messing us up. [15:38:23] I’m pretty certain they can be closed as workers are spawned. [15:40:08] s/pretty certain// I haven’t read that code yet ;-) [15:55:20] I’m less certain. It’s possible that the parent process on each machine is responsibile for maintaining the worker pool. [16:00:30] halfak: https://github.com/wiki-ai/wikilabels/pull/211/files [16:00:45] halfak: We could try Eventlet concurrency... [16:01:20] wiki-ai/wikilabels#217 (user_name - a03beec : Amir Sarabadani): The build failed. https://travis-ci.org/wiki-ai/wikilabels/builds/298060232 [16:01:35] I don’t understand yet what the tradeoffs are. [16:12:47] sample output of the fetch_page_wikiprojects script that does all the work of fetch all wps and mid-level categories for pageids - https://dpaste.de/3SvK [16:13:28] its interesting to see the pages belonging to multiple categories like - ["Technology", "Medicine", "Internet Culture"] [17:08:15] 10Scoring-platform-team (Current): Build mid-level WikiProject category training set - https://phabricator.wikimedia.org/T172321#3737775 (10Sumit) https://github.com/wiki-ai/drafttopic/pull/11 [17:18:03] 10Scoring-platform-team (Current): Deploy ORES early Nov 2017 - https://phabricator.wikimedia.org/T179837#3737792 (10Halfak) [17:18:10] 10Scoring-platform-team (Current), 10ORES: Deploy ORES early Nov 2017 - https://phabricator.wikimedia.org/T179837#3737804 (10Halfak) [17:19:44] 10Scoring-platform-team (Current), 10ORES: ORES 500s when model_info lookup fails due to a key error - https://phabricator.wikimedia.org/T179712#3737807 (10Halfak) [17:19:46] 10Scoring-platform-team (Current), 10ORES: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711#3737808 (10Halfak) [17:19:49] 10Scoring-platform-team (Current), 10ORES: Deploy ORES early Nov 2017 - https://phabricator.wikimedia.org/T179837#3737792 (10Halfak) [17:20:11] 10Scoring-platform-team (Current), 10ORES, 10revscoring, 10artificial-intelligence: Update ORES deploy wheels with revscoring 2.0.9 - https://phabricator.wikimedia.org/T179838#3737809 (10Halfak) [17:30:57] 10Scoring-platform-team, 10ORES, 10Google-Code-in-2017, 10User-Zppix: Document ORES's restAPI functions - https://phabricator.wikimedia.org/T179314#3737844 (10Halfak) 05Open>03declined We already have API documentation -- see the links in the description -- so this is under-specified at best. I'm decl... [17:43:03] 10Scoring-platform-team (Current), 10ORES: Test to see what Extension:ORES will do when it gets this null threshold result. - https://phabricator.wikimedia.org/T179845#3737973 (10awight) [17:56:15] halfak: got 5 mins? [17:56:25] Not right now. Sorry [17:56:29] ok... [17:56:46] But send me what you've got and I'll get back to you when I can. [17:56:51] Maybe 30 minutes from now. [17:56:55] alright [17:57:58] halfak: so regarding the next step of hacking into revscoring for a multiclass classification model, what would be best to start with? adding tf-idf support or adding word2vec support? [17:59:39] codezee, I think we should explore both. I'd like to start workshopping some word-frequency examples in revscoring, but I'd be happy to also start building feature vectors for word2vec. [18:01:01] okay, i'll first work with word frequency then... [18:01:14] and see what we get [19:22:55] OK! I'm packing up to head out. [19:23:00] Take care folks! [19:23:39] o/ [19:36:51] 10Scoring-platform-team (Current), 10Wikimedia-Site-requests, 10Community-Tech-Sprint, 10Patch-For-Review, 10User-Ladsgroup: Enable draftquality model in ORES extension for enwiki - https://phabricator.wikimedia.org/T179596#3738432 (10kaldari) 05Open>03Resolved We are now collecting draftquality scor... [19:38:55] 10Scoring-platform-team (Current), 10Wikimedia-Site-requests, 10Community-Tech-Sprint, 10Patch-For-Review, 10User-Ladsgroup: Enable draftquality model in ORES extension for enwiki - https://phabricator.wikimedia.org/T179596#3738445 (10Ladsgroup) It only includes "OK" and "spam" data, not vandalism. This... [19:40:43] 10Scoring-platform-team (Current), 10MediaWiki-extensions-ORES, 10draftquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Collect all data for draftquality model in enwiki - https://phabricator.wikimedia.org/T179861#3738448 (10Ladsgroup) [19:44:24] 10Scoring-platform-team (Current), 10Wikimedia-Site-requests, 10Community-Tech-Sprint, 10Patch-For-Review, 10User-Ladsgroup: Enable draftquality model in ORES extension for enwiki - https://phabricator.wikimedia.org/T179596#3738476 (10kaldari) @Ladsgroup: I saw at least one vandalism score. Maybe it's ju... [19:48:12] 10Scoring-platform-team (Current), 10MediaWiki-extensions-ORES, 10Graphite, 10User-Ladsgroup: Keep statistics about ores service hits for storing thresholds - https://phabricator.wikimedia.org/T179862#3738480 (10Ladsgroup) [19:51:37] 10Scoring-platform-team (Current), 10Wikimedia-Site-requests, 10Community-Tech-Sprint, 10Patch-For-Review, 10User-Ladsgroup: Enable draftquality model in ORES extension for enwiki - https://phabricator.wikimedia.org/T179596#3738504 (10Ladsgroup) >>! In T179596#3738476, @kaldari wrote: > @Ladsgroup: I saw... [19:57:30] awight: around? [19:57:35] Amir1: hey! [19:58:01] awight: regarding the deploy tonight, is this done? https://phabricator.wikimedia.org/T179838 [19:58:18] Not yet—I’ve only pushed to pypi [19:58:30] gimme a second & I can make that patch [19:58:35] amazing [19:58:37] Thanks [20:01:02] (03PS1) 10Awight: Update to revscoring 2.0.9 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/389549 [20:01:07] Amir1: ^ [20:01:55] (03CR) 10Ladsgroup: [C: 032] Update to revscoring 2.0.9 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/389549 (owner: 10Awight) [20:02:08] thanks [20:03:46] U got the submodule bump... [20:05:02] awight: yeah [20:27:24] (03PS1) 10Ladsgroup: Bump ores and wheels to HEAD [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/389552 [20:27:49] awight: Can you check this? https://gerrit.wikimedia.org/r/#/c/389552/1 [20:28:02] The git pull brought lots of changes [20:32:56] hum. [20:34:49] (03CR) 10Ladsgroup: [V: 032 C: 032] Bump ores and wheels to HEAD [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/389552 (owner: 10Ladsgroup) [20:35:01] oops hehe I was about to check that [20:37:13] Amir1: that bump loooks perfect [21:10:14] Awight mind checking out https://www.mediawiki.org/wiki/Topic:U1bfosekyw1e88yv halfak asked me do something like that want and i wanted opinions [21:14:55] Zppix: It’s pretty confusing—what’s the intention here? [21:15:07] Is this an introduction to the talk page? [21:15:12] To gain questions to add to faq mainly [21:15:39] awight: 4:07 PM Zppix, why not invite people to ask questions on the talk page? [21:16:07] How would people find this invitation, I’m wondering? [21:16:27] I was planning on posting to ai-l if halfak was okay with it [21:16:35] With the link to topic [21:22:02] Awight so do you think its safe to do so? [21:24:45] I think you should steer people to the FAQ and the Talk page, but I don’t think you need this introductory topic [21:26:09] Ok [21:27:13] I hid it awight ill work on etherpad for the email and get suggestions before sendingg [21:27:36] +1, thanks for the help! [21:27:40] Np [21:34:46] Awight https://etherpad.wikimedia.org/p/ORES_FAQ_Email [21:37:50] 10Scoring-platform-team, 10ORES, 10Operations, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3738870 (10BBlack) 05Open>03stalled p:05High>03Normal The timeout changes above will offer some insulation, and as time... [21:38:38] Halfak if you want to take a look https://etherpad.wikimedia.org/p/ORES_FAQ_Email [21:39:31] halfak: awight I'm leaving for the day [21:39:36] o/ [21:39:54] Amir1: cool, I’ll get beta back up np [21:39:54] Amir1: o/ [21:39:55] sorry couldn't finish up all stuff (btw. the PR for wikilabels is up) [21:39:55] I'm on for 10 minutes from the airport ^_^ [21:40:07] I'll check wikilabels [21:41:41] 10Scoring-platform-team (Current), 10Wikilabels, 10User-Ladsgroup: Add list of labelers to campaign stats (sort by labels submitted) - https://phabricator.wikimedia.org/T178004#3738882 (10Ladsgroup) https://github.com/wiki-ai/wikilabels/pull/211 [21:42:16] halfak: http://labels.wmflabs.org/ui/enwiki/ btw. (the info_url feature) [21:44:43] Amir1: My current theory is that the deployment got corrupted by running out of disk space, in a way that even -f couldn’t fix. [21:45:07] But I have no idea what happened that caused us to see a revscoring 1 error on the canary. [21:45:16] Time for a HDD upgrade? [21:45:35] Revscoring 1 error on the canary!? [21:45:48] How's your .whl build environment? :P [21:46:13] That info URL is a thing of beauty [21:46:36] Halfak or awight is that email draft (https://etherpad.wikimedia.org/p/ORES_FAQ_Email ) good 2 go? [21:47:06] Zppix, not yet. I won't be able to take a pass for a while. Let's make the email an announcement! [21:47:13] I want to add stuff to it :) [21:47:17] Ok [21:50:40] halfak: The .whl is wholesome :p [21:51:11] The error was apparently from a revscoring 1 model getting loaded up, “import revscoring.scorer_models” not found. [21:51:25] Interesting! [21:51:34] Definitely a revscoring 1 model somewhere [21:52:15] Halfak maybe the files need pruned? [21:52:23] wat [21:52:46] But more seriously, I think we would have seen a model file issue with previous deploys. [21:52:58] Scap weirdness sounds more likely to me. [21:52:59] I think we have some files that shouldnt be there [21:53:04] Oh yeah? [21:53:07] And need deleted thats a fact [21:53:25] But where it must be a scap issue could im looking at github [21:53:38] Na. we're pretty clean WRT models. We have lots of checks on that. [21:53:52] The normal process for rebuilding models is to delete the model dir and rebuild everything :D [21:54:01] Thats painful no? [21:54:20] Nothing gets committed unless it's in the Makefile (assuming code review is working) [21:54:42] Nope! Well, a little, but it helps us maintain some convenient guarantees :D [21:54:59] Does canary cache? [21:55:07] halfak: So, I’ve been seeing scap doing this thing where it deploys an apparently random revision unless I specify “-r HEAD" [21:55:12] Scap has a cache [21:55:30] Also—and most covfefely, it *logs* that it deployed the version I’ve checked out, even if it isn't. [21:55:35] awight, could it be that a half-completed scap barfed out an old version? [21:55:40] lol [21:55:56] Scap is so "convenient" [21:56:10] I’ve been feeding bugs to Tyler but I thought I would give them a few days off ;-) [21:58:40] Scap is like all software and has its quirks. ORES has its quirks! [21:59:08] lol scap is great, and I think its authors would say you’re being too generous [21:59:24] software seems to be a process of chaining quirks together. [22:00:44] OK time to go! Have a good one! [22:01:20] O/ halfak [22:05:28] 10Scoring-platform-team, 10ORES: KeyError on "features" - https://phabricator.wikimedia.org/T179873#3738957 (10awight) [22:30:48] 10Scoring-platform-team, 10ORES: Specter of revscoring 1 haunting ORES - https://phabricator.wikimedia.org/T179874#3739027 (10awight) [22:31:21] Amir1: fyi, I deployed a few times with the same results you were getting. Eventually rolled back and filed the bug ^ [22:32:40] awight: is the submodule updated [22:33:13] Zppix: It seems that updating the submodule is what caused the problem. [22:33:33] Great