[13:02:02] Hi halfak, I was reading as well as thinking about the JADE and its use cases. IMO it can be used in one way or another to tackle a wide spectrum of issues. Hence, I would like to work on it, to refine existing ideas related to it and define a concrete research problem. Let me know what do you think and what could be the next step forward. [13:03:09] *halAFK [13:24:11] o/ [13:25:27] nikhil07, great! So right now, we're maybe one month from our pilot deployment. [13:26:59] Integration points are going to be a big theme. I.e., we want counter-vandalism tools to capture the judgments of their users in Jade. [13:29:15] May I know what are the existing counter-vandalism tools that you are planning to use for the deployment? [13:33:37] Huggle is one of the big ones. https://en.wikipedia.org/wiki/Wikipedia:Huggle [13:39:35] Okay. let me check and try to find some integration strategies. [13:41:42] btw is there any phabricator task to discuss this issue? [13:42:09] Yeah. Let me grab it. [13:43:11] https://phabricator.wikimedia.org/T238877 [13:43:49] thanks [13:46:16] 10Scoring-platform-team, 10artificial-intelligence: Text complexity scoring - https://phabricator.wikimedia.org/T155843 (10Chtnnh) @Halfak Should we mark this task as duplicate? [13:52:33] 10Scoring-platform-team, 10artificial-intelligence: Text complexity scoring - https://phabricator.wikimedia.org/T155843 (10Halfak) 05Open→03Declined I think we should decline this as it doesn't look like we want to deploy this. But we would like to do something different with T246438. [13:53:48] 10Scoring-platform-team, 10artificial-intelligence: Text complexity scoring - https://phabricator.wikimedia.org/T155843 (10Chtnnh) Noted, thank you. [14:07:17] 10Scoring-platform-team (Current), 10articlequality-modeling, 10artificial-intelligence: Add text complexity scoring to article quality models - https://phabricator.wikimedia.org/T246438 (10Halfak) We talked about this in IRC, but it looks like performance for this feature is pretty bad. I don't think that... [14:17:40] 10Scoring-platform-team (Research), 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10Outreachy (Round 20), 10artificial-intelligence: Proposal (GSoC / Outreachy 2020): Implement an NSFW image classifier with open_nsfw - https://phabricator.wikimedia.org/T247614 (10Halfak) I think it makes... [14:31:34] wikimedia/revscoring#1874 (no_textstat - af75127 : halfak): The build has errored. https://travis-ci.org/wikimedia/revscoring/builds/663089880 [15:28:20] 10Jade, 10Scoring-platform-team: Implement diff *of* Jade entity pages - https://phabricator.wikimedia.org/T247762 (10Halfak) [15:32:38] 10Jade, 10Scoring-platform-team: Implement diff *of* Jade entity pages - https://phabricator.wikimedia.org/T247762 (10Halfak) For dealing with the double-diff view problem, I propose: {F31684919} In this mock, we can see that the diff on the Jade page is hidden behind a semi-transparent div. There is a butto... [15:33:01] 10Scoring-platform-team (Research), 10artificial-intelligence: Implement NSFW image classifier using Open NSFW - https://phabricator.wikimedia.org/T214201 (10Sarahmarie1981) [15:56:07] 10Jade, 10Scoring-platform-team: Implement diff *of* Jade entity pages - https://phabricator.wikimedia.org/T247762 (10Halfak) OK so now we need to know how the difference of Jade pages should look. I've made a mock of how all of the actions should look. {F31685051} [16:55:49] 10Jade, 10Scoring-platform-team, 10Epic: Impement secondary Jade Integrations - https://phabricator.wikimedia.org/T229974 (10Halfak) [16:55:51] 10Jade, 10Scoring-platform-team, 10Advanced-Search, 10Discovery-Search, and 2 others: Implement search indexing for Jade entity pages - https://phabricator.wikimedia.org/T206352 (10Halfak) [16:56:38] 10Jade, 10Scoring-platform-team (Current): Implement secondary schemas for joining Jade data to other tables - https://phabricator.wikimedia.org/T229977 (10Halfak) [16:56:53] 10Jade, 10Scoring-platform-team (Current): Clean up naming conflicts around writing secondary schema data for Jade - https://phabricator.wikimedia.org/T235003 (10Halfak) [17:14:45] 10Scoring-platform-team, 10articlequality-modeling, 10Community-Wishlist-Survey-2015, 10artificial-intelligence: Make quality/reliability of an article more clear to the reader - https://phabricator.wikimedia.org/T120754 (10Aklapper) [17:16:13] 10Scoring-platform-team, 10MediaWiki-extensions-General, 10articlequality-modeling, 10WorkType-NewFunctionality, 10artificial-intelligence: Create Quality Bias Alert for Cited Information Sources - https://phabricator.wikimedia.org/T28426 (10Aklapper) [18:38:18] Hello halfak I spent some time trying to make it faster. I can't really say if it got faster based on my computer's speed though. [18:39:45] One of the things I tried was to reduce the length of the file to 29 lines.... by grouping all idioms that starts with the same alphabet [18:39:53] Something like: [18:40:00] "babe (in arms|in the woods)|baby elephant in the room|back (at you|gammon player|in the day|in the game|of one's hand|to square one|to the drawing board|to the wall) [18:42:43] In the quest to reduce the length of the list though, it gets "wider"... So I don't really feel there are any gains in performance (but yet again, I can't really tell since the time it takes to run on my machine fluctuates a lot) [18:54:40] hey haksoat! [18:55:17] It might be that we can't really improve performance without trimming the list substantially. [18:55:30] It could be that we need a better strategy to scan for matches. [18:55:57] I wonder if there are any other obvious strategies for trimming the list. What do you think? [18:57:19] I'm currently going through the list: https://en.wiktionary.org/wiki/Appendix:Glossary_of_idioms_%E2%80%93_B [18:57:36] I don't know how the mwapi.Session works though [18:57:58] But I'm pulling data using requests.... atleast to test out and see how the A-Z list works out [18:58:20] halfak [18:58:45] Oooh. I like it. [18:59:36] Re. how to use mwapi, it provides some helping structure. E.g., session = mwapi.Session("https://en.wiktionary.org") [19:00:11] Yeah. The get method though. [19:00:20] I tried following what you used for the other list [19:00:21] doc = session.get(action="query", prop="revisions", rvprop="content", titles="Appendix:Glossary_of_idioms_–_B", formatversion=2) [19:00:28] Great [19:00:31] Thank you [19:00:35] text = doc['query']['pages'][0]['revisions'][0]['content' [19:00:36] ] [20:28:18] Extracting idioms took 1.945356845855713 seconds [20:28:26] Extracting features took 20.44516348838806 seconds [20:28:36] Extracting features w/o idioms took 1.4541773796081543 seconds [20:28:56] The values are very strange to me [20:29:02] halfak [20:29:30] Perhaps I should create a gist with the idioms in it [20:29:41] So you can try it out [20:30:22] very weird, indeed [20:30:24] Sounds like a good plan. Thanks for digging into this. [20:30:29] It didn't pick an idiom from the Alan Turing list either since it is a smaller list [20:30:44] *Alan Turing article [20:30:44] It's complicated. I appreciate your work to make sense of it. [20:30:58] haksoat: are you using regex? [20:31:06] is it matched using a dfa? [20:31:16] Yes, using regex [20:31:30] I don't know what dfa is though [20:31:33] dfa regex matching is slower in the common case, but asymptotycally better [20:31:45] I'll read it up [20:32:44] I've got to run in a moment. But let me link to the regex monstrosity we are using: https://github.com/wikimedia/revscoring/blob/master/revscoring/datasources/meta/extractors.py#L13 [20:32:59] Essentially, it turns a list of regexes into one big regex and runs finditer [20:33:03] * halfak runs away [20:33:07] xD [20:33:28] :-D [20:33:53] can this be run standalone? [20:34:45] I guess regexes is a list and text_datasource a full article, like Alan Turing? [20:38:49] Your guess is correct [20:41:04] but text_datasource is a constructor parameter [20:41:30] then what about text_or_texts in process() ? [20:42:17] is text_datasource an article name, then fetched automagically by the parent class? [20:44:19] text_datasource is not an article name, it's the actual text we intend running the regexes on as a string. [20:44:51] I commented the last two lines of __init__ [20:44:57] and changed the parent class to object [20:45:05] then did [20:45:06] reg = regex(regexes, None) [20:45:07] print (reg.process(open("Alan Turing", "r").read())) [20:45:15] should that take a long time? [20:45:48] I doubt if that'd work [20:45:53] doc = mwapi.Session("https://en.wikipedia.org").get(action="query", prop="revisions", titles="Alan Turing", rvprop="content", formatversion=2)text = doc['query']['pages'][0]['revisions'][0]['content'] [20:46:12] missed out import mwapi [20:46:21] Then the text=... on another line [20:46:22] I placed a file named Alan Turing with the content of the enwiki article with that name :P [20:46:31] That's how we extract the content [20:46:37] Oh [20:46:37] I probably miss your 29 lines of regexes [20:46:48] :-D [20:47:07] is that on another file of the repo¿ [20:47:24] What's that? [20:47:52] you mentioned a file with 29 lines of regex [20:48:20] Oh. Okay. That's the list of idioms. [20:48:33] I switched to another list [20:48:54] I doubt you can use the list directly though, perhaps Halfak will clarify when he returns [20:48:57] well, whatever list you are using [20:49:11] https://gist.github.com/HAKSOAT/bc49d2cb84f56d1163b54cae5f5ff9ce [20:50:28] it returns nothing for Alan? [20:50:37] that took 1.5 secs here [20:51:10] Yeah. It returns nothing for Alan [20:51:20] I could create a gist for the longer list [20:51:22] Coming [20:52:24] https://gist.github.com/HAKSOAT/2dcd6055b8070455ff62c70d4f05a067 [20:53:35] AssertionError: sorry, but this version only supports 100 named groups [20:54:12] Ooops. I have no idea what that is. [20:54:59] I suspect you should be using non-capturing parenthesis everwhere [20:59:36] Instead of capturing parenthesis? [21:02:22] yes [21:02:29] then your .group(2) fails [21:02:40] but cauring everything should work i thik [21:03:18] well, if your wrapping is \b, there's no difference [21:03:24] as that's zero-length [21:03:47] ['head and shoulders', 'head and shoulders', 'fall between two stools', 'piece of work', 'go through with', 'put the clock back', 'in the Making'] [21:04:23] about 2 seconds [21:04:45] Yeah [21:04:59] Takes about 2-3 seconds here too [21:05:57] I think you have a strong point with speed improvements if we use non capturing. [21:06:16] I used group_pattern = r"(?:" + wrapping[0] + r")" + \ [21:06:16] r"(" + r"|".join(regexes).replace('(', '(?:') + r")" + \ [21:06:19] r"(?:" + wrapping[1] + r")" [21:06:19] El búfer 1 está vacío. [21:07:39] How about the match.group(2)? [21:08:25] I left one pair of parentheses as matching ones [21:08:39] thus using match.group(1) [21:10:59] I get now [21:16:56] I guess you need all the idioms? [21:18:51] Yes. The longer list will be preferred if speed can be optimized. [21:23:59] do you have the longer list as one idiom per line? [21:24:03] i.e. uncompressed [21:24:53] Yes [21:31:53] Here: [21:31:56] https://gist.github.com/HAKSOAT/ae92132964cbd6d50e36462040d09bc3 [21:31:59] Platonides [21:38:52] thanks [21:45:14] 10Scoring-platform-team (Research), 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10Outreachy (Round 20), 10artificial-intelligence: Proposal (GSoC / Outreachy 2020): Implement an NSFW image classifier with open_nsfw - https://phabricator.wikimedia.org/T247614 (10srishakatux) @Chtnnh I co... [22:21:44] argh, "regular expression is too large" [22:27:46] ok, the pattern has 87732 bytes [22:28:03] and the value it was compiled with caps at 64K [22:40:57] I kinda rewrote it in C [22:41:02] Waoh [22:41:13] so 0.217s what was 0.931 in python [22:41:28] however I need to use a smaller set of idioms [22:44:24] could recompile pcre [22:54:23] ok, that was easy [22:54:35] python2: 2.2s [22:54:42] python3: 3s [22:54:55] (using longer_enwikitionary) [22:55:10] C implementation with longer_enwikt: [22:55:13] 0.625s [22:55:32] 2.118s with the uncompressed one [22:56:12] the uncompressed would take 8.79 in py3 [22:56:34] 5.776 in python2 [22:57:31] hmm, weird python is detecting one more idiom [22:57:53] Exciting stuff [22:58:06] Even though I'm lost already xD [23:01:30] ah, I see now why [23:01:38] python is detecting "in the Making" [23:01:52] but in C the matching is not being done case-insensitive [23:05:40] Okay [23:05:53] Can you share the code as a gist? [23:06:09] I really want to see what it looks like [23:15:37] yup [23:17:45] $ ./compare.sh uncompressed-noquotes.txt Alan\ Turing [23:17:45] Comparing with idiom list uncompressed-noquotes.txt: [23:17:45] python2 0:05.83 [23:17:45] python3 0:08.60 [23:17:47] PCRE 0:02.28 [23:17:50] PCRE JIT 0:02.24 [23:17:54] $ ./compare.sh longer_enwiktionary-noquotes3.txt Alan\ Turing [23:17:54] Comparing with idiom list longer_enwiktionary-noquotes3.txt: [23:17:54] python2 0:02.28 [23:17:55] python3 0:03.14 [23:17:55] PCRE 0:00.62 [23:17:57] PCRE JIT 0:00.70 [23:19:19] Waoh. [23:19:46] would need to use larger corpus to see the asymptotic behavior [23:23:23] https://archivos.wikimedia.es/wikimedia/es/revscoring.bz2 [23:23:51] note you will need to recompile pcre with ./configure --with-link-size=3 [23:29:42] Okay [23:30:54] I'm not very clear though on how I can integrate pcre with the codebase [23:31:44] I Was just comparing codes [23:31:51] test.py uses python implementation [23:32:03] pcre program uses C lib [23:32:41] one could use python ctypes to call pcre, I guess [23:33:26] or make a python extension [23:33:30] I'll look into that [23:33:56] perhaps this won't even solve your problems [23:34:21] it's a wuick code, after all [23:34:32] *quick [23:35:30] Hehe. True. Just me being curious. [23:35:53] that's a good property :) [23:36:17] I initially thought python used pcre, but now I think it will use its own implementation [23:45:18] Oh. I was actually surprised Py2 performed better than Py3 though. I'll take another look at the whole thing later in the day and try to see the way forward. It's past midnight here, need to catch some sleep. Thanks for the comparing code and the feedback too, I did learn a couple of things.