[14:23:59] 10Scoring-platform-team-Backlog, 10revscoring, 10artificial-intelligence: [Investigate] Non-backtracking regex parsers - https://phabricator.wikimedia.org/T173574#3533505 (10Halfak) [14:24:59] 10Scoring-platform-team-Backlog, 10revscoring, 10artificial-intelligence: [Investigate] Non-backtracking regex parsers - https://phabricator.wikimedia.org/T173574#3533519 (10Halfak) I think our test should specifically target the badwords/informals processing. We could also test out tokenizing via the `delt... [15:51:00] halfak: there? [15:51:14] yup [15:51:36] halfak: back from Wikimania? [15:51:40] Yup :) [15:56:42] halfak: there is currently no page on article rerouting on meta, given that i was planning to look into it, can i go ahead and create one? [15:58:19] Yes please! [15:58:27] Did you see the graphic I made for it yesterday? [15:58:41] halfak: also, i was just curious, what tool do you use for something like - https://commons.wikimedia.org/wiki/File:New_article_routing.with_ORES.svg [15:58:46] yes i saw...^ [15:58:55] I think the name should be something like "Automated article topic detection" or something like that. [15:59:03] codezee, google draw [15:59:08] It's a decent vector editor. [15:59:16] I can share the drawing with you if you want to make edits. [15:59:48] Woops. I think it should be "Automated draft topic detection" [15:59:53] draft = new article thing. [15:59:57] and we should focus on that :) [16:00:22] * halfak shares the drawing [16:01:36] got it... :) [16:48:20] halfak: afaik, articles are manually added to wikiprojects by adding the template to the talk pages, right? [16:48:27] currently [16:51:22] codezee: there is also a rule based bot for it: https://en.wikipedia.org/wiki/User:AlexNewArtBot [16:52:01] eranroz: i'll have a look, thanks! [16:52:03] however since it is based on manual rules and not on AI, I guess it is not reaching to other languages beside english(?) [16:52:31] TF-iDF classifier will defently do a good work in any language :) [16:53:43] eranroz: with respect to english, how good it is? or what can be improved in it? [16:55:31] codezee: I'm not very active in enwiki, so I don't really know. [16:56:22] oh, ok [16:56:32] anyway, whatever approach you take - you can use it's historic results as baseline [16:59:21] eranroz: do you know where can i find the historic results? [17:48:29] codezee: sorry for the late response. I mean https://en.wikipedia.org/w/index.php?title=User:InceptionBot/NewPageSearch/Medicine/log&action=history for example for medicine project. [17:49:14] though this isn't easy to use format [18:33:54] 10Scoring-platform-team, 10editquality-modeling, 10revscoring, 10artificial-intelligence: Get signal from adding/removing images - https://phabricator.wikimedia.org/T172049#3534030 (10Natalia) Seems like this feature is not very productive. I ended up using the following regexes: for freestanding picture... [18:34:18] halfak: o/ ^ [19:05:51] fajne, maybe we should track when an image changes but there's not a removal or addition. [19:05:53] hmm [19:06:39] well, i think 90% of the image changes are benevolent [19:08:20] Right, but we could find conditionals [19:08:35] e.g. if anon and badword involved then changing the image is often bad. [19:08:42] not sure. [19:08:53] only image names and metadata changes being cached above, right? [19:10:17] With the features I specified, a *change* to an image wouldn't be caught. [19:10:25] Just changes in the overall count. [19:10:47] oh, okay [19:11:36] halfak: in the example Amir brought in phab the "conservative" user is non anon [19:12:04] fajne, yes, I was just making something up [19:13:58] since we're making up... in ideal world, image recognition or something like this could be of help)) [19:17:12] fajne, I'm making up in the context of the infrastructures we already have :P [19:17:24] Example != Novel technology :P [19:18:00] haha)) also, we can have a condition measuring how "conservative" the edit is. Like, if you wiping out "gay couple.jpg" it's probably a not tolerant edit... on the other hand, if you add such a picture, you may be a vandal too. Tricky! [19:18:57] depends a lot on the existing context [19:19:08] How does one know that the string "gay couple" corresponds to a societal concept that people get their underwear twisted for? [19:19:29] def underwear_twist_factor(string): ... [19:19:35] :P [19:19:49] could use a word list! [19:19:57] badwords, informals, taboos [19:20:16] taboos would have references to genetalia, sexuality, etc. [19:20:37] i am sooo glad it's my last day of internship.. [19:20:54] Amir will be happy to take this task, I am sure! [19:20:56] lool [19:22:13] 10Scoring-platform-team, 10editquality-modeling, 10revscoring, 10artificial-intelligence: Get signal from adding/removing images - https://phabricator.wikimedia.org/T172049#3534158 (10Halfak) ``` from revscoring.features import wikitext, modifiers from revscoring.features.meta import aggregators from revsc... [19:22:18] fajne, ^ [19:22:38] It's definitely Friday :) [19:28:01] halfak: i'm assuming getting a history of addition of articles to wikiprojects wouldn't be as easy as looking into the db, since the Wikiproject templates are essentially part of the talk page, right? [19:34:22] codezee right. [19:34:51] This demonstrates how to get wikiproject tagged pages by template: https://quarry.wmflabs.org/query/20169 [19:34:59] But I think you want to go the other way. [19:35:00] Sec. [19:36:40] i wish wikiprojects weren't such a mess [19:37:17] yes, i was thinking if we know revisions when articles were added to /removed from wikiprojects, those could be a good dataset for reference [19:37:30] codezee, I don't think we need the revisions [19:37:38] wikiproject membership is stable [19:37:44] We just need the current membership. [19:40:53] codezee, https://quarry.wmflabs.org/query/20968 [19:43:58] this will be useful....i'll bookmark [19:44:13] ok, so vector space similarity between the representative wikiprojects document vector and new draft vector could be a useful starting/baseline... [19:45:55] where " representative wikiprojects document vector" -> vector of all current Wikiproject articles as a single document [19:46:19] though that might be an overkill in terms of computation i think [19:46:56] This gets a small sample of pages with rough wikiproject membership. It'll need some post-processing: https://quarry.wmflabs.org/query/20969 [19:48:54] ^ Just made it a little better. Refresh [19:49:53] Anarchism gets "Alternative views", "Libertarianism", "Philosophy", "Politics", "Socialism", and "Sociology". [19:50:29] 10Scoring-platform-team-Backlog, 10Research Ideas, 10artificial-intelligence: New article review routing AI - https://phabricator.wikimedia.org/T123327#3534221 (10Sumit) Also from eranroz, a bot tagging new articles with wikiprojects or lists using a rule-based system - https://en.wikipedia.org/wiki/User:Ale... [19:53:37] apparently, if we start using mysql's features to a full extent, people might not even need any postprocessing, serve the data directly on a platter from mysql :D [19:57:05] That'd be great! At least we could have a process that dumps out updated wikiProject matchings. [19:57:24] harej, didn't you have a proposal for making the directory machine readable and easier to maintain? [19:57:43] No, I have a working auto-updating directory [20:01:27] harej: this - https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Directory ? [20:01:29] yes [20:09:05] 10Scoring-platform-team-Backlog, 10Research Ideas: Create machine-readable version of the WikiProject Directory - https://phabricator.wikimedia.org/T172326#3494902 (10Sumit) might be useful if we sync the machine readable format from here, probably using a cron script - https://en.wikipedia.org/wiki/Wikipedia... [20:11:14] 10Scoring-platform-team-Backlog, 10ORES: starting Edit quality ORES campaign for fawiki - https://phabricator.wikimedia.org/T172629#3534264 (10Halfak) In that link, I see "draft quality" and "edit quality" campaigns. So regretfully, I'm still confused :/ [20:50:53] 10Scoring-platform-team, 10ORES, 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3475312 (10ksmith) There was a request in the 2017-08-16 Scrum of Scrums from the Scoring... [20:53:48] 10Scoring-platform-team-Backlog, 10ORES: starting Edit quality ORES campaign for fawiki - https://phabricator.wikimedia.org/T172629#3534340 (10Yamaha5) I mean fa.wikipedia now has only one campaign (Edit quality) which is finished. I asked to start the second one like enwiki which has two campaigns ("draft qu... [20:54:44] 10Scoring-platform-team-Backlog, 10ORES: starting Draft quality ORES campaign for fawiki - https://phabricator.wikimedia.org/T172629#3534341 (10Yamaha5) [21:01:30] 10Scoring-platform-team-Backlog, 10ORES: starting Draft quality ORES campaign for fawiki - https://phabricator.wikimedia.org/T172629#3534344 (10Halfak) Ahh! The "second" one is article quality. I'll make some updates to this task. :) [21:02:57] 10Scoring-platform-team-Backlog, 10ORES: Train/test wp10 model for fawiki - https://phabricator.wikimedia.org/T172629#3534349 (10Halfak) [21:52:43] ^ that took longer than expected [21:54:02] and that ^ [21:54:04] woops [21:54:08] hurry up github [21:54:14] VVV down there [21:54:16] ha! [21:54:33] and with that, I'm out of here for the weekend. [21:54:40] I'll be around tomorrow for the hack session. [21:54:55] Between 1400 and 1700 UTC [21:55:10] wiki-ai/revscoring#1186 (label_schemas - 5ee2150 : halfak): The build failed. https://travis-ci.org/wiki-ai/revscoring/builds/266138221 [21:56:55] wiki-ai/revscoring#1188 (fix_thresh_opt_pattern - a53d92e : Aaron Halfaker): The build was fixed. https://travis-ci.org/wiki-ai/revscoring/builds/266138593 [21:59:11] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 2 others: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3534478 (10demon) I thought about it yesterday. We should just bite the bullet and get git-lfs support for Gerrit. This is po... [23:59:24] 10Scoring-platform-team, 10editquality-modeling, 10revscoring, 10artificial-intelligence: Get signal from adding/removing images - https://phabricator.wikimedia.org/T172049#3534706 (10Natalia) the results: #DAMAGING: Top scoring configurations | model | mean(scores) | std(scores)...