[15:07:29] o/ [15:07:31] Good TAG folks! [15:12:55] buonasera! [15:17:07] Hey soupault [15:17:51] I'm still grabbing coffee, but in the meantime it would be great if you told me a little bit more about your interests. [15:18:22] Hmm, let me see [15:23:33] I'd say my main passions are math and philosophy (in any order). 'Math' means both 'actual math' and some of derivative sciences. I personally love control theory a lot, but there are popular sciences which has to be considered in any case (for example, modern """data science""" is actually 5% data science and 95% good old math stat and system identification). [15:24:22] As for philosophy, I prefer ontology, philosophy of mind and philosophy of technics. [15:24:28] Hmm... I'd counter that "data science" is 95% scientific practice and 5% stats :P [15:24:42] But I might be an unusual data scientist :) [15:25:09] Finally, I'd say that I'm a positive person (in terms on intents) and try to integrate both passions into something valuable [15:26:19] Cool. So... there are two aspects to the project that I think you might find very interesting. [15:26:29] E.g. https://phabricator.wikimedia.org/T120138 [15:26:32] Besides mentioned above, I'm into several other sciences and practices (for example, aerospace instrumentation, ethics, etc) [15:27:00] We have a modeling issues where one of our most effective (by the numbers) features is a boolean for whether the editing user is anonymous or not. [15:27:33] BUT this has a disparate impact on anonymous editors who, in my opinion, should be considered a protected class in Wikipedia. [15:28:04] ^ This hits both the ethical/philosophical and stats interests -- potentially. [15:28:31] Any experience with machine learning models and evaluation? [15:30:08] yes, sure [15:30:42] I'm working in Python mostly, btw [15:31:27] Works for us. :) Most of our code is in python and we work with sklearn substantially. [15:35:48] So, it would be great to have you look at our strategy and see if you have any suggestions for how we can improve the behavior of edit quality models around anons. [15:36:39] We're doing basic supervised learning. Train/test sets. Linear SVM model. See the feature lists here: https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/enwiki.py [15:37:00] The feature list is built in a modular way, but it should be mostly intuitive. [15:37:21] E.g. find the no_lang_damaging features here: https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/util.py [15:39:09] already looking at them @ https://meta.wikimedia.org/wiki/Talk:Objective_Revision_Evaluation_Service#More_Artificial_Intelligence_for_your_quality_control.2Fcuration_work. [15:39:19] :) [15:40:13] am I right that you doesn't consider article language as a feature? [15:40:22] at least on training stage [15:41:23] soupault, all of the articles in a wiki should be in the same language. [15:41:57] We can't switch up features between training, testing and use. [15:42:01] o/ halfak [15:42:03] they are solidified during training. [15:42:05] o/ Amir1 ! [15:42:18] Sorry to just drop off yesterday. Timing was tight :( [15:42:29] np at all [15:42:34] Yes-yes [15:42:38] I'm happy you're around [15:42:45] Amir1, will do the push to ores-staging right now. [15:43:04] does it have better AUC? [15:43:55] AUC won't change since I'm just using your feature set to operate. [15:44:07] But the server behavior might change. [15:44:13] awesome [15:44:30] Amir1, what is the new version of wb-vandalism? [15:44:38] any news re edit quality campaign for wb [15:44:46] 0.1.6 [15:45:02] Amir1, I've got the data ready. We should be able to get the campaign loaded in the next hour :) [15:45:13] \o/ [15:45:36] awesome [15:47:36] Also, if I can introduce you to soupault -- who has been interested in the project so I'm trying to find interesting projects that need some work. [15:47:51] Hello there! [15:48:06] Amir1 is one of our team members who is working primarily on Wikidata-related modeling stuff. [15:49:00] But Amir1 does a lot of other wiki-tooling work like pywikibot and his ANN Kian. [15:49:13] Kian: https://github.com/Ladsgroup/Kian [15:50:11] Also, Amir1's answers will be short because he broke his hand and shouldn't be typing too much. [15:50:16] :P [15:50:42] * halfak gets back to doing work that he promised Amir1 :) [15:50:44] :D [15:50:53] thanks halfak :) [15:51:23] soupault: happy to meet you :) [15:51:40] Amir1, same do I! [15:54:14] What do you like to do soupault ? [15:54:36] Amir1, looks like the issues re. pywikibase are worked out, but the model behavior hasn't changed much. [15:55:11] BUT, I have an idea. I think I should take your labeling of reverted/not-reverted and re-extract features. There could be something weird between your extraction and mine. [15:55:13] Amir1, I like to research ;). Talk to you a bit later, I have to finish preliminary investigation on project proposed [15:55:48] Amir1, I pointed soupault to the issues around user.is_anon and our general process for supervised learning. [15:56:01] great :) [15:56:10] * halfak goes to copy-paste bits of soupault's discussion of interests from the scrollback [15:56:24] [09:23:33] 19<soupault> I'd say my main passions are math and philosophy (in any order). 'Math' means both 'actual math' and some of derivative sciences. I personally love control theory a lot, but there are popular sciences which has to be considered in any case (for example, modern """data science""" is actually 5% data science and 95% good old math stat and system identification). [15:56:24] [09:24:22] 18<22soupault18> As for philosophy, I prefer ontology, philosophy of mind and philosophy of technics. [15:56:27] halfak, thanks in advance ;) [15:56:32] :D [15:56:57] that's great [15:57:34] halfak: I'll send all of them to you [15:57:41] in one hour [15:57:51] No worries. I think I can work with what you've already given me :) [15:57:59] * halfak makefiles EVERYTHING [15:59:36] And here we go :) [15:59:38] halfak, do you use custom weights for FP, FN? I can see that the number of FN is much higher for 'reverted' and 'damaged' models, while it is opposite for 'goodfaith' [16:00:05] soupault, good Q. So we train with a balanced dataset and test with a dataset that reflects the real world. [16:00:24] We do this by re-weighting the input dataset. [16:00:26] halfak, nice, I see [16:00:43] So, the actual classification is not very useful since it has a lot of false positives (assuming a 50/50 split) [16:00:50] Instead we set thresholds on the probability. [16:01:30] Generally 0.70 is worth review, 0.80 is likely to be damage and 0.90 is definitely damage (with exceptions, of course) [16:01:59] Amir1, we might be able to get some model tuning in for wikidata today too. That would be pretty cool. [16:02:28] I'm seriously considering removing LinearSVC from our model tuning utility because the model behaves so badly. [16:02:40] 1/5 chance the model will never converge. [16:03:00] Which is the *exact opposite* of what the docs say (rbf is supposed to be harder to converge) [16:03:38] halfak: amazing [16:03:49] It seems to be that it might be better to extract more features and train something like GLM [16:04:15] soupault, oh yes. So we have LogisticRegression in our model tuner and I have been seeing surprisingly good performance from it. [16:09:47] o/ guillom [16:10:30] Any chance I can fill your morning with a ton of french curse words and racial slurs? [16:10:43] * halfak realizes that came out badly :/ [16:11:03] See https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/fr [16:11:08] halfak, could you add some comments on ruwiki support? [16:11:22] Oh! Yeah. Let me check the status of things. [16:11:44] Looks like the main blocker is setting up badword/informal detection for the language. [16:11:45] https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ru [16:12:04] We need someone to go through this list of automatically generated *potential* badwords and filter them for us. [16:12:27] I'll do it with great pleasure [16:12:35] I.e. copy-paste words from the "Generated list" into either the "badword" or "informal" list as appropriate. [16:12:37] yay! [16:13:04] Then I'll incorporate that into a language-based feature set. Those live here: https://github.com/wiki-ai/revscoring/tree/master/revscoring/languages [16:13:38] what about datasets? are those were labeled manually in the past? sorry, but I'm not familiar with wiki review procedure [16:14:02] sorry for the bad english :/ [16:14:14] soupault, so we have a few different models. "reverted" is trained on past edits, so we don't need labels for that. [16:14:30] We also have "damaging" and "goodfaith" models that need human judgement applied to new edits. [16:14:33] I only have B2 certificate :) [16:14:53] It turns out that we've already started a labeling campaign for ruwiki based on popular demand. [16:15:19] See https://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F:%D0%9E%D1%86%D0%B5%D0%BD%D0%BA%D0%B0 [16:15:27] OMG ENCODING [16:15:42] it's usual :) [16:16:01] :D [16:19:02] sweet! much clearer now [16:19:26] :) [16:20:30] so, only 4.8% of requested dataset is completed at the moment [16:20:56] Sounds about right. We need someone local on ruwiki to help explain why this is important and organize the work. [16:21:15] The first 4 to finish was where we had native speakers doing coordinating work. [16:21:43] English(me), Portuguese(Helder), Persian(Amir1) and Turkish(ToAruShiroiNeko) [16:22:05] But in fairness, the ruwiki campaign hasn't been running that long. [16:23:47] I see. Ok, I'll investigate what can be done to kick-off the activity [16:23:57] Thanks! [16:31:33] halfak, regarding swear words: how those are being selected? I mean do I really need to classify those 250 or it might be better to propose more complete list composed from another sources? [16:31:51] soupault, more complete lists are always welcome. [16:32:23] halfak, what is the syntax for regexp so? [16:32:31] These words were generated by applying a TFiDF strategy -- essentially, these are words that are commonly added in *reverted* but not in edits generally. [16:32:47] enough said :) [16:32:59] soupault, no regexp on the wiki if that is OK. That way, I can just pull the word list into the unit tests. [16:33:22] But if you would like to write regexes, the talk page or inside the language files in 'revscoring' is a great help. [16:33:24] halfak, you know, Russian is quite rich [16:33:36] also, I'm looking at /en section and there are som regexp's [16:33:58] yeah. There shouldn't be. That is because english is one of our oldest languages and we've improved our process since then. [16:34:11] We will want regexes eventually. [16:34:25] See our test case for English here: https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/tests/test_english.py#L10 [16:34:32] Note no regexes. [16:34:44] It's good to be able to test our regexes against real words. :) [16:36:35] Sure, sure. Pretty nice collection though :D [16:39:04] :D. It's a weird hobby isn't it? [16:42:50] o/ halfak [16:42:54] I wouldn't say so [16:43:45] Hey guillom! [16:43:57] We're having a badwords party. Want to join us? [16:44:34] Ugh. I'll pass this time :) [16:44:53] OK. No worries. [16:45:01] Just to be clear, this is to feed into the edit quality stuff. [16:45:18] If you know any other frwiki-pedians who would be interested in helping, please send them our way. [16:45:24] got it [16:45:28] o/ ellery [16:45:35] gribeco might be interested [16:45:50] He developed Salebot, the frwiki equivalent of Cluebot. [16:46:07] haven't seen him on IRC for a while though. [16:46:09] Oh! Yeah. I'm familiar with Salebot. [16:46:22] I wonder if we can just steal a bunch of regexes from salebot's code. [16:48:20] Gribeco is at https://fr.wikipedia.org/wiki/Utilisateur:Gribeco [16:48:31] I probably have his email address somewhere as well. [16:48:39] (And he lives in the US.) [16:49:14] guillom, thanks! I'll reach out on his talk page. :) [16:49:20] * guillom is going to go afk for a bit; time to make progress on a few great books. [16:49:32] bbl [16:49:32] hah, Salebot [16:49:39] is "sale" describing the bot or the stuff it cleans up? [16:49:54] o/ guillom [16:50:13] harej: Not sure about the etymology ;) [16:50:52] My bot is "Bot de Sept Lieues" i.e. "Seven-League Bot". We do like our puns on frwp. [16:51:40] Le Bot du 18 Brumaire [17:23:50] Hello [17:24:20] o/ White_Cat_mobile [17:24:26] I kinda will be afk for 1-2 hours [17:24:42] OK. I'll probably be gone when you get back. [17:24:45] But two things. [17:24:45] Dinner and all. :-) [17:25:01] I just loaded the edit quality campaign for Wikidata [17:25:12] soupault is currently working on the Russian badwords lists. :) [17:25:32] yeah, since my birth >_< [17:26:12] White_Cat_mobile (AKA ToAruShiroiNeko) does a lot of work around community organizing -- making sure we have native speakers to help us curate badwordlists and making sure we have people working on our quality labeling campaigns. [17:26:29] He's responsible for a substantial amount of the breadth we have in language coverage. :) [17:26:42] soupault is a new volunteer who might be working with us for a while :) [17:28:20] I've got the message - should always contact White_Cat_mobile delicately ;) [17:28:56] He is the badwordslist master ;) [17:29:17] * halfak imagine dumping a bucket of curses in 15 different languages at someone. [17:30:13] I'd pay for that [17:31:55] Sure. I'll be available later today. :-) [17:34:17] White_Cat_mobil_, I'll probably be gone when you get back. Jenny's got some people coming for a visit so I'll need to be an IRL person for the afternoon. [17:40:39] Sure no probs [17:41:04] If you could handle urdu thatd be awesome [17:42:29] White_Cat_mobil_, hopefully I can do that tomorrow. :) [17:42:47] I'll start up some of the long-running jobs now. [17:43:07] Sure no problem [17:51:01] OK. Running the pre-labeler [17:51:11] Should be able to load up the dataset tomorrow. [17:51:43] White_Cat_mobil_, when you get back, could you go to our urdu collaborators and ask for a translation of "Edit quality (20 random sample)" into urdu? [17:51:50] * halfak runs away [17:52:21] Sure [18:48:58] From Amit Belani's *Vandalism Detection in Wikipedia: a Bag-of-Words Classifier Approach*: "A sufficiently large sample of this [pages-meta-history.xml.bz2] dataset was used. Using the entire dataset was not feasible, as the implimentation of the learning algorithm requires the training data to be held in memory." [18:49:00] Why? [18:55:58] Amir1: hey [18:57:25] hey aetilley [18:57:27] :) [18:59:05] Thanks for the papers. [19:00:51] yw [19:00:55] :) [19:01:53] If you can read some more and give us some results on what should we do. that would be amazing [19:04:47] I'm looking at the Belani paper now. I'm optimistic. [19:05:44] So far I only have one question, which I asked before you arrived. [19:06:05] idk, if you have the ability to scroll back. [19:06:17] I can paste again o.w. [19:26:32] Amir1: Can you send me the link to the featurelist again? I keep losing it. [19:26:48] I have too many bookmarks... [19:27:00] Will save this time. [19:27:49] It might be nice (although I'm not volunteering) if we had a nice summary of all current features somewhere linkable from the revscoring wiki page. [19:30:14] https://github.com/wiki-ai/wb-vandalism/blob/master/wb_vandalism/feature_lists/wikidata.py [19:30:19] aetilley: ^ [19:31:02] Thanks! [19:31:09] Also google just produced this: [19:31:10] http://pythonhosted.org/revscoring/revscoring.features.html [22:21:47] Hello. Lets deal with this tommorow please. I am a bit too tired [22:22:01] Sorry about this [23:44:54] halfak: https://www.wikidata.org/wiki/Wikidata:Edit_labels [23:44:57] around?