[15:20:51] o/ [15:36:15] o. Amir1 [15:36:27] o/ halfak :) [15:36:32] How's the work on GlobalUser stuff going? [15:36:41] I'm working on the global user thing [15:36:47] there are two obstacles [15:37:00] 1- it's hugely expensive, there is no batching [15:37:32] 2- Getting user registration is not as easy as it sounds [15:37:50] e.g. https://en.wikipedia.org/w/api.php?action=query&meta=globaluserinfo&guiuser=Ladsgroup&guiprop=groups|rights|editcount [15:38:19] it says I registered in 2008, but that's when I merged my accounts [15:38:25] I started in 2006 [15:38:35] (or 2005, I lost count) [15:38:59] halfak: ^ [15:39:14] (I also made my last PR re. flake8 in sigclust) [15:40:10] Amir1, understood. [15:40:13] This might not be a problem [15:40:19] Since 2008 was sufficiently long ago [15:40:38] We might also just take the max(user.age, globaluser.age) [15:40:55] * halfak consider the length of 7 years [15:41:17] agreed [15:41:26] but about performance [15:41:29] egh. "editcount" is a string! [15:41:36] what do you think? [15:41:37] WHY [15:41:47] Oh yeah. Performance. I'm not too worried. [15:41:58] Our pre-cacher doesn't do batching. [15:42:10] (I thought of using more global variable like global edit count) [15:42:19] And most of the ScoredRevisions use-case should be hitting cached scores so batching won't play out that often there. [15:42:29] It seems like historical analyses will be the hardest hit. [15:42:37] Also we should file a bug. [15:42:43] You should be able to batch this. [15:42:49] good point [15:42:54] I know that anomie is usually fast at getting to these types of bugs. [15:42:56] let me talk to Brad [15:43:50] :) [15:44:01] I'm not sure if he thinks it's feasible but worth a shot [15:44:12] I'll file a bug about editcount being a string. [15:44:12] in the mean time, I use no-caching way? [15:44:25] no-caching? [15:44:32] no-batching [15:44:34] sorry [15:44:36] Yeah. [15:44:39] That's just fine. [15:44:58] If you want to keep hacking, I'll file the bugs. [15:44:59] (I think edit count as string is so easy, I can make a patch for it) [15:45:22] (unless you've already started) [15:45:34] I'll do and subscribe you too :) [15:45:46] OK :) [15:46:32] wait a sec [15:46:37] this is intersting [15:46:46] https://en.wikipedia.org/w/api.php?action=query&&list=users&ususers=Ladsgroup&meta=globaluserinfo&guiuser=Dexbot&guiprop=groups|editcount [15:46:57] halfak: ^ this one is not string [15:47:26] it's not string at all [15:47:44] https://en.wikipedia.org/w/api.php?action=query&meta=globaluserinfo&guiuser=Ladsgroup&guiprop=groups|rights|editcount|merged [15:47:51] Look at the merged accounts [15:48:02] Sorry. I wasn't specific earlier [15:48:37] reproduced [15:48:38] thanks [15:48:40] :) [15:51:16] I'm looking at codes so I maybe I can find the bug and fix it by myself [15:52:44] \o/ open source & SOFIXIT :) [15:52:57] * halfak should really do more work in MediaWiki [15:53:06] *inside of* mediawiki [15:55:22] I checked it's more matter of Extension:CenteralAuth [15:55:37] and the second bug (batching) is also a part of this extension [15:55:48] so legoktm can help us [15:58:05] Why this extension is so huge [15:58:31] god bless shallow cloning [16:01:21] writer is Roan [16:10:59] o/ aetilley [16:11:46] hi halfak [16:12:09] I'm reading through the blog post. Good stuff. [16:13:27] \o/ [16:14:00] We're working on getting a fancy graphic from one of the designers at the WMF. :) [16:14:06] :) [16:14:28] Yeah, I was going to ask how you decided on the revscoring project logo. [16:14:45] pretty badaass, but also looks like some kind of secret society. :) [16:15:06] lol yeah. That was ToAruShiroiNeko and some volunteer work he got for us. [16:15:18] You know you could combine the recursive gear giff for ORES with the triangle of revscoring [16:15:36] a triangle of spinning gears... [16:15:43] anyway.... [16:15:48] :D [16:22:20] I was thinking about exactly what parameters the user has under their control if they want to use ORES.... [16:22:27] This would be: [16:22:31] 1 the context [16:22:34] 2 the model [16:22:45] 3 a (batch of) rev_ids [16:23:02] but that's about it, right? [16:24:05] By the time someone is in a position to use ores, the model they are going to use is already trained, so we're limited in what additional parameters we can include... [16:24:12] halfak: https://gerrit.wikimedia.org/r/#/c/254650/ [16:24:20] aetilley, yeah. That's right. [16:24:36] But there's a feature request that I'd like to address sometime soon that will change that. [16:24:54] \o/ Amir :) [16:25:16] aetilley, https://github.com/wiki-ai/ores/issues/101 [16:25:25] See also https://github.com/wiki-ai/ores/issues/100 [16:25:42] So, the idea is that a user would also be able to re-write certain features that are used in prediction. [16:25:56] This would allow them to ask "what if" questions with the model. [16:26:02] ok [16:26:03] E.g. "What if this user was anonymous." [16:26:11] Or "What if this article had an image?" [16:26:13] +2d [16:26:17] by reedy [16:26:21] woot [16:27:19] aetilley, I've already started some work in that direction and I think it would not be too difficult to complete. [16:28:59] I haven't worked out exactly how ORES will handle a new request structure, [16:29:29] But it seems like you could do something like /enwiki/damaging/3423423?user.age=3472642 [16:30:10] or /enwiki/wp10/2342349?revision.level_1_headers=5 [16:30:51] yes [16:33:21] What I was thinking, although this would be difficult, would be that, at the lowest level, we just provide a huge set of potential features and somehow build a model that will allow the user to score a revision based on the subset of features that they find relevant. [16:33:22] ... [16:34:03] but I think this might be challenging. We'd be effectively training one model to deal with partial vectors of features. [16:34:37] * aetilley is changing location. [16:36:22] aetilley, yeah. I don't see a good way to do partial models. [16:36:27] Regretfully. [16:42:33] * aetilley is back [16:45:31] halfak, have you seen this report from hewiki? https://phabricator.wikimedia.org/T118982 (Adding link to existing text shouldn't get high "reverted" score) [16:45:40] anything to add about that? [16:47:33] Man. Eranros is really damning of the model. Does he see us as competition or something? [16:52:45] dunno [16:54:09] What was done for the first models about the removal of interwikis migrated to wikidata? just chose another random sample from a different period? [16:54:39] Helder, I think Amir filtered out all language pairs during TFiDF [17:00:02] Helder, generally, mass removals of links getting flagged as potentially damaging seems like a good thing. [17:00:12] So I'm not too worried about that one. [17:00:21] I'm more worried about anons getting overweighted. [17:00:28] even removal by bots? [17:00:28] That can have some substantial social consequences. [17:00:43] yah [17:00:45] Helder, good point. user.is_bot should get that, but maybe bots get reverted a lot. [17:01:12] Helder, maybe bots get reverted by vandals a lot in hewiki? [17:01:26] There might be a good way to address that in our revert detection. [17:01:52] e.g. we only count a reverts of revisions that were not reverted back to eventually. [17:02:05] I think we're already detecting that -- just not paying attention to it. [17:02:25] https://github.com/wiki-ai/editquality/blob/master/editquality/utilities/label_reverted.py#L89 [17:03:14] Those three values that come out of mwreverts.api.check are (reverting info, reverted info, reverted_to info) [17:03:23] Right now, we only care about the (reverted info) [17:03:34] hmm [17:03:39] but we can use the (reverted_to info) to find out if there's an edit war going on. [17:04:20] Maybe we could only count "revert_to" events if they are done by someone other than the user who was originally reverted. [17:04:47] 1. vandal edits [17:04:50] 2. someone reverts [17:04:56] 3. vandal reverts someone [17:05:07] 4. someone else reverts back to 2. [17:05:19] Then 2. would count as not-reverted. [17:06:31] * halfak tried to implement that [17:09:15] there will always be a a symmetrical situation like [17:09:15] 1. good user edits (e.g. to make it more NPOV) [17:09:15] 2. someone reverts (to restore his POV) [17:09:15] 3. good user reverts to NPOV [17:09:17] 4. someone else supporting the POV reverts back to 2. [17:09:56] Helder, yeah. I'm just guessing that content disputes like that are things we don't want to learn anyway. [17:10:05] Our model isn't going to be good at picking up point of view. [17:10:18] NLP isn't quite there yet. [17:10:38] Then again, we might be able to add some features to help us to detect opinionated content. [17:13:49] halfak: Question, suppose we train the "reverted" classifier on some training set S [17:14:05] Helder, https://github.com/wiki-ai/editquality/pull/1/files [17:14:07] What do you think? [17:14:32] The ORES description says that this classifier is supposed to predict whether the revision will "eventually" be reverted [17:14:40] aetilley, indeed. [17:16:09] * halfak love having tools like 'mwreverts' :) [17:17:49] So if a revision was edited heavily back to a previous version this would not count as a revert. [17:18:05] Also... [17:19:05] aetilley, that's right. But in practice that is rare. [17:19:17] ok, [17:19:58] and nevermind my other question. I guess I was forgetting that the only chance someone gets to revert an edit is to be the first editor on that page after the edit in question. [17:20:28] aetilley, actually, it's the next two edits. [17:20:29] So "evenutally get's reverted" is equivalent to "next page edit is a revert" [17:20:32] oh [17:20:32] And within 48 hours. [17:20:56] This is based on work I did here: https://www.researchgate.net/profile/Rstuart_Geiger/publication/262177230_When_the_levee_breaks_without_bots_what_happens_to_Wikipedia%27s_quality_control_processes/links/00b7d539f007fef6fd000000.pdf [17:21:08] And assumptions that other wikis look like enwiki [17:21:58] ok [17:22:56] Helder, that pr I had you look at is *so* wrong. I dunno what I was thinking :S [17:23:01] * halfak fixes [17:26:10] OK. Looks like it is working now. [17:26:19] I should run this on enwiki so I can read the edits :S [17:30:18] IT WORKS [17:30:32] I caught an edit that was reverted back to by the same editor. [17:30:35] Who is probably a sock [17:30:44] \o/ [17:30:50] OK. pushing changes. [17:30:58] We can try this against all of the revert models. [17:31:17] If I'm right in my thinking, we'll see a small boost in AUC and also some better behavior. [17:31:36] It'll take a few hours to get to that point though :/ [17:31:52] Reverted labeling and feature extraction take a long time [17:33:06] * halfak self merges. [17:33:13] Now to increment the version # [17:55:20] OK. New revert models being generated. [17:55:39] It'll be interesting to see if the predictions change for Eranros' false positives. [18:08:29] * halfak sighs about Eranros' misunderstanding of kernels in svms [18:53:32] sorry for the delay... [18:54:55] So, this reverted_to stuff is working really well with the sample I'm playing around with [18:55:08] All of the "reverts" that get removed because of it really ought to. [18:55:22] They are almost all re-reverts of vandalism. [18:55:46] We have a problem with flagging vandal-reverting edits as damaging -- this will probably help us avoid that :) [18:55:51] Helder, ^ [18:56:23] Helder, one other thing that I could really use your input on: https://phabricator.wikimedia.org/T107196 [18:56:40] We're moving ORES to production. The RESTBase team really wants to bury ORES in their API spec. [18:56:56] YuviPanda, and my ops support disagree. [18:57:36] But from your perspective as a consumer of the service, you probably have a good idea of where you'd like the documentation to live. [18:57:39] I was reading that today, but I don't have any stronger preferences [18:57:44] Gotcha. [18:58:23] any way will probably be fine for me [19:00:59] ah fun [19:03:47] halfak: have I shown you tmpnb.org [19:04:09] Oh yeah. [19:05:49] * halfak is so tired of thinking about this debate about RESTBase. [19:08:24] halfak: haha :D [19:09:07] Everyone just seems to assert "Unified APIs" and "discover-ability" without realizing that RESTBase was the first major split from the unified API for MediaWiki [19:09:39] See, I think it would be great if ORES was part of the MediaWiki API. [19:13:48] halfak: so legoktm's proposed that too [19:14:20] I'm not sure what the considerations would be, but if we're talking about discover-ability, that'd be it. [19:14:45] Honestly, I don't think that discover-ability is a big problem. [19:14:48] I'd personally find it confusing since mw api's structure is totally different from ours [19:14:52] oh yeah, I totally agree :) [19:15:36] Maybe we just set up our endpoint at /api/ores/ and then tell the Services team that they are welcome to provide routes to that. [19:16:01] And they'll be a user of ORES rather than making me maintain parts of RESTBase. [19:16:21] that won't be confusing at all :D [19:16:41] I agree we should just set it up under /api/ores [19:16:43] Meh. We have pages and revisions showing up in multiple places. [19:18:06] * halfak ignores it all and works on increasing our language support [19:18:09] :D [19:18:51] halfak: +1 [19:19:04] halfak: the swagger stuff itself might be useful - see http://kubernetes.io/third_party/swagger-ui/#!/api%2Fv1/createNamespacedPod [19:19:11] kubernetes has a swagger spec and I find it useful [19:19:15] I agree. I set up a task for it. [19:21:35] +1 [19:22:48] halfak: I'm setting up a tmpnb.org type setup on tools now [19:22:55] with easy access to dbs, dumps, etc :D [19:23:05] I was selling J-Mo on it yesterday at dinner [19:23:45] +1 for easy access to dumps. [19:23:59] I'd love to have some dataset generating scripts look like ipython notebooks. [19:24:11] yeah [19:24:16] E.g. mwreverts, mwsessions, mwpersistence, mwdiffs, mwrefs, etc. [19:24:20] mw mw mw mw [19:24:24] we can make them all be there by default [19:25:28] * YuviPanda laughhsssssss [19:27:14] I really need to gets this moved to the main namespace: https://www.mediawiki.org/wiki/User:Halfak_%28WMF%29/mediawiki-utilities [19:31:05] halfak: more AI IEGs! https://meta.wikimedia.org/wiki/Grants:IEG/Semi-automatically_generate_Categories_for_Vietnamese_Wikipedia [19:35:22] YuviPanda, will try to read and comment when my brain isn't so full of frustrated thoughts :( [19:35:36] * YuviPanda hugs halfak [19:35:39] 'tilll alll beee ok [19:35:41] ! [19:35:49] :) Thanks YuviPanda [19:35:57] One thing I'm going to be looking for is the plan to release datasets for training and evaluating against. [19:36:22] That way, others can critique and build on the work in this IEG. [19:36:49] ToAruShiroiNeko, would be a good reviewer too. [19:39:13] * halfak consoles himself by implementing Ukrainian curse words [19:44:00] Crap. [19:44:16] I can't actually use the awesome work that was done for our ukrainian list. [19:44:34] The user made the regular expressions for us, but I actually need the plain words as examples. [19:47:52] awww [19:48:20] Too awesome for the task ;) [19:53:11] So, I just ran our list against the badwords regexes and picked the matches out. [19:53:13] Good enough., [19:53:21] At least we'll know if we break this in the future. [19:53:35] Can't confirm that the regexes are working as expected now. [19:54:31] і != i [19:54:38] ^ Those are two different characters! [19:55:07] OK. We have a working ukranian [19:55:12] * ukrainian [19:55:55] Now for estonian [19:56:09] Nooo! More regexes! [19:56:37] ToAruShiroiNeko, ! [19:56:49] Are you telling people to filter the word lists into regexes? [20:12:28] OK. I did the same with estonian [20:17:35] https://github.com/wiki-ai/revscoring/pull/214 [20:17:39] * halfak runs away [20:17:42] have a good saturday! [20:20:25] wiki-ai/revscoring#320 (new_languages - e1e7dd5 : halfak): The build failed. https://travis-ci.org/wiki-ai/revscoring/builds/92474950