[14:02:29] halfak: o/ [14:03:32] a few days ago, you gave me https://github.com/wiki-ai/revscoring/blob/master/ipython/feature_engineering.ipynb. I wonder, in order to try to follow this book, I should create a python code consisting of these chunks of code right? [14:06:44] glorian_wd, these chunks of code show you how some of revscoring works. [14:06:58] You will need to create your own chunks of code. [14:07:12] It's a textbook, not an instruction manual. [14:07:19] Or a recipe. [14:10:08] oh okay [14:10:43] halfak: am I supposed to modify revscoring? [14:11:12] You can submit a pull request to revscoring, yeah. [14:20:21] halfak: does revscoring work with python 2.x? [14:20:26] nope [14:20:42] 2.x has been deprecated for more than 10 years :P [14:21:05] oh ok [14:21:15] Oh wait. Not quite. 9.5 years. [14:30:57] halfak: kk [14:31:55] I tried the example on the README at https://github.com/wiki-ai/revscoring. It complained about missing models/enwiki.damaging.linear_svc.model. I wonder if I can get that somewhere since that models is not existed in this repo [14:34:39] glorian_wd, you don't need that model to do what you need to do. [14:35:07] That's an example. It would load a model file if there was one to load. [14:35:14] It show how to load a model file. [14:40:37] oh ok [15:00:40] halfak: I have read the book. I assumed revscoring not only work for Wikipedia, but also for Wikidata. Correct? [15:00:57] yes [15:01:04] *wikibase [15:01:07] ok [15:01:25] halfak: then, is there any book about revscoring for Wikibase? [15:01:47] I believe the methods/features for Wikipedia are different than Wikibase [15:02:07] no [15:02:16] no conceptual difference [15:04:50] hmm. is there anything which explains the features that already existing in Wikibase/Wikipedia? [15:13:01] halfak: maybe you missed the last chat :P [15:13:20] http://pythonhosted.org/revscoring [15:13:24] Also, the code :) [15:13:37] http://pythonhosted.org/revscoring/revscoring.features.wikibase.html#module-revscoring.features.wikibase [15:19:18] halfak: thanks. Looking at them [15:19:27] am looking at them* [15:30:36] halfak: Ok, as far as I can understand from these 2 links, now I have to engineer some new features for Wikibase which correspond to the quality criteria (https://www.wikidata.org/wiki/Wikidata:Item_quality). For example, I have to engineer features for 8 most important languages which are described in the quality criteria. This specific feature is not [15:30:37] existed yet in the current revscoring Wikibase feature [15:30:38] am I right? [15:38:12] right. [17:28:06] halfak: do you think it makes sense if we have a feature which comes from the division between number of external sources (references) and number of statements? [17:28:06] so, number of external sources (references) / number of statements. The notion is, high quality items should have multiple statements with multiple external references. [17:28:06] Suppose there is an item "A" which has 2 statements and of these 2, 1 statement consists of 3 external references, the other statement consists of 0 external reference. [17:28:06] On the other hand, there is an item "B" which has 2 statements, and each of these statements consists of 3 external references. [17:28:06] Hence, using the formula that I mentioned, Item A would be weighted as 3/2 = 1.5. Item B would be weighted as 6/2 = 3. What'd you think? [17:33:11] o/ glorian_wd just saw your messages. +1 for that ratio. Might want to have a set of properties that are excluded (e.g. ones relating to identifiers) [17:33:19] o/ codezee [17:33:28] o/ [17:34:23] glorian_wd: though I'm speaking out of context, but just my opinion that more references to a single statement would help only if they add more information, what if they are redundant? [17:34:48] codezee, yeah, that's what I was thinking WRT identifiers. [17:35:12] But we'll also want to measure "imported from" references differently from references that include actual external support. [17:35:13] I mean atleast one reference per statement is definitely a +ve point, but 2-3 or more references might become redundant [17:35:28] so a kind of logarithmic scale could be helpful [17:35:41] like in tf-idf [17:36:23] Oh good point. [17:36:34] Maybe we really want "statements with references"/"statements" [17:37:58] OK now thinking about draftquality, I think that we could maybe build a "promotional adjectives" list like the badwords and informals lists we have. [17:38:01] codezee, ^ [17:38:16] "industry lead(er|ing)" [17:38:22] "top class" [17:38:24] etc. [17:39:15] from the above example, you mean if an article as "industry leading" it should be top class [17:39:23] *? [17:39:37] *an article has [17:40:06] https://en.wikipedia.org/wiki/Kenneth_Fryer [17:40:15] Na. those are bits of language that are spammy [17:40:23] "set the standard" [17:40:35] "famously" [17:40:48] "extremely influential" [17:41:01] "smash success" [17:41:18] That kind of stuff. [17:41:37] In the short term, we can build a word list. Once Amir1 finishes some work on bag of words strategies, we can include that too. [17:42:24] halfak: ok so there can be thousands of such adjectives, for badwords we had a few possible cases so listing was good, how many adjectives do we keep on listing like this? [17:43:09] halfak: and for each type of article a different adjective I think [17:44:00] halfak: yes a kind of clustering around good quality articles could reveal common indicating words [17:45:44] halfak: regarding testing, do you test it on own laptop or on a server?, asking because I think it'll pull in a lot of article data [17:47:15] codezee, the word lists can be varying sizes. We might be able to borrow from an abuse filter rule. [17:47:50] I think that a tf-idf strategy for detecting words that are common to spam but not non-spam would be a good start. One problem is most spam ends up deleted. [17:48:08] I usually test on my desktop but I'll build models on one of our computer servers in labs. [17:50:54] o/ codezee [17:51:41] halfak: where can I find the abuse filter rules for wikipedia? [17:52:26] I guess here - https://en.wikipedia.org/wiki/Special:AbuseFilter [17:52:36] yeah, I actually also thought about, high quality items should have different external sources on each statement. For instance, item A has 2 statement. 1 statement external source comes from www.forbes.com, another statement external source comes from www.nytimes.com. [17:52:36] But, I don't know if I manage to do that in literally 1 month [17:53:21] codezee: you are right about redundant items. A good statement should have different external sources [17:56:12] codezee, I think we might have to ask someone with the rights. [17:56:37] Oh! Looks like I have the rights. [17:57:13] halfak: I can already see the top filters through this query and regexes - https://en.wikipedia.org/w/index.php?title=Special:AbuseFilter/&sort=af_hit_count&limit=100&asc=&desc=1&deletedfilters=hide&hidedisabled=1 [17:57:19] if that is the right thing [17:59:03] Looks like some are "private" [17:59:07] and others are public. [17:59:37] yes I could only use the public ones :/ [18:00:17] halfak: so after creating this list of adjectives, they should be added as a feature to draftquality/feature_lists/enwiki.py right? [18:01:00] codezee, yes. That's right. We might eventually add them to revscoring directly. After all, these words might be useful for other types of prediction too. [18:26:06] halfak: tsv2json does not accept the arguments "str int int int". Am I using the right one? - https://github.com/btbytes/tsv2json [18:26:46] Use the tsv2json from the json2tsv package :) [18:26:55] pip install json2tsv [18:27:34] oh ok [18:27:53] I'll add that to the readme too [18:29:28] Maybe we should add it to the requirements.txt [18:29:41] mwapi.errors.APIError: permissiondenied: You don't have permission to view a page's deleted history. [18:29:43] :/ [18:30:04] while I was pulling the pages using revscoring fetch_text [18:31:27] right. That's a difficult part of draft quality. You'll need to have me extract features for you. [18:31:36] Or! get advanced user rights for the project. [18:31:41] Which wouldn't be out of the question. [18:32:10] Also, that error messages is phenomenal. :)))) [18:33:01] halfak: so what do we do, get advanced user rights? [18:33:29] halfak: I've also updated requirements, should I directly push to draftquality, or create a PR from my fork? [18:33:46] codezee, create a PR [18:39:22] halfak: PR created,let me know what could be done regarding permissions as I'll be blocked till then [18:40:46] codezee, got an enwiki account? [18:41:33] halfak: yes, its "Sumit.iitp" [18:44:28] https://en.wikipedia.org/wiki/Special:ListGroupRights [18:44:47] I think I'd try to see if you can get "oversighter" right for a temporary period while working on this project [18:45:12] That will give you access to deleted text and has the least scary permission set of the other options (checkuser, sysop) [18:45:42] ok :) [18:46:47] fair enough [18:48:48] Not sure where to request that but I bet there's a process you can read about. [18:48:57] I'm happy to endorse your request. [18:49:27] I'll have a look [19:16:19] halfak: came across - https://meta.wikimedia.org/wiki/WMF_Researcher I think this is relevant [19:16:44] codezee, ahh yeah. Regretfully this doesn't get you access to deleted text. [19:16:52] Oh wait. [19:16:55] Yes it does. [19:16:57] I forgot. [19:17:13] So.. The problem with this is that you'd need to do a legal dance with the WMF. [19:17:18] I'd like to avoid that. [19:17:21] More work for the both of us. [19:17:55] then the other way I found is to write to Arbitration committee - https://en.wikipedia.org/wiki/Wikipedia:Arbitration_Committee/CheckUser_and_Oversight#Appointments [19:18:19] although the Wikipedia channel mentioned it might be difficult this way, I'll give a try [19:19:11] If going through the ArbComm fails, that'll be an additional note of support for doing the paperwork heavy WMF way [21:41:00] halfak: still around?