[05:47:01] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2604222 (10Ladsgroup) https://github.com/wiki-ai/wikiclass/pull/25/files I want to run it on ore... [13:11:49] 06Revision-Scoring-As-A-Service, 10revscoring, 07Spike: [Spike] Investigate HashingVectorizer - https://phabricator.wikimedia.org/T128087#2604743 (10Sabya) Will try it [13:17:35] o/ sabya [13:17:46] o/ halfak [13:17:50] (just saw you commenting in phab and wanted to say "hi") :) [13:19:06] went through grid search. will start on it. [13:19:36] Cool. I'm interested to see what you learn. [13:20:08] I'm guessing there will be benefits, but diminishing returns as you add estimators and depth. [13:23:04] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2604748 (10Halfak) enwiki-20160901-pages-articles.xml.bz2 is 6.0 GB Use the `/srv` mount. ```... [13:24:30] what is the roc auc score we get in production model? [13:24:51] https://ores.wmflabs.org/v2/scores/enwiki/damaging/?model_info=test_stats [13:24:58] 0.908 [13:25:10] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2604753 (10Ladsgroup) With pleasure! [13:25:19] But there's some variability based on how the test set is randomly sampled. [13:25:46] ok [13:26:14] o/ Amir1 [13:26:16] you are probably right that signals are already present in those 77 fetures [13:26:17] I see you around too :) [13:26:19] features [13:26:25] Hey! [13:26:32] sabya, will be good for us to conclusively demonstrate this. [13:26:42] sure [13:26:45] Then maybe we can experiment with the article quality model ^_^ [13:26:48] halfak: I tested the scores against our real service and they are the same [13:26:56] Nice [13:27:01] do I need to make a huge tsv file? [13:27:02] Good test there [13:27:10] Amir1, yeah, it'll be big, but not that big [13:27:18] you can write a compressed file [13:27:47] e.g. ./extract_scores ... | bzip2 -c > output.tsv.bz2 [13:27:59] okay [13:29:58] halfak: btw. I'm implementing this: [13:29:58] https://usercontent.irccloud-cdn.com/file/0o8mGXON/ [13:30:08] See the last row [13:30:45] Nice. [13:31:07] I was actually just looking at a question re entity usage from kjshiroo [13:31:27] He's one of my minions from the GroupLens lab. Mind if I direct him your way? [13:31:35] (the api is merged now! https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&prop=wbentityusage&titles=Main%20Page|Malayan%20civet) [13:31:41] I was generally directing him towards #wikidata [13:31:49] Nice! [13:31:54] yeah, Of course [13:31:59] I wonder if we should get that added to the XML dumps [13:36:36] halfak: I really the idea of adding wp10 data to a labsdb, It would be great if we implement that too [13:36:46] Smashing the quarterly goal ;) [13:37:27] halfak: https://github.com/wiki-ai/wikiclass/pull/25/files [13:37:35] review please ;) [13:37:58] Amir1, just posted a note [13:38:38] okay [13:41:18] OK notes complete [13:41:33] +1 for getting data on labsDB [13:42:30] Once we know how long it takes to run on a current article dump, we should decide how we want to produce a historical dataset [13:42:33] I have some ideas [13:47:16] Amir1, FYI https://github.com/wiki-ai/wikiclass/pull/26 [13:47:33] oh, sorry [13:47:39] No worries :) [13:51:48] halfak: https://github.com/wiki-ai/wikiclass/pull/25/files [13:52:04] I'm done. [13:53:20] https://github.com/wiki-ai/wikiclass/pull/25/files#r77349255 [13:53:26] Looks like you have some typos [13:55:21] fixed now [13:55:31] thanks [13:58:58] (I'm downloading the dump atm) [14:02:54] ETA 82 min. [14:03:27] halfak: can you check again? https://github.com/wiki-ai/wikiclass/pull/25/commits/511e049a6e937a92527552f9cc88a0c190479597 [14:04:45] Amir1, can you do a test run? [14:05:03] I did with the early version but yeah [14:05:04] let me [14:08:07] halfak: done, there was a redundant "]" somewhere. Fixed now [14:08:36] kk merging [14:08:54] {{done}} [14:08:55] How cool, halfak! [14:08:58] Thanks AsimovBot [14:09:33] \o/ [14:11:49] Once the download is done. I will run the script. We can star it now too but let's just be sure [14:13:45] Sounds good [14:15:11] wiki-ai/revscoring#803 (feature_vector_real - 4c89759 : halfak): The build passed. https://travis-ci.org/wiki-ai/revscoring/builds/157105847 [14:16:10] 06Revision-Scoring-As-A-Service, 10revscoring: Implement abstraction for Sparse Feature Vectors - https://phabricator.wikimedia.org/T132580#2604934 (10Halfak) OK. All ready to go! See https://github.com/wiki-ai/revscoring/pull/287 [14:16:19] Amir1, I have a monster for you: https://github.com/wiki-ai/revscoring/pull/287 [14:18:50] +1,151 −401 [14:18:56] nightmares tonight [14:19:57] lol [14:20:04] * halfak only submits monster pull requests [14:20:08] This is not a great thing [14:20:20] But I do a lot of refactoring and generalizations. [14:21:34] I have gift for you [14:21:34] https://incubator.wikimedia.org/wiki/Wp/ase/AS18517S20500S2ff00M529x544S2ff00482x483S20500519x504S18517503x517 [14:21:39] Wikipedia in ASL [14:22:08] lol wat [14:22:16] https://incubator.wikimedia.org/wiki/Wp/ase/AS10e20S15a06S29a0bM514x538S15a06487x462S10e20498x480S29a0b487x505 [14:22:19] My screen is all turned [14:22:44] Ohhhh It's American Sign Language? [14:23:06] yes [14:23:14] I think I have a font-fail [14:23:43] 06Revision-Scoring-As-A-Service, 10revscoring, 07Spike: [Spike] Investigate HashingVectorizer - https://phabricator.wikimedia.org/T128087#2604944 (10Sabya) Here is the link to compare the results against: https://ores.wmflabs.org/v2/scores/enwiki/damaging/?model_info=test_stats [14:23:48] For me it's like this: [14:23:54] https://usercontent.irccloud-cdn.com/file/qFWdOtRc/ [14:26:19] Yeah. That's what it looks like to me too. So I guess that's intended. [14:26:55] 06Revision-Scoring-As-A-Service, 10revscoring: Implement abstraction for Sparse Feature Vectors - https://phabricator.wikimedia.org/T132580#2604965 (10Halfak) Here's my timing script and output: https://gist.github.com/halfak/565d1c2153da57c5c6600cb175f20236 [14:37:16] Amir1 lives dangerously [14:37:40] :)) [14:37:56] * Amir1 listens to Livin' La vida loca [14:37:58] So, one thing that'll be a pain is re-generating the feature files in editquality and wikiclass [14:38:12] But fun story, I think wikiclass will simplify a bit [14:38:21] since it already uses JSON [14:38:27] and has to convert to TSV for revscoring [14:39:45] * halfak checks on the results of the extended user_groups [14:41:45] halfak: btw. I found this bug when I was running in verbose mode, pushed a minor fix directly to master [14:41:45] https://github.com/wiki-ai/wikiclass/commit/622497b5840647ff6311e97a9354282ccf18e804 [14:41:50] I hope you don't mind [14:42:03] Amir1, no worries [14:49:32] * halfak re-generates enwiktionary model [14:49:42] Otherwise, the models are pretty much good for the user_group refactor [14:49:48] It'll be nice to call this done [14:49:54] halfak: Are you using ores-compute-01? [14:49:58] Yeah [14:50:03] But I should be done soon [14:50:14] I'm using that machine to generate scores for wp10 [14:50:24] If you want all CPU right now, you could use ores-staging-01 [14:50:29] *02 [14:50:46] It's okay for now but I guess I need that machine for a day or two [14:51:02] It's downloading the dump in that machine right now [14:51:39] What do you think if we use stats1003 machines :D [14:54:56] Amir1, +1 [14:55:01] That would be most suitable [14:55:07] But dictionaries might be different :/ [14:56:46] do we need dictionaries in wp10 modles? [14:56:51] *models [14:59:15] Amir1, yeah. I think so [14:59:23] :| [15:00:54] Amir1, I guess not [15:01:05] https://github.com/wiki-ai/wikiclass/blob/master/wikiclass/feature_lists/enwiki.py#L42 [15:01:10] So we're probably fine :) [15:01:47] okay [15:08:59] Amir1, it seems like there's something weird going on with the enwiktionary rev_reverted file [15:09:05] I see that it's been checked in. [15:09:10] Do you know what's up with that? [15:09:22] I can't recall [15:09:26] let me think [15:09:43] When I run the make command to generate it, I don't find nearly as many reverted edits. [15:10:32] Because we had the same situation with wikidata, I sampled in another way [15:10:42] to make balanced dataset [15:10:59] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Extend user group features - https://phabricator.wikimedia.org/T143909#2582897 (10Halfak) This is now ready for review. https://github.com/wiki-ai/editquality/pull/45 [15:11:18] Amir1, gotcha. We should capture this better sampling in the Makefile somehow. [15:11:23] I'll make a task for that [15:26:07] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Fix makefile entry for enwiktionary.rev_reverted.20k_2016.tsv - https://phabricator.wikimedia.org/T144605#2605083 (10Halfak) [15:35:50] OK. I'm going to hop on my bike and head to the university. [15:35:53] AFK for about an hour [15:36:03] FYI: Amir1 ^ [15:36:14] have fun! [17:06:08] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2605408 (10Ladsgroup) The generating scores is being ran on stat1003. Results are in /home/ladsgr... [18:05:25] o/ [18:10:07] I'm pushing new models and ORES to staging [18:15:36] OK looks like the advanced rights are getting prediction weight appropriately. [18:15:50] Compare: https://ores-staging.wmflabs.org/v2/scores/enwiki/damaging/642215410/ [18:16:00] With: https://ores-staging.wmflabs.org/v2/scores/enwiki/damaging/642215410/?feature.revision.user.is_trusted=true&feature.revision.user.is_patroller=true&feature.revision.user.is_curator=true&feature.temporal.revision.user.seconds_since_registration=2323553 [18:16:29] Rather https://ores-staging.wmflabs.org/v2/scores/enwiki/damaging/642215410/?feature.revision.user.is_trusted=true&feature.revision.user.is_patroller=true&feature.revision.user.is_curator=true [18:33:32] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Extend user group features - https://phabricator.wikimedia.org/T143909#2605593 (10Halfak) @Iniquity this is now deployed on ores.wmflabs.org. We'll deploy it in production (ores.wikimedia.org) after a bit of review. In the meantime, I want to show you t... [19:55:15] Tuxedo: you could also edit [MobileFrontendInstallDir]/resources/skins.minerva.base.styles/common.css [19:55:25] wrong channel [20:19:45] 10Revision-Scoring-As-A-Service-Backlog, 10revscoring: Implement PCFG features - https://phabricator.wikimedia.org/T144636#2605842 (10Halfak) [21:38:02] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Fix makefile entry for enwiktionary.rev_reverted.20k_2016.tsv - https://phabricator.wikimedia.org/T144605#2606027 (10Halfak) See https://github.com/wiki-ai/editquality/blob/master/datasets/enwiktionary.rev_reverted.20k_2016.tsv It looks like maybe this fil... [21:38:05] 39 [22:14:47] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Fix makefile entry for enwiktionary.rev_reverted.20k_2016.tsv - https://phabricator.wikimedia.org/T144605#2606104 (10Halfak) ``` $ cat datasets/enwiktionary.prelabeled_revisions.200k_2016.tsv | grep "reverted" | wc 821 3284 22988 $ cat datasets/enw... [22:18:21] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Fix makefile entry for enwiktionary.rev_reverted.20k_2016.tsv - https://phabricator.wikimedia.org/T144605#2606111 (10Halfak) OK. My plan is to run `label_reverted` on the 200k dataset and then do this: ``` (head -n1 datasets/enwiktionary.rev_reverted.200k... [22:44:43] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Fix makefile entry for enwiktionary.rev_reverted.20k_2016.tsv - https://phabricator.wikimedia.org/T144605#2606158 (10Halfak) https://github.com/wiki-ai/editquality/pull/46 [22:44:58] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Fix makefile entry for enwiktionary.rev_reverted.20k_2016.tsv - https://phabricator.wikimedia.org/T144605#2605083 (10Halfak) a:03Halfak