[00:23:43] Back! [00:23:55] Had to type up notes afterward. Took longer than expected. [00:24:21] Regretfully (or happily) I must run off to go have dinner with Jenny. [00:24:28] See yall at the hack session tomorrow [00:24:47] I'm gonna just leave this here: https://en.wikipedia.org/wiki/Florence_Y%27all_Water_Tower [00:54:25] https://github.com/aetilley/pcfg_scorer [15:06:50] o/ [15:49:42] o/ Helder [15:49:44] :) [15:50:01] oi! :) [16:49:43] Hi all [16:50:01] o/ aetilley [16:50:21] Saw your email. I'm not sure how to interpret it though. [16:51:02] You mean the the one with the link to the repo? [16:51:18] * halfak pulls it up again [16:51:24] Yeah. [16:51:27] "I got it scoring" [16:51:31] Ha ha [16:51:36] :) [16:51:45] I got it Not giving an error each time. [16:51:54] Ahh! Cool :) [16:52:04] BTW, we should talk about tokenization. [16:52:09] Sure [16:52:12] We'll likely need to make improvements to support your work. [16:52:23] * halfak works on an example of what we have. [16:53:19] https://gist.github.com/halfak/496c27cccdeef087bb98 [16:53:33] I figure this will be useful for you. [16:53:45] Since already have a notion of "word" and punctuation types. [16:53:56] But it may be incomplete so we might need to extend it. :) [16:54:10] * aetilley looks [16:56:57] Cool. So the package is delta? [16:57:03] or wikitext split? [16:59:44] (03CR) 10He7d3r: Switch from the "reverted" to "damaging" model (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/257851 (https://phabricator.wikimedia.org/T112856) (owner: 10Awight) [16:59:47] The package is "deltas". It's a dependency of revscoring. [17:00:22] We provide a datasource of these "tokens" for revision and parent_revision. See https://github.com/wiki-ai/revscoring/blob/master/revscoring/datasources/revision.py#L30 [17:05:27] * halfak finally finishes his comms work and starts programming. [17:05:35] Been writing emails and talk page posts for the last 3 hours :/ [17:08:49] halfak: Ok, I can look at that. Do you have any thoughts on a training corpus for this scorer? [17:09:57] Still not sure what would constitute an adequate size. [17:10:37] Anyway, I'll have to write a script for constructing a counts_file from the corpus. [17:11:15] Or find some sort of preexisting data that can be easily translated to a counts file. [17:11:34] I'll need to do this for a "regular" scorer and a "vandalous" scorer. [17:20:06] aetilley, I think that it would make sense to start with the rev_ids in our training sets. [17:20:36] So, you would have ~19000 obs of good edits and 10000 obs of bad. [17:20:42] *1000 bad [17:21:42] So I imagine that you could take our current list of features present in the files I sent you for clustering and add two columns to them. [17:21:49] one for each scorer. [17:21:58] Or maybe a scorer delta [17:22:01] Not quite sure. [17:22:20] Let's just N columns where N is the number of new values you imagine when comparing parent_revision and revision. [17:22:58] So the main task right now is constructing a file like this [17:23:00] https://github.com/aetilley/pcfg_scorer/blob/master/counts_file.txt [17:23:29] It doens't have to be from wikipedia edits, but one day we might want one. [17:23:42] +1. [17:24:01] So I think you'll process enwiki.features_damaging.20k_sample.tsv and generate two files. [17:24:13] non-damaging_counts.tsv [17:24:22] and damaging_counts.tsv [17:24:43] That right? [17:26:13] ok, but I thought that file was just feature values [17:26:31] Is it text? [17:26:38] halfak: [17:27:37] It contains a rev_id. [17:28:02] I don't understand what you want to use the rev_id for in this case. [17:28:10] At least for generating counts. [17:30:06] Oh, I see [17:30:21] you mean if I have the rev_ids then I can fetch the text for myself. :) [17:31:28] https://gist.github.com/halfak/1620beae124716504cba [17:31:33] Yup :) [17:33:13] ello [17:33:17] o/ ToAruShiroiNeko [17:33:28] hi ToAruShiroiNeko [17:33:53] halfak: So i might need a little help with that much. [17:35:10] * aetilley looks at APIExtractor [17:38:05] Ah, ok this looks promising. [17:38:22] I might even be able to use NLTK to to the part-of-speech tagging [17:38:47] :D [17:38:53] Let me know if you need a hand. [17:39:07] I kinda wish I could avoid dealing with wiki-markup right now though [17:39:12] halfak: o/ [17:39:27] aetilley, we should be able to do that with mwparserfromhell. [17:39:30] Hey Amir1! [17:39:35] * halfak looks. [17:40:05] what should I do? :) [17:40:50] Looks like mwparserfromhell.Wikicode.strip_code() doesn't do what I thought it did. [17:40:52] Hey Amir1. WMDE blog stuff? [17:40:55] yeah [17:40:56] OR testing of ORES extension. [17:41:19] for ORES extension we should wait until awight comes here [17:41:23] :( [17:42:35] kk [17:42:38] Have you noted on the card that you are blocked on some help from awight? [17:42:59] the card is closed [17:43:06] I think I should make a new card [17:43:17] "Test ORES extension in a live system" [17:43:18] the ORES extension review? [17:43:27] Oh! Gotcha. Re-open, I think [17:43:38] Since we're still struggling to test the extension. [17:47:21] okay [17:47:25] let me re-open it [17:50:44] aetilley, OK. So I have figured a few things out. [17:51:00] In the short term, use "content_tokens" rather than "tokens" and that will filter out most of the markup. [17:51:10] ok [17:51:21] * aetilley looks at prev gist [17:51:38] extractor.extract(637398301, revision.tokens)[:10] [17:51:38] Regretfully, this is still imperfect. I don't have time now, but we should file some bug reports against mwpaserfromhell. [17:52:13] I've updates the gist to demo "content tokens" [17:52:14] https://gist.github.com/halfak/1620beae124716504cba [17:53:19] Ok [17:53:44] And your fairly confident that mwpfhell will let us do part-of-speech tagging [17:53:46] ? [17:55:36] It appears that the original authors had access to some sort of WM tree-bank [17:56:03] ? [17:56:14] I don't think mwparserfromhell has any tagging [17:56:33] Other than wikitext vs. "content" [17:56:35] oh, then I misread your previous comment. [17:56:40] gotcha [17:56:52] then NLTK it is. [17:56:53] Ahh. It's my tokenizer in "deltas" that calls some tokens "word" [17:57:19] Right, but I'll have to break appart the "word" group. [17:57:20] I figure you can look for tokens that are "word" or "period"/"qmark"/etc. for sentency things. [17:57:32] *aprt [17:57:34] *apart [17:57:35] :D [17:57:36] :) [17:57:37] ok [18:31:12] halfak, what is the current way to test if "x" is a badword in a given language? [18:31:38] Helder, we don't test a single word at a time anymore. [18:31:40] It does not seems to be portuguese.is_badword(x) [18:31:47] This is due to milti-word badword detection. [18:32:02] So instead, we apply a set of regex to a chunk of text. [18:32:27] So, you would access "portuguese.revision.badwords" [18:32:37] or "portuguese.diff.badwords_added" [18:33:14] See https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/meta/regex_extractors.py [18:33:55] Text... and SegmentRegexExtractor are used on full text/diffs respectively. [18:37:19] halfak, is there anything I can use in the lines of "if x in portuguese.revision.badwords print(x)"? [18:40:01] Helder, it's a little awkward, but here's how I'd do it. [18:40:02] https://gist.github.com/halfak/0d89ff662f23e71445e4 [18:41:14] * Helder looks [18:41:43] Amir1, I'm considering wrapping wb-vandalisms wikibase item datasources and features into revscoring. [18:42:09] that would be great [18:42:10] eew! [18:42:14] I want to split general datasources from those specific to wikitext. [18:42:29] Helder, we could add a simple helper function to languages [18:42:51] It would look like portuguese.contains_badword(some_string) [18:43:24] halfak: ok [18:43:26] Or maybe portuguese.extract_badwords(some_string) [18:43:38] can we start writing the blog post? [18:43:44] how should we do? [18:43:51] google docs? [18:43:54] Amir1, I'm thinking that we want to refactor datasources/features so that wt- and wb-based features are peers. [18:44:03] Amir1, I'm thinking an etherpad for now. [18:44:37] If you start one I'll suggest a basic structure. [18:44:56] sure [18:45:17] that's reasonable to consider making them peers [18:45:34] So, re. wt/wb datasources, I think it'll looks like "from revscoring.datasources.wikitext import revision as wt_revision" [18:45:39] because wikitext-realted features are not higher level than wikibase level features [18:45:47] shouldn't be [18:45:51] and "from revscoring.datasources.wikibase import revision as wb_revision" [18:45:55] but currently are [18:45:56] Yeah [18:46:04] halfak, looks like I can also use something like "portuguese.revision.badwords_list.process(x)", right? [18:46:19] so wt_revision will have features that come from mwparserfromhell [18:46:32] And wb_revision will have features that come from pywikibase [18:46:46] Helder, yeah. that's a good point. [18:46:54] halfak: https://etherpad.wikimedia.org/p/wmde_ores_blogpost [18:46:59] great! [18:47:01] I forgot that the 'process()" method was exposed. [18:50:20] Amir1, first thing we need to do is decide on what substantial arguments to make. [18:50:30] Note that I am using the term "argument" loosely. [18:51:31] ok :) [18:51:40] I filled in a few potentials. [18:52:02] Think broadly about anything you might want to discuss and we can come back to choose favorites later. :) [18:55:49] sure [19:14:19] here is an easy PR for you guys: https://github.com/wiki-ai/revscoring/pull/227 [19:15:04] Awesome. thank you! [19:15:18] :) [19:23:33] halfak, how long does it takes to traing a damaging or reverted model nowadays? [19:24:41] Helder, depends on the modeling strategy, but it's pretty quick for most 1-2 minutes. [19:25:01] Strangely we are having issues with convergence in linearSVCs that makes them sometimes never stop training. [19:25:14] But the good news is that we get better performance all around out of other modeling strategies :D [19:25:27] e.g. suppose I want to compare the AUC of the live models with a new one, trained after doing small changes to the list of badwords for ptwiki [19:25:55] Helder, hard part there is extracting new features based on the new badwords detection. [19:26:08] That can take a long time. [19:26:17] like 5 hours? [19:26:27] We now do parallelization in feature extraction, so if you can run it on a beefy server, it can go really fast. [19:26:39] I think 30 minutes would be likely if you have 4 cores to work with [19:26:42] 15 minutes with 8 [19:26:48] 7.5 minutes with 16 [19:26:51] great [19:27:15] Helder, did you ever log into ores-compute.labs? [19:27:26] I don't think so... [19:27:26] That's got 8 cores and we usually use it for this kind of work. [19:27:36] last time I tried I was having some problem with ssh... [19:27:49] Gotcha. Maybe worth trying again? [19:27:50] (I don't even remember what kind of problems) [19:28:02] I seem to remember making a phab card. [19:29:27] hum... it seems to work... [19:29:27] Last login: Wed Aug 12 23:35:03 2015 from bastion-01.bastion.eqiad.wmflabs [19:29:31] o.O [19:30:26] :P [19:31:11] halfak, what should I look for there? [19:31:55] Really, the machine is open to your use. I think that what you'll want to do is start by setting up your virtualenv. [19:32:10] I have a good gist for that [19:32:11] one sec [19:32:49] https://gist.github.com/halfak/9f4830895496af9e9731 [19:33:23] If you like copy-pasting the repo names: https://gist.github.com/halfak/5146e66178fadd8d3ac8 [19:33:53] You'll also probably want https://github.com/wiki-ai/editquality [19:34:11] You can see how we gather obs. and extract features in the Makefile in 'editquality' [19:35:01] You should be able to just run the makefile and it will automatically download the files and start labeling reverts and extracting features (assuming you `pip install revscoring editquality` [19:35:05] :) [19:36:51] halfak, and those tsv files? are they available anywhere? [19:37:09] They should be. [19:37:25] We don't have a great way of listing them yet, but the URLs in the makefile should all work. [19:37:30] Some are pulling from wikilabels. [19:37:34] Others from quarry. [19:42:33] halfak, shouldn't "make" take care of creating the "datasets" folder? [19:42:38] instead of "/bin/sh: 1: cannot create datasets/dewiki.sampled_revisions.20k_2015.tsv: Directory nonexistent" [19:43:12] Helder, yeah... or we could have a placeholder in that folder so that a git clone will create it. [19:43:30] Actually, we should really start checking the revision_samples in. [19:43:35] That would solve the problem. [19:46:47] halfak, http://dpaste.com/2E99WS4 [19:46:57] isn't this midding at https://github.com/wiki-ai/revscoring/blob/master/requirements.txt ? [19:47:50] Ahh! Yes. That's new. [19:47:56] I just switched to using yamlconf. [19:48:02] Thanks for flagging it. [19:48:18] * halfak feels bad for self-merging a bunch recently. :\ [19:48:34] I'll add it quick. [19:48:38] also, ImportError: No module named 'mwparserfromhell' [19:49:12] although this one is in the requirements file o.O [19:49:24] is the pip version of revscoring outdated? [19:49:43] Shouldn't be. Do you have 0.7.8? [19:50:31] I don't know, I just "pip install revscoring editquality" [19:51:13] yep, it is revscoring (0.7.8) [19:52:54] Weird. I'll check it out quick with a fresh install. [19:54:07] * halfak watches the install picking up mwparserfromhell [19:54:09] Weird [19:57:54] Helder, I'm going to blame a hiccup in pip for missing mwparserfromhell. [19:58:23] oh, btw some dependencies are missing from requirement.txt in ORES [19:58:34] pylru [19:58:35] That might be expected. [19:58:36] and redis [19:58:40] There are optional dependencies. [19:58:47] Including pylru and redis. [19:59:06] but ores dev_server doesn't work without them [19:59:16] Yeah... that's a fair point. [19:59:16] halfak, is there a way to force pip to reinstall all dependencies in the file? [19:59:36] Helder "pip install -r requirements.txt"? [20:00:05] Amir1, good point. We should make the dev server do no caching so that it doesn't require pylru. [20:01:11] no, I mean, the ones it should have installed when I did "pip install revscoring editquality" [20:01:22] (since I didn't clone revscoring for example) [20:01:47] Oh. Not sure. [20:01:52] I'd look through pip docs. [20:02:16] okay [20:03:53] Helder, I don't see anything obvious in the pip docs. [20:04:19] I uninstalled revscoring and I'm waiting for it to install again... let us see [20:05:24] kk [20:11:03] Hmmm... apparently, I have to install "numpy" separately before it attempts to install "scipy"... [20:13:35] Yes. Bug has been filed against scipy for at least a year :( [20:25:03] halfak, ok, that worked, but now I have this one: http://dpaste.com/22Z2NE6 [20:25:43] curiously, it didn't happen when I executed make again... and it is now processing something... [20:25:50] The future file is empty. [20:25:57] *feature [20:26:04] So it's probably extracting features? [20:26:14] Or maybe labeling edits as reverted/damaging [20:26:21] probably, some dots and some question marks are appearing... [20:26:31] That sounds like feature extraction. [20:26:32] :) [20:26:43] Yeah. Check out top. [20:26:44] are the question marks "deleted" revisions? [20:26:58] You are using all available resources to extract features. :) [20:27:08] Yes. ? == deleted revision or revision of a deleted page. [20:27:23] Actually I suppose it means we couldn't find the revision for *some* reason. [20:27:25] ouch! should I stop? [20:27:31] With those two being the dominant ones. [20:27:34] No rock on :) [20:27:41] Extract those features :D [20:28:01] If you are ever worried about sharing resources just run "nice " [20:28:14] BTW: is it possible to do these kinds of tasks without being connected to labs all the time? [20:28:14] e.g. "nice make models/ptwiki.damaging.linear_svc.model" [20:28:33] That will make it lower the priority of your jobs. You'll just use all the resources that no one else wants. :) [20:28:45] Helder, are you familiar with "screen"? [20:28:52] nope [20:28:58] man screen [20:29:22] "No manual entry for screen" [20:29:26] WAT [20:29:44] Oh! You're doing that on your local machine [20:29:45] * Helder sudo apt-get install screen [20:29:51] You might not have it installed. [20:29:54] Yeah :) [20:30:14] So screen lets you start up a terminal window in the server you are working with. [20:30:23] You can then log out and log back in without interrupting it. [20:30:31] You can detach and re-attach screens. [20:30:38] This is how I usually manage long-running processes. [20:31:00] Looks like this is a good tutorial: https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/ [20:31:38] thanks! [20:31:50] No problem. :) [20:32:01] * halfak is stoked that Helder's running some tests :) [20:32:35] not much for now, just trying to catch up a little of what I missed in the last few months [20:33:00] (as you can see by the timestamp of my last ssh session above) [20:36:51] * Helder thinks feature extraction progress looks like a starry night sky [20:37:37] :D It's going to get a bit better too. I'm working on a nice refactoring right now. [20:38:07] Really, it's not going to change much on the outside, but it will make it easier to mix and match language features for non-latin languages later :) [20:38:27] I figured I'd do that while I was adding term frequency features. [20:38:48] Helder, scope this out: https://phabricator.wikimedia.org/T121003 [20:39:16] I think this will address your concerns raised here: https://github.com/wiki-ai/revscoring/issues/213 [20:40:09] So, if you add a new instance of "shit" to the article https://en.wikipedia.org/wiki/Shit, you'll see a very minor proportional increase in the badwords term frequency on the article. [20:40:21] But if you add another curse word, you'll see a much higher proportional diff. :) [20:40:50] Once I started working on this, I realized that it could be quite powerful for vandalism and edit type detection. [20:41:05] E.g. edits that add a lot of proportionally new words to an article are probably adding information. [20:41:29] So we can apply this to non-stopwords, dictionary and non-dictionary words, badwords and informals. [20:41:35] And I think we'll get all sorts of new signal :) [20:43:02] interesting [20:44:11] what kind of feature would you create with that ? [20:44:19] an average of the additions/removals? [20:44:28] or a sum? [20:44:34] something else? [20:44:49] Yeah. A sum of additions and removals individually. [20:44:55] So a proportion sum. [20:45:25] the sum would be inside groups like "badwords", misspelings, etc? [20:45:31] Yup [20:45:32] :) [20:48:42] BTW: halfak wasn't there a copy of the models I'm generating already available on labs? so I could just copy instead of regenerating identical models again? [20:49:08] Helder, probably, yes. [20:49:18] ...but I'm worried that it would be an unfair comparison. [20:49:33] Because there could be something different. [20:49:47] Either way, you need the whole feature set if you are training a new model. [20:50:06] Actually, I forgot that the models are checked into the repo. [20:50:23] The 'editquality' repo [20:50:41] Either way, you need that features file for the old and new comparison. [20:55:16] halfak: Can you make a test server to me? [20:55:28] Amir1, for constructing models? [20:55:34] no [20:55:38] for the ores extension [20:55:40] http://ores.wmflabs.org/scores/enwiki/damaging/?revids=10|9|8|7|6|5|4|3|2|1 [20:55:57] they're all not damaging [20:55:58] As in deploy the testwiki model? [20:56:04] yeah [20:56:26] Hmm... Task switching at the moment would be expensive. Could I do it tomorrow? [20:56:53] I'd like to get a new wikidata model built so that we can deploy that with the testwiki model. [20:56:55] sure thing [20:57:01] With 0.97 AUC :) [20:57:04] no rush [20:57:06] awesome [20:57:21] Seriously want to run some tests with that. [20:57:30] Sounds too good to be true. [21:01:54] hmm [21:01:58] let's check [21:13:28] halfak, is it normal to have thinks like "mwapi.errors.APIError: internal_api_error_DBQueryError: [7984dda5] Database query error -- None" during the process? [21:14:10] Helder, sort of. For some reason the Wikipedia API has been sketchy recently. [21:14:19] We should really have a retry in there. [21:14:27] We shouldn't be seeing DBQueryError.