[15:21:31] o/ awight. [15:21:35] * halfak reads scrollback [15:43:24] o/ aetilley [15:58:25] hi halfak [15:59:43] Want to video or irc? [16:02:00] I think IRC> [16:02:15] So, short term, I can get you datasets that will let you cluster. [16:02:26] ok [16:02:44] Long term, I think that I should implement a flag on the extract_features script that will allow us to include the rev_id with the extracted features. [16:03:05] That way, you'd be able to produce a dataset that is pairs. [16:03:20] And we can review those edits to figure out what the clusters *mean* [16:03:23] word [16:03:35] yeah, about that [16:03:37] So. Let me help you get data and then I'll start hacking on that. [16:03:39] Oh? [16:04:00] I'm just curious what you had in mind. Suppose I give you two sets of rev ids [16:04:33] Do we start looking at features that are over represented in either set? [16:04:39] I guess that's my first intuition. [16:05:00] The first thing I want to do is figure out what the AUC is on each cluster. [16:05:16] Ok [16:05:20] If we have an AUC that we don't get good signal on when looked at individually, there's something wrong there. [16:05:58] We might need to cluster over all edits -- not just the reverted ones. We'll find out, I guess. [16:06:11] ok [16:06:21] We're developing a methodology so what we're actually going to do is something we need to figure out. [16:06:23] OK. Data. [16:08:10] I would suggest a combinaiton of supervised and unsupervised learning. [16:08:50] That's not terribly specific [16:09:34] aetilley, email is sending. [16:09:42] It has the big 20k feature set attached. [16:10:02] The last column is a boolean for whether hand-coders marked it damaging or not. You'll probably want to exclude that when clustering. [16:10:36] Ok, just just to be clear, if I'm going to compute an AUC I'll need both the reverted_label and the wiki_labels label correct? [16:12:07] received. [16:13:11] I think we'll need to (1) identify the clusters of for each rev_id, (2) train the model with some data held back, (3) test the model with the held-back data and compare fitness across clusters. [16:13:43] So, right now, I'm hoping you'll work with the data to make sure the clustering strategy works while I get you a dataset that *has* rev_ids. Then we can finish step #1. [16:15:10] aetilley, if we eventually bake this process into train_test, then we'll do all that in one step. [16:16:29] BRB [16:18:55] ok [16:24:39] halfak it isnt supposed to be specific. A specific example is a card in phabricator :p [16:25:17] https://phabricator.wikimedia.org/T113919 [16:30:03] ToAruShiroiNeko_, OK. This card is also non-specific. [16:30:23] It reads like a vague proposal rather than a specific task. [16:37:07] halfak: Hey, I'm struggling with dependency injection in revscoring [16:37:28] Got an error? [16:37:43] Or figuring out how to design a thing? [16:38:14] extractor.extract(entry.revisionid, features, cache={entry.revisionid: {revsc_rev_d.text: text, revsc_parent_rev_d.text: parent}} doesn't work for me [16:38:27] I want to know what's the problem [16:38:39] (I tried contex instead of cache) [16:40:07] What is "entry.revisionid"? [16:40:42] Oh! Don't use the extractor directly for that. [16:40:46] https://www.irccloud.com/pastebin/qam6uoY6/ [16:40:49] Just use revscoring.dependencies.solve [16:41:06] If it doesn't need to hit the API, you don't need an extractor. [16:41:14] can you show me how? [16:43:17] Amir1, https://gist.github.com/halfak/8308eb1529fe80c566ff [16:43:53] awesome [16:43:56] You saved me [16:43:59] \o/ [16:47:41] So it would seem there should be functionality like str.split() but where I can give it more than one delimiter corresponding to rows, columns, (etc) for a target list/array [16:47:44] anyone? [16:47:57] Yeah. In python? [16:48:01] aetilley, ^ [16:48:02] yeah [16:48:17] Are you processing the file I sent? [16:48:23] let \t correspond to row entries and \n for cols [16:48:27] yes [16:48:36] OK. I'll write you a quick gist. [16:48:41] One sec. [16:48:43] In the past I've counted the elements and used reshape [16:48:51] but this requires one to know the length of rows [16:49:04] ok, brb [16:55:55] aetilley, https://gist.github.com/halfak/c7d7f923c58868cfac48 [16:56:25] So this function requires that you import the python feature list so that you know how to convert the string values that appear in the file to the right datatype. [16:57:22] you'll need to 'pip install editquality' to gain access to that feature_list. [17:02:23] Thanks. Trying it now. [17:06:33] aetilley, ^ almost ready. [17:08:32] 200th pull request! [17:09:01] * YuviPanda groggily waves at halfak [17:10:26] o/ YuviPanda [17:10:49] halfak: "no module named 'editquality.feature_lists' [17:11:01] aetilley, 'pip install editquality' [17:11:02] but editquality is installed [17:11:05] oh [17:11:07] hmm [17:11:51] halfak: extracted features for 1K edits in 17.54 seconds, [17:13:32] Amir1, sounds pretty good. We might get even more of a speedup by preserving the revision.parse_tree datasource and re-assigning it as parent_revision.parsetree for the next revision. [17:13:46] Or wait. Whatever the pywikibase types equiv. is . [17:14:17] aetilley, confirmed the problem. looking into it. [17:16:00] halfak: should I work on doing more bits for revscoring in production today or for quarry data import? [17:16:19] YuviPanda, revscoring production, I think. [17:16:26] If all things were equal. [17:16:27] ok! [17:16:36] debiannnnn paaaccckkkages [17:16:43] * YuviPanda does things [17:16:51] let's see if I can move us off pip completely in staging today [17:17:26] I got to go, come back pretty soon and finish this off [17:17:54] aetilley, got a weird issue here. I have a fix you can do in the meantime. [17:18:07] Clone this: https://github.com/wiki-ai/editquality [17:18:18] Run 'python setup.py install' inside of it. [17:18:21] It should work. [17:18:24] ok. one sec [17:20:47] Amir1: look at https://etherpad.wikimedia.org/p/quarry-for-pwb-scripts when you're back maybe [17:23:06] halfak: that seems to have worked [17:23:18] aetilley, great. [17:23:27] This is a dirty rabbit hole I'm working my way down :( [17:28:32] aetilley, nothing you need to do, but FYI, the issue is fixed, so 'pip install editquality' will get you the feature_lists now. [17:28:57] noted [17:30:05] although I might modify this script to return a numpy array [17:30:42] aetilley, makes sense. [17:31:02] Should be trivial to go between a python list (actually an array) and a numpy array. [17:31:42] it will be straightforward, but np.array(feature_rows).shape is (19884, 2) which is not quite what we want. [17:31:50] It's ok, I'll change it. [17:32:05] I'm going to go ahead and self-merge https://github.com/wiki-ai/revscoring/pull/200 so that I can start generating feature value files with rev_ids in them. [17:44:13] OK. New features are being extracted with rev_id included. [17:44:16] I've got to run. [17:44:32] aetilley, there'll be new datasets for you tomorrow morning. I'll email 'em when I got 'em. [17:44:33] o/ [17:46:03] ok [17:46:04] thanks