[00:26:15] Hi halfak [00:26:23] o/ pipivoj [00:26:49] So, re. modeling, most of the work we are doing right now is feature engineering. [00:27:07] For example, aetilley is working on PCFGs [00:27:11] * halfak gets link [00:27:39] https://en.wikipedia.org/wiki/Stochastic_context-free_grammar [00:27:54] I've been working on term frequency-based measures. [00:28:12] E.g. https://phabricator.wikimedia.org/T121003 [00:29:05] We're also hoping to include bag-of-words features, but we haven't worked that out yet. [00:29:59] Are you familiar with bag-of-words generally? [00:30:04] pipivoj, ^ [00:30:26] I'm currently checking waht it is. :/ [00:32:00] Have you considered neural networks as models? [00:32:46] I suppose it would be hard deciding the number of layers and neurons, but maybe it's not a bad idea. [00:34:27] pipivoj, +1 [00:34:57] We've been abusing (using?) sklearn -- a collection of machine learning strategies with nice interfaces in python. [00:35:14] Yeah, I've seen you have [00:35:15] Amir1, has a neural network that he has been using on Wikipedia though. [00:35:25] Seems they don 't have NNs [00:36:04] they=scikit-learn [00:36:25] pipivoj, exactly, BUT it would be cool if we decided we wanted and (1) found another library that will help us implement NNs or (2) built the general purpose library we want and then use it. [00:36:36] It seems like we have some options in python. [00:36:45] I haven't looked at them carefully though. [00:38:11] E.g. http://pybrain.org/ [00:38:28] Just now I'm reading it. [00:39:15] Or https://github.com/tensorflow/tensorflow [00:40:27] Looks like this might be nice: https://github.com/aigamedev/scikit-neuralnetwork [00:42:08] Anyway, if you wanted to look into that, I'd be happy to help you get our labeled data and start experimenting. :) [00:42:58] There a nice web-based python notebook system(jupyter) installed in labs that would be good for running experiments and sharing your results. [00:43:06] Yes, I am interested. Anyway, what about performance of Python libraries? I mean it's an interpreted language, shouldn't C(++) perform faster? [00:43:53] It probably will in almost all cases. However, a well engineered system in just about any modern/well-adopted language performs and scales well. [00:44:36] A large portion of the python community is into scientific programming, so they put a lot of effort into performance and parallelization. [00:44:50] Still, we can never beat raw C++. [00:45:04] well, and numpy is written in C/Fortran [00:45:15] so most of your heavy lifting already is handled outside of python [00:45:17] But I can program a lot faster in python than C++, so I focus on good systems engineering. [00:45:25] Yeah. Right. [00:45:35] Sorry. I was going to bring that up too, but you already know. [00:45:46] libsvm is the example I was going to bring up. [00:45:51] Totally C. [00:46:15] But I access it through sklearn.svm.SVC. [00:46:19] In python [00:46:27] It's fast and it works great. [00:47:25] We're using a lot of GradientBoost and RandomForrestClassifier too and they seem to be *way* faster than we can even get the data from the feed in order to make classifications. [00:47:26] Ok then, my noobish curiosity is satisfied. [00:47:42] Our biggest bottleneck WRT CPU is the diff algorithm we use. [00:48:09] If you'd be interested, I'd really like to talk to you about converting that to C. [00:48:12] :) [00:48:32] That would speed up a lot of routine data science work that I do -- not just ORES. [00:50:38] pipivoj, one more thing I want to tell you about that you might like -- our feature extraction garden. :) [00:51:12] This is a branch that I am just about to merge: https://github.com/wiki-ai/revscoring/tree/features_commons/revscoring/features [00:51:22] Hm, I'm not sure if I'm that much versed in C to take a bite at diff algorythm porting, since I gather it might be a lot of system stuff to work on. Registers and similar [00:51:40] pipivoj, exact reason I don't write much C! [00:52:05] :) [00:52:16] :/ :) [00:52:41] So the feature garden contains a definition of every feature that we have ever considered using. [00:52:49] It's a dependency injection framework. [00:53:13] So you can just pick and choose the features you want to use in your classifier and tell the "extractor" to figure it out. [00:53:34] The dependency injection framework figures out how to connect to the API and get exactly what it needs so that you don't have to. [00:54:56] So, take for example, the number of "aliases_added" to a Wikibase item (e.g. in Wikidata) [00:55:06] https://github.com/wiki-ai/revscoring/blob/features_commons/revscoring/features/wikibase/features/diff.py#L34 [00:55:44] from revscoring.features.wikibase import revision [00:56:07] revision.diff.aliases_added <--- Your "feature" [00:56:21] from revscoring.extractors import api [00:56:30] import mwapi [00:56:53] extractor = api.Extractor(mwapi.Session("https://en.wikipedia.org")) [00:57:31] ^ crap. that's english wikipedia. I meant this: [00:57:44] extractor = api.Extractor(mwapi.Session("https://wikidata.org")) [00:58:24] extractor.extract(1234567, [revision.diff.aliases_added]) --> 4 [00:58:42] 1234567 is the revision. [00:59:01] You can just pass a list of features you want to the extractor and the extractor will get them. [00:59:16] We have (I'm really guessing) 250? [01:00:02] We have lots of categories of feature types: bytes, temporal, wikitext, wikibase. [01:00:41] There's the base feature type that works on any revision no matter what weird extensions you have installed in your MediaWiki: "revision_oriented" [01:01:06] Anyway, when we set up a classifier for a new problem, we mix and match features and run tests until we get a good fitness. [01:01:23] There's some utilities in that revscoring library that will allow you to do most of that from the command line. [01:03:44] Here's an example of the feature_left for English Wikipedia: https://github.com/wiki-ai/editquality/blob/rs_v1/editquality/feature_lists/enwiki.py [01:04:59] pipivoj, I've got to run, but I want to show you one more thing. Our tuning reports. [01:05:03] https://github.com/wiki-ai/editquality/blob/rs_v1/tuning_reports/enwiki.damaging.md [01:05:15] Wait a sec [01:05:17] ^ This is a tuning report for our damage detection model for English Wikipedia. [01:05:21] Oh sure. [01:05:31] * halfak gets excited talking about this stuff. :) [01:05:33] When you say jupyter in labs, you mean wmflabs? [01:05:37] Yes [01:05:39] :) [01:06:04] For me to use over web IUC? [01:06:25] IIUC if i understand correctly [01:06:31] phew [01:06:35] Yes [01:06:36] :) [01:06:43] That would be great [01:06:55] YuviPanda, ^ [01:07:00] I've got a convert! [01:07:28] Is there a ritual of sort, or we have just done it? [01:07:36] ;) [01:07:39] tools.wmflabs.org/paws [01:07:44] sign in with your wiki account [01:07:46] and tada [01:07:52] if your username is ascii characters only [01:08:05] you can pip install stuff there to play with [01:08:44] YuviPanda, is that paren bug really bad, or no time to look at it? [01:10:49] latter [01:11:33] Gotcha. [01:11:45] * halfak logs into PAWS and adds some docs to his mw_ examples. [01:16:09] * halfak just realized he said "feature_left" above and face palms. [01:16:14] "feature_lists" [01:41:16] YuviPanda, https://wikitech.wikimedia.org/wiki/PAWS [01:41:22] And https://tools.wmflabs.org/paws/user/EpochFail/notebooks/examples/mwxml.py.ipynb [01:41:54] Err. https://tools.wmflabs.org/paws/public/EpochFail/examples/mwxml.py.ipynb [01:42:00] Man. Those long lists look bad. [01:42:08] In the editor, they turn into scrolley divs. [01:48:51] OK. I'm out of here. Have a good one! [01:48:51] o/ [02:05:34] Have to go. [02:05:40] Bye. [12:44:16] halfak [12:44:20] what would be our topic? [12:44:20] Technical, Policy, Outreach, Projects, Research, Governance, other [12:44:35] We are techincal, outreach and research [12:44:39] probably also a project [16:36:01] hi all [16:37:26] ello [16:48:31] o/ [16:53:55] ToAruShiroiNeko, what is the status of the proposal? [16:53:58] o/ aetilley [16:54:22] working on it [16:54:38] what would be our topic? [16:54:43] Technical, Policy, Outreach, Projects, Research, Governance, other [16:54:49] We are techincal, outreach and research [16:54:49] probably also a project [16:55:20] I need to write between 300 to 600 words [16:56:01] Depends on what we want to talk about in the few minutes that we have. [16:58:37] indeed [16:58:43] we can talk about what we did [17:00:07] issues (positive or negative) which have emerged from projects [17:00:12] things we did worked [17:00:16] things we did did not work [17:00:18] etc etc [17:00:22] we have a good narritive [17:01:57] *narative [17:04:09] ToAruShiroiNeko, maybe it would help if you pointed to some specifics of that narrative you have in mind. [17:04:27] Regretfully, I can't see how we'll cover *all* of the project in 18 minutes or less. [17:04:50] * aetilley is switching routers [17:09:18] ToAruShiroiNeko, ^ [17:10:14] well yes [17:10:18] we ran into difficulties [17:10:21] first was rtl [17:10:37] then non-delimetered languages like japanese and korean [17:10:50] chinese variants was a problem [17:13:53] Aha! So language support for emerging projects. [17:14:07] That sounds like it could be a critical issue with ORES as a case study. [17:14:24] Seems like we could partner with someone from language engineering at the WMF> [17:16:26] sure [17:16:30] we also had other issues [17:16:34] overfitting problemks for instance [17:16:43] the issues with wikidata [17:17:01] and the latest issues we are having difficulty with botpedias [17:17:14] where generating a training revert set isnt being as straightforward [17:17:30] all these difficulties we solved, left for later or are working on now would be our topic [17:17:39] our work isnt trivial [17:17:43] its not straightforward [17:17:55] we do many tweaks etc which people wouldnt normally see [17:19:40] ToAruShiroiNeko, maybe we could make efficient quality control generally be the "critical issue" [17:21:10] halfak: Thanks for the update btw. [17:21:14] (Over gchat) [17:21:49] we could go that direction too [17:26:17] aetilley, no prob. Ya'll should be in the loop [17:28:54] ToAruShiroiNeko, how about we target efficient quality control as a "critical issue" [17:29:12] We could pull in Krinkle to help us talk about how it works for small wikis in CVN [17:31:59] And discuss the thresholds we place on counter-vandalism scores -- how much effort we can hope to reduce on major and minor wikis with machine learning. [17:32:25] We still need that small-wiki CVN model. I haven't had any time to experiment with that. [17:40:03] sure [17:40:10] other issue can be a poster presentation [17:40:34] Languages ^ ? [17:41:03] problem with them [17:41:04] sure [17:41:50] * halfak copy pastes all the documentation for his workaround [17:41:54] Curse you sphinx! [17:42:09] You dont want to curse egptian gods [17:43:40] hmm it doesnt seem to be a god, curse on. :p [17:43:52] Curse you sphinx developers! [17:44:01] we can also expand on that angle [17:44:02] one sec [17:44:05] this reminded me of something [17:44:12] * halfak shoots a cannon at sphinx's face [17:44:14] too soon? [17:44:27] it doesnt have anose already :p [17:44:41] 2015 Community Wishlist Survey [17:44:42] yes [17:44:53] issues community thinks are critical [17:44:58] we are mentioned there a fair number of times [17:45:16] :D! [17:45:18] https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Moderation_and_admin_tools#Suggesting_AbuseFilter_by_machine_learning [17:45:49] ToAruShiroiNeko, forgot to relay another thing. So, Lila and Wes suggested that maybe we can work out a partnership with grants where we run an ORES campaign. [17:46:11] We could encourage people to propose new UIs for counter-vandalism/newcomer socialization or new modeling problems. [17:46:19] E.g. detecting spammy new article creations. [17:46:29] Or automatically categorizing articles (new or old) [17:46:54] I'd like to get someone funded to work on the copy-vio detection seriously. [17:47:15] It would be great if we could partner with the WMF's business development folks to get a good deal on a copy-paste lookup service. [17:47:35] ^ things halfak doesn't have time for that ORES/Revision scoring could do. [17:57:32] dinnering [17:57:33] brb [18:00:08] Is that like the Plaigarism detection software that my sisters use? :) [18:00:42] *"Plagiarism" [18:05:31] ^ yes [18:08:01] Interesting. [18:21:30] so [18:21:45] we will have grants per model? [18:22:00] or problem? [18:22:15] Assuming we can work that out. It's just an idea right now. [18:22:21] sure [18:22:26] GSOC might be a noth opportunity to get funding. [18:22:33] On a per-model basis. [18:22:39] shorter paper trail would be nice, an execurtive report on progress rather than weekly report maybe [18:22:59] it may be better if you have a budget that grants people review [18:23:15] My impression was that the weekly reports were easy because we really just need to list out what's on the phab board. [18:23:21] yes [18:23:25] I am not saying its hard [18:23:30] I am not saying we stop them either [18:23:39] but they are too much for the executive types [18:23:43] And that the big reports were a burden because they require some story telling. [18:23:49] they only care about numbers and graphs probably [18:24:03] In the end, yeah. [18:24:09] No points for good engineering work. [18:24:28] halfak: Did you get a chance to look at that new branch? [18:24:39] Nope. Can you link me quick? [18:24:49] I'm just finishing up the hard-coded documentation for revscoring. [18:25:58] Sure. Heck might as well do one more push first. Hold up. [18:27:55] k [18:31:29] https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Bots_and_gadgets#Improve_the_.22copy_and_paste_detection.22_bot [18:31:38] ranks #9 [18:31:51] ie Plaigarism detection essentially [18:32:19] https://phabricator.wikimedia.org/T120435 [18:32:24] seems to be some work there already [18:38:50] Indeed. [18:40:07] so then this is the critical issue I will put in the submission [18:45:01] Cool :) [18:45:11] halfak: I'm making a couple more changes, but here's the basic idea [18:45:12] https://github.com/aetilley/pcfg/tree/OOP [18:46:50] aetilley, one thought re. "read_lines" [18:47:03] Is it common to use a space as a delimiter in the data files you are working with? [18:47:13] It seems that a [TAB] would be a better delimiter. [18:54:28] halfak: noted [18:54:42] Otherwise, I really like the example you have in your README. [18:55:31] * aetilley just pushed again, but perhaps not huge changes. [18:55:56] Well the idea is that the mapping from various string literals to [18:55:58] say [18:56:08] their symbol wrt the PCFG [18:56:18] +1 makes sense. [18:56:21] can be handled inside the Symbol class [18:56:42] still? [18:57:37] My immediate goal is to augment the .train_from_file method to allow users to train from non-Chomsky Normal Form data and do the conversion automatically. [18:58:10] (CNF means every rule is either unary: --> or [18:58:11] Do you think we might define "readers" that are capable of processing and normalizing a few different formats? [18:58:33] binary: --> (, ) [18:59:18] Yes, that should also be handled by the self.train_from_file() method. [19:05:16] aetilley, how will we detect the file type? [19:14:26] You don't. You specify it when you use the training method. [19:14:32] For now anyway. [19:14:34] halfak: ^ [19:15:12] aetilley, gotcha. It seems like we might want to make this more explicit and let the train() method accept normalized data. [19:16:14] So one file type that's begging to be implimented (for this method) is one that just has the q values and skips the counts entirely. [19:16:27] (That would actually be the equivalent of "normalizing" in this context) [19:17:07] The PCFG object does not keep track of counts. It only keeps track of a CFG (language and tansitions) and parameters q for each transition. [19:19:56] q = some probability? [19:21:19] Yes. Remember a PCFG is just a CFG with the addition of a function q from rules to real numbers such that [19:21:59] sum_{string} q(X -> string) = 1 [19:22:03] for every fixed X [19:22:25] Gotcha. Is it's a conditional probability? [19:22:35] Think of q as the conditional probability of an X splitting into the symbols in String [19:22:37] yes [19:23:10] So, what types of file types/data formats do you think we might eventually support? [19:25:12] Well and easy one would be what I call CNF_PARAMS where it's exactly like CNF_COUNTS (currently implemented) except that the q values are given directly. [19:25:43] Harder would be what I'm calling NCNF_COUNT/PARAMS where there are non CNF transition rules. [19:26:01] (eg. non-binary) [19:26:23] There are algorithms for converting a non-CNF grammar into a CNF one. I just need to write them down. [19:26:41] * halfak starts extracting features for dewiki. [19:27:00] So it looks like we spend a lot of CPU time matching badwords/informals in a large chunck of text. [19:27:25] aetilley, non-CNF transition rules? [19:27:40] Oh! I see. So they might be conditional on more history? [19:28:06] So, like a slightly less hidden markov model. [19:28:16] No. Eg. any non-binary transition rule is non-CNF [19:28:40] * halfak tries to remember what CNF stands for [19:28:49] C normal form? [19:28:49] Chomsky Normal Form. I define it above. [19:28:52] Ahh [19:29:23] Well, a non-binary transition rule might be tertiary. [19:29:47] And that would mean that more history is taken into account when considering the transition probability that the current state. [19:29:49] Right. If there are teriary transition rules, then the grammar is not CNF [19:29:52] get it? [19:29:53] E.g. the state before that! [19:30:00] Yeah. [19:30:06] Oh [19:30:08] A hidden markov model has binary rules only. [19:30:23] A slightly less hidden markov model might have > binary rules. [19:30:23] I see what you're saying. [19:30:58] So, can you account for this tertiary rules in your code? [19:31:14] But the thing is, in this case the ternary rules from say symbol X and the binary rules from that symbol X are mutually exclusive [19:31:22] the latter are not seen as being a Part of the former. [19:31:35] (until you force this, which is the punch line) [19:31:41] If not, it seems like we could write an algorithm to collapse > binary to binary conditional. [19:31:56] Yes. That's what I'm saying. [19:32:00] Which would lose information, but make other data work in the binary scheme. [19:32:02] Gotcha. [19:32:12] Should definitely log a warning in that case. [19:32:26] Maybe leave a feature request open to handle that better. [19:34:01] I want to point out though that theres an important sense of "markov" in which arity doesn't matter: [19:34:04] Yeah... so. Fun story. Term frequency diffing takes a lot of CPU :/ [19:34:17] aetilley, +1 [19:34:36] So the feature request would be to generalize the system so that it could work with Narity. [19:34:38] What makes a process Markov is just that the probability of transitioning from state S_t to S_{t+1} is independend on how we got to S_t [19:35:30] Woops. you're right. I was misrepresenting the hidden in "hidden markov models" [19:35:45] In this case, regardless of the arity of the transition rules, the computations in these algorithms assumes t, say, that the current parse is (X1, X2, ldots), and considers parses of these pieces with no care about how we got to this point in the parse. [19:35:49] halfak: [19:35:50] ^ [19:36:37] lol: "ldots" = ... [19:37:20] It just so happens that the algorithms that we have handy AlSO insist that all transition rules be of arity <= 2. [19:37:29] But a rule could be p(--> X2|X1) [19:37:31] and that they take the CNF structure. But that's ok. [19:37:53] or p(--> X3 | [X1, X2] ) [19:37:58] Those have no source [19:38:05] something needs to be on the left of the arrow. :) [19:38:15] X1 I guess in the first case [19:38:25] also p(rule) isn't a rule, it's a number. [19:38:42] OK. I'm going to get the notation wrong. Can we not get stuck on that? [19:38:56] sure. X1 --> X1 X2 [19:38:59] is a rule [19:39:06] (binary) [19:39:11] Can you convert that rule to English? [19:39:19] X1 --> X1 | X2 is actually unary. :) [19:39:41] to english: "X1 rewrites as a X1 followed by a X2" [19:39:54] eg. S --> NP VP [19:40:06] Sentence rewrites as a noun-phrase followed by a verb phrase. [19:40:15] Ahh yea. So I am imagining more historical information looking like this: [19:40:29] X1 X2 --> X1 X2 X3 [19:40:52] Given that we saw X1 and X2 in that order, what's the probability that we'll now see X3 [19:41:31] Remember that there's no time axix here. ...-> X1 X2 X3 means that all three are present after the rewrite [19:41:43] keep in mind: [19:41:48] S --> NP VP [19:41:56] I was assuming that this was a state transition graph. [19:42:05] In this case, the word "rewrite" doesn't mean anything to me. [19:42:34] It seems like we might be talking about a tree. [19:42:34] It is, but the left/right hand sides are descriptions of the Current state. There's no history. [19:42:49] There is certainly a tree structure present. [19:43:18] "rewrite" means going down one level in the tree? [19:43:21] rather, left hand side is "Susan Ran" [19:43:35] right hand side NP: Susan , VP: ran [19:44:09] But notice that instead of "Susan" we may have "The milkman" [19:44:26] in which case we better hope our grammar has the rule NP --> DET NN [19:44:51] where DET stands for determiner and NN stands for Nominal (e.g. "milkman") [19:46:42] So, this means we have a markov-like process at multiple levels. [19:47:03] E.g. we can talk about transition probabilities at the bottom of the tree as well as back towards the top. [19:47:31] Does this give us more information over having transition probabilities at the bottom only? [19:48:48] What's Markov about all this is that the score of any given parse tree is computed on a Markov like fashion: First the score of the transition from S to (whatever_string_of_symbols) then the conditional probabilities of splitting the Symbols in (whatever_string_of_symbols) [19:49:34] This will all terminate when we compute the leaves of the tree, which, in CNF are unary rules from nonterminals to terminals. [19:50:35] But, say, in computing the probability that a NP is parsed into "The Milkman" we do NOT use information about how we got to NP from our initial S. THAT'S what makes it "markov" [19:54:42] What is S in this case? [19:54:45] Sentence? [19:56:06] Yes, which is customarily the start symbol in each computatin [19:56:10] computation* [19:56:29] halfak: o/ [19:56:35] aetilley: o/ [19:57:39] o/ Amir1 [19:57:50] hi Amir1 [19:58:15] hi, I'm talking to the Arab guy who is working on the lists [19:58:52] \o/ We just heard back from a polish speaker too :) [19:59:08] We're still waiting on urdu, right? [19:59:27] BTW, I found a performance issue in our new code. [19:59:49] halfak: about urdu, I don't know, we got the list [19:59:52] So, generating token frequency deltas for big revisions is faster than diffs (win) [20:00:09] But we end up doing both a diff and a token delta so we're almost twice is slow :( [20:00:26] Amir1, we shoudl ping bmansurov when he is online about taking a look at that list. [20:01:25] halfak: he works for Uzbeck (uz) not Urdu (ur) [20:01:35] Woops! [20:01:46] So, do we have lists for uzbeck? [20:02:18] Urdu is mostly being spoken in Pakistan, Uzbek is being spoken in Uzbekistan [20:02:29] Yeah. Realized my mistake :/ [20:02:37] yeah, a very long time ago, bad words are there [20:02:56] but list of edits (extracted from dump) I don't know if it's needed [20:06:35] halfak: https://github.com/wiki-ai/ORES-GUI [20:06:45] we should put it in ores.wmflabs.org [20:06:51] pure js [20:06:52] :) [20:07:45] Awesome. +1 [20:07:51] We should merge this into ORES :D [20:07:56] wiki-ai/ores [20:08:08] So that it is loaded at ores.wmflabs.org/ui/ [20:08:22] That should be pretty easy. [20:08:34] Where does CSS get pulled in? [20:11:10] CSS is not being loaded directly [20:11:40] it loads a library called semantic ui (it's like bootstrap but prettier) [20:13:41] * halfak looks more for that [20:14:28] OH! I see the scripts being loaded at the top of the HTML, but it looks like it is referencing files that may not be there. [20:14:34] Can we use the tools CDN for this? [20:15:09] CDN is okay, just downloading them is okay [20:15:57] I've no preferences [20:16:16] CDN is preferred. ORES isn't nearly so good at providing assets fast as the CDN [20:16:26] We use the CDN heavily in Wikilabels :) [20:16:43] * halfak implements the weird features that we used to need the diff for. [20:16:49] And tests are passing! [20:18:31] about the performance [20:19:20] is there a way to omit diff? [20:19:32] if not possible omit delta? [20:19:37] Yeah. that's what I'm working on right now. [20:19:50] Delta does a better job and is a little faster, so I want to try to get that working as expected. [20:33:01]