[09:37:07] halfak so do we regenerate scores where we have older ones with the new model? [14:24:13] ToAruShiroiNeko_, yes [14:24:26] This was the "cache invalidation" stuff I was talking about yesterday. [14:33:26] * halfak reads about git submodules [14:37:13] Would one want to develop revscoring code while in submodule mode? [14:37:19] * halfak considers ^ [14:40:39] Oh god. You need to learn a whole new set of commands [14:40:44] WHY? [14:42:47] So, it looks like we might be better off just using the submodule to deploy our code. [14:45:25] We're in an OK state, but something is weird with worker-03. [14:45:30] In flower, I get "Unknown worker 'celery@ores-worker-03'" [14:45:49] And it doesn't seem to be taking on new tasks. [14:46:21] Celery is pegged at 100% [14:47:27] Looks like the last log line is from 4 hours ago [14:48:10] No error. Just the weird stderr output thet myspell gives when it can't deal with a character that isn't utf-16. [14:48:38] Let's see if I can get rid of those [15:20:20] o/ halfak [15:21:04] o/ Amir1 [15:21:16] https://github.com/rfk/pyenchant/issues/58 [15:21:25] ^ UTF-16 and pyenchant stuff [15:21:33] I met some googlers several days before (one of them is research scientist) he suggested me to send paper for KDD, Next year is in SF, You can be there :) [15:22:05] Yeah. That sounds good. I think that we could publish about your work on WikiBase|Data [15:22:30] We could pull in the clustering stuff too. [15:22:31] we are catching upstream bugs, I call this work cutting edge [15:22:32] When is the deadline? [15:22:38] :D [15:23:00] http://www.kdd.org/kdd2016/ [15:23:26] I'm trying to find the deadline [15:23:44] It was Feb. for the conf this year [15:23:49] Probably similar this time. [15:23:53] *next time [15:23:54] yeah [15:24:03] That seems like good timing for us. [15:24:19] So, I have a little bit of homework for you and I'll take some on too. [15:24:45] Your homework, What's most interesting about working in this space? What do you think others should know about before they get started? [15:24:59] My homework, how do people do papers like this in KDD. [15:25:03] Gotta follow the norms. [15:25:17] sounds fair :D [15:25:39] I'm aiming to get something in CSCW too. [15:25:40] I think about this and will let you know soon :) [15:26:01] I'd like to be 1st author on the CSCW one, but I think it would be a great idea to have you be 1st author on a KDD submission. [15:26:20] deadline is pretty close [15:26:27] October first [15:27:00] ? [15:27:49] For CSCW, the deadline will likely be in May [15:27:57] Are you sure? [15:28:10] Yeah. I'm on the program committee :P [15:28:20] We just finished reviewing papers for the conf. this year. [15:28:22] February 27–March 2, 2016 [15:28:36] Yeah... 2016 is "this year" [15:28:48] So I'd be aiming for CSCW 2017 [15:28:48] oh I see [15:28:52] Yeah. [15:29:06] But I am aiming something for 2016 as a workshop [15:29:11] cool [15:29:18] https://meta.wikimedia.org/wiki/Research:Infrastructure_for_open_community_science [15:29:41] okay, for KDD I probably can't participate [15:29:55] visa... [15:30:12] Yeah... That'd be a bummer, but I still think you should consider first-author-ing it. [15:30:37] and I like to be the first person but I'm inexperienced in writing papers [15:30:44] I can present in your absence and the paper on your CV will mean a lot if you want to do something in research & technology later. [15:30:44] Can you help on that? [15:30:50] Totally. [15:30:58] Great [15:31:05] so let's work! [15:31:09] \o/ [15:31:28] What part do we want to write for KDD? [15:31:42] wb-vandalism? [15:31:48] or Kian [15:31:53] or rev scoring [15:40:39] halfak: oh btw I'm writing tests for wb-vandalism and pywikibase I added tox.ini and things like that [15:46:36] Amir1, good Q. I think that wb-vandalism and Kian are interesting for KDD. A lot of revscoring is old news to the lit -- we're roughly building off of Alder and Wests work around damage detection in text documents. [15:46:56] The thing that I think makes revscoring interesting is more of a CSCWy angle. [15:47:38] So, back to Kian and wb-vandalism. With Kian, we'll likely need to do some work to try to generalize it. We'll also need to see if anyone has published about this strategy for extracting structured data before. [15:48:02] For wb-vandalism, I think that we've already seen a bit of prior work in our collab with Martin and Stefan. [15:48:27] We can essentially publish the "here's the difficulty and opportunities with detecting damage in structure data wikis" [15:56:36] halfak: What do you think of generalizing Kian (I want to add NLP-based content based parts to it too) [15:57:29] +1 for that. What do you think is needed in order to generalize it? [15:58:09] some tf-idf based scripts [15:58:21] I think I can do it in the next two months [15:59:13] What, exactly, would be generalized? [16:06:58] Adding statements would be possible through content [16:07:09] (now it's possible to add statements in general) [16:07:46] I'm not sure I understand what you mean by that. [16:08:01] by "possible through content" [16:10:06] now Kian adds statements based on categories, it will be possible to add statement based on content of articles (e.g. frequenter of words like "he" "she", etc. to determine the article is about human or not) [16:13:00] Oh! [16:13:16] I thought you were reading in words. But categories, of course, make sense too. [16:13:36] halfak: when you have the chance, tell me how you want to prime librarybase as a datastore. [16:13:41] So you'd do TF/iDF on the content of the article and the rest of the wiki and then use the high weighted words as features? [16:14:08] halfak: exactly [16:14:17] harej, so, I'm not sure how you'd like to input the metadata, but I'd like to gather it. There are API endpoints that, given a DOI, will give us a JSON document. [16:14:46] Amir1, cool. Sounds very interesting to me. If we write about it, we'll want to be able to speak to the class of strategies. [16:14:56] E.g. machine learning as input to structured databases. [16:15:19] We'll need to be able to speak to the decisions you made in Kian and how other contexts might require different decisions. [16:15:53] exactly [16:16:02] I really want to talk to you about this [16:16:04] I'd like to gather it on the basis of (a) whatever comes between and and (b) whatever is not between ref tags but uses a template like {{cite book}}. The reason for the latter is because on some articles, a more conventional citation style is used with brief footnotes and lengthier endnotes. [16:16:07] harej, I'd like to be responsible for getting you the relationship between this JSON document and where the that contained the DOI was in an article. [16:16:24] Note that I emphatically want more than documents that have DOIs. [16:16:37] harej, no worries. I actually don't pay attention to tags anyway :) [16:16:51] harej, I hear you. I think I want to do DOIs first though [16:16:55] Sure. [16:17:04] If you can find me an API end point for ISBN then I'll do that next [16:17:19] I know the API for PubMed, but they represent a smaller fraction of cites. [16:17:29] And I bet most refs that have a pubmed ID have a DOI as well. [16:18:00] Anyways, it is a wiki. If you have an account I can give you All The Privileges and you can set up the data model as you wish. [16:19:26] harej, na. I don't want to touch the wiki. [16:19:39] I mean, I will if it comes to that, but I'd rather just get you raw input metadata [16:19:47] Sure. [16:19:48] And let you figure out how to organize properties. [16:19:54] So what do you need me to do, then? [16:20:27] Not much at this point -- except for looking at examples of metadata that I'll give you soon and trying to figure out how to import them. [16:25:58] ello [16:26:09] wasnt aware IRC was this active :D [16:39:37] https://github.com/rfk/pyenchant/pull/59 [16:39:46] Upstream fix in place. [16:39:56] Or rather, pull requested [16:40:21] * halfak goes to note all of the server work he has been doing in phab [16:46:25] hi peeps [16:46:32] o/ Oscar_ [16:46:35] Welcome :) [16:49:17] Oscar_, what's up? [16:53:09] everything fine halfak [16:53:16] How's things here [16:53:55] Good! Life was very stressful this last week as YuviPanda and I were debugging some stability issues with the ORES service, but now, it seems to be in a great state. [16:54:10] So we get to press forward with new development now :) [16:55:01] great :) [16:57:31] Oscar_, what brings you to #wikimedia-ai? [16:59:08] halfak: probably a message of ToAruShiroiNeko_ in the caf� (our village pump) [17:00:23] I'm interested in the classification system for wikiprojects,or whatever is called :) [17:00:48] The article quality predictor? [17:01:02] Or the vandalism predictor? [17:02:10] the article quality predictor [17:02:23] Hi halfak [17:02:39] Hi YuviPanda! [17:03:02] Oscar_, we should be able to stand this new type of model up pretty quickly. [17:03:28] The major problem is finding a good amount of labeled data to train the classifier on. [17:03:28] halfak: I might be able to take pip out of the deploy process today [17:03:39] \o/ [17:03:40] COol! [17:03:59] It looks like it's a pain to do development in submodules [17:04:04] A lot of new git commands to learn. [17:04:54] halfak: oh? [17:05:02] What exactly do you mean [17:05:50] "git submodule update --remote revscoring" [17:06:03] Rather than just "git pull" or fetch/merge [17:07:19] YuviPanda, I'm working on making our logs be *silent* unless something goes wrong. [17:07:27] See my latest https://github.com/wiki-ai/revscoring/pull/191 [17:07:33] halfak: you can cd into the submodukes and use normal commands [17:07:43] I fixed this one upstream, but I think we'll want to deploy our stopgap anyway. [17:07:55] The docs lie -- or overcomplicate then. [17:08:02] Regardless, if that's true, I'm happy :) [17:08:05] Yes [17:08:11] So it is basically [17:08:14] A pointer [17:08:18] From the parent repo [17:08:23] To the submodukes [17:08:28] Yeah... That's really what I wanted. [17:08:37] And a particular commit hash [17:08:48] halfak: so it is ties to a particular commit [17:09:15] halfak: only submodule related thing is when you want to update the commit the parent points to [17:09:27] Yeah. That makes sense. [17:09:33] Then you just cd to the submodule and update it to whatever commit you want [17:09:46] Then come out and git add/ commit it [17:09:50] Like it was a normal file [17:09:57] That's it [17:10:03] Do try it out :) [17:11:19] I made sure I could do through the deploy flow before I merged, but I didn't check the dev flow -- just read some docs. [17:11:26] So yeah. I'll do that for the next PR. [17:11:41] This'll help aetilley work [17:11:51] And make dev from the vagrant make sense. [17:12:20] halfak: yeah since ores and revscoring are both there [17:12:28] halfak: the vagrant script needs some updating [17:12:52] And the fabfile [17:13:08] Do you think that sklearn will be ready today? [17:14:24] YuviPanda, when you look at deployment process, check out this: https://github.com/wiki-ai/ores-wikimedia-config/issues/29 [17:14:35] I think that there's something weird with ores-staging-01 [17:14:41] halfak: not sure but I am sure we can get rid of pip anyway and move it to a separate step [17:14:56] Which we can run only if we know any of the dependencies have changed [17:14:57] Oh! So we'll maybe just manually install sklearn or something? [17:15:07] Oh yeah. That makes sense. [17:15:18] We couldn't do this before since we needed pip for ores and revscoring [17:15:20] Not anymore [17:15:27] So that is step 1 [17:15:42] Step 2 is to use packages for as many things as possible [17:16:04] (Debian ones) [17:16:57] Makes sense [17:17:06] halfak: hear you, not in a hurry though. [17:17:16] I'm here more to help in building some bridges with the w:es community :) [17:17:29] :) [17:17:51] So Oscar_, do you use any article quality templates in eswiki? [17:18:16] halfak: I'm going to freshen up, I'll be back in a fee. How long are you gonna be online for? [17:18:28] YuviPanda, another hour or two [17:19:01] halfak: OK then I'll be back in 5-10 min for full overlap [17:19:08] :) [17:21:27] hehe [17:24:08] halfak: yes, is based in the en:wiki wikiproject template, with the classification and all (stub, start, good, etc) [17:24:16] halfak btw the storyteller is catching up on their backlog [17:24:21] we may be featured at some point [17:26:32] Oscar_, if I had a good description of the types of templates that are used with some examples, I could start building an extractor. Is that something you could do for me? [17:26:57] ToAruShiroiNeko_, \o/ that's great news :) [17:38:37] halfak: sure! where do I place it? :) [17:38:53] How about on the talk page here: https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service [17:39:27] Ok [17:48:29] halfak: ok, here for real [17:48:30] (sorry) [17:48:49] checking out ores-staging first [18:25:07] halfak: am modifying the fabfile to not use pip now [18:25:16] halfak: are you still against not specifying ranges in the requirements.txt file? [18:25:30] I still think it's going to cause nothing but problems and we can work around problems when people actually encounter them [18:26:03] YuviPanda, if we don't have ranges, it's easy to get version conflicts in one's environment. [18:26:12] err [18:26:21] E.g. if we specify a different version of 'requests' than 'mwoauth' specifies. [18:26:28] well [18:26:32] don't specify transitive dependencies [18:26:44] Huh? [18:26:56] actually [18:26:59] you're already screwed there [18:27:04] We need requests for our thing and mwoauth needs requests for its thing. [18:27:08] since if mwoauth wants a different version and you want a different version [18:27:15] you are in 'hope' land already. [18:27:18] if it wants a specific version [18:27:19] Indeed we are, but with ranges, it's easier to not have the problem. [18:27:20] you should too [18:27:22] well [18:27:27] I"m going to stop arguing this now [18:27:42] but I think it's a terrible idea because you're claiming to support a large number of version combinations [18:27:50] Not arguing. just explaining my thoughts. [18:27:52] for use cases that don't exist atm [18:28:05] we've already hit nightmares with pip and scipy / numpy because of it [18:28:07] We've already solved this problem with ranges. [18:28:16] ok, I disagree but I"m going to move on [18:28:17] We used to specify the versions exactly and we had clashes. [18:28:34] YuviPanda, that's because a commit broke the ranges. [18:28:39] err? [18:28:42] pip was doing was it was supposed to do. [18:28:46] it was a problem because it was using ranges in the first place [18:28:56] Someone changed out scipy range outside of what Jessie had installed. [18:29:05] but again, let me move on. I don't think I'm going to convince you of this and it's going to stop becoming my problem soon [18:29:06] Well we used to specify versions exactly. [18:29:09] no, it was upgrading --upgrade [18:29:18] That too [18:29:23] indeed, and there was a conflcit and we should've fixed it by pinning our version of requests to what our underlying libraries are using [18:29:34] instead of just using a range. and fixing it upstream, etc [18:29:46] YuviPanda, that's cool if we do it in ores-wikimedia-config [18:29:50] But not if we do it in revscoring [18:29:53] * YuviPanda moves on [18:30:21] E.g. we specify what versions we want in ores-wikimedia-config and let revscoring have a range that includes those versions. [18:30:21] halfak: for debian all versions will be pinned though. I wonder how we should track that [18:30:35] halfak: perhaps have a requirements-debian.txt that just specifies the versions? won't actually be used anywhere [18:30:45] but that's what we'll be running in produciton, and that's what you should be testing against... [18:30:53] Yes. [18:30:55] Indeed. [18:30:58] ok [18:31:14] I'll be responsible for testing the prod versions in my local revscoring. [18:31:29] But our config may not match someone else's [18:31:32] halfak: I will say that I'll insist on moving to pinned versions the first time we hit a problem because of version differences :) [18:31:38] So that's why I think it belongs in ores-wikimedia-config [18:31:45] well, there's no someone else atm so you're working for an imaginary audience :) [18:31:48] YuviPanda, that's not fair [18:31:55] There are many someone elses [18:32:00] I am working with them :) [18:32:00] who? [18:32:09] Bluma Gelley and the IEG for notability [18:32:24] are they having problems with version compatibility that revscoring's ranges are fixing? [18:32:29] There's Diyi Yang who is working on Edit type classification [18:32:39] Yes [18:33:00] halfak: I think it's fair, fwiw - if it goes down I feel responsible for bringing it back up as well, no matter the cause. [18:33:18] I don't fully feel comfortable supporting something that could possibly be tested against different versions [18:33:28] but you say that won't be a problem so I'll trust you until proven otherwise [18:33:32] YuviPanda, no that's true. But I don't think we need to inflict this one everyone else. Let's just inflict exact versions on our installation with ores-wikimedia-config. [18:33:59] I don't think 'exact' versions are 'inflict'ing anything on anyone [18:34:02] YuviPanda, well, that's a weird thing to say about a general library. [18:34:09] I still don't understand the problem you're trying to solve with them [18:34:14] revscoring is intended to *be* a general library [18:34:19] Most general libraries get away with just using exact versions [18:34:42] No they don't :) Or mwoauth would refuse to run all the time because I would have locked it on requests 2.5 [18:34:47] a couple years ago [18:34:49] or was it 2.0 [18:35:12] ok. [18:35:45] it's interesting that npm solves this in a different way [18:35:54] by not allowing any transitive dependencies into the global scope [18:36:00] so if mwoauth required requests 2.5 [18:36:03] it'll get that [18:36:09] and if your app got requests 2.6 [18:36:10] it'll get that [18:36:12] Yeah. I kinda thought that would be what pip did. [18:36:15] It would have been great [18:36:16] this of course leads to other hilarious problems [18:36:21] And we wouldn't have this conversation [18:36:29] Oh yeah. new conversation, I guess. [18:36:34] like when you pass back a wrapped thing from version 5 and the calling code thinks it is version 6 [18:36:35] yeah [18:37:53] halfak: but yeah, do remember to make sure to test against the same versions we'll be running in production. [18:38:00] Anyway, I'm down for specifying exact versions and testing that there's no conflict with installing revscoring with those exact versions. [18:38:03] halfak: ^ merge? submodule bump [18:38:31] halfak: to do this I 1. cd 'ores', 2. 'git fetch' 3. 'git reset --hard origin/master' 4. cd .. 5. git add ores 6. git commit [18:38:36] 2 / 3 can be a 'pull' as well [18:39:12] YuviPanda, no pull request yet? [18:39:26] Or just pushed to master? [18:39:36] oh lol forgot, assumed it was a PR since it showed up here [18:39:37] let me do [18:39:44] done [18:39:59] halfak: I'm going to be futzing around with staging now [18:40:02] ^ SPAM TIME [18:40:03] once this gets merged [18:40:07] SMAP! [18:41:27] ^ Maybe we could turn that off somehow. [18:43:23] FYI: http://socio-technologist.blogspot.com/2015/09/mediawiki-utilities-unix-style.html [18:43:52] wah didn't know you had ab log [18:43:52] nice [18:44:39] halfak: heh, so uwsgi restart on staging takes a while too [18:44:45] halfak: i need to investigate wtf is up with that [18:44:56] I still think it's graceful restart behavior, but let's see [18:46:07] halfak: yay for individual tools you can pick and choose [18:47:08] :) [18:51:03] halfak: mwapi is a recent addition, isn't it? [18:51:08] did that patch get merged? [18:51:11] (to dependencies) [18:51:19] Good Q [18:51:30] mwapi >= 0.3.0, < 0.3.999 [18:51:30] mwtypes >= 0.1.3, < 0.1.999 [18:51:31] Yup [18:51:46] oh, mwtypes as well [18:51:47] ok [18:52:01] Yeah. It's a trivial lib. But it's got 'jsonable' as a dependency. [18:52:03] halfak: have they been deployed already? [18:52:19] https://github.com/halfak/python-jsonable [18:52:19] Yes [18:52:23] Deployed [18:52:27] :\ [18:52:31] Sorry to pull the rug out [18:52:32] ok! [18:52:37] no that's ok [18:52:45] I was just making sure that a next deploy doesn't 'need' pip [18:52:48] jsonable has no dependencies though :) [18:52:56] Oh yeah. [18:53:03] so that's 3 new packages. shouldn't be too hard... [18:53:04] also [18:54:48] halfak: ^ [18:55:53] halfak: I want to run you through deploying a change to ores/revscoring with this setup. [18:55:54] YuviPanda, tested against staging? [18:56:05] You should be able to run the fabfile locallyu [18:56:07] halfak: I have, but I need an actual change in revscoring [18:56:13] halfak: or ores to 'really' test it [18:56:23] Sure. Let's merge something :) [18:56:32] https://github.com/wiki-ai/ores/pull/87 [18:56:34] That one [18:56:36] merge that one [18:56:43] https://github.com/wiki-ai/revscoring/pull/190/files [18:57:01] That'd work too [18:57:14] halfak: let's do both! [18:57:22] Oh! And this one: https://github.com/wiki-ai/revscoring/pull/191 [18:57:28] It will quiet down some error messages [18:57:36] That'll make our logs better [18:57:42] yeah, I'll do that after testing these? [18:57:46] Sure [18:58:09] halfak: ok, so now what we need to do is: 1. update the revscoring submodule in ores to point to the new sha, 2. update the ores submodule in wikimedia-config to point to the new sha [18:58:18] halfak: do you want to try to do that? [18:58:29] Sure. [18:58:44] halfak: ok! I'll be here to help. can also just hop on a audio / video call if you think that'll help [19:00:22] Na. I think i got this [19:00:25] * halfak did his reading. [19:00:30] halfak: :D cool [19:02:03] https://github.com/wiki-ai/ores/pull/90 [19:04:04] YuviPanda, should I wait for merge? [19:04:11] I suppose the checksum stays the same after merge. [19:04:11] halfak: yes [19:04:15] well [19:04:15] oh [19:04:15] kk [19:04:15] yes [19:04:21] but I am not sure what github will think [19:04:24] yeah [19:04:26] :) [19:04:27] halfak: actually yeah, don't think you need to wait till you merge [19:04:48] halfak: wanna try to make sure? [19:04:54] Sure. [19:05:23] Previous HEAD position was 3c631f3... Merge pull request #89 from wiki-ai/submoduleing [19:05:32] right [19:05:32] Oh wait.. no I think it is working [19:06:20] ? [19:06:22] Yeah... No it's warning me about tracking a different branch [19:07:08] paste? [19:08:13] https://github.com/wiki-ai/ores-wikimedia-config/pull/33 [19:08:19] Let's see if it works :) [19:08:58] seems to have [19:09:19] halfak: hmm, assuming that the sha doesn't change after merging, which it might if there's a merge commit [19:09:30] Oh yeah. [19:09:33] That's a good point [19:09:39] We probably should not do this [19:09:51] yeah [19:10:00] halfak: so 1. merge in revscoring and then 2. update in orse? [19:10:08] halfak: i'll merge the revscoring pr now? [19:10:23] yes [19:11:21] halfak: done [19:11:21] ok [19:11:24] I merged wrong PR.... [19:11:25] ok [19:11:34] You did? [19:11:40] oh [19:11:40] no [19:11:42]