[00:39:57] 90%! [00:40:03] And I found some bugs :) [01:18:18] Fun story. Just ran into a case where my local enchant dict differed from the one for travis. [01:18:36] We'll need to be cautious of making sure that we generate features on a similar install as ORES. [01:19:51] wiki-ai/revscoring#455 (features_commons - 3aa04cf : halfak): The build was fixed. https://travis-ci.org/wiki-ai/revscoring/builds/99363928 [01:19:55] \o/ [01:20:01] Take that travis [01:20:23] Hmm... Travis only reports 88% coverage. [11:32:23] halfak: fyi, wikilabels was down apparently, uwsgi wasn't responding. someone reported on -labs, I restarted (couldnt' investigate, is 3:30 AM) [14:27:27] Thanks YuviPanda! [14:31:49] YuviPanda: hey, can you delete the ores service group in tools? [14:32:32] Hey Amir1! [14:32:42] Working on the blog while I sip my coffee [14:32:46] hey halfak :) [14:33:01] I'm working on it [14:33:16] Adding things etc. [14:33:30] I'll try to get something together re. limitations. [14:34:19] awesome [14:35:27] I hope we finish this soon [14:35:57] +1 We want to get it out there. [14:36:45] Oh! Also, I want to re-think the balanced extractor. [14:36:58] Right now, it is primarily IO blocked on looking up information about users. [14:37:16] I think processing the XML dump is the wrong strategy. [14:37:27] We cache user data [14:37:38] Yeah. Even so. [14:37:39] so for every user we can have them [14:37:44] but [14:37:46] I agree [14:37:59] I think that what we want to do is run label_reverted on a extra large sample of non-bot edits and then subsample until balanced. [14:38:17] I think we only should use xml dumps for non-user related stuff [14:38:26] +1 [14:38:31] like --only-reverted [14:40:16] We'll get a big boost by just ignoring bot edits and bot user_ids is a short list that we can put into a set() to get a constant lookup speed. [14:41:01] I think that, in the future, we'll want to exclude bot edits from our training entirely. [14:41:12] We might want to have different classifiers specifically for bot edits. [14:43:27] I <3 this page: https://www.wikidata.org/wiki/Wikidata:ORES/Report_mistakes [14:43:42] I want to do one for the anon false-positives across the set of wikis. [15:11:00] halfak: I did that in my old script [15:11:09] we can do it in this code too [15:11:37] do you want too? [15:11:40] +1 [15:11:58] I might need to figure out how to get old models up and running so that we can show our progress for models we replace 6 months ago. [15:12:07] That should be pretty easy actually. [15:12:15] We have a history of everything! [15:19:25] halfak: re. the bug in qualifier test, Can you tell me from which revision you've got old and new Alan Turing items? [15:20:10] Good Q. Let me try to find out. [15:21:22] halfak: btw. It's Alan Turing [15:21:36] Can you change file names? [15:21:43] Did I typo? [15:21:48] yup [15:21:51] lol [15:21:52] alan touring [15:22:02] derp [15:22:08] :D [15:38:20] OK. Can't figure out what revisions they are :( [15:38:32] I have an idea though. [15:38:39] Can you edit the JSON directly in wikidata? [15:39:11] We could save the two JSON blobs as revisions on a Sandbox and then use the visual diff to inspect them. [15:39:14] Amir1, ^ [15:39:39] no it's possible unfortunately [15:39:47] we can send it through API though [15:39:57] but I'm positive we will get error [15:40:14] since there are some tools that prevents creating duplicate items [15:40:24] Arg! [15:40:39] I'm trying to figure out what's wrong [15:40:50] Can you make a Sandbox that will expect the right content format? [15:41:10] I'm writing some codes that inspect values [15:41:16] I don't think so [15:41:33] don't worry about it [15:41:48] I'm working on it [15:42:09] OK. Will leave it to you. [15:43:52] halfak: it seems number of qualifiers hasn't changed [15:44:22] OK. So I guess the code is working right, but the test case is insufficient. [15:44:39] or there is a problem upstream [15:44:44] which I highly doubt [15:44:57] (by upstream I mean pywikibase) [15:44:57] What are qualifiers usually used for? [15:45:24] The year a population was estimated? [15:45:31] it's one of it [15:45:39] e.g. for Chelsea Manning sex. We say it was male once [15:45:43] *she [15:45:51] I wonder if we could just put together a custom test. [15:46:05] How would you write a simple test to detect a qualifier change? [15:46:07] we can just remove one of qualifiers [15:46:12] I won't make any harm [15:46:24] let me paste it for you [15:46:37] great [15:48:33] halfak: https://gist.github.com/Ladsgroup/b0738ec40188cfc10241 [15:48:41] It's old version of pywikibase though [15:48:48] but AFAIK it's no different [15:49:11] but you can see the algorithm [15:49:47] So, but in this case, the qualifiers didn't change. I want to fabricate some data that *does* have a qualifier changed. [15:50:17] E.g. we could put the revision.item_doc and revision.parent.item_doc in the cache that have nothing but a difference in qualifiers. [15:50:47] we can change it two ways [15:50:48] Would you write a custom JSON doc or just build up a pywikibase.ItemPage? [15:51:03] 1- Remove a qualifier in json file [15:51:20] 2- make the item by pywikibase.Item() [15:51:33] and then find a claim that does have a qualifier [15:52:00] and claim.removeQaulifier(claim.qualifiers[0]) [15:52:10] How would you construct a pywikibase.Item() for testing? [15:52:12] I can find exact names of these methods [15:52:39] new version: pywikibase.Item(json_content) [15:52:58] old version: item = pywikibase.Item [15:53:04] item.get(json_content) [15:53:28] OK. what does json_content look like though? [15:53:43] I'm not familiar enough with the format to make a fake-document with the right properties. [15:54:13] Could we do pywikibase.Item().addClaim(pywikibase.Claim(...))? [15:54:24] That would make an item with just one claim and nothing else? [15:54:29] yes we can do that too [15:54:38] I think that'll be good for unit testing. [15:54:43] but it's wrong syntax [15:54:44] we should do [15:54:45] I'm cheating by loading in a whole document. [15:54:55] item = pywikibase.Item() [15:55:03] item.addClaim(claim) [15:55:16] (like list.sort()) [15:55:38] Gotcha. Makes sense. [15:55:50] halfak: also you can see how the json content look like by checking out the json files [15:55:53] it's messy I know [15:56:00] but use pprint [15:56:04] or json.dumps [15:56:19] yeah. I gotcha. I was trying to avoid reverse engineering that :) [15:56:34] +1 [16:00:03] halfak: oh also we are in blog again [16:00:05] http://blog.wikimedia.org/2015/12/29/blog-popular-posts/ [16:01:18] Ha! Cool. I wonder if we'll get a pageviews bump again on [[:m:ORES]] [16:01:19] 10[1] 04https://meta.wikimedia.org/wiki/:m:ORES [16:01:24] Thanks AsimovBot [16:02:32] Oooh! We can see when major news stories came out in the page view data. [16:02:40] 12/19 and 12/23 [16:03:47] * halfak loads up a new graph for ORES [16:03:57] o/ ellery [16:06:53] o/ aetilley [16:07:01] Amir1, https://upload.wikimedia.org/wikipedia/commons/b/b8/Pageviews_ORES_and_Revscoring.svg [16:07:28] We were getting 1000 views/day at one point! [16:07:42] OMG [16:10:29] win [16:11:37] halfak: I just made some edits in the blog post draft in getting involved part [16:12:00] Nice. Saw the edit to the project page too. [16:12:54] oh yeah [16:13:07] halfak: how you generate those graphs? [16:13:30] I have a hacked version of the webapp that they released with the pageviews API. [16:13:48] The javascript only understood wikipedias, so I hacked it to work for metawiki too. [16:14:11] It's all wrapped up one JS file. Do you want me to send? [16:14:18] *HTML file [16:14:21] The JS is in the HTML [16:15:09] We need something like that for ORES. [16:15:37] A simple single-page script that proves a simple UI to just score edits. [16:15:40] I want to see the original one [16:15:55] because I'm interested in getting these graphs for fa.wp [16:16:11] Here's the code: https://gist.github.com/marcelrf/49738d14116fd547fe6d#file-article-comparison-html [16:16:21] "A simple single-page script that proves a simple UI to just score edits. " I can do something for this [16:16:31] let me find some time [16:16:46] Is it okay for you? [16:16:49] Sure! [16:16:54] We've got a lot to do first though [16:17:14] Oh yeah! Was meaning to ask you to look into the filter-rate of the Wikidata model. [16:17:39] I want to work out how many edits one needs to review in order to catch 95% of edits that will need to be reverted. [16:17:50] exactly, I probably do it from Jan 15 [16:18:29] halfak: I don't know how to run things on test set [16:18:33] at all [16:18:52] you can use `revscoring score` script. [16:18:57] Or write your own script. [16:19:12] I think that what you'd need to do is 1, train a new model with a test set withheld. Find out the maximum threshold that you can set that catches 95% of reverted edits. [16:19:51] Essentially, all you need is \t pairs. [16:20:19] Once you find the threshold, then you'll want to find out what % of scores cross that threshold. [16:20:24] That's our filter rate. [16:20:51] hmm [16:21:04] I can run some of my scripts to get f1-score too [16:21:16] f1-score depends on where we set the threshold. [16:22:07] We need this for the blog post, I think. [16:22:30] Right now, we're just guess-estimating our filter rate. [16:22:57] You know what. Don't worry about this today. I'll take a first pass to see what I can learn. [16:23:08] I might just use our current model and label some new edits. [16:23:14] Since data is free :) [16:23:17] f1 score will determine the threshold [16:23:45] we can get optimum ones [16:23:47] Amir1, f1-score will allow us to optimize for balance, but I think we want to optimize at a fixed recall. [16:23:55] yeah [16:24:13] I was saying about another thing, just an experiment [16:24:35] We can use the f1 score in our hyperparameter optimization. [16:25:08] +1 [16:25:14] I use it in Kian [16:25:27] I use f-0.25 to add [16:25:44] and anything between f-1 and f-0.25 for human review [16:26:13] OK. I'll work on this in the next 4 hours and let you know how far I get. [16:26:25] I've got a meeting in 30 minutes, but then my day is wide-open. [16:26:32] okay [16:26:41] I need to study for a while [16:26:54] +1. [16:27:03] but my only thing right now is getting data [16:27:13] to get that number for the blog post [16:28:18] last thing halfak, How can I get numbers of the test set? I don't want to randomly sample [16:28:21] it's okay to sample [16:28:22] That and code review for the massive PR [16:28:44] oh about code review, I'm okay about merging it now [16:28:54] but let's run some tests for performance [16:29:03] Amir1, my plan is to sample edits from outside of our train/test set, label them using the same labeling script as reverted or not and then apply our live model (via ORES) to the new edits. [16:29:04] and ask Helder and YuviPanda to take a look [16:29:08] since it's big [16:29:19] Amir1, let's not merge yet then. [16:29:26] I'll ask Helder to take a pass over it. [16:29:43] And I'll make sure that the docs are ready to go. [16:29:58] okay [16:30:10] I'll do performance tests by rebuilding all of the edit quality models and then we should merge. [16:30:11] about the test, okay [16:30:32] what do you think of sample size? [16:30:48] I think we'll be able to do OK with 1-2k edits, but I'll do 10k to be safe. [16:30:53] I'll only sample non-bot edits. [16:31:09] okay [16:31:21] I sample both and see what happens [16:40:17] halfak: I'm sampling using this: sql wikidatawiki "select rc_cur_id from recentchanges where rc_bot = 0 and rc_type in (0, 1) order by rand() limit 10000;" [16:40:31] Looks good to me. [16:40:34] it's working now [16:40:49] Cool. BTW, if you want to drop this and get back to your studies, that's OK with me. [16:41:13] I'm about to finish this [16:41:17] But if you want to keep going, I want (rev_id, score) pairs. [16:41:35] I want to leave it work (extracting scores and reverted status) [16:41:45] kk [16:41:46] so I don't waste any time [16:41:49] :) [16:49:32] halfak: Is the new version deployed to ores? [16:49:50] the one I was working on it and improved AUC 1% [16:51:39] Amir1, good Q. You should be able to check the version [16:52:39] halfak: it's not deployed yet [16:53:16] can you deploy it? at least to ores-staging [16:53:48] I think I made the PR [17:03:52] I've got to go [17:03:54] be back soon [17:04:40] Will deploy before generating these scores. [17:11:08] huh... wikidatawiki model isn't working right. [17:11:14] It's predicting false for everything. [17:11:17] 100% [17:15:06] Hmm.... I am using a slightly more recent version of wb-vandalism. Let me revert to the version that the model was generated against. [17:15:50] Yup. Still predicting false for everything. [17:15:51] Weird. [17:16:05] * halfak prepares to re-train [17:21:46] Hmm... no change to requirements. [17:21:53] I'm re-extracting features now. [17:22:00] I should have a new model in a couple hours. [17:23:38] o/ Amir1 [17:23:56] For some reason, the model that is in wb-vandalism always predicts False @ 100% now. [17:23:57] o/ halfak [17:23:58] I wrote the code to get scores [17:24:06] I did some troubleshooting, but I couldn't figure out what was up. [17:24:20] So I am re-extracting features and training now. [17:24:38] oh boy [17:24:46] yeah. Glad I checked :) [17:25:09] features looked okay as far as I remember [17:25:37] Yeah. I don't see an obvious problem. [17:26:08] in ores-compute [17:26:23] I got reasonable scores for all of revisions [17:26:33] that's how I updated the table [17:26:43] that's strange [17:26:50] maybe I missed something in my PR [17:27:14] OK if I make your homedir on ores-compute readable by anyone? [17:27:46] I can make it [17:27:51] I don't if it is [17:28:09] Woops. Accidentally hit enter. So {{done}} [17:28:09] How efficient, halfak! [17:28:14] lol [17:28:29] :)))) [17:28:59] it's okay [17:29:54] Amir1, so, when I run the code from your directory on ores-compute, I get the same issue. [17:29:57] WEIRD! [17:30:16] wut [17:30:28] I even used your virtualenv! [17:30:35] I just got it from there [17:30:48] I don't know [17:30:55] that's super strange [17:30:55] Strange stuff. [17:31:28] We're almost ready to train a new model. :) [17:31:44] 20/24k features extracted. [17:32:19] okay [17:32:23] I go to study [17:33:13] Godspeed! [17:33:23] Oh hey! Scoring script. Is it running? [17:33:28] Amir1, ^ [17:33:44] I'll pick that up again as soon as I can get this deployed. [17:39:20] Test show 0.977 AUC and 0.922 accuracy. [17:46:40] halfak: not yet [17:46:55] that's what I got before [17:47:16] Is it deployed? [17:47:26] Not yet [17:48:12] what should we do? [17:49:15] If oyu pass me the script, I'll run it and you can study/sleep. [17:49:35] \o/ works [17:51:16] Going to staging now [17:52:36] halfak: [17:52:43] https://gist.github.com/Ladsgroup/d5e843d063597e82f70a [17:52:44] http://tools.wmflabs.org/dexbot/res_aaron.txt [17:52:58] I should put some sleep time between requests [17:53:11] Na. It's good. [17:53:25] Ooh. I'm gonna islice this. [17:53:25] :) [17:53:35] <3 itertools [17:54:26] hmm [17:54:34] I use itertools next time [17:54:46] I didn't know something like this exists [17:54:55] No worries. I'll give you a nice example :D [17:55:09] awesome [17:57:07] Back to study [17:57:15] o/ [17:58:05] * halfak hammers staging with precached [17:58:48] Aaaaand it all looks good. [18:00:27] here we go! [18:02:19] "MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error." [18:02:37] So... that's new [18:03:22] Arg! [18:03:25] YuviPanda, around? [18:14:47] Restarting cache redis [18:16:45] OK. That seems to have solved the problem temporarily. [18:20:16] Well that was fun. [18:20:24] Best I can tell, we're on a time limit. [18:20:44] For some reason we're not evicting keys correctly in redis. [18:24:28] And we're having the issue again. [18:24:43] oh boy [18:24:56] is it staging issue or the main sever? [18:26:52] main server [18:27:01] OK. So I have temp fix that I think will work. [18:27:05] "config set stop-writes-on-bgsave-error no" [18:27:18] This tells redis to not fail on writes when the rdb file can't be changed. [18:27:31] So that means we're not persisting our cache to disk anymore. [18:27:49] If the machine restarts, we'll lose all of our precached. [18:33:07] Everyone on the internet just turns off persistence when they run into this problem! ARG! [18:41:13] So, I'm starting to think this is a memory issue. [18:49:50] https://gerrit.wikimedia.org/r/261642 [18:50:00] https://phabricator.wikimedia.org/T122666 [19:02:52] o/ awight [19:05:24] Well... that issue seems to be in a stable state. I'm going to get back to working on wikidata scores. [19:53:33] halfak: You around? [19:53:39] yup [19:54:28] So I think I might need to talk to someone who is familiar with MW parsing. [19:54:38] Here's my current situation: [19:55:13] I need to extract a bunch of sentences from revisions. [19:55:25] Not markup, just sentences. [19:55:31] The tool you suggested [19:55:42] (context_tokens) [19:56:07] works to a degree, but actually seems to let a lot of markup through. [19:56:24] I'm wondering if there's anyone at the foundation who might [19:56:52] 1) Know where I can get a grunch of text of WM pages, but just the main content or [19:57:17] 2) Have suggestions on how I can use the parser-from-hell to get this on my own. [19:58:41] *sorry, that tool was [19:58:48] revision.content_tokens [19:59:26] aetilley, so, yeah, it looks like mwparserfromhell doesn't necessarily remove all markup [19:59:56] But you could do OK, I think, by supplementing it. [20:00:03] Right. Now if there's a pattern to Which markup get's through, I could work around it. [20:00:05] (regretfully, I don't know of a better source of data) [20:00:18] aetilley, can you give me some examples of where markup does get through? [20:00:26] yeah, hold on [20:01:52] so for instance [20:02:00] extractor.extract(686575075, revision.content_tokens)[:10] [20:02:13] sorry, [:40] not [:10] [20:02:29] where extractor is some APIExtractor on an english wp session. [20:02:47] Notice the "top" and "bottom" [20:02:54] What do those first 10 tokens look like? [20:03:07] (oh and also look at the first three tokens) [20:03:17] * halfak is in the middle of a major revscoring refactor and doesn't want to have to write a script to get that [20:03:18] Want me to send you a paste? [20:03:21] Yes please [20:03:31] k, one sec [20:08:15] Sent. But take your time in getting back to me. I understand you're hella busy. [20:09:19] Aha! I see. It looks like likes to file namespace are treated strangely. [20:09:23] *links [20:13:13] Sending you another one from the end of the article [20:15:25] aetilley, https://github.com/earwig/mwparserfromhell/issues/136 [20:15:36] * aetilley looks [20:15:43] I just filed a bug [20:17:28] Ok, now a noob question. If I'm looking at an article, what is the quickest way to get it's revision number? [20:17:32] its [20:17:49] Go to the history page and click on the date of the last version. [20:18:59] omg halfak [20:19:01] so sorry [20:19:10] I didn't respond to page [20:19:38] It's OK. Can't always expect you to be available. :) [20:19:51] We're online now, so no harm done. [20:20:02] Did you already read my notes and patchset? [20:20:22] I was unfortunately up till 5AM (we had 3 vaguely related outages yesterday! woo!) and then forgot to turn phone from 'alarms only' to 'priority + alarms' [20:21:31] halfak: so let's just switch the cache to a different machine [20:21:35] and make it big [20:22:26] halfak: (And then look at end of the URL?) [20:22:42] Hmm... That doesn't seem like a good solution. Shouldn't we make it so that we don't run into issues no matter what the cache size is? [20:23:08] halfak: yes, so we'll adjust maxmem with the different machine as well [20:23:19] halfak: this was my mistake in forgetting to adjust it when we switched from AOF to RDB [20:23:28] let me find that link... [20:23:52] oooh and in vm_overcommit missing [20:25:11] Yeah. Was looking at that too. Goddamn internet said "just turn off RDB" [20:25:16] halfak: I think the problem was just vm_overcommit [20:25:31] it was in our old redis puppet code and missing in our switched new redis puppet code [20:25:33] Would we get "cannot allocate" errors with that? [20:25:52] Also, just to make sure, we're still evicting keys LRU style, right? [20:26:02] It wasn't clear to me from the config. [20:27:23] halfak: yup [20:27:37] halfak: let me make sure (re: config) [20:28:16] halfak: specifically, it'll cause it to try to fork and have that fork fail [20:29:56] What exactly is forking here? [20:32:00] halfak: redis [20:32:10] halfak: so it forks, and then just reads from the data [20:32:18] and writes it as an rdb snapshot [20:32:30] halfak: this doesn't actually produce two copies of the data [20:32:40] but prevents concurrency issues when snapshotting [20:32:43] this is 'bgsave' [20:33:25] OK. So the fork is just the async job for writing the RDB? [20:33:30]