[08:14:53] Would it make sense to have a collection on Wikipedia research here? http://about.scienceopen.com/collections/ [08:34:40] HaeB: I'm done expanding https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/Next_issue/Recent_research [09:30:19] Nemo_bis: great! [09:30:40] BTW did anyone find examples of the actual articles that were edited? [09:44:25] HaeB: I didn't look for them, but the authors are responsive so you could ask [09:44:45] (I just wrote them mentioning the articles, since we were already in contact a few months ago) [09:46:48] Ainali: it's not clear to me what advantages it has over Zenodo, but if there's a way to automate sync and avoid redundant work why not [09:47:51] Ainali: but if you meant for *publication*, rather than indexing/citing, then personally I prefer posting stuff on https://zenodo.org/communities/wikimedia?page=1&size=20 [09:49:56] Sorry, I meant over Zotero [09:50:22] https://www.zotero.org/wikiresearch [11:13:37] Nemo_bis: Well, neither of them seem to be open science, but general collection platforms. [12:46:17] Ainali: I don't understand if you want to collect citations/references or actual publication content [12:47:00] Zenodo is a platform for open access, how is it not an open science thing? [13:28:45] Nemo_bis: I just thought it would a good idea if the publication content were collected in one place. [13:30:11] Regarding Zenodo, it seems like the allow non-open licenses [14:01:37] Ainali: do you mean, also mirroring content published elsewhere? [14:01:52] As for licenses, either you allow unfree licenses or you cannot archive everything [16:39:09] o/ halfak. [16:39:28] halfak: are the models in https://github.com/wiki-ai/wikiclass/tree/master/models the latest models if we want to get quality scores for articles? [17:28:52] o/ leila [17:28:58] yeah, those are the most recent models. [17:29:18] They use some language assets change between OS versions. We trained them on debian jesse. [17:29:29] I can't imagine they'd be weird on debian stretch [17:29:40] Safest bet might be to just use ORES for your analysis. [17:29:43] What do you have in mind? [17:30:05] we need quality scores for all articles on English Wikipedia, halfak. should we hit ORES? :) [17:30:22] As of today (or most recent dump)? [17:30:29] or is an old score OK? [17:31:21] let me ask you this. where is ORES' model? I think the best thing would be to get the model and run it ourselves, as Tiziano may need to run it at different time intervals. [17:31:29] (how old is old, btw?) [17:32:25] 6 months or so. I'm due to update the monthly article quality score dataset [17:32:35] Do I was thinking that it might be easier if I just do that. [17:32:37] yeah. /me checks [17:32:59] hey tizianop. [17:33:20] Generally, I don't think it would be a bad idea to score 5M pages using ORES. [17:33:25] hi leila :) [17:33:36] I have a utility that is designed to hit ORES quickly and efficiently for this kind of stuff. [17:33:40] ok, halfak. Tiziano: do you need a model, or you can hit ORES? [17:33:41] o/ tizianop [17:34:57] maybe the model would be better [17:35:02] Either way, you're getting the same prediction -- it's just that by using ORES, you're guaranteed to be running the model in the same env in which it was trained and tested [17:35:06] Why would the model be better? [17:37:03] tizianop: ^ [17:37:14] tizianop: shall we go with hitting ORES? [17:37:27] it may save you time, too. [17:37:55] We've got two clusters. One is for batch jobs like this. So you wouldn't have to worry much about potentially taking ORES down. [17:37:59] ah, ok! It was only because I was already playing with the code, but if the other option it's better, let's go for this :) [17:38:17] So, what exactly do you need. It could be we already have data ready [17:38:27] It could also be that I'm already generating the data you need. [17:40:04] (I'm checking the data) [17:41:26] we are using a dump from March, and we need to generate a quality score for each article in this dataset [17:42:54] leila, I thing it would be better to be consistent with all the analysis and get the scores from this dataset, right? [17:43:12] What date? [17:43:12] *think [17:43:14] yes, tizianop, agreed. [17:43:29] tizianop: in that case, please give specific dates to halfak [17:43:56] halfak: publish ORES somewhere so we can cite it. [17:43:59] I can easily get you something as of the first day of a month as part of the normal processing work. [17:44:03] https://figshare.com/articles/Monthly_Wikipedia_article_quality_predictions/3859800 [17:44:15] yeah, I still need to write the ORES system paper :S [17:44:30] do that, people will need to cite it when they use it. :) [17:44:32] As you can see these predictions are very old [17:44:38] But I'm working on updating them. [17:45:54] tizianop: do you have the exact dates? :D [17:46:47] I'm searching the original file, because we have converted everything in parquet format [17:52:01] leila, halfak we generated the dataset the 17th February 2017 using /latest... so I'm trying to understand what was the last dump before this date :) [17:52:36] tizianop, you'll need to figure out not just what the latest was but when it was available as it takes a long time to compress those dumps. [17:52:42] So check the last modified date on the files. [17:53:13] * halfak learned this lesson the hard way and now annotates all output datasets with the date on the dump. [17:58:03] Uh oh. [17:58:13] I went looking for files and it looks like they got deleted :S [17:58:27] So you might have a snapshot from the 20th of some month or from the 1st of some month. [17:58:34] If it's the 1st, then \o/ [17:58:59] If it's the 20th, I'll help you set up a job to run against ORES. [18:07:40] I think we deleted the original XML file, but I extracted the last revision in the dataset and it is the 1st of Feb :) [18:08:12] Awesome! I'll have data for you by Monday :) [18:08:14] datetime.datetime(2017, 2, 1, 20, 51, 1) [18:08:25] thank you!!! [18:08:29] And you'll have data on article quality for every article-month in WIkipedia's history :D [18:08:38] As of the 1st of the month [18:08:49] great! [18:09:07] thank you very much! [18:09:16] \o/ glad to work that out. Thanks for a kick in the butt to update that dataset ^_^ [18:09:25] leila, ^ [18:10:56] tizianop: just to make sure: can you wait until Monday? ;) [18:12:52] and halfak: do let us know if you need motivations in the future. we're always here for you. ;) [18:14:37] :P [18:17:49] leila yes, there is still a lot to do before the evaluation [18:20:44] ok, sounds good tizianop. and halfak told me secretly that he /may/ be able to have the data by tomorrow morning (let's say PST) [18:21:10] I'm careful not to promise things that may not happen, but with 40 brand new CPUs working on this, I think it'll update overnight. :) [18:21:36] In the meantime, it might be valuable to use the old data from the figshare link to build your analysis pipeline. [18:21:38] yeah, so count on Monday, tizianop, but be even happier if it arrives tomorrow. ;) [18:21:55] If you're really worried about timing. [18:22:04] I'll do my best to make sure I don't block you. [18:22:05] perfect! [18:26:48] <3 halfak. thanks. [19:32:12] I'm just going to leave this here: https://grouplens.org/blog/friends-with-benefits/ [19:32:15] No comment on the title. [20:50:37] halfak: Love the photo at the bottom! [20:57:24] \o/ me too [20:57:28] jdfoote[m], ^ [20:57:30] Also hi! [21:00:51] halfak: Hi to you, too! :)