[18:27:21] SMalyshev: is your comment at T143819#3876607 meant to be for that task or for T183020? [18:27:21] T183020: Investigate the possibility to release Wikidata queries - https://phabricator.wikimedia.org/T183020 [18:27:22] T143819: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819 [18:29:10] I think the two tasks are mixed, but maybe they need to be merged SMalyshev? I'll wait for your response. nuria_ FYI ^ [18:33:37] lzia: I think both tasks can be merged in teh absence of strong differences among them right (cc SMalyshev )? The dataset that SMalyshev is working on: https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Publishing_query_data#User_agent_sanitization [18:34:08] lzia: will satisfy both tickets [18:36:03] milimetric: we shall wait for you. [18:36:09] nuria_: in a meeting, will reply once out. :) [18:37:49] I think T183020 is for specific dataset Markus has been working with, whie T143819 is for setting up mechanism of persistent producing the dataset [18:37:49] T183020: Investigate the possibility to release Wikidata queries - https://phabricator.wikimedia.org/T183020 [18:37:49] T143819: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819 [18:39:25] SMalyshev: and so I understand.. why does yor proposal does not satisfy his use case? [18:52:21] nuria_: it probably would but my proposal is for processing raw data, which Markus has already done as I understand, so his is much smaller scale [18:52:30] for him it's just releasing what already exists [18:52:47] for my one, it's building a pipeline to create new ones... much bigger project [18:53:24] basically doing what Markus has already done manually but on ongoing basis automated [18:53:42] SMalyshev: right, his seems to be a one-off but very similar right? https://meta.wikimedia.org/wiki/User:Markus_Kr%C3%B6tzsch/Wikidata_queries [18:54:05] SMalyshev: such when you do the hard work of making this recurrent it iwll contain this other work , right? [18:54:07] yes exactly [18:54:08] I based on that one [18:54:32] Markus' one was manual work (which is mostly done now) but mine is doing the same ongoing automatically (so next time he won't have to do it) [18:54:48] nuria_: right [18:55:06] I hope to reuse some code from Markus even if I can get it [18:55:43] SMalyshev: i do not think they had any code just a way to send wikidata queries to a different table, dump that data and parse it with python [18:56:02] SMalyshev: and the work of preselecting wikidata queries you did alredy with tags [18:56:07] *already [18:56:07] well they must have had the code for sanitizing sparql and user agents [19:06:17] SMalyshev: user agents are processed with ua parser so extracting categories from a hive map is quite trivial [19:06:34] SMalyshev: but let me look a sec [19:10:13] SMalyshev: some code here: @stat1005:/home/mkroetzsch [19:11:16] SMalyshev: but not much there [19:11:20] SMalyshev: one more look [19:12:21] SMalyshev: also https://github.com/wikimedia/analytics-refinery/blob/master/hive/wikidata/create_wdqs_extract_table.hql [19:12:49] lzia: probably some cleanup needs to happen if this collaboration is done, right? [19:24:09] yes probably [19:38:10] * lzia reads [19:46:46] SMalyshev: if the only difference between the two data-sets will be that one of them will get continuosly updated, and one of them is only a 1-time release, we should consider making Markus' release a repeated release. This is better for the discoverability of the data. basically, the data will be published with the paper and code, and the three will go hand-in-hand forever. :) [19:47:21] lzia: well, I suspect there's a lot of manual work in what Markus' team has done [19:47:42] SMalyshev: now, continuously publishing the data may create the need for further privacy discussions. Is this the case? (we need to make sure users are not, for example, deceived to take certain actions that reveals more information that they want to about them via the calls.) [19:47:48] so ideally I'd like to make that automatic [19:47:58] SMalyshev: +1 for automatic for sure. [19:48:17] SMalyshev: would it be helpful if we hop on a 15-min call with Markus to get to the bottom of this? [19:48:45] lzia: so I wrote the data set description on wiki, I think as described it should be pretty robust but if you see any concern there please tell [19:49:48] lzia: I think the main thing we could use is the code and maybe some walkthrough if it's reusable - and if there are any licensing concerns, etc. with regards to the code [19:52:15] ah I see the code is published in https://github.com/Wikidata/QueryAnalysis [19:52:56] so I'll probably have to take a look and see how far is this from what I want and how it can be tweaked