[15:12:29] hey bmansurov! [15:21:03] bmansurov: talk to you in 45 mins, if you want, take a look at some more concrete ideas for the WikiLabels experiment here https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Statements/Citation_Reason_Pilot#Data [15:41:05] miriam: o/ [15:41:06] cool [15:50:58] channel, bmansurov halfak, do you any tool to extract reference from the XML dumps? I have a regular expression that might work, but I was wondering if there some tool that already solve possible corner cases [15:51:10] mwrefs [15:51:11] :) [15:51:30] https://github.com/mediawiki-utilities/python-mwrefs [15:51:41] Complete with utilities [15:51:45] And unit tests [15:52:10] amazing! will check! thanks halfak [15:53:14] good morning lzia [15:57:41] dsaez, I'll be interested in having you contribute to this repo and then we can share some code :) [15:58:10] dsaez, what's your gihtub username? [15:58:32] digitalTranshumant [15:58:56] https://github.com/digitalTranshumant [15:59:53] * halfak invites dsaez to all the github orgs [16:01:01] great [16:05:19] I just noticed that I have some stalled PRs in mwrefs from awight. [16:05:28] I think I might use my lunch to revive those today ^_^ [16:05:33] :D [16:05:37] lunch?? [16:05:41] That sounds like work. [16:05:56] Yeah. But I want to and I'm impatient [16:06:35] uh-oh https://github.com/digitalTranshumant/CopyrightEvidence [16:06:37] ciao dsaez [16:07:33] ciao Nemo_bis [16:13:24] Nemo_bis, that was a hackthon project [16:16:02] ow hi dsaez et al. [18:11:49] bmansurov, dsaez: is it safe to say that the current state of T186519 satisfies the needs discussed in the off-site and we don't need to ask for budget for a separate machine? [18:11:50] T186519: Request creation of "research" VPS project - https://phabricator.wikimedia.org/T186519 [18:13:04] dsaez: The solution is not 100% what you wanted. basically, sensitive data cannot go to Cloud VPS, which prevents it from being everything you wanted. however, it seems everything you wanted cannot happen really, cuz Andrew is also saying that where sensitive data is, we cannot have too many tests, packages, etc. [18:15:12] dsaez: I /think/ it makes sense to accept that this is the best solution in the short run. In the longer run, if the needs are clear for you, we should do things differently. For example, we can set up a system where imitated webrequest log data is generated and can be consumed/tested on a more open environment such as Cloud VPS. I don't think for anything you want to do, at the testing level, you need /real/ sensitive data. [18:40:55] leila: sorry, I was in a meeting. I also think the current proposed solution of using cloud VPS is good for our current use cases. I also heard that we can use stats1007 in the next year. [19:10:04] bmansurov: then let's go with Cloud VPS for now. stats1007 would be great, but that would be inside of prod/analytics, so experimenting on it won't work.