[00:01:10] ewulczyn: awesome (for the deck), I also have a last minute request to change the agenda for RG tomorrow [00:01:16] see mail I just sent out [00:02:03] I'll do it next week [00:02:20] DarTar: can we start the meeting 15 min earlier to go over the deck? [00:03:23] works for me [00:03:24] leila: we could, but the other presentations don’t look urgent and I’d rather make time next week with an ad hoc meeting [00:03:41] I can definitely come in earlier [00:03:52] humm. it's a slippery slope DarTar. research is almost never urgent. [00:04:11] we also used 30 min last week to discuss quarterly goals. [00:04:19] look, we can skip the review of the goals but that’s going to affect what you can work on in the next couple of weejs [00:04:31] and then we had to come on a holiday to listen to Ellery's presentation. ;-) [00:04:32] leila: do you have a better proposal? [00:04:46] starting 15 min earlier is my proposal [00:05:12] fine, organize it [00:05:19] k [00:07:22] I actually think people need to do a bit of reading (I will provide resources) to follow my presentation. So going next week is not bad. It will give people time to prep. :) [00:08:43] Starting 15 minutes early is probably still a pretty good idea. I want to hear about Aarons work and ask many questions [00:09:18] DarTar, leila ^ [00:09:27] ewulczyn: I am not sure about the rest of the team but I won’t have much time to do any reading tonight [00:09:35] ewulczyn: I just sent an invite to R&D [00:25:11] 3Quarry: Quarry does not respect ORDER BY sort order in result set - https://phabricator.wikimedia.org/T87829#1000128 (10MahmoudHashemi) 3NEW a:3yuvipanda [00:32:31] what dartar said [00:32:49] Some day I hope to have enough energy to watch TV after I finish my work, let alone read background papers. [01:01:04] leila, wanna talk? [01:01:09] sure [15:51:01] morning halfak :) [15:51:15] Hey Ironholds [15:57:12] Hello, halfak. Can we speek now, please? [15:57:23] Hi khitron! [15:57:47] Hi. It was yes or no? [15:57:47] Yes. So leila is my teammate in R&D and she has the most experience with Wikidata querying. [15:57:59] Anyone else interested in chatting about Wikidata stuff? [15:58:14] Ironholds, any experience looking up pages cross-language with wikidata? [15:58:41] khitron, can you remind me what you were looking for? [15:59:50] halfak, not really. Frankly every time I've tried to use wikidata just getting the API to work has served as sufficient evidence to disprove the idea of a just god. [16:00:22] lol [16:00:36] and that's why I'm hoping to give khitron a hand. [16:00:56] * Ironholds nods [16:01:41] Worst case, we'll learn together. [16:01:46] I mean, I can point to, like, the API docs. Beyond that I don't have the expertise [16:01:47] yep! [16:04:35] * Ironholds is producing anonymised exampled sample logfiles. Wheeee. [16:04:54] take 8m rows. Randomly sort each column using distinct values of "random". Randomly sample 1k rows from the resulting jumbled mass. [16:05:11] if you can pick out an actual, honest-to-god request from that you deserve it. [16:11:29] Sorry, there was no network in last 10 munutes [16:11:59] khitron, my last question was "can you remind me what you were looking for?" [16:12:24] So, you did not see my answer. Again: [16:13:41] I'd like to find any documentation about Wikidata tables. There is a good one about wikipedia tables: ([[wikidatawiki:template:databases]]), but nothing about Wikidata I can find. Thank you. [16:14:58] khitron, https://wdq.wmflabs.org/ [16:15:05] I think that's where you'll want to start. [16:16:45] Thanks, but no. I already saw this. It's URL querrying. I need database API - tables, columns, and so on [16:18:25] khitron, there's no "database" in the traditional sense. [16:19:08] Give me a minute [16:19:52] Here you are: http://quarry.wmflabs.org/query/1248 [16:20:22] Oh. So you already have what you need. [16:20:49] No. I know that the database exists. I have no idea about API [16:21:12] I just showed you an API. [16:22:25] khitron, the database contains basic abstract classes - "tags", for example, and "pages" [16:22:44] it does not contain mappings of pages to tags, or mappings of localised IDs to pages. [16:22:53] the API is the only access method for that kind of data. [16:23:01] so, you're both right ;p [16:23:14] Thanks Ironholds. [16:24:00] Thanks. But how I can find a interwiki for some aricle and be sure it comes from wikidata and not [[:en:article]] old interwiki? [16:24:56] that data would be the sort of thing that is only really API accessible. [16:25:12] As halfak said, leila is the one with the wikidata querying expertise, so you may have to wait for them :/. I don't have a query to hand. [16:27:50] Thank you. Unfortunately, I can't use URL queries in this particular problem. I need to join interwiki tables with wikipedia tables. If it's impossible, I'm lost. [16:29:23] khitron, you might have to join manually. If I were you, I might do this in python. [16:29:41] e.g. make a query to the API and then follow that up with a query to the DB for the matching page_ids. [16:31:32] I do not think it's possible, for this problem. but I'll try. Noone never thought about adding to enwiki_p.langlinks a boolean column "wikidata/interwiki"? [16:32:29] that's a question for the wikidata team, I think [16:32:36] Indeed. [16:32:48] and, it should be possible. Dump the pages table to a staging table, dump the results of the wikidata query to a staging table, INNER JOIN. [16:34:58] We'll see. Thank you very much to both of you for your help. [16:36:46] No problem. It might be worthwhile at some point to provide a dump of what you are looking for, so I'm curious how you end up solving the problem. [16:36:53] We could probably automate that. [16:38:31] If I'll success I'll tell you [16:42:58] Thank you again, halfak [16:43:29] khitron, If you stick around a bit, i'll introduce you to Leila. [16:43:38] She might have some more advice. [16:43:52] Of course [16:44:01] I think that the folks in UTC-8 should be getting online within the next 30 minutes. [16:44:15] hello - I have come to ask silly research questions [16:44:25] I have to compute edit rate for visual editor, per wiki [16:44:34] our thoughts so far: [16:44:47] use buckets with no less than 1000 events [16:45:11] divide number of people who clicked edit by # who saved [16:45:19] milimetric, by edit rate, you mean proportion all edits that are saved via VE? [16:45:29] yes [16:45:31] no [16:45:33] Oh! This is edit completion rate? [16:45:35] not VE / wikitext [16:45:38] completion, yes [16:45:43] Gotcha. [16:45:45] * halfak thinks [16:46:06] This sounds like a reasonable approach to me. [16:46:06] so, no fewer than 1000, bucket weekly and daily to start (looked at the data, some 33 wikis have enough data to bucket daily) [16:46:32] milimetric, might it make more sense to do weekly? [16:46:46] Since weekly periods have a regular shape? [16:46:48] yes, but then it's harder to tell if you're "moving the needle" after a release? [16:46:54] No [16:46:56] It will be easier [16:46:59] that's why we were going to do it weekly, definitely [16:47:09] +1 then. [16:47:09] oh - interesting, how's it easier [16:47:25] well, we were going to do weekly for that reason but figured we also needed daily [16:47:25] because the period of weekly activity muddies the visualization. [16:47:51] but then don't you have to wait a week after a deployment to see if something got messed up? [16:47:54] Then again, maybe success rates don't have a weekly period. [16:48:06] milimetric, I see what you mean. [16:48:31] ok, so we'll do both, it's easy enough [16:48:39] I think you're right and that I'd try to do it daily to see how much fluctuation exists. [16:48:54] We're not doing a raw count here, so the weekends wouldn't have their usual dip. [16:49:01] right [16:49:02] thanks! :) [16:49:17] godspeed :) [16:49:40] ok, I promise questions like this semi-regularly, because it seems like a good idea to inform y'all of what I'm doing and double-check for sanity [16:49:52] the queries I write will be in the limn-edit-data repo [16:50:07] and as always feel free to ping me with any concerns / ideas [16:50:26] +1 I'm hoping we can talk when you are done with this too. I'd like to look at the plot. :) [17:17:04] halfak, is the meeting still on? Nobody is here [17:17:10] not sure if I messed up my calendar somehow [17:17:17] yes [17:17:19] I was derp [17:19:17] hey folks [17:19:17] hangout not responding [17:34:42] FYI: We're all late for research group. [17:34:50] Nettrom, ^ [17:35:01] We have some QR stuff to review, but we're coming soon. [17:35:22] halfak: yep, just me and J. Marlow here so far, we'll wait [17:35:42] passed on the message [17:49:32] Deskana|Away, JFYI: query speed makes it unlikely, I think, that we'll have the PV count by 1pm. I can try to calculate a count from the sampled logs, if that's okay? [17:49:56] but that may not happen either, solely because I have pretty much continuous meetings until midday your time :/ [18:02:49] Ironholds: No worries. Thanks for the update. [18:03:02] np [18:16:08] halfak? [18:21:56] khitron, he's in a meeting [18:22:00] and will be for the next hour [18:22:21] I see. Thank you. [18:22:40] I saw his nick and thought he's here [18:23:09] on IRC != not doing anything except IRC ;P [18:23:14] halfak: I added a comment for you on the deck, please take a look when you’re done with the preso [18:24:23] "/nick halfak|away" ? :-) [18:29:39] in a meeting guys [18:44:11] halfak: see above, I need 5 minutes of your time for slide 30 [18:44:33] right now revscoring is both a Q2 achievement and a Q3 goal and that’s very confusing [18:45:08] Is. Revscoring classifier {{done}} Revscoring service {{todo}} [18:45:27] alright [18:45:51] I'm cool with your judgement though. [18:46:42] that’s fine, I’ll ditch the plot though. I want to remove any hook for potential distraction [18:47:19] Do we want to say anything about fulfilling ad-hoc requests? [18:49:38] there’s one line on this [18:49:48] kk [18:49:56] slide 29, last item [18:50:21] DarTar, can we get on a call right now? [18:50:48] sure, but I’ll have 2 mins only [18:50:51] kk [18:50:53] link? [18:51:26] PM'd [19:42:51] hey Nettrom, did you see my email re: workshop registration? [19:44:12] J-Mo: I met Aaron. His work is fascinating. [19:44:46] hi harej. I agree. halfak is indeed a fascinating individual who does fascinating things. [19:45:14] <3 you guys [19:45:22] * halfak blushes [19:45:30] but you can't tell because of man-beard [19:47:06] lol [19:47:10] halfak: I would like to integrate your auto-heuristics into the WikiProject evaluation scheme. Only problem with this is that imposing it in favor of human review would cause civil war. [19:47:16] So perhaps a parallel system could be developed. [19:47:37] We can certainly do both. That would help my research too. :) [19:48:08] Also, I wanted to do a referer log study but I am told that HTTPS-by-default means no referer logs :( [19:48:26] Well... not necessarily. [19:48:38] Oh? [19:48:42] We'll have within-site referrers and most incoming referrers. [19:48:50] The problem is really when you go https --> http [19:48:56] http --> https is fine [19:49:00] https --> https is fine [19:49:28] So basically, we'll have extensive information on referrers within the Wikipedia space, and even some non-Wikipedia space, and then a big "somewhere else" category. [19:50:24] Though I imagine much WikiProject traffic is driven by links on Wikipedia itself, but we don't know that yet do we! [19:50:43] The real problem will be when people navigate from Wikipedia to somewhere else, and that won't be in our logs anyway. [19:51:17] harej, Ellery from the Research team is working on a public referrer dataset. [19:51:25] :D [19:51:33] I don't see him in the channel now, but when he comes back, we should ask him for a status update. [19:51:55] Will it be ready in the next, uh, two weeks or so? [19:52:09] Hmm.. probably not, but we should still check. [19:52:34] There's a privacy & legal review that we need completed before we can release it. [19:53:04] Okay. So you couldn't just spin out a custom report for me before that solution is rolled out. [19:59:29] harej, no worries. I'll be waiting on your pings to get anything meaningful done for WikiProject X. [21:11:26] J-Mo, did you send out the workshop email yet? [21:11:52] no, probably tomorrow morning [21:14:55] kk. Do you want to send 'em all out together? [21:15:05] I think the academic one is ready to go. [22:11:31] halfak: , yt? [22:11:40] Yeah. [22:11:41] What's up? [22:12:51] ottomata, %^ [22:12:52] so [22:12:58] when you've used wikihadoop before [22:13:06] you ran on the .bz2 files, ja? [22:13:13] yup [22:13:20] It's made for bz2, right? [22:13:24] and, you got more than one mapper per file, right? [22:13:36] Hmm... Not positive [22:13:51] bob is asking me about this, and I had always thought there was more than one mapper [22:13:54] but maybe not [22:14:02] Hmmm there should be [22:14:05] i just ran it with a test one (that was large enought to need splitting) and it only launched one mapper [22:14:05] Isn't that the point? [22:15:44] thought so. [22:38:09] halfak: I'm getting some errors trying to intstall deltas and using mediawiki-utilities. Are these packages for python 3.*? [22:38:20] Yes [22:38:31] What are the errors? [22:38:57] This is a syntax error: def __init__(self, uri, *args, user_agent=DEFAULT_USER_AGENT, **kwargs): [22:39:15] Not in python 3.x it isn't. [22:39:41] yeah, I'm using 2.7.9 [22:39:51] ok thats what I suspected [22:40:35] I'd feel bad for only supporting python 3, but it has been out for nine years now. ;) [23:06:41] halfak: i think you need to add the "statistics" package as a requirement for "Revision-Scoring" [23:06:55] That's built into python3 [23:07:23] huh, I had to use pip to get it, although I'm using anaconda [23:09:03] * halfak is checking it out [23:09:33] ewulczyn, it looks like it is standard as of 3.4 [23:09:38] https://docs.python.org/3/library/statistics.html [23:09:59] ok, I'm using 3.3.5 [23:10:13] should I be using 3.4? [23:10:23] Looks like it causes no harm to just add it as a requirement. [23:10:32] 3.4 will install it but won't use it. [23:10:40] 3.3 should be fine otherwise. [23:13:17] ewulczyn, https://github.com/halfak/Revision-Scoring/pull/35 [23:26:57] halfak: btw, i do get multiple mappers, but we have only tried history files [23:27:07] apparently multi-streaming files (whatever they are) only do 1 mapper [23:27:17] not sure why, maybe has something to do wiht how they are comprewssed [23:27:29] Hmm. multi-streaming? [23:27:35] Is that some sub-option of bz2? [23:28:12] uhh, no it is a file name [23:28:12] um [23:28:35] ah [23:28:36] simplewiki-latest-pages-articles-multistream-index.txt.bz2 [23:28:37] huh? a filename is affecting mapping? [23:28:37] etc. [23:28:42] oh. [23:28:43] Yeah. [23:28:46] No idea what that is. [23:28:53] ha, me neither, bob is using it apparently [23:32:21] OH! That's not XML [23:32:33] ottomata, ^ [23:32:54] no? [23:33:28] oh [23:33:29] halfak: sorry [23:33:30] not index [23:33:31] ptwiki-20150117-pages-articles-multistream.xml.bz2 [23:33:32] that one [23:34:13] i also just tried [23:34:29] ptwiki-20150117-pages-articles.xml.bz2 [23:34:32] only gives one mapper too [23:42:44] * halfak shrugs [23:43:05] yeah dunno. i told bob to use the files that are already split if he hasta :/