[13:54:54] o/ [16:08:00] Do you guys have any interest in any of the article feedback tables that are still around? [16:08:27] Wondering if https://phabricator.wikimedia.org/T59185 can be upgraded to just archive and delete the lot [16:22:00] Hey Reedy, I'm guessing the answer is "yes", but we'll need DarTar to confirm. [16:22:20] He's in PST, so he should be around soon. [16:24:28] Be good to find out, whether for the couple of versions of AFT, for other wikis etc [16:24:39] The answer might be to move the data elsewhere, rather than the production cluster etc [16:29:06] o/ [16:37:32] halfak: any idea if someone already has (or keeps up to date) the output of your mwcites utility on enwiki? I have an output of just the current revisions (from last months dump) but not the whole of history. I hope this is the right place to ask... [16:44:45] Reedy, +1 I will help you find out what's up. [16:44:52] :) [16:45:01] tarrow, hey! I haven't updated the dataset in a while, but I could. [16:45:10] Any idea about EmailCapture too? https://phabricator.wikimedia.org/T57676 [16:45:25] tarrow, It'll take about 15 hours. I should be able to provide it by tomorrow. Would that work? [16:45:37] Reedy, sorry. no idea there. [16:46:19] I don't think there's any value there... It just looks to be a log of ratings, and the emails sent [16:46:39] that would be great for me if it is easy for you to do. Only 15 hours though? I think that is pretty much how long it took for me just to do current revisions. [16:47:32] Otherwise I was about to set the job going on tool labs; just working out how to submit jobs in chucks to try and speed it up [16:47:34] tarrow, I'll use 16 CPUs at a time :) [16:48:17] ah cool, that would be really awesome if you're happy to! [16:48:25] Not a problem. Due for an update. [16:48:42] cheers, that's very kind :) [16:54:21] tarrow, ping me tomorrow if you don't see an update somewhere. :) [16:55:13] ok cool, where should I see the update? I must admit I couldn't even find an outdated version using google [16:58:21] http://figshare.com/articles/Wikipedia_Scholarly_Article_Citations/1299540 [16:58:25] tarrow, ^ [16:59:18] Wonderful! Great! [17:03:34] it's tarrow! [17:08:34] hey1 [17:08:37] !* [19:05:22] Hey, I want to find out the number of new registered users in a given wiki in X month. I am not able to figure out how to use the user_registration value in conjunction with the timestamp stuff. Help? [19:06:10] Which timestamp stuff? [19:06:21] SELECT * FROM user WHERE user_registration >= DATE_SUB(CURDATE(), INTERVAL DAYOFMONTH(CURDATE()) -1 DAY); <-- This gives me data from last month [19:06:43] I want to pick a month, say, two months ago, or three months ago. [19:07:01] I picked that off StackOverflow. [19:07:03] I wouldn't generally use the dynamic stuff [19:07:24] Umm, it's for a one-time query. I'm not using this anywhere. [19:07:31] Still [19:07:35] It just increases complexity [19:07:46] How do you suggest I do this then? [19:08:05] just give me a minute :P [19:08:16] I was getting an example timestamp from the db to make sure I put the correct number of 0s ;) [19:08:17] 20050925093632 [19:09:06] :) Is this yyyymmddhhmmss? [19:09:10] yus [19:09:20] where user_registration >= '20151001000000' AND user_registration < '20151101000000'; [19:09:37] that's for october [19:09:40] Okay! Makes ton of sense, thanks! [19:09:57] I think you can use.. [19:10:12] user_registration BETWEEN '20151001000000' AND '20151101000000' [19:10:18] depending on your preference :) [19:10:41] Right. [19:10:57] Niharika, DATE_FORMAT [19:11:20] * Niharika looks that up [19:11:38] SELECT DATE_FORMAT(NOW(), "%Y%m%d%H%i%S"); --> 20151116191125 [19:11:54] :D [19:11:58] Aha. Okay. Thanks halfak! [19:12:20] ok, so etherpad is self-restarting now [19:12:47] Really? Yay, panda. [19:14:25] yeah it's the php school of fixing things [19:14:27] 'if it is down restart it' [19:16:38] When your software's going down / You must restart it♪ [19:16:40] I don't see any pandas ;) [19:18:05] * PlasmaFury used to watch a TV show called 'Get Ed' as a kid [19:18:17] it had a whacko character who was clearly not right in the head but super fun [19:18:28] and plasmafury was something that character said in one episode for a random reason [19:18:37] https://www.youtube.com/watch?v=RGKMWwQiq48 [19:19:14] lol @ rocket skateboard [19:21:19] halfak: in hindsight it seems obvious that they're doing drug delivery [19:21:58] lol [19:37:59] Anyone willing to recheck my query? http://quarry.wmflabs.org/query/6135 (List of people who registered in the month of october and made less than 10 edits in the same month) [19:38:09] On nlwiki. [19:39:35] Niharika: That'll include anyone registered in november etc too [19:40:00] Reedy: But I did filter by user_registration, right? [19:40:08] yeah, you only put a lower bounds on it [19:40:28] WHERE u.user_registration >= '20151001000000' AND u.user_registration <= '20151101000000' [19:40:35] Both bounds? [19:40:36] yup [19:40:56] also, the editcount will be total edits, so any edits after october too [19:41:01] Depending on how much you care about that [19:41:40] Hang on, I put both a lower and upper bound on the rev_timestamp and user_registration. [19:41:50] Something wrong there/ [19:41:55] ?* [19:42:15] Reedy: ^ [19:42:51] the query I'm looking at is [19:42:52] also, the editcount will be total edits, so any edits after october too [19:42:53] ffs [19:43:00] SELECT * FROM user WHERE user_registration >= '20151001000000' AND user_editcount < 10; [19:43:06] Oh. [19:43:12] Gotta run it to save it [19:43:16] lol [19:43:21] Wait. This is a draft. It won't show you the right thing. [19:43:27] Computers suck [19:44:16] http://quarry.wmflabs.org/query/6135 [19:44:19] yes running it saves it [19:44:20] Does this work? [19:44:38] still shows the same sql I posted above [19:44:55] One more try? :P [19:45:01] I ran it a couple of times now. [19:45:41] I can't say it's 100% correct, but it looks about right, taking in the various variables :) [19:46:36] Reedy: But it's wrong. :( SELECT * FROM user WHERE user_registration >= ‘20151001000000’ AND user_editcount < 10; <---This gives me an answer in a few thousands. The number should be quite similar. [19:46:57] do it incrementally [19:47:07] start with that query, add both bounds [19:47:13] See what the number looks like [19:47:22] Reedy: Okay. [19:47:31] Remember, that one includes registrations for half of november too [19:48:21] Reedy: Right. But 800 to 3800 is a way big difference for an additional half month. [20:02:42] Eureka! :P [20:03:14] It wasn't counting people who made 0 edits in the first query because I was joining on revision table. [20:03:19] *I think* [20:05:17] Niharika, makes sense. [20:05:25] You can left-join to avoid that problem. [20:05:44] bearloga, were you working with Ironholds on the discussion toxicity analysis? [20:05:54] halfak: Doesn't make sense. :( It was a left join in the first place. But I am pretty sure that's what's happening. [20:06:09] I am not sure why it's happening though. [20:06:10] halfak: first time hearing about it [20:06:18] * Niharika will sleep on it [20:06:29] Niharika, can you link em to the query again? [20:06:48] halfak: http://quarry.wmflabs.org/query/6135 [20:07:04] bearloga, OK. No worries. [20:07:27] fhocutt! Were you working with Ironholds on talk discussion analysis stuff? [20:07:52] Niharika, can help. Going to be around for 10 more minutes? [20:08:04] Also, is it OK to ignore edits to deleted pages? [20:08:18] halfak: Yes and yes! [20:08:22] Cool. [20:09:17] halfak, yes! [20:09:46] fhocutt, got in contact with a research who wants to build up a "troll detection" system. [20:09:56] return isDomas(); [20:09:56] I'm looking for relevant literature [20:10:00] lol [20:12:56] Niharika, http://quarry.wmflabs.org/query/6136 [20:13:02] I think that'll work. [20:13:06] I'm running some tests now. [20:13:13] * PlasmaFury should have some form of collaborative editing in quarry [20:13:18] Yup. It looks like it. [20:13:29] PlasmaFury, forking first <3 [20:13:45] yeah [20:13:45] halfak: Looks good. Thanks a lot! :D [20:13:48] forking.. [20:13:48] :D [20:13:50] No problemo [20:14:31] halfak: ooh, interesting [20:14:36] ML fun stuff? [20:14:48] fhocutt, yeah. Probably. :) [20:15:00] I haven't done any of that yet, sounds fun [20:15:08] At least working out some datasets so that people can do ML, offline analyses, qualitative work, etc. [20:15:27] yeah, we can probably help with that [20:15:44] OK if I get you CC'd on the conversation? [20:16:51] sure thing. We have a dataset, but we haven't started coding the messages yet [20:17:13] Cool. Will do. [20:19:57] thanks! [22:29:21] hi people. :-) We still have 38 min of the meeting last Monday untranscribed. If you would like to contribute to the document, please ping me. [23:26:29] halfak: your desirability ratio looks more like a relative risk than an odds ratio to me. odds ratio is the ratio of the odds, so it's more like p(desirable|score)/(1-p(desirable|score))/(p(undesirable|score)/(1-p(undersable|score))) [23:28:22] * halfak reads and thinks [23:29:02] so many parens [23:29:36] is there an IRC client that supports MathJax? because I would love that [23:29:46] bearloga, yeah. I suppose you're right. [23:30:07] bearloga, that odds ratio does not account for the prior? [23:30:28] halfak: oh it does. I'm just using the posterior form [23:30:48] halfak: I could subsitute the likelihood x prior into that but nobody here wants/needs to see that mess [23:31:13] Gotcha. [23:31:22] Do you think it would be more desirable to use the odds ratio? [23:31:35] * bearloga thinks [23:31:46] We do have a problem with extreme values. I'm wondering if that would help make the output more reasonable. [23:31:50] * halfak wants to run a simulation [23:33:42] I just assumed that bayesian strategies like this one would trend toward extreme values. [23:34:10] halfak: why? [23:35:04] Well, as more data is added -- assuming that the observations were really drawn from one of the two distributions -- the ratio should get more and more extreme. [23:35:58] halfak: also, to answer your question: it depends on the interpretation you want. relative risk lets you say "this newcomer is 4 (for example) times more likely to be desirable than undesirable" while odds ratio lets you say "the odds that this newcomer is desirable is 3 (for example) times the odds that this newcomer is undersirable" [23:36:43] Yeah. looks like I was good on the former. [23:36:47] I just used the wrong term. [23:37:09] The latter is way harder to interpret. [23:37:34] halfak: learn something new every day :P harder to interpret but actually used way more often :) [23:37:59] * bearloga still thinking about the extreme question [23:48:58] halfak: the ratio would become extrmelely large as p(desirable|score) goes up and p(undesirable|score) goes down, but neither is dependent on size of data. more data points would influence the ratio to the extent that it will make the estimate more reflective of the truth, but more data wouldn't affect the truth [23:49:17] if that makes sense [23:50:29] Yeah. It still seems like we get to very extreme values fast, but it might just be that this is how confidence grows. [23:51:02] what do you mean "get to very extreme values fast"?