[00:11:22] MatmaRex, this looks reasonable to me. Can you remind me how you'll do evaluation? Looking for the proportion of deleted media files? [00:11:47] If so, we'll need to set a time horizon, but I like that measure. [00:12:08] halfak: yes. each upload using the tool will be tagged with the appropriate change tag, identifying the bucket [00:12:27] halfak: and the plan is to count the total uploads with each tag, and count how many of them are deleted by Commons admins [00:13:24] halfak: ideally, i'd want to run this for a week (starting… tomorrow), but if you say that's not enough (or if we get so few uploads that the result isn't clear), we'll extend it to run over christmas [00:14:10] +1 [00:14:25] Let's keep it running while we do the analysis unless it seems like it is causing problems. [00:14:53] We'll want to do an analysis after day 1 to make sure that one of the conditions isn't super bad or there's a bug preventing it from working. [00:15:06] How does bucketing work? [00:15:14] Will a user be in the same bucket for all uploads? [00:16:03] halfak: yes. it's based on user id [00:16:10] (only logged in users can upload anyway) [00:16:16] Cool. [00:16:19] Perfect [00:16:29] We'll need to control our analysis by user [00:16:42] Because a single user could upload a bunch of bad images and ruin a condition. [00:17:42] So, I think that a https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test of problematic uploads per user will work nicely :) [00:26:55] :D [00:31:34] halfak: i think we'll end up gradually enabling the test for more users over the week, since i probably won't get it backported to wmf.8 (so it'll only hit wikipedias on thursday with wmf.9), and since there are no translations right now (and i'd like to test with more than just en.wp when we get it translated) [00:43:31] MatmaRex, I was assuming that this would run on commons. [00:43:54] halfak: no, it's cross-wiki [00:44:11] halfak: the upload dialog can be launched from VisualEditor or wikitext editor on any Wikimedia wiki right now, and the uploads go to Commons. [00:44:51] OMG <3 [00:45:00] That's pretty cool. [00:45:03] :o [00:45:11] So will the old upload wizard stay on commons during the test? [00:45:21] yep, we're not touching it [00:45:29] the cross-wiki upload tool is fairly limited [00:45:56] (the interface only supports uploading a single file at once, and only with cc-by-sa license) [00:46:35] it apparently almost doubled the number of uploads to Commons by new users. pity that so many of them were not the kind of uploads we want, heh [00:47:05] MatmaRex, indeed. This is the problem with removing barriers to contribution [00:47:23] You get a bunch of crap with the good stuff. So, we need to do quality control better. [00:47:41] MatmaRex, we could explore setting up an ORES model to help people curate. [00:48:08] That would allow people to sort the hour/day/week's uploads by the most likely to be problematic. [00:48:15] We haven't done any image processing yet. [00:48:22] But we could dig into the description [00:48:48] How much more engineering time do you have devoted to this project? [00:49:00] halfak: yeah, i know that you did revision scoring, but i don't expect that it handles images yet ;) [00:49:18] halfak: there's very little by way of description in these uploads, though. (and in uploads in general, i guess) [00:50:10] MatmaRex, you think that could be a good threshold? [00:50:19] We could require a description and operate on that. [00:50:33] halfak: as much as it's needed? this tool is technically one of our quarterly goals for the current quarter [00:50:37] Also, I'd like to pull in an intern or something to work on NeuralNets for image processing. [00:50:41] we do require a description, but they're short. [00:51:03] (typically, at a glance; i have no statistics right now) [00:51:44] " MatmaRex, you think that could be a good threshold?" i don't understand? [00:52:13] Sorry. I was thinking that requiring a description would reduce the number of bad uploads. [00:52:19] But if you already do. [00:54:28] halfak: for curation, people generally check suspected copyvios by uploading it to google images ;) i was thinking about making a bot/tool to automate this, but i figured getting the test running had higher priority [00:54:57] and i figure that google will probably quickly block me (or start sending captchas, or something) if i search for a thunsand images a day [00:55:30] (there aren't many alternatives to google image search, and they're all worse) [00:56:32] halfak: i was also thinking that it would be an interesting project to detect inexact duplicate images across Wikimedia projects (scaled, cropped down, watermarked, etc.). we can only detect exact duplicates right now (by sha1 hash) [00:57:26] MatmaRex, yeah. That seems like a trick. I bet the research lit would have something for us though. [00:58:26] yeah, it's definitely possible (after all, google images or tineye do this, and probably on larger corpuses), and it's probably a difficult research area :P [00:59:11] Maybe. Need to either set aside time or find a researcher working in this space to find out. [00:59:37] Then again getting a service for this would be nice. [00:59:59] and there are tools like http://www.visipics.info/ [01:01:14] Oooh! [01:01:45] "The neat functions" [01:01:50] I like this documentation writer [01:24:47] o/ J-Mo still around? [01:24:56] yep [01:25:37] I had a 1:1 with abbey today where we talked about the potential of turning a project about newcomers, patrollers and socializers into a proper Wikimedia Research collaboration [01:25:52] that sounds promising [01:25:53] * YuviPanda provides halfak with 50% more time [01:26:16] * halfak consumes time immediately with meetings [01:26:18] ;) [01:26:28] * YuviPanda continues keeping meetings down to <2h a week [01:26:47] J-Mo: btw, remember the 'notebook' stuff I told you about over dinner when you were here? halfak is going to be using that to do the teahouse analysis :D [01:26:53] J-Mo, I don't have a bunch of thoughts right now, but I was hoping to be able to talk to you soon about what you'd like to see us do in such a collab. [01:28:29] I'd love to talk through ideas with you, halfak. I'll be on vacation Wed-Fri, but I'm in-office Mon-Wed next week, and Mon-Wed the following week. Or we can talk face to face during all hands week :) [01:28:40] * J-Mo prefers the last option, because it's most likely to involve beer [01:29:12] YuviPanda: yes, I've heard. I'm looking forward to seeing an iPython notebook demo with data I can wrap my head around! [01:29:18] :D [01:29:24] he's going to do it in R! [01:29:24] and/or sink my teeth into. Pick your cranial metaphor [01:29:35] an 'R' notebook? [01:29:38] yes [01:29:51] IPython was renamed to 'Ju''Py'te'R' [01:29:54] (Jupyter) [01:30:01] damn. guess I have to go learn a whole new language. [01:30:09] ah, I see [01:30:30] J-Mo: yeah, it can do python too (and is the default). halfak just wants to use R this time [01:31:26] J-Mo, was thinking the last option [01:31:40] J-Mo: I hope to first gather some momentum from around the place and then use that to setup one of these in prod, so normal people have a better chance of doing internal data analysis without having to learn commandline stuff [01:32:17] Also, awesomer work-logs [01:32:18] awesome! I sincerely hope to be one of your alpha testers. [01:32:31] :D [01:32:38] * YuviPanda continues slow long war against commandline bullshitery [01:32:59] YuviPanda, just don't take away my clean little unix utilities. [01:33:16] seperation of ideas from implementations! :D [01:33:42] hmm, I guess I need to express that idea more clearly [01:34:04] unix commandline is incredibly powerful, but having that be the only option - where you need to clear a bar 'this' high before you can do anything is bad. [01:34:14] so I guess I want people to have a better way to there [01:35:52] Yes. We should be able to work in the same env. with a lower bad, but similar power [01:35:58] *lower bar [01:36:00] bad bar [01:36:05] yeah [01:36:31] going from 'I do not know what this is, I just wanted to do X!' to 'oooh, X' should be smaller [01:36:33] Really, I think it is the setup that is the worst [01:36:44] yeah [01:36:51] so this solves all that too [01:37:00] 'what do you mean, I have the wrong version of make?!' [01:37:07] How do I run my query? If the answer was make a file and pipe it to "mysql", then we wouldn't have nearly as much trouble. [01:37:16] * YuviPanda nods [01:37:28] Though, I'm a fan of not explaining "pipe" to those who don't want to use it. [01:37:34] 'no you gotta set this up and then that and compile this' [01:37:37] that too [01:37:40] 'how do I run a query?' [01:37:43] %%sql [01:37:50] results = %%sql [01:37:52] done [02:12:50] halfak: wow, R just seems such a clusterfuck to install :| [02:12:57] halfak: is there a virtualenv equivalent for R? [02:13:13] Not that I know of :( [02:13:35] I only use R for the one thing it's really good at. [02:13:45] So I have never had a need for that. [02:15:06] right [02:15:13] fun fun fun [02:18:40] halfak: nvm, got it all working :) [02:18:47] \o/ [02:18:48] Woot [02:18:54] halfak: you just need a mysql connection to labsdb on top of it right? [02:19:00] or do you want it to analytics-store? [02:19:03] either is the same [02:19:39] YuviPanda, labsdb will be fine for the text analysis, but analytics-store will be desirable long-term [02:20:07] halfak: so for *right now*, I'm thinking of just setting up a super simple thing that runs on either stat1003 or on your own machine [02:21:12] Hmm. Then analytics-store [02:21:46] do you want to run this on stat1003 or on your own machine? [02:23:53] halfak: um, is there no working 'pip' on stat1003? [02:24:16] YuviPanda, had to set it up in my virtualenv. [02:24:27] There should be a working virtualenv installed. [02:24:37] yes but it can't seem to call onto the outside world? [02:24:39] does it need a proxy? [02:24:49] Yes [02:25:11] https://wikitech.wikimedia.org/wiki/Http_proxy [02:25:15] aha [02:25:17] ok [02:25:21] let's hope R respects that [02:25:32] It seems to :) [02:43:01] halfak: got about 5 minutes to help me test something? :) [02:46:33] Sure. As soon as I am done getting Amir1 set up [02:47:02] halfak: :D ok [02:51:38] halfak: assuming I have you here now :D can you clone https://github.com/yuvipanda/halfhub on your local machine? [02:52:10] * halfak does [02:52:20] then do [02:52:25] sudo bash install-packages.bash [02:52:31] bash install-jupyter.bash [02:52:49] (do examine the super short scripts before you do for security :) ) [02:53:09] and let me know what happens [02:53:23] * halfak just gets done looking at the scripts [02:53:46] I would like to file for the record that I wasn't just going to sudo run without looking ;) [02:54:11] * halfak sees "rm -rf / [02:54:33] :D [02:54:56] I too wanted to file for the record I do tell people to look at things before sudo-ing scripts [02:54:56] YuviPanda, should I be sudo-ing the second one? [02:54:59] halfak: nope [02:55:05] failed [02:55:06] halfak: actually, second one [02:55:08] run [02:55:12] bash install-jupyter.bash . [02:55:15] add a . at the end [02:55:36] aha. Worked that time [02:55:45] what's it doin [02:56:06] j/k no it didn't [02:56:17] * halfak gets copy-pasta [02:56:30] http://dpaste.com/041E4YJ [02:56:52] Oh hey! It looks like it is using my other venv. [02:57:11] halfak: pull? [02:57:15] halfak: stupid typo, I just fixed it [02:57:41] * halfak examines script again [02:57:54] weeeeee [02:58:18] what's it doing [02:58:59] New errors [02:59:25] http://dpaste.com/326GG4Q [03:01:05] halfak: try: [03:01:12] CXX_STD = CXX11 bash install-jupyter.bash . [03:02:31] halfak: also what version of R are you using [03:02:39] 3.0.2 [03:02:48] bah [03:02:51] this requires at least 3.1 [03:02:55] OK. [03:03:37] getting [03:03:40] oh [03:03:44] you can just upgrade your R? [03:03:46] nice [03:03:48] :D [03:06:18] * halfak is still upgrading [03:06:27] Had to have fun with repos [03:06:50] * halfak feels the sleep coming for him [03:07:05] OK. R is new. [03:07:13] Trying to install jupyter again [03:07:41] * YuviPanda hopes [03:09:10] New error [03:09:27] http://dpaste.com/3W8S9YV [03:09:56] I'm now on 3.2.3 [03:09:58] too new? [03:10:07] halfak: no, that should work [03:10:14] halfak: as an experiment, just try running the script again? [03:11:39] Same error [03:11:58] ugh ok [03:12:04] you should let sleep claim you [03:12:10] I'll mess around to see wtf is going on [03:12:33] OK. Have a good one! [08:01:13] halfak [08:01:53] hello everybody... [10:25:32] 10Quarry, 6Labs: 502 Bad Gateway on HTTP-requests to quarry.wmflabs.org - https://phabricator.wikimedia.org/T121502#1880330 (10Stigmj) 3NEW [13:58:45] o/ [16:17:58] \o [16:18:19] o/ [16:52:54] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1880983 (10Edgars2007) OK, will get serious. I think the problem is still there, @yuvipanda. Or at least is related to this one. In last few days some 15 blank queries (those, which you get after pressing "New query") have appeare... [19:58:03] halfak: I sent you notebook info [19:58:20] ShiveringPanda, saw that. [19:58:31] did it work [19:58:34] I'm blocking myself into revscoring feature hacking until 3PM PT [19:58:36] ok [19:58:40] that's fair [19:58:42] enough [19:58:48] * halfak focuses hard [20:01:33] For those here interested in data visualization, the "Historiography" tech talk is starting. Details at https://lists.wikimedia.org/pipermail/wikitech-l/2015-December/084304.html [20:02:37] Histography* [22:59:15] halfak: still focusing? [22:59:32] Just about to come out of meeting and get started. [22:59:35] cooool [22:59:45] Will need to throw the ball for the dog for 10 mins. [22:59:53] so, etc 3:30 PT [22:59:57] *3:10 [23:00:20] ok [23:16:37] o/ ShiveringPanda [23:16:44] Starting instructions [23:16:46] hi [23:16:49] :D [23:16:51] ok [23:18:54] OK. Looks like the first thing I need to do is get my repo on here. [23:18:59] Can I run a term on this? [23:19:18] Aha! Found it [23:19:19] halfak: yes, in 'new' you can click 'terminal' [23:19:26] halfak: let me know if you want any packages installed [23:19:37] Will do. [23:19:43] Am I already *in* a venv? [23:19:46] yes [23:19:48] you are [23:19:55] cool [23:19:57] it's in /srv/notebook [23:20:02] and your user has write perms [23:20:18] * halfak installs pymysql [23:20:21] awesome [23:20:48] :D [23:21:00] halfak: I thought you were going to use R? :D [23:21:16] Both. Remember? [23:21:28] I only use R for stats and plotting. [23:21:32] aaah I see. [23:21:34] Two things it does exceptionally well. [23:21:38] so you do your data munging in python [23:21:43] and then load that up into R [23:21:51] is that correct? [23:21:56] +1 [23:21:58] cool cool [23:22:08] yeah, eventually there'll be support for multi-lingual notebooks [23:22:13] where you can do ^ in one notebook [23:22:16] but not yet I guess [23:22:31] halfak: are you gonna do data munging on a notebook or terminal? [23:23:21] So... I can't paste into the terminal. [23:23:36] I have some python scripts and a Makefile. [23:23:40] halfak: yup, works only in chrome unfortunately right now. [23:23:43] paste and copy [23:23:46] chromium / chrome [23:23:50] But then I explore the data in R [23:23:50] fixed in an upcoming release [23:24:10] No prob switching [23:24:13] thanks [23:24:26] halfak: you can find your running terminal under the 'running' tab [23:26:32] Hmm.. Looks like I'm going to need to change how I work with R. [23:26:36] Not that much. [23:27:39] So, back when I was playing around with this in the past, I could mix R and Python in the same notebook. Is that still possible? [23:27:55] ShiveringPanda, ^ [23:28:23] halfak: not sure, checking [23:31:30] halfak: not that I can see immediately [23:31:37] halfak: oh wait [23:31:44] halfak: you mean it was possible before? then it should still be [23:32:25] Yeah... trying to remember. I think I could start up an R block and work from that. Anyway, that's fine and I can work from python for today. Maybe I'll discover a few things again while I work. [23:32:36] halfak: yes, you're looking for the 'R magic' [23:32:41] I can set that up [23:33:41] That would be cool :) [23:33:52] I imagine the SQL bits I get are "SQL magic"? [23:34:24] yeah [23:35:44] OK. Task #1. I need to move data from the prod slaves to labsDB. [23:36:08] I forgot how user databases work in labs. Can you direct me to the right doc? [23:37:40] halfak: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database [23:37:59] halfak: basically, find username in .my.cnf and create db of form username__ [23:38:04] NOOO. I can't edit SQL files in ipython notebook! [23:38:25] It just downloads them when I click on them. [23:38:27] hmmm [23:39:33] halfak: what do you mean? [23:39:51] When I click on a file that ends in .sql, it just downloads the file. [23:39:58] oh that's strange [23:40:03] I jupyter [23:40:09] Can you get to my notebook? [23:40:22] I see it [23:40:27] (in my own instance) [23:40:37] halfak: can you just rename it to not be '.sql'? I'll file a bug upstream [23:41:16] guys guys guys. https://www.mediawiki.org/wiki/Multimedia/December_2015_cross-wiki_upload_A/B_test [23:41:32] halfak: you can also just use vim / nano in the terminal :D [23:41:37] Boo [23:41:45] Can do that on my own term. [23:41:46] :P [23:41:48] :P [23:41:49] i know [23:41:56] I'll file a bug about this I guess [23:42:10] MatmaRex, you should move the research report to meta:Research :) [23:42:25] MatmaRex, https://meta.wikimedia.org/wiki/Research:New_project [23:44:05] see, i should've asked if there's a template for this somewhere. [23:44:47] MatmaRex, a lot of those instructions are for external academics. Delete any sections that don [23:44:48] maybe later, i've already done way too much writing today (and too little coding!), and i still need to document the feature itself too. [23:44:49] t make sense [23:45:14] MatmaRex, no worries. Mind if I do a copy-paste move? [23:47:07] ShiveringPanda, I have a problem with using the file browser in ipython. I'd like to git mv things, but the ipython system doesn't know git. [23:47:35] hmm, right. so the ipython ecosystem doesn't really integrate with people using git for notebooks very well. [23:47:46] but to be fair, I haven't found any GUI for git usable... [23:48:38] Yeah. me either. [23:48:44] Versioning is important to me. [23:48:55] indeed [23:49:06] Maybe I should just focus on using the notebook for note taking and not try to write any scripts in it. [23:49:18] so there's ongoing work [23:49:21] on diffing notebooks [23:49:39] which is step 1 for allowing versioning properly [23:49:56] halfak: whwat's your git workflow now? [23:50:07] halfak: do you use a commandline git or an integration into an editor? [23:51:23] commandline [23:51:37] But I would be fine with some integration for renaming files and all that [23:52:51] halfak: I agree [23:53:01] halfak: for now can you keep using the commandline on the web terminal? [23:53:37] Yeah. Making due. [23:53:47] thank you [23:53:48] * halfak is happy to have new toys and is willing to work on the corners :) [23:56:01] ShiveringPanda, any fancy command I need to run to get a mysql client? [23:56:10] I'll need mysqlimport too for loading datasets. [23:56:20] halfak: let me install it to you moment [23:56:21] *for [23:56:24] Thanks [23:59:19] halfak: done