[00:02:57] halfak: yeah! it's from talking to you - of picking what you wanted to do in the morning [06:52:43] joal, I have a working script for restarting dump downloads using md5s. It is at https://github.com/kjschiroo/research-cluster if you are interested in it. I took a look at some of the work you had and some of the work that I already had to get the dumps I was working on restarted. [15:56:19] joal: I was thinking of setting up the download_dumps script to do md5sums in parallel since they take a painful amount of time with files that large. Probably going to have to use multiprocessing rather than threading. Any thoughts? [15:57:07] kjschiroo: I have thought about that, threading should be enough [15:57:24] kjschiroo: I have taken example on your beautifull code to modify my various scripts [15:57:59] kjschiroo: https://github.com/jobar/research-cluster/tree/etl [15:59:13] joal: :) Thanks! Will threading actually lead to runtime improvements in this case with the global interpreter lock? I thought it worked for network stuff because of the waiting involved. [16:00:25] kjschiroo: I don't enough on python threading to answer - I would have thought that using multithreading for computing over multiple CPU would have done the trick ! [16:01:29] kjschiroo: in standup, will be back after [16:02:05] one issue though kjschiroo when using those dumps is that we can't test in parallel :) [16:06:49] Re threading: Yeah, one would have hoped. I'm pretty sure this is one of those things that python is lacking with. Unless you are using multiprocessing python only uses a single core. If you are doing network operations this works just fine, but I don't think it does the trick for computation tasks. This is a relevant post about it [16:06:49] http://stackoverflow.com/questions/4496680/python-threads-all-executing-on-a-single-core [16:09:16] Re dumps: Yeah, I'd realized that also. Where is that limit imposed? I was wondering if we could use different wikidbs and get away with it. It would also be helpful to have a small wiki to test against. [16:09:51] kjschiroo: dumps.wikimedia.org accepts 2 connections per ip [16:10:08] :( [18:29:49] I have a research project proposal here to get a data set enabling us to calculate the labor hours being contributed to Wikipedia. If anyone has feedback to offer it would be appreciated. https://meta.wikimedia.org/wiki/Research:Measuring_editor_time_commitment_and_workflow