[18:36:27] Can anyone advise on an acceptable approach to mirror images from wikimedia? The image dumps @ http://dumps.wikimedia.org/backup-index.html are not available and there don't seem to be any alternatives [18:36:47] I have checked out the Xowa project @ sourceforge and have been able to obtain images that way, but I would like to go to the primary source, especially wrt updates [18:41:17] raidex, "don't" [18:42:00] well, it depends how you do it [18:42:05] if you want to set up a live mirror: don't. [18:42:43] otherwise, polling commons.wikimedia.org as and when they're needed, I guess? If you're using MediaWiki as the mirror platform it can automatically cache from commons as images are asked for [18:49:10] Ironholds: ok, thanks. If I wanted to build a random sample of, say 1000 images, should I then stand up a MediaWiki instance and crawl that with caching enabled? [18:49:32] I want to make sure I don't abuse bandwidth and get banned [18:50:04] raidex, oh, 1,000? [18:50:04] I need them for supervised learning, so I need as many images as I can get in the sample [18:50:06] no, just do that live [18:50:12] 1,000 is fine [18:50:33] we get more peeved when people set up LITERAL live mirrors of our entire site :D [18:52:04] So it's ok if I set up a slow crawler? I tried using a Node.js library named osmosis and I noticed that even with only 1 thread I could only pull a couple of images before it got stalled for minutes [18:53:10] I thought the rate couldn't be the problem since I was very conservative. Any filtering going on for user agents? [18:54:11] raidex: user agents aren't typically filtered unless they're doing something bad AFAIK. [18:55:10] guillom, Ironholds: so is there any reason why a single request to the uploads server would seem to stall for such a long time? [18:55:21] I guess you could simply use https://commons.wikimedia.org/wiki/Special:Random/File 1,000 times and a wget. [18:55:44] raidex: There's probably a reason; I just don't know what it is :) [18:55:46] can't you just use the API to get a 1,000 random pages? [18:55:59] Probably [18:56:41] I'm currently using this module as API client for nodejs: https://github.com/macbre/nodemw [18:56:53] Can't be totally random. I need the images to correspond to specific articles people will be tagging [18:57:00] It's incomplete but you can call your own custom queries as well. [18:57:19] raidex: Do you have a list of those topics? [18:57:20] thanks guillom -- will check it out [18:58:18] raidex, not really [18:58:36] the only filtering I'm aware of is "the user agent should be /something/" [18:58:37] Pretty varied: countries, people bios, animals for now, but it's not really aligned to themes yet [18:59:17] Ironholds, guillom: thanks to you both. Very useful. Off for now. Peace. [19:01:29] np! [19:01:53] raidex: The API provides a "page image" for many pages. See https://en.wikipedia.org/w/index.php?title=Apple&action=info for an example, and https://en.wikipedia.org//w/api.php?action=query&prop=pageprops&format=json&titles=Apple for the query. Once you identify your topics, this could be an easy way to find and download the corresponding images. [19:03:14] https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&format=json&titles=Apple is the correct API query, although strangely the previous one also works [19:38:02] Has anyone played with similarity measures to compare revisions of Wikipedia articles? I'm looking at several metrics (Damerau–Levenshtein, Hamming, Lee, Jaccard) but either they require sets of equal length, or they were designed for small sets/strings and don't apply well to longer texts. [19:44:29] guillom: are you looking for research papers, or code to figure out changes? [19:44:44] guillom, didn't halfak? [19:45:22] there's also http://files.grouplens.org/papers/ekstrand-wikisym09.pdf [19:45:35] and then there's halfak's work on diffing and stuff [19:45:47] Nettrom: I'm looking for a methodology / metric that I could apply. So research paper about the metric would be good, or just references to a specific metric. Then I can figure out if there's an existing implementation. [19:45:59] Ironholds: How dare he go on vacation! [19:46:13] thanks Nettrom; looking now :) [19:46:28] I thought Ironholds is substitute-Halfak while he's away ;) [19:46:28] guillom, inorite [19:46:32] guillom: yw! [19:46:33] Nettrom, gee! [19:46:48] I'd love to help but I've gotta bike to the lab [19:46:53] I know I can identify reverts using the sha1, but I want something that's more granular. [19:47:33] guillom: I'm browsing through the Ekstrand paper now, chapter 6 seems like something you want to look at [19:47:53] Thanks :) [19:48:21] Cosine similarity; interesting. [19:49:17] yeah, for document comparison, cosine similarity isn't uncommon [19:49:33] AFAIK [19:50:08] And even better, npm has lots of modules about cosine distance / similarity. Woot! [19:51:25] * guillom --> annual review. [19:51:33] Thanks again Nettrom! [19:53:44] yw, glad I could help [22:06:51] Hmm... I'm trying to install scipy in my virtual environment on Tool Labs. [22:06:57] It's not working :( [22:22:12] harej: hi [22:22:16] harej: setup your virtualenv with [22:22:22] --include-site-packages [22:22:24] or equivalent [22:22:33] harej: you'll get scipy by default [22:22:37] not python3 tho [22:23:04] I think it's a dependency thing. I just need to make sure to set up all the dependencies first. [22:40:43] downside you'll have to use scipy [22:40:49] why do you need scipy, when you have R? :)