[17:54:17] * halfak travels from University to Home [18:04:02] Ohhhh, we're halfak there [18:55:46] o/ apergos [18:55:52] Do we have any dumps of commons images? [18:55:52] hey [18:55:55] no [18:55:59] we did once upon a time [18:56:05] OK. Figured it would be HUGE [18:56:06] but we don't right now [18:56:08] yeah [18:56:10] actually [18:56:11] yuuuge [18:56:12] I want to download 2m images. [18:56:14] tbh we never dumped commons [18:56:19] So I guess the API it is [18:56:22] what we did was dump all images in use per wiki [18:56:25] except commons [18:56:30] that's just too frickin huge [18:56:35] as in hyuge [18:56:58] is wikiscan on labs servers? [18:57:05] ello apergos [18:57:19] still thinking of getting you two boxes of belgian chocolates ;) [18:57:25] hahaha [18:57:39] well we have a local belgian chocolate supplier interestingly enough [18:57:48] I discovered them some months ago [18:57:51] mmmm! [18:57:51] pft exports :p [18:57:56] yeah well [18:57:59] the real thing is hand delivered [18:58:04] I am sure it is [18:58:07] :D [18:58:30] It is strange to live in a country where pringleas are not an import. :/ [19:00:02] or the origin of ISIS chocolates -_- [19:01:00] I wonder how fast I can query for files from the API. And whether en.wikipedia.org/w/api.php will resolve to commons DB. [19:01:31] halfak: what are you trying to do? [19:02:31] Get 2m images included on Wikipedia pages. [19:02:41] full res ones? [19:02:43] Some researchers I am working with want to analyze them [19:02:43] random ones? [19:02:46] I guess so [19:02:56] They'll have a list generated from random walks of Wikipedia. [19:03:10] ok, so given a list of image names you want to download them all [19:04:24] halfak: if you throttle that to one per second without concurrency it'll take about 24d [20:44:27] YuviPanda, yeah. Was thinking that was way too long. [20:44:34] But maybe there's not much we can do about that. [20:45:34] halfak: yeah [20:45:56] Is 1 request per second a hard limit? [20:45:59] halfak: at best it'll take a couple of weeks [20:46:18] halfak: I think that's my vague recommendation. As long as you don't hit it parallelly it'll be ok [20:46:28] halfak: if you co-ordinate with ops you can probably get away with a larger limit [20:46:57] Yeah. Was thinking I'd hit with 4 processes in parallel to cut the job down to ~ a week. [20:47:09] probably, yeah [20:47:16] could do with a notification + ok from ops [20:47:26] also remmeber you don't need to hit the API to find full path to original image source! [20:47:36] you can calculate the full path to the image from just the image name [20:47:41] Yeah. I'm trying to figure out the best way to do that [20:47:58] I wrote this code in Java for the android app but that's easily translateable [20:48:04] halfak: let me find it [20:51:39] halfak: for a file named 'File:Coesfeld,_Lette,_Windmühle_--_2015_--_5768.jpg', you first make a md5 hash of the name sans namespace ('Coesfeld,_Lette,_Windmühle_--_2015_--_5768.jpg'), which is 4b497d8c8e7bd591ce1679ec8db167a7 [20:52:00] halfak: the full URL is https://upload.wikimedia.org/wikipedia/commons/4/4b/Coesfeld%2C_Lette%2C_Windm%C3%BChle_--_2015_--_5768.jpg [20:52:13] the '4' and '4b' are the first and first two chars of the md5 hash [20:52:24] Cool! [20:52:24] and followed by the name of th efile [20:52:31] halfak: the prefix is static [20:52:43] halfak: so you can use this to determine the full path of the original image [20:53:18] Awesome. [20:53:20] * halfak copy-pastes [20:53:25] For future reference [20:53:34] :) [20:53:58] halfak: might change within the next few years, btw. but there'll be announcements etc [20:54:56] If this research code is running in a few years, I accept the consequences ;) [20:55:54] halfak: :) https://www.mediawiki.org/wiki/API:Etiquette#Request_limit basically says 'if you are making serial requests it is all fine, no need to ask' [20:56:27] halfak: remember to ask them to set a very easily identifiable User Agent, including a contact email address, perhaps mention your name in it :) [20:56:45] am going to go afk now to try to fix my phone [20:56:46] +1 [20:56:47] \o/ [20:56:49] ttyl [20:56:49] o/ [21:45:33] hey YuviPanda (or halfak, probably). What param do you pass on a Jupyter notebook URL to get the raw file for download? [21:45:53] Sorry. No idea there [21:46:35] Yeah, thanks anyway halfak. I swear Yuvi showed me. ?format=raw doesn't work though.