[12:21:44] hey ;-) [12:22:55] have you looked at the video copyright code yet? [12:28:16] Hi DrTrigon DrTrigon_ jayvdb [12:28:30] perfect timing ;-) [12:28:34] Hi! [12:28:46] :D [12:28:54] I didn't have time today to do it on the recent files though :( [12:29:23] Had to go out suddenly and just came back. I wil do it in another 2-3 hrs and post results for that in conpherence [12:30:24] about time; the other meeting today is at a bad timing for me - any change that we find a replacement? [12:30:24] ok [12:31:06] I havent heard back from Ty. If he can make it, I'll talk with him, and then organise another meeting with us all [12:31:26] Tomorrow same time or day after same time is good for me. [12:31:39] that would be perfect - if not possible ping me and I'll try to follow... [12:31:41] I'm not available on Mon-Wed at that time though [12:32:18] ok. AbdealiJK is that time today good for you? [12:32:24] Yes [12:32:24] can we share a google calendar with availability? or a doodle? [12:32:26] ok [12:32:59] if this is of any use? [12:33:02] If Ty cant make it in a few hours, a Doodle would be good [12:33:19] perfect - so let's see what happens [12:33:32] are you fine guys? do we want to start? [12:33:40] ok, do we intend to skype now? [12:34:06] I think IRC is fine - not much to discuss this week [12:34:14] But anythings good for me [12:34:15] no oppinion [12:34:41] jayvdb, Which would you prefer ? [12:35:56] maybe try with IRC first and then a quick skype for anything that we want to chat about [12:36:08] Alrighto [12:36:15] alrighty [12:36:32] So, regarding updates - I looked at EXIF data a bit more to see if we can use any more of it [12:37:14] I found that we can detect whether an image is a screenshot using one bit of information (If "gnome-screenshot" was in some EXIF data it was a screenshot from GNOME's default tool) [12:37:40] I tried checking other tools too - cheese, digikam, imagemagick, etc. But none of them left any trace on the exif data [12:38:13] very useful [12:38:28] screenshots can be copyrighted, and thus need to be deleted [12:38:29] I then also tried to add more softwares but found popular things like matplotlib, octave, matlab, etc don't mess with the exif data :( [12:38:54] jayvdb, Nod. Plus there's a Category:Screenshot where we can move it to get the right people looking at it [12:39:02] yup [12:39:08] jayvdb: > 3000 files in cat screenshots [12:40:17] AbdealiJK: but you have some detection for matlab, right? [12:40:23] I see "Number of grey shades used: 246" - I guess that can be used to determine if a image is b&w only [12:41:04] DrTrigon_, Right you are. Sorry - matlab was written wrongly in that message [12:41:16] :) [12:41:26] jayvdb, Is there a reason to detect BnW images ? I found that the Category didnt have all sorts of BnW images [12:41:37] there is a separate category for browns only , like https://commons.wikimedia.org/w/index.php?oldid=198123175 [12:41:38] yea it's a pitty not all software use the exif tags ... [12:42:11] https://commons.wikimedia.org/wiki/Category:Black_and_white_photographs [12:42:28] and other https://commons.wikimedia.org/wiki/Category:Monochrome_photographs [12:43:03] Ahh, ok. I was just looking at Category:Black and white [12:43:12] your tool should be able to process all of the files in https://commons.wikimedia.org/wiki/Category:Monochrome_photographs , and move them into a subcategory [12:43:49] (straight forward...) [12:44:08] Not very sure if it is straight forward - but yes will look into it [12:44:14] hehe [12:44:18] ;)) [12:44:18] Especially the sepia color checking [12:44:41] You don't have to do ALL subcats PERFECT [12:44:47] Cool, will do that :+1: [12:44:59] DrTrigon_, nod [12:45:24] I think converting RGB -> HSV and then thresholding the Hue should work ... [12:45:44] or peak detection? [12:45:47] * AbdealiJK will try it out later [12:46:03] Peak detection to find whether an image is sepia color ? [12:46:03] may be also with summing/integrating? [12:46:35] regarding models, it would be useful to publish a list of makes/models that you cant categorise, because the category doesnt exist [12:46:52] someone probably needs to create a category, or create a redirect to an existing category [12:47:50] What do you mean by publish a list ? [12:47:51] jayvdb: yes important point! good note! [12:48:03] e.g. if there is a color cat missing... [12:48:19] ...like you have red, green but blue is missing [12:48:55] jayvdb: can we just use/add thes cats? [12:49:12] not too extensively of course... [12:49:34] yes, possibly, but I am worried about junk EXIF data resulting in junk categories. [12:50:07] jayvdb, I think DrTrigon_ meant can we add it manually when we find it ourselves rather than making a list [12:50:14] yes, we should only do that in very restricted cases [12:50:14] maybe have a threshold - if the bot has seen the same make/model 3 times, add the category to the file page even if the category does not exist [12:50:49] Nooo, the bot should never add it. [12:50:54] AbdealiJK: some of them are not very disputed - but some will be - thus a list beforehand might be useful [12:50:57] then it is easy for a human reviewer to create the category [12:51:24] Ah - right. If there's a human reviewer then it's fine. [12:51:26] AbdealiJK, the bot shouldnt create the category, but it can add `[[Category:Foo]]` even if the category doesnt ecist [12:51:27] the bot checks whether it exists, and if start to populate [12:51:50] *exist [12:51:51] jayvdb: that's what I consider "creating" [12:51:58] ... ;)) [12:52:15] Just a note - even if the category doesnt exist, the bot does print it (just without the [[ ]] for the link) [12:52:20] ya, it does implicitly create a category [12:52:48] ok, got it [12:52:54] AbdealiJK, but to reach 5% (or another goal %), you need to modify the file pages [12:53:19] so linking to categories, even if they do not exist, is still categorisation [12:53:38] ok [12:53:54] jayvdb: which is good?!? [12:53:59] yup [12:54:29] AbdealiJK: about "Camer/Scanner" ... [12:54:38] DrTrigon_, yes ? [12:54:53] ... I think you can try to suggest categories ... [12:55:11] ... again be bit careful and do only obvious cases ... [12:55:21] ... that is better than nothing, right? [12:55:31] I am suggesting categories right ? [12:55:34] DrTrigon , that is what I was talking about above. The bot can add categories to the page, which is a suggestion [12:55:35] I don't think I understand [12:55:48] "Because of this, it cannot "suggest" categories that do not exist yet" [12:55:55] Ahhh [12:55:56] (may be I got it wrong?) [12:56:13] ok. Got what you meant [12:56:21] we call them 'red categories' [12:56:32] Based on the above discussion that it's ok to add red-categories, I will add it. [12:56:42] ok [12:56:42] cool! ;) [12:56:48] BTW: One issue here is that [12:57:07] I cannot know whether it should be "Taken with " or "Scanned with " [12:57:27] in case of doubt just " Hence we still cannot suggest whether it is "scanned" or "taken" [12:57:58] nice info but not the main part... [12:58:14] ...as soon as you look up the software you should know it anyways. [12:58:15] "Created with ..." ? [12:58:16] I think I'll just make "... with " so that the human who is checking can easily move that category and decide what "..." is [12:58:33] "Created with ..." is for softwares jayvdb - Like ImageMagick, Matlab, etc. [12:58:57] "Taken with ..." is for camera photographs [12:58:57] a camera is 80% software these days ;-) [12:59:13] yes but a scanner or a generator? [12:59:16] ;)) [12:59:19] I don't think the commons community would agree though ^ [12:59:38] photographs or coders? [13:00:11] Ok, 1 moment. I'd like to pause as I find this is a very small issue [13:00:25] I've already tested it for most of the big brands - Fujifilm, Nikon, Canon, etc. [13:00:27] sure, go on! [13:00:46] And so 80% of images should be categorized correctly [13:01:08] ok. so not worth worrying about? [13:01:20] We can increase it later by adding *some sort of* red links. But It's a minor detail and we can just do it later [13:01:36] nod. [13:01:47] agree, the code needs to be ready for that ... [13:01:57] agree [13:02:00] ... but addind is some "minor" task for later. [13:02:11] Maybe the red-category can be "Taken/Scanned with " [13:02:37] ^ interesting :) [13:02:50] or "Taken or scanned with " [13:02:54] or we ask on village pump? ;) [13:03:08] AbdealiJK: do you want to continue with the next point? [13:03:09] I think we can just choose one. And if someone comments with a better option, we can just use cat-a-lot to move stuff easily [13:03:20] DrTrigon_, yes [13:03:21] nod [13:03:22] nod [13:03:49] now regarding OpenStreetMap vs Google Maps, we really should be using OpenStreetMap if we can [13:03:51] So, I tried creating an "official" docker file based on our last week conversation [13:04:10] nod [13:04:14] AH, ok - skipping the docker point then [13:04:18] how much worse is the OSM data ? [13:04:24] I tried using OSM. I prefer OSM compared to Google too [13:04:25] ??? [13:04:59] So the issue is that OSM was giving "New No 4, Street ABC, Town, City, Country, PIN" as a string [13:05:14] and you need to parse it? [13:05:26] Can't parse it - it has no standard format [13:05:45] hmm. sounds like the geopy package is not helpful [13:05:58] * AbdealiJK checks the OSM API [13:06:30] looking for the python package I switched to recently ... [13:06:44] AH, youve tried it already - nice [13:07:18] jayvdb, But even with that OSM will still have a limit [13:07:47] the limit isnt the problem [13:08:18] wikimedia has a copy of the OSM data, so we can host our own instance of any query service if we need it [13:08:33] :o nice [13:08:34] ahhhh! [13:08:41] (and may already have a copy of the Nominatim service running somewhere on tools) [13:08:56] Wait ! so is OSM a part of wikimedia ? [13:09:00] Or is this a collaboration ? [13:09:40] not part ..., right? [13:09:47] https://github.com/damianbraun/nominatim [13:09:56] https://pypi.python.org/pypi/nominatim [13:10:10] this is a collaboration [13:10:39] Wikimedia uses OSM data a lot, so we set up our own database and tile server [13:10:48] I can use `nominatim` +1 [13:10:52] I knew once that wiki draws a lot of data from OSM... :) [13:10:54] I see jayvdb [13:11:00] cool! [13:11:09] and replicate the data into our local clone fairly regularly [13:12:47] Wikimedia also has a license for MaxMind GeoIP, but I have never used that, so I dont know how useful it is. And I am not sure whether Wikimedia Foundation will be able to share their license for this purpose (probably, but not 100%) [13:13:20] AbdealiJK: briefly comming back to docker - sorry for the time! - do you have anything you can give me in order that I can experiment arond a bit? [13:13:37] jayvdb, Is there any IRC channel I can ask aboout the OSM + wikimedia stuff ? [13:13:50] I know the MaxMind license is basically 'unlimited' [13:14:22] jayvdb: IF we can avoid using commercial services it would be nice ... [13:14:35] ...of course only if we don't loose quality [13:14:40] https://wikitech.wikimedia.org/wiki/OSM_Tileserver [13:15:08] https://lists.wikimedia.org/pipermail/maps-l/ [13:15:26] DrTrigon_ I'm updating my latest dockerfiles [13:15:48] https://www.mediawiki.org/wiki/Maps [13:15:55] just put it somewhere and drop me a link [13:16:02] https://maps.wikimedia.org/#4/40.75/-73.96 [13:16:41] DrTrigon_, in https://github.com/AbdealiJK/file-metadata/tree/ajk/docker you can see Dockerfile.ubuntu - that would be what you'd be interested in [13:16:55] There's also Dockerfile.centos if you prefer that [13:17:14] whatever you think works better - can use anything inside the VM [13:17:56] https://tools.wmflabs.org/locator-tool/index.html#/ [13:18:00] I think Ubuntu would be easier. [13:18:08] ok [13:18:16] maps.wikimedia.org is really cool jayvdb [13:20:06] agree - thanks for that! [13:20:23] open questions here? do we want to go through the meeting agenda? [13:21:28] AbdealiJK, jayvdb: ^^^ [13:21:30] sure [13:21:47] yes please [13:22:36] so AbdaliJK as I understand we get "categorization hints" (could not check the code yesterday...), right? [13:23:28] Yes, that was done last week if I'm not mistaken [13:23:51] I thought there were minor thing open - even better! :) [13:24:14] then I was thinking about the to define the goal ... [13:24:33] ... would it be possible to start gathering and outputing stats now? [13:24:48] in a nice table / overview [13:25:05] DrTrigon_ what sort of stats are you thinking about ? [13:25:35] e.g. when you do a run for https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/Category_Male_faces ... [13:26:00] I find automatically adding it is irritating - because the script gets a bit messy [13:26:25] ... can you output a table at the end summarizing e.g. 1000 processed, 100 categorized, 10 issues, like ... [13:26:27] Stats need structured data, and right now that is just dumping wiki lines [13:26:38] something you did in the beginning but as table [13:27:01] I can do "1000 processed" and "100 categorized" [13:27:05] automatically doe not need to be ... [13:27:09] "10 issues" cannot be done easily right ? [13:27:21] ... may be in a seperate script ...? [13:28:02] DrTrigon_ I tried that in the beginning too. But it seemed unnecessary ... I would need to parse the wikicode that the first script writes and again process data, etc. [13:28:04] by "issues" I mean "uncertain categorization" [13:28:12] I found it was just easier to do manually [13:28:42] it not about we need that in the bot - it's about to get an idea how it performs overall [13:29:42] or maybe a table, with number of categories added per file. i.e. 5+ cats added: 10 images ; 4+ cats added: 50 images ; 3+ cats added: 100 images ... [13:30:20] that can be done [13:30:30] we need a tool to keep control of the bot quality [13:30:41] I think uncertain categories can be done too [13:30:50] and that has 2 parts; how good the bot finds data from files ... [13:31:07] ... and how good it can relate this info to categories on commons [13:31:56] right. we dont want a bot adding "Black and white" and "photograph" when it can be adding only "Black and white photographs" [13:32:16] the most valuable categorisation is leaf categories in the category tree [13:33:55] perhaps that is a way to decide on quality categories : the bot must add three categories that are leaf categories [13:34:01] ^ just an example [13:34:07] so we need to find a way to measure perfomance of (A) alogrithms like face detection, etc. and (B) relating that info to commons categorization [13:34:35] if it does that, for 5% of the images, the tool is very useful [13:35:28] of categorized images, right? [13:35:47] (that could be a goal... correct) [13:36:03] 'three leaf categories' is probably too hard. maybe only 'one leaf category, and two non-leaf categories' [13:36:48] ok, I need to go soon [13:37:08] I might be go for 1 leaf cat in 5% as MVP and 3 leaf cat in 5% as bulls eye [13:37:16] :) [13:37:29] nod [13:37:43] AbdealiJK: what do you think about measuring performance? [13:37:55] 3 leaf is probably not very feasible. Currently, the only analysis which gives us a leaf is Scanned with/Created with/Taken with [13:38:17] and the black whit photographs, right? [13:38:20] geo should give you a leaf cat [13:38:23] Nope [13:38:33] jayvdb, Not really, the place name can be of any granuality .. [13:39:00] For example, I can detect that it was in India, Chennai. But maybe not that it was in IIT Madras (my college) [13:39:03] AbdealiJK: sry, the monochromes [13:39:08] geo should give you a leaf cat , or red leafcat [13:39:29] ok . going now. ill be around in 15 mins [13:40:17] DrTrigon_, https://commons.wikimedia.org/wiki/Category:Black_and_white_photographs has lots of subcats [13:40:42] ;) the "Nope" was not for me... [13:40:57] so do you think measuring perfomance is feasble? [13:41:44] jayvdb: bye - see ya! [13:42:04] AbdealiJK: so do you think measuring perfomance is feasble? [13:42:04] DrTrigon_ I am not sure how it can be done actually ... [13:42:39] ^ The measuring performance [13:43:40] may be first step is to summerize info per bot run - the we can think about merging all thoses together [13:44:29] nod. I will try creating a summary report based on what jayvdb mentioned (Number f images with 3 cats, etc) [13:44:35] Is there any other stat ? [13:45:17] I would include anything you want to learn about or does not work as you expect ... [13:45:34] ... you mentioned face-detection to be worse than you expected, etc. [13:46:06] In the stats ? [13:47:04] they are mostly for us developers, right? we want to know what parts of the code can or need to be improved... [13:47:33] also I would ouput info like 5 sure categorizations, 3 experimental ones (relates to the face-detection etc.) [13:47:47] Ok, so I think "Number of distinct categories" [13:47:52] ...respective to the 2 run modes [13:47:58] "Number of files for each category" [13:48:21] But we currently don't have a distinction for experimental vs sure - so, I'd like to procrastinate that if possible [13:48:52] that is ok - but keep it in mind, for design etc. [13:49:02] yep [13:49:25] also it should be clear why they are distinct or categorized etc [13:49:57] I didnt follow your last message, could you elaborate ? [13:50:30] if you do stats like "Number of distinct categories" we also need to know WHY they are distinct e.g. [13:50:48] what result / algorithm made the bot deciding like that [13:51:09] (was it due to facedetection or anything else etc.) [13:51:32] Apologies. Let me clarify what I meant by "number of distinct categories" [13:51:55] please :) [13:52:04] I meant overall in the whole batch, how many different categories were used [13:53:20] I see. [13:54:55] DrTrigon_ ok. think we can call this to a close ? Let me begin with the current stats we've mentioned and see if we need more over time ? [13:55:14] And add appropriate stats as needed/feasible [13:55:59] Yes let's do that. Once we see what we are talking off it will get more clear anyway. :) [13:56:08] nod [13:56:13] (clearer?) [13:56:14] back [13:56:24] nice! [13:56:35] We were just planning on ending :) [13:56:37] so AbdealiJK: do you want to add anything? [13:56:44] Nothing from me - no [13:57:10] jayvdb: anything to add? questions? [13:57:29] list of distinct categories suggested (and then 'added' when the bot writes) would be useful [13:57:35] Minor update: I've begun running the script on new images. Will update with results when done [13:57:55] nice [13:58:23] ill be around until late tonight, so post an update when you have some useful % stats [13:58:27] AbdealiJK: wanted to add the Bulk test, Upstream and Software results are nice! [13:58:56] ...and I got a very strange error: http://dpaste.com/3JA5RK3 [13:59:05] that's all from my side! [13:59:22] DrTrigon_ I answered that in the conpherence I believe [13:59:25] thats all from me [13:59:46] * DrTrigon_ did not refresh conpherence yet... [14:00:17] you wrote it as pwb.py already... I see... will check! thanks! [14:00:42] so thanks and have a nice evening folks! [15:02:35] Is everyone up? [15:09:46] hi [15:10:01] hey [15:10:30] ping DrTrigon DrTrigon2 ;-) [15:10:47] jayvdb: Yeessss? :) [15:10:55] AbdealiJK just left this room; i've pinged him on Skype [15:11:07] You wanna meet now? [15:11:22] if you are free, TyLandercasper is here and on Skype [15:11:44] I kind of about to leave... let me check... [15:12:25] ok, have about 1.5 hrs - let's go... ;) [15:16:59] TyLandercasper, Hi ! [15:17:15] hello! [15:17:30] Is there a meeting going on ? Or is it postponed ? [15:17:43] they're tying to add you right now [15:34:47] I fell out the call... [15:34:56] me too [16:05:33] i think ivelost everyone [16:33:17] AbdealiJK: https://phabricator.wikimedia.org/Z441 running `$ python bulk.py -search:'eth-bib' -logname:ETH-Bib` currently ;)