[12:19:00] Hello [12:48:29] @AbdealiJK: When does the meeting start? [12:49:03] DrTrigon: Normally, a while back. But jayvdb mentioned he'd be late on pharicator [12:49:15] Would you like to wait for him ? or begin ? [12:51:00] @AbdealiJK: I mean how late will he be? An idea? ;) We can start whever you want! [12:51:18] I have no clue, he did not mention that ! [12:51:34] I assumed a few minutes >_< [12:52:25] Let's begin - he can join ? [12:53:17] DrTrigon: The major updates for this week are the new logs I mentioned on Monday. And also I made a manual and auto script (wikibot-filemeta-manual, wikibot-filemeta-auto) [12:55:50] AbdealiJK: nod, go on... [12:56:11] Have you checked the logs ? [12:56:25] The error rate and so on ? [12:56:36] Not yet ... anything special there? [12:56:59] :P There's nothing much to say about the updates, Just finishing things as we'd discussed [12:57:33] Do you reach the MVP requirements easily? Or is it hard? [12:57:43] DrTrigon: Well, I'm just a little confused about the interpretation of what a graphic is and no, what a face is and no, etc. And hence wanted you'll to check error rates [12:58:08] Which day? [12:58:18] In general, not for a specific day [12:58:27] So give me some time ... [12:58:31] Nod [12:58:59] So, IMO faces would be around 10% error rate. Graphics and Black and White photographs may not be 10%, and could be upto 15% [12:59:09] As I mentioned, Barcodes are 66% because there are very few of them [12:59:43] But the thing is I can remove Barcodes. And reduce the thresholds for Black and White to be very very small without affecting things [13:00:12] Because Black and White and Grpahics are both "Type" bucket, and that's 100% - it won't change if the number of these categories decrease [13:00:20] So do that for the automatic mode and keep it as it was for the manual [13:00:33] ^ Exactly. I did that [13:00:39] Perfect! [13:00:44] Location bucket is always accurate ... it's more accurate than humans. [13:00:59] ? What do you mean? [13:01:13] Well, the location bucket uses the GPS coordinated [13:01:25] So, it's as accurate as it gets [13:01:41] The best a human can do is to use the same coordinates? [13:01:49] Then I understand. [13:01:50] Yep [13:01:57] So I like File:2016 07 23 D-R. DUFOUR.jpg [13:02:14] What is [[:]]? [13:02:16] opening [13:02:27] Ah, that is files that have been removed after I did the analysis [13:02:42] * DrTrigon looks at https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/uncatimages/20160722 [13:03:07] DrTrigon: The uncatimages is not updated with the new 100% types bucket. Because of the API Error [13:03:26] That's running as we speak ... with higher api limits set [13:04:54] aa, sorry now I understand https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata/GSoC2016#Final_Results ... let me check again... [13:05:55] * DrTrigon looking at https://commons.wikimedia.org/wiki/User:AbdealiJKTravis/logs/newimages/20160720 [13:06:05] error: File:Tree of 40 Fruit - tree 071 diagram.jpg [13:06:39] I like File:Chata wuja Tomasza page040.jpg but let's ask jayvdb for his oppinion [13:07:20] very good is: File:Rody Eynas.jpg [13:08:08] correct and wrong: File:Juventus FC - Enschede, 1971 - Francesco Morini (cropped).jpg [13:08:37] ? [13:08:44] Why is it wrong ? [13:09:14] between the legs is supposed to be a face? [13:09:49] Ah, no - ignore the bounding boxes [13:10:11] They show the haarcascade detections too. But the haarcascade does not detect 2 other features (eyes/ears/etc) so it's not actually used [13:10:48] a cool pro feature: green boxes for used ones and red for the unused [13:11:20] I also like: File:Лазар Баранович – Меч Духовний (1666).jpg [13:11:56] error: File:Fälö by.jpg [13:12:20] agree [13:13:06] error: File:Gymnogobius urotaenia ukigori.jpg [13:13:54] error (or not? ;) File:Debrecen Segner tér troli ZiU-9 323; Ganz-Solaris 372.jpg [13:14:42] nice one: File:Nepalese Obverse of 10 ₨ (2008).jpg [13:15:05] and again: File:Les Ternes église modillons (1).jpg [13:15:30] "Human" is wrong for them... [13:16:24] That's a little questionable [13:19:10] @AbdealiJK: Have to check with john whether he want to go through every single one and count, and then the question is how to count (image-wise, face-wise, ... ;) or whether he is fine with taking some probes/samples over a random set of e.g. 100 images, etc... [13:19:19] checking graphics now... [13:19:22] error: File:Chrzanów, stary dom Świętokrzyska 52.jpg [13:19:55] error: File:WMCH Staff meeting Lugano 2016 DSC00175 05.jpg [13:19:56] DrTrigon: What was the aprox error though ? [13:19:59] For faces [13:20:51] you mean in the MVP [13:21:19] No, I mean in 22 Jul log file [13:21:30] What according to you was the error rate of "Human Faces" [13:22:31] let me calc properly... [13:35:50] @AbdealiJK: went over about 50% of that page: 140 correct, 11 wrong [13:36:13] 9 either deleted or non-human face [13:36:31] That's 7% [13:36:32] Cool [13:36:55] agree [13:38:17] (let's hope the variation in susch sample is less than +/-3 wrong ones) [13:38:28] I had checked 25, 26, 27 July and they had given me 8-10% aprox. I did rough calculations in my head though [13:38:52] to me this is perfectly fine! [13:39:00] let me have a look at graphics... [13:41:44] DrTrigon: Is https://commons.wikimedia.org/wiki/File:Herrerasaurus_NT_small.jpg a graphics or not in your opinion ? [13:41:50] Also, how are you counting these things ? [13:42:02] i.e are you going file by file and checking the categories ? [13:43:01] https://commons.wikimedia.org/wiki/File:Herrerasaurus_NT_small.jpg is graphics IMO [13:45:54] And https://commons.wikimedia.org/wiki/File:Male_Kankuamo_(Labelled).jpg and https://commons.wikimedia.org/wiki/File:Luth_player-Sb_7899-IMG_0897-white.jpg ? [13:47:59] @AbdealiJK: 1. we need to meet the minimal categorization percentages defined in the MVP if we do no we have to check all files for false-negatives (e.g. missed graphics or faces etc.) if we meet the MVP I do trust you numbers - 2. we need to check the false-positives manually (checking categories only not all files) as these will be the ones that most users might complain about also regarding the bot request [13:49:28] @DrTrigon I agree. But we already meet the Point 1 mentioned right ? [13:49:52] And what we're doing right now is checking Point 2 - correct ? [13:53:32] Yes - so that was the answer to your question whether I a checking every file or just the categories [13:53:45] Ah. [13:53:46] checked about 50% of the page for graphics: [13:53:51] correct: 91 [13:53:54] wrong: 14 [13:54:12] 15% [13:54:22] Hmm, That's 15%. My number for the whole page gives me ~13% [13:54:44] Just a moment though, I'm trying to figure out how you're actually counting these [13:54:54] there are quite a number of pictures in there - are they so similar to graphics? [13:55:48] DrTrigon: Well, Graphics is checked using the edges and so on. You can check the EdgeRatio, NumberOfGreyShades, PercentFrequentColors mentioned on the left to understand why it would have been categorized as graphics [13:55:49] IMO: your last 2 examples are pictures [13:56:11] aa, yes of course... wait... [13:57:54] THe condition for Graphic according to bot is EdgeRatio < 2 OR EdgeRatio < 0.13 [13:58:48] @AbdealiJK: rought guessing about 50% of the wrong ones have a category "Taken with ..." or "Created with " so that should exclude them from categorization as graphic... [13:59:04] ...no idea how that would affect the overall percentages though [13:59:22] Well, Graphics is of bucket "Type" [13:59:32] So, it wont affect the stats in https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata/GSoC2016 [13:59:57] and I guess the created with and taken with comes from exif data, right? [14:00:03] Yep [14:00:17] so that should be "rock solid info" [14:00:38] Well, I think "Created with" may not be rock solid [14:00:49] we have to take care for things as "created with gimp" and so on [14:01:03] Yeah ^ [14:01:23] But a lot of real photos are also touched up in Photoshop. So, photoshop or GIMP can be either graphic or photo [14:01:36] ok, so exclude all "taken with" from categorization as graphic [14:01:50] But, again coming back ... it seems that for the MVP stats we can simply remove Graphics and it won't really matter ... [14:02:42] this is also a solution - so postpone my proposal for later - but at some point we have to cope with that.... ;) [14:03:00] I just checked the stats for "Taken with" and it does help a lot. [14:03:04] I'll add it :) [14:03:14] And yeah - agree with you [14:03:31] cool! ;) [14:05:02] Awesome ! I did not notice that pattern :) [14:05:09] * AbdealiJK is very happy now. [14:07:17] So then checking th "categorised correctly in one other category": we have about 100% categorized here, did not check for error though but we could have up 30% (what I hope we do not) and would still be fine [14:08:03] that is very nice! [14:08:09] No, 1 moment [14:08:15] We do not have 100% for categorized in one [14:08:30] We have 86% [14:08:42] Because File type categories were not counted in that right ? [14:08:43] aaa yes the 0 ... [14:09:51] But, logically ... the number of errors in that wouldn't be more than the errors in graphics + errors in faces [14:10:02] that leaves us with 15% ... not that generous but given we anyway want to say below 10% error rate - that's still fine [14:10:13] ^ Yep [14:11:11] Also, that 10% is on a much smaller subset ... [14:11:14] yes as we do percentges actually it should be the weighted average of graphics and faces error, so upper bound is graphics (~15% currently) and lower faces (~7% currently) [14:11:57] you meant thats better or worse? [14:12:20] So, see. if we have 14 errors in 140 faces => 10% error rate in faces [14:13:19] But the 70% is out of all the images. So, that becomes 14 wrong categories out of ~700 files => Error rate is 2% only [14:14:16] So, if we count the 14 errors in graphics and 11 errors in faces that you had counted (in half the files) we get 25 / 700 = 4% errors [14:14:44] So, we have 86 - 4 = 82% which have 1 correct category - which is above the minimum 70% [14:14:47] yes, but I did a sample only so I checked about 50% of the whole page [14:14:57] Yep, 700 is 50% of the whole page :) [14:15:04] ... ;) [14:15:05] Total in page is 1570 [14:15:27] Aproximates everywhere ... [14:15:52] But yeah - it does seem like the stats work out in general. [14:16:08] yea, yea I see - sorry - denominator question again... ;)) [14:17:32] Okay ! so enough math for me for the day. [14:18:36] @AbdealiJK: Can you refresh my memory - we do not have an indicator for what are leaf categories yet, or do we? [14:19:06] No, we do not have that [14:20:06] hmmm [14:20:11] Bump: Just a note here. The logs of uncategorized images are done for all days except 21 and 26 July [14:20:19] If you want to take a look [14:20:47] 26th was done but not updated, yet? [14:21:10] No, it's still going on [14:21:47] 40 files left for 26th [14:22:08] 200 files left for 21st [14:22:36] 26th has a link already... ;) [14:22:59] Ah yes. But that is an old one [14:23:25] i.e. that log is when I had run it on 29th JUly [14:23:43] The bots better now, and also the Number of files has drastically decreased because they've been categorized [14:24:26] So, the major change in uncategorized file is that the Location bucket is only 3-5% as compared to 8+% earlier [14:24:42] funny ... comparing the newfiles and uncat from 20th ... [14:24:59] Ah [14:25:08] DrTrigon: Don't check https://commons.wikimedia.org/w/index.php?title=User:AbdealiJK/file-metadata/GSoC2016 yet [14:25:22] DrTrigon: Those stats don't get auto updated. I have to do it manually [14:26:02] it seems the overall categorization rate is bit lower on uncats (kind of makes sence - the good ones have already been done) but looking at single categories as human faces or graphics they are quite similar [14:26:16] it looked at the stats on the subpages [14:26:39] COol [14:26:41] Cool * [14:27:43] so yes I agree to your former statement: "enough math for today"! [14:28:57] do you want to ping jayvdb to agree on a new meeting date? or do you want just to talk to him for some questions? [14:29:56] I don't have much to speak actually. I think if he calculates the error rates and they turn out Okay for the MVP - then we're good [14:29:58] ...in any case just inform me what you need from my side - I guess we (jayvdb and I) should have a look at what you hand in before you do it as well, right? [14:30:26] DrTrigon: I'm going to be handing in https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata/GSoC2016 [14:30:31] yes! and if there is missagreement or so just ping me for discussion! [14:30:37] They just want 1 link with all info [14:30:39] Sure, sounds good ! [14:30:59] And this 1 link is allowed to be "manually done"? [14:31:14] Yep [14:31:23] They suggested a blog post ... [14:31:28] Or a google drive doc [14:31:53] ok, that's as good ... [14:32:21] ... may be we want a commons admin to lock that page after the deadline? not sure whether this is possible for user pages... [14:33:37] Again, I don't think it would matter - as blog posts and google drive docs can be edited [14:33:49] I think they might take a snapshot of it [14:34:19] But yeah, locking it doesn't hurt. Eitherway the history information is always preserved [14:34:44] Yes but the wikipage can be edited by anybody - and then they would have to check the history and so on... [14:35:09] True [14:36:19] Otherwise you could check whether you can move your nice summary to another namespace where locking is possible - I guess jayvdb can answer that easily - this is just a suggestion. [14:36:54] So I will look over that page carefully and ping you if something come to my mind. Other than that I'm done if you are... :) [14:36:54] Yep 👍 [14:37:08] I'm done too :) [14:37:25] Have a nice evening and weekend! [14:37:47] Cya :)