[21:12:44] isaacj, yt? [21:12:57] dsaez: yep [21:13:31] I'm looking to your code and outputs.. ... [21:13:38] uh oh [21:13:50] === Human === [21:13:50] label prop [21:13:50] 0 human 0.268812 [21:13:50] 2977 child 0.000023 [21:13:50] 6146 displaced person 0.000008 [21:13:52] === 0 === [21:13:54] label prop [21:13:56] 1 0 0.063171 [21:13:59] 8 0 0.011147 [21:14:01] === Film === [21:14:03] label prop [21:14:07] 2 film 0.030739 [21:14:10] 31 3D film 0.002277 [21:14:12] 55 animated feature film 0.001395 [21:14:16] not sure how to interpret that [21:14:39] 3Dfild is a subset of film? [21:14:49] *3D film [21:15:37] sure -- in between the "===" is the wikidata class that met a given threshold and then listed under it are the top-three (by pageviews) subclasses that were aggregated into that class [21:16:10] so Film has under it "film", "3D film", "animated feature film", and probably a bunch more subclasses [21:16:35] toy story: https://www.wikidata.org/wiki/Q171048 [21:17:00] ok [21:17:01] so in the data any views to toy story would be associated w/ "animated feature film" initially but based on this taxonomy would be considered "Film" [21:17:49] the numbers on the far left in the printout are just the pandas index numbers, which happen to correspond with their pageview rank [21:17:55] the 100% is computed out of the sum of pageviews? items? subclass-of relationships? [21:18:17] 100% is sum of pageviews [21:18:39] i think the numbers are 1.9M wikidata items -> 32K unique instance-of properties [21:19:22] and then depending on the thresholds you use, those 32K instance-of properties are aggregated into somewhere betweeen like 20 and 100 classes [21:20:38] and its 7.5M pageviews across those items [21:20:44] interesting distribution ... it's a super long tail [21:21:40] yeah because you have instance-of values like "GDR Badminton Championships" and "Canadian Forces base" [21:21:59] subclass-of or instance-of or both? [21:22:36] those are values for instance-of [21:23:02] as far as unique subclasses...let me check [21:23:53] 80k different subclasses across all of wikidata [21:24:21] got you.. [21:24:44] we need some sort of hierarchical clustering [21:24:46] and the 32k unique instance-of values are only based on the articles that i'm analyzing so there are probably way more instance-of values out there [21:24:57] yeah, hence this aggregation technique