[14:51:14] dsaez: yt? [14:54:21] looks like that cirrus2hive wikidata table is only enwiki and frwiki so going to probably look into processing the json dump and loading into a single temporary hive table because the interlanguage switching touches so many languages that it'd be a lot of joins in mariadb [16:07:57] isaacj hi [16:08:11] ok, maybe check joal script [16:11:35] dsaez: thanks, will do [16:51:53] dsaez, isaacj - Hi :) [16:52:31] dsaez, isaacj: If the data the `big-join` generates is what you globally need, the best could be to genearte it once and write it as parquet [16:52:57] dsaez, isaacj - Given you seem to be working on wikidata those days, I'll also bring a newer snapshot onto hadoop [16:58:22] hey joal :) [16:59:38] agreed that a snapshot with project, page ID, page title, wikidata-id would be super helpful! is this the script you're talking about: https://phabricator.wikimedia.org/T215616#4944316 [17:00:03] yessir [17:24:48] joal: mmkay, i think i see what's going on then. correct me if i'm wrong, but you're building a table that maps every revision to its corresponding wikidata ID. for my use case, where i am not mapping to a specific revision, joining to any revision where the wiki and page title matches should be sufficient though because all of the revisions for a page will have the exact same information about wikidata ID [17:26:31] isaacj: Ok - I can update the script to work with page-only data and provide a page-oriented table [17:26:47] dsaez: You tell if you need a similar table for revisions :) [17:29:55] joal, having that table handy and updated would be great. [17:35:47] dsaez: updated is the issue - I'll have it, then at some point it'll be updated automatically [17:52:03] joal, cool [20:07:03] isaacj: chelsy mentioned that you are interested in what kind of data we may have about usage of wikipedia content by amazon alexa etc - see the two links at https://www.mediawiki.org/wiki/Voice_assistants_and_Wikimedia#See_also (TLDR: we don't really know ;) [20:09:41] HaeB: thanks for the share. yep, those question marks say a lot [20:15:40] isaacj: my notes from talking to alexa folks: https://wikitech.wikimedia.org/wiki/Analytics/Alexa [20:16:06] isaacj: it caches wikipedia heavily so pull of content and usage of content are quite different [20:16:55] nuria: yeah, i'm seeing that. so no page views then and API usage is disassociated with how often the content is served [20:21:18] isaacj: Very, the do a major effort not to be a burden [20:26:19] nuria: mhmm understandable but also opaque. thanks for sharing this too! i'm collecting these resources to see what it is we know / don't know / don't know but could know [20:32:26] nuria: yes, that was one of the links at https://www.mediawiki.org/wiki/Voice_assistants_and_Wikimedia#See_also ;) (isaacj feel free to add more there if you find other material) [20:33:04] ...or start a new page