[03:24:13] hi, i'm proposing a new type of data entries -- per-sitelink pageviews [03:24:13] https://phabricator.wikimedia.org/T174981 [03:24:53] prefix:total_page_views [integer] . [03:26:58] yurik: sounds quite useful. how difficult would it be to get the number of combined pageviews across all languages? [03:28:16] YairRand, imo, very easy - i'm almost done with a python script to get the totals. With some polishing, it shouldn't take too long. The delays may be due to setting up a totally new pipeline for data imports - at this point, the only data importer is in Java, downloading from WD [03:28:48] so to do it properly, we would also need some monitoring around this system [03:29:03] shouldn't be too hard, but that's really in the ops land [03:29:15] please vote on the issue ) [03:30:21] wait, we still have votes on issues? I thought they got rid of the button? [03:30:40] even before the move to phabricator? [03:31:43] YairRand, award token :) [03:32:23] so many options to choose from... [03:33:18] and also apparently a weird bug in the system that causes tokens to sometimes be automatically rescinded? [03:33:34] (attempt 2) [03:33:40] that was weird [03:35:28] and phabricator's phabricator doesn't allow posting issues without creating an account, so meh. [03:35:44] hehe [03:36:47] oki, let me finish the rough draft implementing it - you will instantly be able to play with it at the http://88.99.164.208/wikidata/ once i'm done [06:24:06] has anyone here heard of Unigraph.io? [11:52:43] Thiemo_WMDE Aleksey_WMDE could you bring microphone please? [11:53:20] consider done [12:25:51] what do I do when someone is not married anymore, but they are still put as "spouse" in wikidata? [12:26:13] I guess I can not just remove it [12:26:30] atluxity: add qualifier start/end dates? [12:26:44] you ask me? [12:26:49] maybe also change the rank of the statement [12:28:13] I dont know when their marrige ended [12:28:25] atluxity: whose? [12:28:26] we can put unknown [12:28:33] Q1825205 [12:28:52] I tried doing that now [12:29:01] should the change be instant? [12:29:05] "unknown"https://www.wikidata.org/wiki/Help:Statements#Unknown_or_no_values [12:29:07] on wikipedia article [12:31:00] atluxity: how do you know the marriage ended? [12:34:05] I have no references, but has talked with the person in question [12:34:53] he was unable to find out how to edit his article, so called for help [12:38:48] I really doubt this was correct but yeah I'll let it be [12:41:27] we have a poor celebrity press in Norway, so someone getting divorced does not really make for great headlines and can be hard to find reference for. [12:42:06] so for us doing wikipedia, should we tell them that they should still be listed as married because we can not find proof online that they are divorced? [12:42:57] changing the rank is the wrong thing to do, deprecated means that the statement was never correct [12:43:18] nikki: i think we mark current spouse as preferred [12:43:22] and former as normal [12:43:30] yes, that's what we're supposed to do [12:43:33] but you're right about deprecated [12:43:39] but it's pretty difficult to get people to follow that :( [12:43:43] yeah :/ [12:44:03] oh, ok [12:44:08] if we don't know the exact date of divorce, but know the year or something, then just putting the year is ok [12:44:33] not sure that's the case for this [12:44:43] it's just natural to want to edit the statement you're trying to mark as less preferred, sometimes you might not even know which of the other statements is better [12:44:57] setting an end time does not help [12:45:16] it is still displayed in the article [12:45:37] if the wikipedia article is fetching the data from wikidata, you should contact someone at that wikipedia about it [12:45:54] they're probably not checking things like dates [12:48:54] perhaps preferred "spouse: novalue" would do trick but preferred statements should always have some verfiable source [12:53:22] it would be best to get them to fix their template so that it stops displaying the data misleadingly when we don't have anything more current [12:53:44] what if someone gets divorced and we don't know if they remarried? :/ we wouldn't be able to pick between novalue and somevalue [12:56:22] +1 [20:31:21] We had 20 million edits last month. [20:59:49] SMalyshev, i posted some examples in my last comment - https://phabricator.wikimedia.org/T174981#3579558 The most frequent usage is various importance ranking, which you cannot readily do unless you merge the data. Given 200 items that a sparql query could get, you often want to order them according to your preference - e.g. by population size, or by a combination of population plus popularity. Calling a separate api to get that is like calling [20:59:49] wikidata directly after using wdqs because the labels are missing. [21:00:57] well, the thing is pageviews are not very stable data, and they are not even wikidata thing... and we already have place which handles them [21:01:24] so I am not convinced we need to create another separate sync system just to run them between two databases... [21:02:22] that's very true (then again, wikidata is sometimes not very stable either :D). In a way, pageviews are like sitelink counts - you could get them via other ways, but it makes it much better to store them in wdqs to simplify many queries. [21:02:35] it sounds to me, at least for now, that using external service would be a better idea... also not 100% clear which pageviews you want to count for a wikidata entry [21:03:20] using external data does not allow you to do joins (easily) - unless you are talking about syndication? [21:03:41] yurik: the difference is that wikidata is tactically unstable because it's work in progress, but it ultimately is supposed to converge somewhere (at least excepting changing things like people being married/divorced/elected/dying/etc.) [21:04:06] sure sure, i was not serious about that point :) [21:04:13] but pageviews are conceptually unstable - these are data series, not fact. and wikidata doesn't do data series very well right now [21:04:46] with PV, the idea is to offer relative total counts (not series) - so that items can be easily compared in terms of popularity [21:05:06] so it doesn't matter what the actual number is, as long as it is comparable with another number [21:05:17] think percentages :) [21:05:33] (as in - this article is 50% more popular) [21:05:33] oh, so you need popularity score! we already have that [21:05:37] in Elastic index [21:06:13] check https://en.wikipedia.org/wiki/Destreza?action=cirrusDump - look for popularity_score [21:06:44] not in WDQS - its not possible to do a SELECT { population > X } ORDER BY popularity , or is it? [21:07:14] nope, it's not exported now. Not sure how to do that yet... [21:07:26] if it was page_prop, it'd be relatively easy, except for the part where we don't know about pageprop updates [21:07:36] also, this is a property of wikipedia page, not wikidata item [21:07:57] sure, that's why it would be stored as popularity [21:09:24] yeah but how you would use it? [21:09:29] SMalyshev, who calculates the popularity score? MW? [21:09:46] what if this is the most popular article on dewiki but doesn't have enwiki article? [21:10:45] e.g. a page about elections in Spain would probably super-popular on eswiki and rather mediocre on enwiki. so is it popular or not? [21:13:11] SMalyshev, something like this: WHERE { ?sitelink about ?wd ; inLanguage 'en'; popularity ?rate } ORDER BY ?rate DESC. [21:13:42] sure, but that all depends on what is the goal of your query - to find a concept (WD) or a wiki page [21:14:24] so for example, if i am creating a map, and looking for english translations of a place in africa, I would probably want the popularity of english pages for that location [21:15:11] because an english-speaker would likely want the most popular places to appear before the less popular ones [21:15:47] this way, when creating a map of california, san francisco would automatically show up before san jose simply because it has (?) higher popularity rating [21:16:12] despite being smaller in size [21:20:53] SMalyshev, as for popularity vs timeseries - i think it would be good to store yearly popularity numbers for research purposes - e.g. sitelink pageviews:2016 -- easy to do major trend checking. But I am not certain this is needed as much as pure popularity scores [22:03:17] yurik: but I think we already have these data in the pageviews service... not sure wikidata graph db is the right place [22:03:39] unless we get a broader concept of storing timeseries/historical data [22:04:12] SMalyshev, we have the data, sure, but we have no easy way to use it together with the query. Its like saying we have the wikidata itself, why do we need a WDQS? you can simply pull the items one at a time [22:04:22] I mean I get how it can be useful, but I am still not 100% convinced it's useful enough to not just use external service for it [22:05:40] yurik: it's not the same... this data is not frequently used in queries, as far as I know, and only is needed for ranking results, as far as I can see. [22:06:55] I'll think about it more, but right now I'm not convinced it's a good match, given how much trouble we'd go through to get it there [22:07:14] if there's an easy way to do it I'd not object [22:07:41] basically its the fundamental problem of convenience - one can parse all the dumps, import them into a database, run a WDQS query, join it locally, and get the final answer of which items are more useful/relevant. I am not sure i understood your point about the data not frequently used in queries - there is no such data ATM [22:09:10] agree about the difficulty... I am working on a python script for it, but it would be better to just use the popularity from ES [22:09:47] well, the data is available via the API, isn't it? [22:10:26] so we could make an api call from wdqs and fetch it, possibly? [22:18:02] SMalyshev, sure, but it should be done in bulk, right? analytics makes full dumps available, and i thought those are easier to consume than to make complex api calls, especially when you have to deal with dissappearing values (e.g. if page is deleted/renamed) [22:18:37] if page is deleted/renamed, you'd get no pageviews for it anyway, since it'd have no sitelinks [22:26:29] SMalyshev, http://tinyurl.com/y8gafo42 [22:28:22] plaza Israel in Argentina is marked as Israel itself - popularity makes for a nice order which objects should be fixed first :) [22:30:35] I can see it, surely, for you use case... I'm just not convinced yet it is worth the trouble for generic wdqs case [22:30:39] *your [22:32:16] sure, i understand your concern - you are right that we shouldn't do it just for my usecase. I guess this is more of a question for the WQDS community - how many people would be interested in actually using it / benefiting from it. So far its only 2, so hard to say yeis/neis just yet [22:33:04] is this a discussion about the page view counts in wdqs? [22:33:17] yes [22:34:18] yurik: sure, if a significant number of people say they need it then it changes things... ultimately if the community wants it and it's possible then we should do it, no question [22:34:47] I suspect that users of the data would very frequently prefer to have things ordered most-viewed to least. random pages that barely anyone ever looks at are less interesting. [22:34:51] exactly :) i don't want to be the only one pushing for it :) [22:35:34] YairRand: that depends on the nature of the query, I assume [22:36:48] I'm having difficulty imagining any (at least moderate length) human-viewed query where the more interesting pages in general aren't more interesting to the user. [22:38:51] unless they particularly wanted it sorted by something else [22:39:23] (sorted by age, date, length of term, population size, etc.) [22:40:55] i also think that at the moment the only sort criteria in WDQS is by less useful properties, such as name or age, plus some calculated value like "distance from X". But i do not know how many queries there are that would prefer to sort by popularity [22:41:14] YairRand: that's easy. Any negative-results and give-all queries would be like this [22:41:44] YairRand: e.g. "all paintings without specified author" or "oldest Russian-speaking writer" [22:42:19] a lot of queries have either natural ordering other than popularity or do not have inherent order at all [22:42:32] "moderate length" - refering length of results [22:43:29] but yeah, there probably are quite a bunch that have natural order or no order, for which this wouldn't be useful [22:44:49] is there a general forum for sparql on Wikidata? WD:RAQ seems limited to specific requests. [22:46:06] i think we should take it to a mailing list - either wikidata or wikidata-tech [22:47:28] ...I didn't even know there was a wikidata-tech. (goes to subscribe.) [22:49:43] (apparently I was already subscribed, yet it has low enough volume that I didn't even remember it.) [22:53:11] SMalyshev, is there any real benefit from pre-declaring URIs in Blazegraph? E.g. saying that prefix:value is a frequent constant? I looked inside the jnl file - seems like the actual URI is only stored once. [22:54:16] yurik: yes, dictionary items would get shorter storage ids as far as I can see [22:54:55] i.e. dictionary ones are 2 bytes iirc while general string is 9 bytes. Also probably BG doesn't need to lookup in the dictionary on disk when using them [22:55:14] ("on disk" should be of course taken in broad sense, as there's os cache, etc.) [22:55:34] ah, i see. Interesting. thanks. But adding to the dictionary is a new binary schema? [22:55:56] so if you use something a real lot, like standard predicates, make sense to put it in the dictionary for performance [22:56:11] if not, BG will put it in generic hash [22:56:56] i mean - do i need to regen the whole DB if I add a new predicate constant? [22:57:20] yurik: yes, changing the dictionary means you need to reload, since binary data now means different thing. Theoretically it could be incremental, practically I don't think it works that way [22:57:36] you could hack it but that's weirdness territory... [22:58:01] yurik: so theoretically if you add it in the end maybe not... but official advice is "yes". [22:58:07] is there a way to do an in-place migration - copy data from one instance to another? [22:58:44] yurik: not really. I mean you could export the data into RDF, but that would be the same as loading new data. there's no other export really [22:59:14] it basically all indexes, and I don't think they are exportable in any form [22:59:17] either running two java instances, or connecting two different DBs inside the same process, and doing some magical INSERT graph into graph [22:59:41] yurik: no I don't think so... [22:59:42] i saw sparql update had something like that, but wasn't sure [22:59:45] bummer [22:59:46] ok, thx [23:36:50] SMalyshev, is there a maximum number of values that I can add to the dictionary? I have added about 60, was not sure if i should add a few hundred [23:38:12] yurik: I don't think there is, but maybe since they are supposed to be 2-bytes, 64K is a natural limit [23:38:41] not sure though it won't just switch to longer int... no idea. probably makes sense to check the code and see [23:38:45] SMalyshev, sure, but how items are defined by Wikidata itself? [23:39:03] or does it define just a few dozens? [23:39:14] hmm I didn't count them :) [23:39:29] it's probably a few dozen yes if you mean the inital vocabulary [23:39:40] does it expand somehow?? [23:39:47] (other than my own)