[21:01:21] #startmeeting RFC meeting [21:01:22] Meeting started Wed Mar 29 21:01:21 2017 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:01:22] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:01:22] The meeting name has been set to 'rfc_meeting' [21:01:22] Meeting started Wed Mar 29 21:01:21 2017 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:01:22] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:01:23] The meeting name has been set to 'rfc_meeting' [21:01:41] #topic T161527 Canonical data URIs and URLs for machine readable page content [21:01:41] T161527: Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527 [21:02:27] hi all! [21:02:41] thanks tim for the link. [21:03:14] I'll try to give a short intro to the topic, and the problem we are trying to solve [21:03:55] On Commons, we now have the data namespace, with pages like https://commons.wikimedia.org/wiki/Data:Avignon_City_Wall.map [21:04:12] We want to link to these from Wikidata. And we want to represent these links in RDF [21:04:35] To do so, we need a nice canonical URI for the page content (not the HTML page) [21:05:16] and we want tools that consume RDF data to be able to get data in raw form (not HTML page) easily [21:05:40] yes, indeed. which means the canonical URI should be resolvable (i.e. it should be a URL) [21:05:50] So here are the four proposed forms: [21:05:58] A) https://commons.wikimedia.org/data/Avignon_City_Wall.map [21:06:00] B) https://commons.wikimedia.org/data/Data:Avignon_City_Wall.map [21:06:01] C) https://commons.wikimedia.org/raw/Data:Avignon_City_Wall.map [21:06:03] D) https://commons.wikimedia.org/api/rest_v1/page/data/Data:Avignon_City_Wall.map [21:06:40] The difference between A and B is that B includes the namespace, so the same pattern can be used with pages in pther namespaces, it's not specific to this particular use case [21:06:42] says WHATWG in https://url.spec.whatwg.org/#goals : "Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest." [21:06:53] The difference between B and C is just "data" vs "raw" in the path [21:07:00] Option D is a RESTbase URL [21:07:11] question: when we talk about Data:Av.map, would localizations of Data: also be used? I.e. Данные:Av.map or however this space is named in Russian? [21:08:02] TimStarling: For HTTP URIs, I agree. There's also stuff like urn:isbn:123456789x.... [21:08:13] IRI is wider than URL IIRC. though for now it doesn't matter too much [21:08:40] Yes, we could have cömmmons/dätä in a IRI ;) [21:08:58] yes, technically a URI is either a URL or a URN [21:09:21] but if we're actually talking about a URL namespace then maybe it makes sense to stop saying URI all the time? [21:10:13] Fair enough. URI is the term used in the context of RDF, which drives this need. BUt we can just say URL here for now [21:10:36] So, pretty URLS for raw page content. [21:10:43] any preferences? [21:10:47] My peprsonal choice is option B [21:11:23] (side note: do we want to back this with action=raw, or ditch action=raw?) [21:11:28] I have slight preference for A, due to Data: being just one of the namespace names, and namespace having bunch of other names [21:12:08] SMalyshev: true. but what do we do when we need a canonical URL for the raw content of a page in another namespace? [21:12:14] also, I'm not entirely sure that the code would work the same for all members of Data namespace - .map and .tab might work differently? [21:12:35] for now, the idea is to just serve the raw page content. [21:12:55] DanielK_WMDE_: probably something other than /data/ since it'd not be data (if it were, why it's not in Data:? :) [21:12:56] .map and .tab are both JSON, but the URL doesn't tell you that. And it'S not necessary [21:13:50] I'm just a bit worried to over-prescribing it now before we know other use cases. But I don't have too much problem with B, it's not a huge issue if it's B [21:13:55] SMalyshev: I want a URL scheme for raw page content that knows nothing of common's magic Data namespace. [21:14:04] DanielK_WMDE_: could you say more about the requirement to serve a machine readable representation? [21:14:12] (DanielK_WMDE_: am getting this error message - "Error 404 – File not found https://commons.wikimedia.org/data/Avignon_City_Wall.map We could not find the above page on our servers." - when trying to view A and B [21:14:33] Scott_WUaS: none of the options are implemented. we are discussing which one we want. [21:14:41] Scott_WUaS: it's not implemented yet, we are just discussing which one to implement [21:14:42] Thanks [21:15:02] more concretely, which use case do you see enabled by serving some representation to a client that doesn't know what it is going to get? [21:15:26] gwicke: well, for one thing, it's best practice in the world of linked data to link data. ideally RDF to RDF, but if you have another data format, link it. [21:15:41] I'd be OK with C too though I would choose B then probably just because data is slightly more clear word than "raw" for a wider audience IMHO [21:15:47] clients may want to do stuff with something with it [21:16:17] okay, so this is targeted at RDF clients in particular? [21:16:26] SMalyshev: i thing /data/ is clearer, but it's a bit confusing together with the Data: namespace. the two instances of data in the URL are unrelated. they don't refer to the same idea [21:16:40] gwicke: that is the current use case, yes. [21:16:53] it's intended to be usable in other contexts too [21:16:59] gwicke: they are probably know what they are going to get. I.e., you want to draw all streets in Manhattan, you ask Wikidata for streets that are located in manhattan, then get their commons geodata property, and that gives ou geojson that you can draw on the map [21:17:02] in that case, I guess the default response should always be RDF [21:17:13] (Thinking in terms of keeping options open ahead into the future, which options would dovetail best with Google Streetview/Maps/Earth hypothetically?) [21:17:19] DanielK_WMDE_: what idea is /data referring to? [21:17:52] TimStarling: the content of the page. as oppposed to the HTML rendering of that data. [21:17:53] gwicke: the output of the URL won't be RDF, it'd be something like (Geo)JSON or text/csv or text/tsv [21:18:00] what consumes the RDF? [21:18:07] TimStarling: we could also use /content/, but that's more ambiguous [21:19:14] gwicke: For a *generic* RDF client, yes. But the clients typically using Wikidata RDF are not gerneral purpose RDF clients. They are special purpose clients for finding and showing a specific kind of information from wikidata [21:19:27] like inventaire or sumofallpaintings [21:19:49] yeah we're talking about things like "ask for data and then use it to draw stuff on the map" [21:20:00] TimStarling: /content-data? /raw-content? [21:20:16] (... especially Google Streetview's Time Slider function?) [21:20:16] so it helps those clients to have a link to raw data? they know what to do with it? [21:20:17] right now we have points, with this one we'll also have shapes [21:20:27] We may want to reserve a place inthe URL for specifying the desired format. But then we are talking about retrieval URLs, not identifiers. For that, I'd use the RESTbase scheme [21:20:43] TimStarling: yes. SMalyshev gave a good example. [21:20:55] TimStarling: yes, for specific property they'd know what they expect to get from that URL [21:21:09] oh, here's another example, which went live today: [21:21:10] https://tools.wmflabs.org/monumental/#/object/704354 [21:21:15] A tool to show info about monuments [21:21:23] a more concrete client we are trying to enable would be useful for the discussion, I think [21:21:25] that could show the geo-shape of a monument, of it [21:21:28] * gwicke basically seconds TimStarling's question [21:21:29] ...if it's known [21:21:35] is your proposal forwards compatible if we want to link to RDF instead of raw data in the future? [21:22:12] TimStarling: it's not different - the link is always one URL, the context defines what you expect to find in this URL [21:22:35] Following up on SMalyshev's question from earlier: what would the canonical URL be on a wiki where "Data:" is localized? Would it be "Data:" or would it be the localized name? The latter would be massively inconvenient when writing software [21:22:41] for geodata, you'd expect to find some kind of JSON schema that defines geodata [21:22:48] TimStarling: yes, since it does not restrict the kind of data served. but it leaves it to the client to check the Content-Type header. changing the format might break clients that assume a type. [21:22:59] RoanKattouw: that exactly is my only issue with including namespace :) [21:23:15] it seems that we agree about using regular retrieval URLs / APIs for specific clients [21:23:19] RoanKattouw: the canonical URL will use the canonical namespace name for that wiki. [21:23:29] so the question is more about the generic client? [21:23:50] DanielK_WMDE_: So the localized name? [21:24:07] TimStarling: Do you think the desired format should be embedded in the URL? In RDF land, that is generally not done. Clients rely on content negotiation instead [21:24:13] In that case I'd ask for at least a redirect from the English name, otherwise just computing the canonical URL in an external tool is a lot of work [21:24:19] that's also what we do for wikidata's linked data interface [21:24:26] can you show us an example RDF document? [21:25:59] does it not have the type in the link? [21:25:59] gwicke: well, it'd be best if we could have single nice URL for all data retrieval and not have clients that want json use one URL and clients that want csv use another [21:25:59] we want generic URL that says "here is data representation of this dataset" with possible addition of Accept: or such that can represent it in different formats [21:25:59] Basically like www.wikidata.org/entity/Q704354 works [21:25:59] RoanKattouw: localized in the wiki's content language, yes. The URL uses the canonical page title. whatever that is for the given wiki [21:25:59] TimStarling: one that contains a reference to a geo-shape? no, that's not implemented, pendinng this discussion :) [21:26:26] well, that gets you to HTML when you visit it with the browser [21:26:29] is it likely that there will be more data namespaces than Data:? I feel like that would be weird, since Data: is such a general term and there are already two unrelated kinds of data (map and tab) in it [21:26:30] because of content negotiation [21:26:41] DanielK_WMDE_: it might be a bit confusing if canonical is not English, but I guess if you go there, you deserve it :) [21:27:14] is there an actual generic client that follows linked data URLs and can handle a broad set of returned formats? [21:27:15] TimStarling: no, the URIs we use in RDF do not have the type encoded in the link. clients sele4ct the desired type using the Accept header, as SMalyshev said [21:27:29] lucaswerkmeister: theoretically there can be. Consider File: could have structured data too in the future, for example [21:28:21] lucaswerkmeister: i anticipate a need to have canonical urls for other page content. We already have them for Wikidata, we currently use wikidata.org/wiki/Special:EntityData/Q1 as the canonical (format neutral) data url [21:28:22] so /data/x would ultimately not be a simple redirect to /wiki/x?action=raw, it would be a content negotiating endpoint? [21:28:40] lucaswerkmeister: but you may also want a nice URL for a JS "page", or a translation table, no? [21:28:41] TimStarling: yes, I think that's the idea. [21:28:55] at least I think that's a good idea :) [21:29:07] TimStarling: I'm open to that idea, but I don't want to commit to doing that right now. [21:29:25] If there is need, we should add content negotiation there [21:29:29] for now though I'd be ok with sane redirect too [21:29:32] actually, thinking about it [21:29:48] I'd really like to replace wikidata.org/wiki/Special:EntityData/Q1 with wikidata.org/data/Q1 [21:29:50] if we have only one format implemented, negotiation is kind of futile anyway [21:29:52] what do you think SMalyshev? [21:30:01] it would be really nice to be able to use the same URL pattern there [21:30:10] DanielK_WMDE_: except the need to reload all dumps, I don't see any problem with it [21:30:13] why not just /wiki/{title}? [21:31:00] gwicke: I think it's harder to make negotiation work on /wiki/{title} especially with tons of options already using this URL. [21:31:05] there are ways to link to linked data from HTTP [21:31:10] and HTML [21:31:21] gwicke: because that's the UI. we could also implement contgent negotiation there, true. But it's a bit dirty. When asking for HTML, you'd not be getting just HTML content, but HTML + chrome [21:31:31] a client would already know what to ask for anyway [21:31:37] also may cause confusion between "the data in the page" and "the page with all its GUI in all its glory".... [21:32:04] gwicke: i think it would be good to have an easy distinction between "URL of the data" and "URL of the page I want to browse" [21:32:20] SMalyshev: exactly [21:32:21] yeah the chrome issue, so I'm a bit wary about reusing that URL... too much legacy on it [21:32:37] that's a representation change only though, isn't it? [21:32:38] gwicke: how would you distinguish "HTML rendering of content" from "UI to interact with content and the site"? [21:32:53] no, that's not representation only, that's semantics in my mind [21:33:16] gwicke: well, suppose we did want to see just HTML representation of the data, sans chrome - how would we ask for it then? [21:33:33] gwicke: i'd rather use the REST API URL than using the page URL. But the /data/ URL would be even better, I think :) [21:33:43] we could of course invent more parameters, but it becomes a bit ugly [21:33:50] real clients can only process specific formats [21:34:00] so they need a way to discover & ask for them [21:34:36] this problem is basically the same between HTML, HTML+Chrome, some JSON flavor, some XML flavor and so on [21:34:38] in the context of RDF, the standard mechanism for that is the Accept header [21:34:41] discover not necessarily, we are not super-good at that now, ask for them - yes [21:35:17] honestly - i don't expect the content negotiation use case to become relevant soon [21:35:22] gwicke: technically yes, but conceptually I think "browser view" vs. "raw data" are different enough to warrant different URLs [21:35:35] DanielK_WMDE_: is there a generic RDF client that implements content negotiation & asks for specific formats? [21:35:37] what I want is a pragmatic solution for "reliable pretty URL for raw page content". [21:36:05] gwicke: typically, they ask for one specific format. typically Turtle. [21:36:20] gwicke: not for geoJSON data as far as I know. It wouldn't know how, since the fact that it's geoJSON data is a specific property of Wikidata (not even Wikibase) [21:36:41] it's because it is a value of P-whatever we know it's geoJSON [21:36:44] i don't know of a general purpose RDF client that can handle geoJson. that doesn't seem a relevant use case to me [21:36:49] so generic tool wouldn't be able to do it, but specialized one would [21:37:05] I think it is better to have the content negotiating endpoint be a separate URL from the UI [21:37:12] generic tool would be able given URL, extract raw data from that URL [21:37:18] are there ways to mark up alternate links at the source? [21:37:26] but it won't know the context of it [21:37:37] as in link-to-turtle-here and link-to-geojson-there [21:37:39] I think it's good to have an opaque title string so that the implementation is not dependent on hairy MW namespace normalisation/i18n, i.e. options B/C/D [21:38:06] gwicke: though content negotiation *is* widely used in linked data world, just not for this specific use case. For federation - all the time. [21:38:13] TimStarling: what do you mean by opaque title string? [21:38:35] "Data:Avignon_City_Wall.map" means something to MW [21:38:41] to me Data:foo is "opaque" in the sense of "whatever mw uses" [21:39:19] it's opaque to the content negotating frontend, it isn't decomposed into namespace and dbkey [21:39:54] TimStarling: ah - you are saying you prefer B/C/D because the content negotiation mechanism doesn't have to know about namespaces? [21:40:03] yes [21:40:11] ok. i agree [21:40:45] B vs C is just name-shedding. So let's concentrate on B vs D [21:41:18] In my mind, REST API URLs are good for referring to a specific blob of bytes. a specific revision in a specific format. [21:41:24] I think it would be nice to have a toy content negotiation script, which validates the Accept header in some trivial way [21:41:42] I'd like the format-neutral (possibly content negotiating) current revision URL to be distinct from that, and simpler [21:41:44] just to make it clear what our principles are [21:42:03] I like B more out of those. Note that URLs are forever. So longer and more specific URL we have, harder we will have to work to make it work forever. [21:42:11] the issue with content validation & caching proxies is always normalizing accept headers without turning the caching layer into a mess [21:42:36] alternative is to forgo caching [21:42:42] TimStarling: for wikidata, we currently do content negotiation in PHP, via the Special:EntityData page. [21:43:07] gwicke: content negotiation generally uses 303 redirects. the accept header shouldn't be relevant to the caching [21:43:29] the cacheable response will be requested from a format specific url [21:43:31] if you don't cache the redirect, sure [21:43:42] Shouldn't Vary take care of it anyway? Or we're talking about silly caches that don't do Vary properly? [21:43:43] that's forgoing caching [21:43:51] for wikidata, we do /entity/Q1 -> /wiki/Special:EntityData/Q1 -> /wiki/Special:EntityData/Q1.json. [21:43:58] the last response is cacheable [21:44:01] the others are 303s [21:44:28] SMalyshev: a plain vary: accept would fragment your cache to the point of making it fairly useless [21:44:28] gwicke: the negotiation is not cached. the blob is. [21:44:41] no vary:Accept needed [21:44:41] I don't think btw we're going to get a deluge of .map requests that would make not caching redirects an issue anytime soon... [21:45:51] TimStarling: do you think doing the content negotiation in PHP/MW for now is OK? Special:PageData would behave similar to Special:EntityData [21:46:07] or do you want it in nginx/apache/varnish? [21:46:33] yes, it's fine, and doesn't change my opinion on having the namespace in the URL [21:46:46] agreed [21:47:07] do you think we should redirect to action=raw? or the REST API? [21:47:09] should /data/x return a 303 to /wiki/x?action=raw from the start? [21:47:13] we could also serve data without redirecting [21:47:26] but then we do get the cache fragementation that gwicke is referring to [21:47:36] TimStarling: yes, that's my question [21:47:51] if we return 200 from the outset then we will be afraid to do 303s in the future for fear of breaking existing clients, right? [21:48:08] i don't think it would break anything but caching [21:48:29] but 303s are best practice for content negotiation in oD [21:48:32] *LoD [21:49:14] TimStarling: https://www.w3.org/TR/cooluris/#r303gendocument [21:49:18] #link https://www.w3.org/TR/cooluris/#r303gendocument [21:49:36] e.g. in curl, following the Location header is non-default, so we have to make sure clients that don't set FOLLOWLOCATION are broken [21:49:56] to achieve this, it's best to use a redirect from the start [21:51:01] can 303 responses be cacheable if we want that initially? [21:51:18] I think this nicely illustrates the conflict between the linked data ideal of "here is a URL, the sufficiently smart client will figure it out" and the reality of limited clients having strong expectations about returned content formats etc [21:51:21] the 303 itself? could, but you'd have to vary on accept [21:51:54] ok, I'll take that as a no [21:52:09] gwicke: yes, indeed. so let's build something with a sane default and the option to make it smarter later [21:52:30] I was just wondering about the case where you have only one content type, whether there is an optimisation [21:52:41] could optimise in VCL I guess [21:52:47] I think we should return 303s and if the client can't do it, it should learn to :) most HTTP toolkits have the capability, though not all enable it by default [21:53:18] TimStarling: if the negotiation code knows that there is only one acceptable format, then 303s can be cached, yeswith no vary on accept. [21:53:30] TimStarling: yes, could hardcode it for now and then un-hardcode later if we need more [21:53:46] if another type was requested, you'd get a... oh, 406 or whatever it was [21:54:03] ok, sraw man time: [21:54:23] rewrite /data/Foo to /wiki/Special:PageData/Foo. [21:54:40] Special:PageData will check the Accept header (idf any) against the mime type of page Foo. [21:55:09] If it matches, it will respond with a 303 redirect to /w/index.pgp?title=Foo&action=raw [21:55:30] If there is no match, it will respond with the appropriate error code (406 or something) [21:56:10] The 303 is cacheable, until we decide to implement content transformation to support more formats [21:56:30] once we do, the 303s become uncacheable, or we need some serious logic to normalize Accept headers [21:56:41] good? [21:56:41] but then if a client sends Accept: foo/bar varnish will respond with the cached 303, not with a 406 [21:56:50] you have to send 303 from MW unconditionally to make it cacheable [21:57:06] or have VCL implement Accept header validation [21:57:21] hmmm... you are right. so scratch that part. [21:57:25] redirects are not cacheable [21:58:58] a possible VCL hack might be to check if */* is in the Accept list, since most clients ultimately send it [21:59:27] browsers do. specialized clients... not sure [21:59:34] but yes, it would be an option [22:00:23] yeah specialized clients usually don't send */* since they only know to handle a narrow set of formats [22:00:25] will users be waiting for these responses? is latency critical? [22:00:41] browsers are omnivores [22:00:55] I'm being a bad chair, it is :00 [22:00:59] TimStarling: you mean humans? quite possibly, yes. i imagine web apps and mobile apps as clients [22:01:27] then best to do the VCL hack [22:01:34] you know - click a qr code on the wall of the ancient fort you are visiting, and get a nice map overlay [22:01:37] action items? [22:01:41] time to wrap up now [22:01:46] totally possible people would load it via JS, yes. [22:02:13] TimStarling: i'll turn the straw man into a proposal. then we can discuss it again - or make it a last call. [22:02:28] we can bikeshed over the name in the path again, and the name of the special page :) [22:02:39] (and whether it should rather be an action handler) [22:02:44] #action DanielK_WMDE_ to write up a proposal based on discussion outcome, then move RFC to last call [22:02:54] \o/ [22:02:59] thanks all for your input! [22:03:23] #endmeeting [22:03:23] Meeting ended Wed Mar 29 22:03:23 2017 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:03:23] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.html [22:03:23] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.txt [22:03:23] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.wiki [22:03:23] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.log.html [22:03:24] Meeting ended Wed Mar 29 22:03:23 2017 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:03:24] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.html [22:03:24] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.txt [22:03:24] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.wiki [22:03:24] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.log.html [22:03:34] there are two meetbots! [22:03:42] hehehe [22:03:46] better than zero I guess [22:03:52] i hope they don't try to write to the same file at the same time... [22:04:01] :) [22:04:07] they are trying, the URLs are the same [22:11:10] then let's hope they are not running on windows ;) [23:04:20] [17:18] TimStarling: we could also use /content/, but that's more ambiguous [than data] [23:04:36] Lawl at data being comparatively less ambiguous. [23:09:39] We currently have at least action=raw, action=render, action=view, printable=yes. [23:10:29] I'm still not clear on the actual purpose of that meeting, but I wonder whether it makes sense to re-visit some URL-related caching decisions. [23:10:38] And/or using the URL as the cache key. [23:13:18] [17:04] To do so, we need a nice canonical URI for the page content (not the HTML page) [23:13:53] I'm still not really buying this argument. It feels like saying "we need a pretty/nice api.php?action=query URL", which is just silly. Who cares if the URL is ugly? The application code? [23:15:46] Anyway, when it's an actual RFC, maybe I'll comment. [23:16:30] did you miss the bit about content negotiation? [23:17:10] [17:37] I think it's good to have an opaque title string so that the implementation is not dependent on hairy MW namespace normalisation/i18n, i.e. options B/C/D [23:18:15] What's an opaque title string? A positive integer page ID? [23:19:19] no, a page ID is not a title string [23:19:42] you know what a title is [23:19:45] Sure. [23:20:08] I find https://phabricator.wikimedia.org/T161527 really lacking in the problem statement. [23:20:18] > There is currently no canonical URI/URL for referring to and retrieving these data sets. [23:20:30] Versus several paragraphs jumping into potential solutions. [23:20:59] I think pretty URLs are pretty dubious in general. Programmers/developers seem to care about them more than any actual user/reader. [23:21:53] TimStarling: So the last resort argument of people to defend pretty URLs is then to say, well the URL is used for caching! [23:22:36] Which I guess brings me back to wondering if we should just care less about using the URL as a cache key. [23:22:48] it's probably won't be cacheable, you should really read the whole log [23:23:03] Okay. [23:23:15] I saw the part about 303s not being cacheable. [23:23:32] I skimmed over it as I'm not sure I believe it.