[21:00:16] #startmeeting RFC meeting [21:02:39] ECDSA host key for bastion-restricted.wmflabs.org has changed and you have requested strict checking. [21:03:16] I think Andrew mentioned something like that earlier.. [21:03:44] poor meetbot, sick again... [21:03:54] status in #-labs says network changes and the last time it did that (failed bastions to other hosts with different keys) [21:03:56] rebooted, don't know about keys though [21:04:35] * robla waves [21:04:40] bastion-01.bastion.eqiad.wmflabs seems to have not changed keys [21:04:50] hola robla [21:05:00] I'm back working (at least part-time) this week [21:05:20] awesome news! [21:05:25] * aude waves [21:05:42] welcome back robla [21:05:42] o/ [21:05:49] well, if someone who is able to log into tool labs could restart meetbot, that would be useful [21:05:57] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Tools/meetbot [21:06:38] https://wikitech.wikimedia.org/w/index.php?title=Help:SSH_Fingerprints/bastion-restricted.wmflabs.org&action=history doesn't have any updates [21:07:31] ah technology [21:07:49] in the worst case, should we just do some manual #info grepping afterwards? [21:08:27] DanielK_WMDE_: would you like to give an intro? [21:08:27] marktraceur might be able to restart meetbot [21:08:36] gwicke: yea, can do [21:09:07] I can't get logged into tools-dev or tools-login at all; probably the maintenance they are doing... [21:09:14] ok, let's start then [21:09:18] So, I'd like to get a first round of feedback today on my proposal for supporting multiple content streams per page [21:09:20] #link https://phabricator.wikimedia.org/T107595 [21:10:21] The idea is that we want to have a) multiple user-editable content objects on a single page (e.g. wikitext plus structured data for categories plus extra info for, say, the lead image for mobile) [21:10:57] ...and b) we'd want to permanently store various derived kinds of data for a given revision (rendered html, diff, blame map, etc) [21:11:39] To allow this, I propose to introduce another level of indirection between the revision record and the actual data blob (currently in the text table or external store) [21:11:55] we now have: page -> revision -> text (-> ext store) [21:12:33] we would then have: page -> revision -> slots -> urls; urls can refer to the text table, or ext store, or whatever other storage mechanism we like, e.g. RESTbase [21:13:16] any questions about the idea so far? [21:13:40] the capability to store multiple bits of content per revision is definitely something we need [21:13:45] * aude thinks this would be nice from a caching perspective [21:13:47] reminds me of resource forks on classic mac os :) [21:14:03] brion: NTFS calls them streams, I think [21:14:04] the same need prompted the creation of RESTBase in the first place [21:14:13] e.g. we need to invalidate the site links html on wikidata (because of dom change), but not have to invalidate everything [21:14:21] what do you think will be the first use case? [21:14:53] i'm a little nervous about being able to update 'derived' slot data [21:14:58] TimStarling: stroctured data associated with file description pages would be an obvious use case for two "primary" content objects [21:15:01] * brion likes immutable things [21:15:32] brion: either we have updates, or we have sub-revisions. think of the derived data as a persistent parser cache [21:15:33] basic idea of multiple data blobs sounds useful for many cases [21:15:44] *nod* [21:15:47] (which would be one function it may actually have) [21:15:50] we have a lot of use cases in RB land, like HTML, data-parsoid, data-mw, revscore data, derived mobile html, etc [21:15:53] it seems like the problem of managing multiple resources gets a lot harder with the strictly linear revision model MW currently has [21:15:57] sub-revisions? [21:16:00] this does bring us back to classic debates such as 'should viewing an old revision show you the old versions of the images/templates' [21:16:19] and this extends into current revisions & tempalte/etc updates [21:16:25] robla: with respect to the page history, any edit to any of hte resources would create a new revision [21:16:56] DanielK_WMDE_: edit conflicts will be a lot more fun! :-) [21:16:58] while template updates for example would probably not create a new revision [21:17:27] they'd only add a new 'render' [21:17:34] if the page has streams A, B, and C, revision 1 would be (A1, B1, C1). If an edit changes only stream B, the second revision would be (A1, B2, C1). The unchanged slots would point to the same data blobs again [21:18:06] sure [21:18:11] and each slot is immutable? [21:18:16] robla: the code for displaying diffs and handling conflicts would need to be extended to support multiple objects. [21:18:32] robla: but it's not a new problem. it's like git doing a diff/patch over multiple files [21:18:42] it's not conceptually difficult. but yes, code needs to be written [21:18:44] you call it a subrevision but it has a new rev_id, it is not like we have a subrevision ID [21:18:49] the main cases we are talking about here are a) primary data, and b) derived data [21:18:51] it is an actual revision [21:19:13] derived data would normally be updated whenever the primary is updated [21:19:21] adding revisions to the visible page history on template/image update would drive editors mad, I worry [21:19:21] it will appear in page history as such [21:19:22] the relationship between primary data items is more interesting [21:19:25] but it might actually be a good idea [21:19:33] TimStarling: well, there are user-editable slots, which would be immutable. and there are derived slots, which are mutable. The data url associated with that slot of the revision would be replaced with a different data url [21:20:05] if you have structured data which is not derived from wikitext, and allow that to be edited, then you need to represent that in history [21:20:05] DanielK_WMDE_: would it be too complicated to have the mutable and immutable slots be distinct? [21:20:10] eg stored/accessed separately [21:20:22] TimStarling: my current idea is to not have sub-revisions, but to update derived data of revisions silently, just like we update the parser cache silently when templates change [21:20:23] i can see convenience in putting them together (one common infrastructure) [21:20:27] TimStarling: it depends on how central that information is to the article [21:20:39] but also the other way (immutable storage can be "really" immutable, archived to different disks, etc) [21:20:44] for example, it is debatable whether a lead image update is critical to the article content [21:20:47] brion: i think having images/template updates in page history is something that users would want (generally) [21:20:52] brion: it would add complexity, but it's a possibility, I think [21:20:54] lead images should be reviewable [21:20:57] * aude imagines a filter for it though [21:21:03] TimStarling: +2 [21:21:14] aude: +2 :) [21:21:18] brion: my idea was to code the distinction between mutable and immutable deep into the storage service. [21:21:18] yeah, but that doesn't necessarily mean that everything has to be the same kind of 'edit' [21:21:29] it just means that it needs to be trackable & reviewable [21:21:32] the storage layewr would just refuse to update primary (user editable) slots [21:21:40] there would be a flag for that in the database [21:21:42] DanielK_WMDE_: so same API to access them, but potentially could be separate backend storage (something that the frontend wont have to worry about) [21:22:10] TimStarling: lead images would count as primary content. it's not derived data. changing them would create a new revision. [21:22:37] DanielK_WMDE_: annotations about the page content too? [21:22:43] well today, lead images are extracted from the list of images available in a file -- they literally are derived data [21:22:52] so what happens if something that's derived today becomes content tomorrow? [21:22:54] brion: the storage of the actual blob could be configured per slot. wikitext could go one place, wikibase json another, html yet another. media content could live on the file system directly. [21:23:18] data urls are flexible. that's another layer of abstraction that i have only hinted at in the rfc, since i didn'tz want to overburden it [21:23:21] so derived data is not treated as an archive, for backups etc.? [21:23:40] gwicke: annotations could be stored separately, sure [21:23:58] TimStarling: depends on the interest [21:24:08] there is interest in HTML dumps, for example [21:24:11] TimStarling: that's my current thinking, yes: primary data is archived and revied, derived data is treated much like the parser cache. [21:24:22] doesn't even have to be persistent, if that is not desired. [21:24:35] dropping caches can be hell on performance though [21:24:43] of course [21:24:44] i would recommend keeping the data unless it's actaully invalidated :) [21:25:11] yes, i just mea nto say that the architecture can accommodate volatile as well as persisnent data [21:25:17] And speaking of invalidation... I'm envisioning suppression could get a little messy [21:25:32] csteipp: why? [21:25:42] you could still suppress a revision, just like you do now [21:25:51] derivation makes suppression definitely more interesting [21:25:56] People might want to be able to suppress parts of content rather than the whole content? [21:26:04] there are more dependencies to track [21:26:27] Right, if something derived includes user_text, then renames / suppression gets hard. But we can manage it. [21:26:28] Krenair: that's a new feature, which would be doable, but i'm not sure whether it's terribly useful, or woth the cost [21:26:45] csteipp: just like the parser cache. [21:27:23] it's a bit more interesting than that [21:27:42] once you store derived content that's composed, you have to track those historical dependencies [21:27:52] that's what we realized while working through this in RB [21:28:05] gwicke: yes, any content blob can depend on any other content blob [21:28:16] dependency tracking would happen slot-to-slot. [21:28:25] nice [21:29:19] implementing this is no requirement though. we don't *have* to store html for all revisions. we don#t have to put any html in there at all. that's just one possible use case. [21:29:44] full fine grained dependency tracking would be cool, but poses some challanges wrt scalability. [21:29:53] blob-to-blob dependencies can easily go into the billions [21:30:24] we'll see what we can do ;) [21:30:45] it's definitely not a trivial problem [21:30:46] one thing i'm wondering about is backwards compat of the database schema. [21:31:00] Is it even possible? [21:31:03] would extensions be able to add arbitrary slots to pages? [21:31:14] 'easy' way -- store primary slot in text table, others in other table ... [21:31:18] do we still want the revision table to point to the text table at least for the "main" slot, so tools working directly against the database wouldn't break? [21:31:22] legoktm: or some service [21:31:27] but on labs, the text table is useless anyway... [21:31:29] tools working directly against the database should die [21:31:32] legoktm: yes. [21:31:36] at least for reading text :) [21:31:37] it's fairly easy to set up a service that keys on title/revision [21:31:39] no, the easy way is to use rev_text_id for the main content [21:31:50] brion: +1 [21:31:51] What if the "main" content changes? [21:32:03] E.g. move from wikitext to HTML. [21:32:06] there's no need to have a second text table [21:32:07] (late to the party .. ignore if already addressed) there seems to be some similarity in this rfc and what restbase wants to do .. but i suppose this is more a proposal to change core mediawiki storage abstractions, and not just about a storage implementation? [21:32:09] brion: yeah [21:32:10] Are we just kicking that down the road? [21:32:16] James_F: ideally, you only switch format on new revisions [21:32:21] TimStarling: yes. for b/c, we could duplicate the link to the main content there [21:32:25] means you keep a wikitext parser around forever of course [21:32:31] brion: Is that ideal? Yeah. :-( [21:32:31] for consistency, i'd also want to have it in the new table [21:32:38] James_F: immutable data 4evahhhh [21:32:48] we have it in separate storage right now [21:33:09] subbu: exactly. in my mind, RESTBase would be one of the storage machanisms used by core [21:33:16] brion: But "main type is X for rev < A, Y for A < rev < B, Z for rev >= B… [21:33:28] subbu: yeah imo the abstraction & how it affects our internal and external apis is the important part [21:33:34] DanielK_WMDE_: one question I had on the task is whether you see this as a wrapper for every service providing revision-related data out there from MW's perspective [21:33:52] DanielK_WMDE_, brion thanks .. (will read backlog after). [21:33:56] gwicke: Do you mean, would we also want to do page properties like this? [21:34:12] the obvious alternative to this proposal is to have multipart content for what we are calling "primary content", and keep derived content merely linked, like it is now [21:34:15] James_F: well, type of the main revision in a particular item should probably be able to change over time (even from arbitrary rev to rev maybe) [21:34:35] gwicke: yes, pretty much. i mean, nothing keeps some 3rd party service from providing extra data associated with a page revision, without it being recorded in the db. but we'd get the infrastructure that could record any such association of extra content [21:34:41] brion: Where would we store that knowledge? [21:34:50] brion: Just the current CH type? [21:34:59] James_F: type should be stored along with the revision i think, conceptually at least :D [21:35:02] * brion rereads [21:35:08] with multipart content, the interface changes would be limited to EditPage and its consumers, instead of also touching Revision [21:35:16] TimStarling: it would be nice to at least have a standard infrastructure for storing such associated data, and for storing and querying the links to the revisions. [21:35:22] which is pretty much what i'm proposing [21:35:24] DanielK_WMDE_: we are looking into storing wikitext in Cassandra as well [21:35:50] brion, James_F: type already is store dwith the revision. then it would be stored per slot per revision. [21:35:53] nothing short term, but it's a possibility [21:36:03] DanielK_WMDE_, what are the non-derived streams in a revision that you envision, concretely? [21:36:08] the RFC says [21:36:24] metadata, html, images, are all derived data .. from wikitext. [21:36:27] about why not to use multipart... [21:36:29] "1. they are more flexible and more efficient with respect to blob storage" [21:36:38] which could be addressed at the storage layer [21:36:40] gwicke: yea, you can implement blob storage services based on Cassandra, or whatever you like [21:36:46] "2. they avoid breaking changes to APIs the allow access to raw page content, by presenting the content of "main" slot there per default. Attempting the same with multi-part revisions would lead to round-trip issues when only the main part of the content gets posted back from an edit." [21:36:54] there is definitely a lot of overlap between this RFC and RESTBase [21:37:02] which could be addressed by having a reassembly layer between the API and Revision [21:37:24] it's basically moving the MW-internal storage to a model that's closer to RESTBase's [21:37:35] subbu: wikitext (obviously), media info (associated with file description pages), lead image data, possibly tags and categories (no more need to put them into the wikitext) [21:38:07] gwicke: yes, I think the two go well together. [21:38:09] And TemplateData, Graphs. [21:38:15] why is it nice to have a standard infrastructure for derived data? [21:38:16] Other things that abuse PageProps. [21:38:24] James_F: templatedata, yes! [21:38:39] it does not seem modular [21:38:43] i'd say there's some definite use for things that are done as subpages today [21:38:54] but that opens the 'why not just use subpages?' question ;) [21:39:09] subtitles in various languages for a video [21:39:15] subbu, James_F: also, template definition and documentation could be separate wikitext objects on the same page. no more need for cruft. [21:39:17] brion: Because sub-pages suck. [21:39:18] data table for a graph [21:39:19] to me, the question is mostly 'why should derived data be stored in MediaWiki'? [21:39:29] James_F: indeed. ;) being able to treat something as a unit is nice [21:39:33] DanielK_WMDE_, ok, so, you are proposing that the current monolithic wikitxt model .. to, at the very least, consider metadata and core data as separately and independently editable. [21:40:03] gwicke: yeah, that's what I mean, the MW borg is trying to assimilate all your data [21:40:04] So #REDIRECT[[]] would be deprecated (or, at least, not actually stored in the wikitext blob)? [21:40:08] TimStarling: to define a new slot for derived data, you give the name of the slot, and register the blob storage handle to be used with it. that's it. you can plug in whatever you like. [21:40:10] it's not modular [21:40:19] DanielK_WMDE_: What derived data would you put there? [21:40:27] TimStarling: the advantage is that it is simple to find all data associated with a revision [21:40:31] TimStarling: yeah [21:40:36] I mean, if you had blame maps, that is probably a massive system written in some other language [21:40:45] James_F: diffs, blame maps, rendered html, ... [21:40:53] DanielK_WMDE_: I was imagining multiple 'real' slots ('primary' in your language). [21:40:55] listening on a change bus for article text changes [21:40:57] DanielK_WMDE_: https://en.wikipedia.org/api/rest_v1/?doc is providing a listing currently [21:41:15] so now it has to be half written in PHP and store its text in derived slots? [21:41:26] it's not an exhaustive list, for sure, but it's growing [21:41:26] James_F: yes, multipled primary slots for things like template data, template docs, media info, etc [21:41:39] DanielK_WMDE_: OK. So there'd be 'content' (one of which was 'primary') and 'derived' (which are derived from… all the content? just the primary content?)? [21:57:53] #action DanielK_WMDE_ to clarify interaction with services like RESTBase [21:58:20] #info James_F thinks B/C adds complexity with minimal benefits. Brion thinks that 3rd party code that access the database directly should die. [21:58:30] * James_F grins. [21:58:45] ok, I suppose we will need another meeting on this some time? [21:58:50] #info may be good to specifically mention access APIs [21:58:51] my earlier question: is the multi-part content alternative question resolved or needs to be addressed in the RFC? [21:58:55] :) [21:59:00] #info gwicke strongly cheers for establishing clear APIs for the storage layer [21:59:18] subbu: i tried to describe the downsides of the multi-part approach there. but we didn't discuss it [21:59:28] * subbu will read rfc [21:59:28] gwicke: i'm all with you there :) [21:59:59] subbu: please comment! [22:00:26] in next week's RFC meeting we will discuss my tidy RFC https://phabricator.wikimedia.org/T89331 [22:00:34] #info gwicke preferrs the association of data with revisions to be programmatic, rather than materialized in the sql database [22:00:34] #info gwicke sceptical about the utility of the indirection between blob and per-plob metadata [22:00:35] although we may be short on numbers since there is a management offsite [22:01:17] thanks everyone [22:01:24] #endmeeting [22:01:29] * TimStarling says, as if meetbot is listening [22:01:38] hehe [22:01:39] :) [22:01:41] * James_F grins. [22:01:58] thanks for runnign the show, tim! thanks everyone for your feedback! [22:02:20] DanielK_WMDE_: thanks for taking the time to write up the RFC! [22:03:56] as much as we quibble about the details, I think we are in agreement on a lot of the big picture [22:05:26] (fyi labs should be stable now) [22:06:18] famous last words ;) [22:07:09] well...reverted to a mean of instability :) [22:08:16] TimStarling: are you looking into grepping out the meetbot lines? [22:08:36] happy to do so unless you are already on it [22:09:26] go ahead, I haven't started, thanks