[19:55:16] Marti_J: hi there, are you available for a quick chat ? [21:00:03] #startmeeting RFC meeting [21:00:23] ohnoes [21:00:40] the bot was gone yesterday too :( [21:00:46] where does it live? [21:00:56] who maintains it? [21:01:17] it lives at bots.wmflabs.org I guess [21:01:25] https://tools.wmflabs.org/meetbot/ [21:02:07] https://tools.wmflabs.org/?tool=meetbot [21:02:11] marktraceur! [21:02:26] what instance? [21:02:29] maintainers seem to be hashar, coren, and TimLandscheidt [21:03:16] TimStarling: become meetbot? [21:03:34] if we can find the labs instance then maybe we can restart the service [21:03:54] plan B will be to just have the meeting and publish the notes on the wiki [21:04:13] yea [21:04:32] should we create an etherpad? [21:04:41] good idea [21:04:50] will you? [21:04:51] http://etherpad.wikimedia.org/p/RFC-meeting-20150429 [21:05:06] thanks :) [21:05:06] I'm not sure how that will work [21:05:18] reading two things at once? [21:05:20] we add things there, then transfer to MW [21:05:27] copy&paste from IRC [21:05:33] so, we are talking about integrating the file upload history with the history of the file description page. [21:05:40] https://phabricator.wikimedia.org/T96384 [21:06:54] It seems to me that the best way to do this is to associate multiple content blobs with each revision, instead just one. The association with the blob could be a url-linke thing, like we use for external storage now. [21:07:29] so it could be in the text table, or in external store, or cassandra, or the file system, or wherever. [21:08:12] but there are some open questions about that. [21:08:16] how does this interact with the move to hash-based image addressing? [21:08:39] is the revision history a linear sequence of mappings from names to hashes? [21:09:24] gwicke: depends on how the hash based addressing works. would different "revisions" of the file have different hashes? [21:09:33] yes [21:09:34] would it still be possible to "update" a file? [21:09:41] they are the hash of the original content [21:09:54] so a re-upload would change the hash [21:09:58] I think we already do that inside swift [21:10:16] yeah, Aaron was working on that [21:10:24] I think the plan is to also expose that in HTML [21:10:35] gwicke: i'm not sure i fully understand how that works. my goal is to have a single sequential history for the file itself, its description page, and the associated structured metadata (license, author, etc) [21:10:49] it's a separate concern, I think [21:10:51] if that is associated with a name, a number, or a hash, i don't care much [21:10:54] yea [21:11:26] so... if we have a single history for multiple "facettes", what would a diff for that look like? [21:11:39] especially a diff spanning changes to multiple facettes? [21:11:51] what would it look like in an XML dump? [21:11:54] well, there are precedents you from VCS systems you can look at [21:11:56] does it make sense to have different licenses for the same original? [21:11:58] do we need separate permissions for each facette? [21:12:12] sometimes they just say (binary file changed) [21:12:16] or perhaps these should just be blobs in "associated namespaces"? [21:12:24] or even different descriptions? [21:12:43] for image changes you could show the two images side-by-side [21:12:44] gwicke: it makes sense to be able to change the license or description, yes. [21:12:59] DanielK_WMDE: I mean different licenses for the same file [21:13:07] at different description pages [21:13:14] at the same time [21:13:25] Does it really make sense to have the same image on different description pages? [21:13:29] because the file has been uploaded multiple times? no, that's bad, of course [21:13:56] the XML dump can already include image content, with dumpBackup.php --include-files [21:14:01] TimStarling: yea, side by side is good... and then have the text diff on the same page, just below? [21:14:16] yes [21:14:32] and the structure ddata diff below that? yea, why not... [21:14:37] you could have the media hierarchy be responsible for displaying a diff [21:14:44] by default, (binary file changed) or something [21:14:50] The XML format would need to change to accomodate multiple content blobs per revision. [21:14:53] we have similar concerns with data-mw [21:15:02] and possibly later page metadata [21:15:02] then the image handler can override it and display side-by-side [21:15:10] currently, the content data is directly inside the revision tag, iirc. we'd need another level in the XML dom [21:15:36] also, we somehow need to be able to specify the "role" of the blob. And the content model, which is currently recorded in the revision. [21:15:44] you know there are lots of interesting ways to compare images, side-by-side is not the most awesome thing imaginable [21:16:22] http://www.abc.net.au/news/specials/christchurch-quake/ [21:16:35] http://www.imagemagick.org/script/compare.php [21:16:37] :) [21:16:56] multiple bits of content associated with a revision will be fairly common [21:17:10] I think it's worth looking for a general solution that's not specific to images [21:17:36] I'm just saying that diff display should be a responsibility of the media type handler [21:17:47] or the frontend [21:17:51] could even be done client-side [21:17:59] the media type handler in the frontend or wherever it lievs [21:18:16] yes. i'm thinking splitting the revision table into two, one for the actual revision (timestamp, user, id) and one for the blobs( blob-id, url, content-model, format, hash ) [21:18:48] TimStarling: content handler is responsible for providing the diff view. [21:18:51] right... [21:18:56] that's roughly what we have in RB [21:19:02] once consequence of this could be to merge the concepts of media handler and content hander [21:19:02] the latter part [21:19:26] it's title, revision, timeuuid [21:19:30] and generalize the file store to a blob store, that would also cover the external store [21:20:05] oh, splitting in two that way [21:20:27] for a minute I thought you meant do a union query to merge the two tables for history display [21:20:32] which you could basically do already [21:20:34] the "blob" table needs to have the revision id, of yourse. forgot that [21:20:51] a frontend table and a backend table is basically what we have already with revision/text [21:20:54] TimStarling: that would be nice and hasish for B/C :) [21:21:30] but the more i think about it, the more it seems like we should work to unify not only the history of media and page content, but also content handler and media handler, and external store and file store. [21:21:55] all unified into one enormous 20k line class? ;) [21:21:56] uploaded media would become a secondary blob in the revisions of the description page. [21:22:07] hehe [21:22:16] more like unified into implementing a common interface [21:22:48] to facilityte reuise, i expect we'd rather be splitting the corrent code more [21:22:51] media is going to be stored by hash [21:22:55] *re-use [21:23:01] so currently we have rev_text_id [21:23:18] i.e. the schema does not allow multiple text rows to be associated with a revision row [21:23:38] gwicke: no problem, the "url" in the blob table would just say media:6ab4f339e or whatever [21:24:07] assuming you wanted to do that... [21:24:08] we'd have a "store" to handle the media "protocoll". [21:24:29] it's much simpler to have the text row point to multiple underlying blobs [21:24:32] TimStarling: when splitting the table, the "blob" part would have the revision id. [21:24:34] basically an extension of how ES works [21:24:42] that way, you could associated any number of blobs with a revision [21:25:08] yes, exactly how ES works, with a plugable architecture [21:25:13] you don't even need a schema change to do that [21:25:26] we already do it.. [21:25:33] well, not a schema change in revision/text/blob [21:25:54] we could just put multiple speudo-urls into the text table, instead of one... [21:26:02] there is still the matter of dropping image/oldimage and rewriting half the filerepo directory [21:26:19] DanielK_WMDE: why store it in the text table at all? [21:26:30] why not simply look it up elsewhere? [21:26:36] gwicke: just because that mechansim is already there. we could do that without a schema change [21:26:53] TimStarling: why would you need to rewrite the filerepo? [21:27:03] we can look things up by revid already [21:27:43] I mean the creating/deleting/moving file revisions, there is a lot of code to do that [21:27:50] legoktm: I was given meetbot privileges but I've never actually used them, it's hashar's baby I think [21:27:53] gwicke: we currently do: page -> rev -> text -> ES. We can keep that, or move to page -> rev -> content-blob -> ES/FR [21:27:55] mostly it will just go away [21:27:58] which is nice [21:28:05] exactly :) [21:28:21] DanielK_WMDE: or page -> rev -> http://en.wikipedia.org/api/rest_v1/page/html/Foobar/659900971 [21:28:46] gwicke: that's what the text table is for though, storing URLs [21:29:00] why do we need to store them? [21:29:04] gwicke: sure, but you need to maintain the relationship between rev and all the blob urls somewhere, along with role, content model, etc. [21:29:42] gwicke: i want to be able to ask for the image blob or the meta data blob for a revision, specifically. [21:29:50] DanielK_WMDE: you mean the available types of content per revision? [21:29:52] that mapping needs to be somewhere. [21:29:57] How do you envision displaying the file history table in the ui with this scheme. Or is that just going to be removed? [21:30:01] gwicke: yes. [21:30:19] bawolff: the idea is that it would be removed, that is an explicit goal [21:30:38] which is a product discussion of course [21:30:40] bawolff: I would suggest to just drop it. But we can keep it by flagging revisions that updated the media blob. [21:30:43] DanielK_WMDE: one option would be to consult a list of sources, and return the result [21:30:44] Users might want to see at a glance how many versions of the files there were [21:31:03] DanielK_WMDE: including things that might not yet exist, but can be created on demand [21:31:26] a bit like http://en.wikipedia.org/api/rest_v1/page/ [21:31:30] DanielK_WMDE: flagging as in revision tagging? [21:31:36] bawolff, TimStarling: if we unify the history, we *can* remove the extra "upload" history. Buf flagging revisions as "upload" is easy enough, so we can also keep it. [21:31:42] bawolff: yes, exactly [21:32:29] gwicke: sure. i'm just saying the mapping needs to be somewhere. in mysql, or in some external service, programmatic or materialized, whatever. [21:32:48] bawolff: history is a vertical slice, I think Daniel was talking about listing the types of content per revision, a horizontal slice [21:33:13] gwicke: no, i was thinking about how to "emulate" the file upload history, bawolff got that right [21:33:37] ok, so basically history per type of content [21:33:38] gwicke: as to the horizontal slice, i don't really want to list, i want to be able to pick *which* blob i want for a revision. [21:33:44] maybe listing them would also be useful, not sure [21:33:44] we need to have a UI design process [21:33:44] with content being 'upload metadata' or the like [21:33:52] that's not really our responsibility [21:34:06] gwicke: not really per type of content. i'd just use revision tagging to tag uploads. [21:34:11] it is a product management responsbility [21:35:11] uploads are different events from description edits [21:35:17] they touch different content types [21:35:42] one modifies the wikitext associated with the image name, the other the mapping of image name to content hash [21:35:53] TimStarling: merging the histories implies that you'd see uploads in action=history. It doesn't necessarily imply we remove the upload history from the description page. [21:35:57] we could, but we don't have to. [21:36:19] gwicke: indeed. uploads could also modify both. [21:36:26] the revision tag would be for marking the event. [21:36:34] Well we already have dummy edits for upload. It would mean the dummy edits are actually "useful" [21:36:55] what "parts" of the content were changed is independant of that. though in practice, uploads would always modify the media "part". [21:37:02] DanielK_WMDE: you can already see uploads in the page history [21:37:11] https://commons.wikimedia.org/w/index.php?title=File:Omar_al-Bashir,_12th_AU_Summit,_090202-N-0506A-137.jpg&diff=159138536&oldid=148791824 [21:37:18] bawolff: and we'd no longer have extra database tables and tons of special case code for dealing with media files [21:37:36] which would be nice, certainly [21:37:47] in a way, the per-name history is a merge of the histories of several content types [21:37:51] TimStarling: "(No difference)" :) [21:38:07] yeah, when you upload, it creates a null revision [21:38:15] like page moves [21:38:35] gwicke: yes. and in my mind, actually unifying the mechanisms on a low level would alow us to drop a lot of code and complexity, and make the platform more flexible. [21:38:44] it's like content handler squared :) [21:39:12] it would integrate well with what we are doing with RB [21:39:20] you could hack a file diff feature on top of the existing null edit if you wanted to [21:39:48] TimStarling: i'm actually thinking it would be useful if uploads could modify the media file and the description and/or meta-data in the same edit. often, these logically belong together [21:40:21] TimStarling: sure, but hacking it on top will introduce more special case code. [21:40:21] we could already do this without incrementing the revision number of the description page [21:40:33] would be nicer to re-use code and infrastructure, and gain fexlibility [21:41:16] secondary events can already be represented by timeuuid, and merged into the history view [21:41:29] Altering description on upload would be a lot easier (in terms of presenting a sane UI) when/if wikidata for images actually happens [21:41:53] and it would probably be the same amount of easiness regardless of this proposal [21:42:17] gwicke: that would work for integrating the view on the history page. i'd prefer to actually unify content blob management with media file management. [21:42:31] that'll make it hard to add things later [21:42:38] like what [21:42:41] ? [21:42:53] for example, annotate the history with derived events [21:43:06] 'this image was added to page x' [21:43:21] with a selection on which events to display [21:43:23] Oooh [21:43:36] gwicke: why would that become more difficult? [21:43:39] Would you be able to revert from such an interface? [21:43:59] DanielK_WMDE: if you associate all that with revision ids, how do you insert those events in the past? [21:44:07] harej: excellent question [21:44:16] harej: the way i'm imagining it now, you could revert an upload exactly like and edit, and you would be reverting the description page along with the file. [21:44:26] "Butt.jpg added to article Barack Obama [rollback]" [21:44:32] Isn't that essentially what abuse filter already is? [21:44:38] just not with that ui focus [21:45:00] I see history more as a timeline of events associated with a logical bit of content [21:45:06] harej: ah, no, that's using images, not managing image uploads. different topic... [21:45:12] And that way you could watch list files for inappropriate additions to articles. The Fair Use Police would like that also [21:45:20] DanielK_WMDE: ah. [21:45:50] yeah harej was getting excited for a minute there [21:45:51] gwicke: i agree that this would be useful, and that we should thing of action=history that way. we still have to manage actual revisions as üpart of that [21:46:14] i don't see how my proposal would make the integration of these two perspectives more difficualt. it seems to me like it would stay exactly as it is now. [21:46:41] I'm basically arguing for separating the way things are stored from how they are presented [21:47:08] yes, and i agree. but i think that's orthogonal to what i'm proposing [21:47:16] * DanielK_WMDE just realized he's proposing something [21:47:28] I'm a bit wary about emphasizing / overloading revision ids more [21:47:45] DanielK_WMDE: you're proposing EditPage/VE changes as well? [21:48:03] I suppose you are [21:48:15] not really... [21:48:16] since you already said you want to be able to edit the page while uploading [21:48:26] that is an EditPage change [21:48:33] The scope on this proposal seems humungous... [21:48:42] TimStarling: it's really an UploadPage change, I think [21:48:55] but that's just something that would become possible, and has been suggested before [21:48:56] you could integrate an edit box on Special:Upload [21:49:12] but then how do you revert and undo via links from the page history [21:49:15] not somethign i would consider an integral part of the idea of unifying file management with revision management [21:49:23] do they go to action=edit as they do now? [21:49:47] probably not, they would work more like they do on wikidata [21:50:01] they go to a diff page, with a big "save" button [21:50:18] so you can check and approve the revert or undo action [21:50:27] DanielK_WMDE: if we used timeuuids to identify secondary bits of content at least, we'd get the ability of displaying those in a timeline without having to mess with revision ids [21:50:49] I wouldn't mind using timeuuids for the main content type either, but that'd be a breaking thing [21:50:54] Why is messing with revision ids a bad thing? [21:50:57] gwicke: we need some identifier to associated blobs with revisions. i don't care what it looks like [21:51:16] I'd consider unifying revision id to represent a revision to the content in question a good thing [21:51:17] has to be unique, and should be related to time [21:51:36] so that we can order in a timeline [21:52:14] bawolff: the difficult part is exhaustively defining 'content' [21:52:32] TimStarling: we could stuff all the new extra info into the text table with some magic flags set, instead of splitting the revision. that's more along the lines of a "multi-part content model". i currently like the idea of a new table better... [21:52:44] as we are moving from a single blob to multiple components, in some cases retroactively [21:53:23] gwicke: i support moving away from an int as revision id, towards timestamp + uuid or something like that [21:53:33] then that's the new and improved revision id. [21:53:45] yeah, timeuuid is timestamp-based uuid [21:54:02] https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_.28MAC_address_.26_date-time.29 [21:54:16] but agree that it's a detail [21:54:33] can be some other bit of entropy + separate timestamp as well [21:54:33] gwicke: can they be sorted chonologically? [21:54:38] yes [21:54:43] then i'm all for it [21:55:12] they have 100 nanosecond time resolution [21:55:18] disadvantage is length [21:55:25] anyway, sorry for the distraction [21:55:35] you should see the code in MW for that [21:55:56] for uuids? [21:56:00] poor Aaron trying to emulate Windows NT date functions in PHP [21:56:16] it's actually pretty straightforward [21:56:16] haha! [21:56:23] you know it is a microsoft-authored RFC and it is very trivial to implement in the Windows API [21:56:24] pretty sure there are libraries already [21:56:37] nothing about date is ever streight forward. [21:57:16] "Version 2 UUIDs are similar to Version 1 UUIDs, with the first 4 bytes of the timestamp replaced by the user's POSIX UID or GID (with the "local domain" identifier indicating which it is) and the upper byte of the clock sequence replaced by the identifier for a "local domain" (typically either the "POSIX UID domain" or the "POSIX GID domain")" [21:57:19] I'm sure he loved writing it though [21:57:19] soudns like fun :) [21:57:31] it is includes/utils/UIDGenerator.php [21:57:52] * gwicke implemented uuids in JS, which has weird behavior about binary arithmetic on 128 bit ints [21:58:20] anyway. [21:58:32] sorry, distracted [21:58:41] aaanyway- my point is more that we'd avoid the need to store this in the revision table [21:58:53] I think we need to have more of a product, UI-focused discussion [21:59:00] with user involvement [21:59:01] mockups [21:59:21] then requirements and then architecture [21:59:37] incorporating Daniel's and Gabriel's ideas of course [22:00:15] proposal: multiple blobs per revision, with a role, model, and hash associated. may be a good time to move to timeuuid revision ids. unify content handler with media handler, and file repo with external store. [22:00:30] UI questions: file description page, upload page, hostory view, undo/revert [22:00:38] oh, and diff page, of course [22:01:37] TimStarling, gwicke: shall i try and summarize this on the etherpad? or better someone how isn't me?... [22:02:16] lets both try [22:02:22] :) [22:02:29] http://etherpad.wikimedia.org/p/RFC-meeting-20150429 [22:02:31] have fun [22:02:38] thanks TimStarling! [22:02:47] gwicke: i'll try in reverse chonological order. [22:02:53] thanks everyone [22:02:57] it's a fun topic [22:03:01] #endmeeting [22:03:21] that's the main thing meetbot is good for, spamming the channel at the end of the meeting so everyone knows to stop talking [22:03:26] :P [22:03:30] STOP TALKING [22:03:31] STOP TALKING [22:05:44] DanielK_WMDE: in RB we currently have both the MW revision id and a timeuuid [22:05:50] for each blob [22:06:00] ex: https://en.wikipedia.org/api/rest_v1/page/html/Foobar/ [22:11:39] DanielK_WMDE: another can of worms is stability of names & destructive renames [22:12:27] some old notes on that at https://github.com/wikimedia/restbase/blob/master/doc/MediaWikiPageContent.md#page-renames [22:12:54] gwicke: are you still working on the etherpad? otherwise i'll copy it over to phab [22:13:54] I'm done [22:15:00] gwicke: thanks [22:18:49] added 'support for multiple types of content associated with a logical 'page' or 'media', history and editing support for those' to the arch goals pad [22:20:33] thanks!