[21:01:48] #startmeeting RFC meeting [21:02:17] #topic the future of rev_parent_id https://phabricator.wikimedia.org/T193690 [21:02:29] hm... no meetbot? [21:02:49] #link https://phabricator.wikimedia.org/T193690#4254713 [21:02:59] hmmm, weird I thought it was the wm-bot but is that not the right one? [21:03:14] no, that's not it [21:03:28] grr, I thought I was even checking for the bot beforehand :( [21:03:44] let's just do what we did last time and I can grep for the tags again [21:03:56] who is here for the RFC meeting today? [21:04:03] * DanielK_WMDE is [21:04:17] i have to adjust a server but i'll be watching intermittently :) [21:05:05] musikanimal: you there? [21:05:19] I can fix it, I did it a couple of weeks ago [21:05:25] hallo! [21:05:38] hey! [21:06:02] we are about to start. i was hoping to see some more people here who contributed to the ticket [21:06:16] o/ [21:06:17] #startmeeting RFC meeting [21:06:17] Meeting started Wed Jun 6 21:06:17 2018 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:06:18] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:06:18] The meeting name has been set to 'rfc_meeting' [21:06:26] #chair kchapman DanielK_WMDE [21:06:26] Current chairs: DanielK_WMDE TimStarling kchapman [21:06:39] #topic the future of rev_parent_id https://phabricator.wikimedia.org/T193690 [21:06:48] #link https://phabricator.wikimedia.org/T193690#4254713 [21:07:46] I was hoping to see Lahwaacz and/or GeoffreyT2000 [21:07:47] oh, well. [21:08:08] let's get started anyway. [21:09:02] the core of the issue seems to be: it's unclear how rev_parent_id is to be interpreted. is it the "previous" revision the the page history as shown on the history page? [21:09:13] or is it the revision the author of the edit actually based their edit on? [21:09:20] in most cases, this is going to be the same thing. [21:09:36] but when revisions get deleted, or histories merged, or revisions get imported, then it is not the same [21:10:12] MW's notion of the "previous revision" is by timestamp, nearly everywhere. In particular, that'S howw diff=prev and diff=next works [21:10:57] on the other hand, rev_parent_id is used to compute the size difference shown on the history page and user contributions page. [21:11:01] should we talk about branching/merging, which I think was supposedly the original use case for rev_parent_id? [21:11:43] maybe later I guess [21:11:47] :) [21:11:57] i was hoping to get clarity on simpler issues first :) [21:13:25] <_joe_> if we interpret rev_parent_id == "previous revision by timestamp", it's just a performance trick, which might or might not be worth it purely on performance grounds [21:13:58] probably worth it, for contributions [21:14:05] right. [21:14:15] Right, the only user-visible consumer of rev_parent_id are the on history/contribs pages where either a "size difference" is shown or an "N" marker for new page. [21:14:29] we could just store the size difference instead, that would be even cheaper [21:14:31] I think we should consider where we want to be in the long term [21:14:36] I would peprsonally love to have a way to preserve the original author's intent, but I don't see how to do this with rev__parent_id in the current system. [21:14:41] <_joe_> TimStarling: I was getting there :) [21:14:47] for one thing, we have lost that info for many old revisions anyway. [21:14:58] do we consider difference in byte size a good edit size metric? [21:15:05] also, it's completely unclear how that would work with imports. the parent revision may be on a different wiki. [21:15:31] Preserving the original difference encompasses more than the numerical difference. I think users have a reasonable expectation that pressing "diff" will correlate to the size difference. [21:16:05] E.g. we'd have to preserve the diff (or some other data form that would allow producing a diff) as well. [21:16:08] tgr: i persoonaly do not - size-of-diff would be much better than diff-of-size. But we don't have that info for old edits. [21:16:09] an expectation which is broken when they intentionally screw up the history [21:16:32] Krinkle: users have a reasonable expectation that the diff size correlates with the diff they are seeing, and not something else [21:17:04] they also have a reasonable expectation that a change with size 0 is not large, which MediaWiki breaks [21:17:10] Krinkle, tgr: arguably, both options we have violate that under some conditions. [21:17:38] 1) when replacing 100 lines with compleptely different text of the same lingth, you get a big diff, but the size difference is 0. [21:17:41] DanielK_WMDE: If we opt for rev_parent_id to be a cache of "previous revision", then the text diff will match the size diff [21:17:54] IMO in the long term we want a diff metric that meaningfully corresponds to how much the edit changes - something like git's aded lines / removed lines [21:18:01] Sure, yes, but they logically match as being about the same change. [21:18:06] 2) when you know the actual size diff of the edit, but diff=prev gives you a diff against a different revision, that'S also quite confusing [21:18:09] the size diff was never a diff size. [21:18:24] <_joe_> version control systems have ways to deal with broken history trees; I don't think that problem is unsolvable for the future, and we can live with a somewhat inconsistent past [21:18:34] Krinkle: right - it depends on what you mean by "match". [21:18:40] my point is, rev_parent_id is not going to be useful for that, so I'd rather see it used for logical history trees, for which it is actually useful [21:18:51] and precompute diff sizes in a new table [21:18:53] matching expectation is first and foremost about relating to the same two revisions. What difference we show is a separate discussion :) - One that supposedly could be done orthogonally. [21:19:12] tgr: that would mean a) we don't have it for old revisions and b) we can't show *any* size difference any more. [21:19:38] as I said, we'd need a new table for size differences [21:19:46] more work, but not vastly more work [21:19:56] tgr: i'd love to have that, but that's a pretty big change in behavior. [21:20:07] The (size difference / N indicator) is currently the only consumer of rev_parent_id, as such, we have some lattitude with regards to how to solve the problems with its current inconsistency. The inconsistency being that we sometimes use it to cache "prev rev" and sometimes use it to restore the original "prev rev" at time of saving edit. [21:20:13] just make it a property of the revision, recompute when previous revision is messed with [21:20:46] The primary objective is to resolve that inconsistency, either by swinging one way, the other way, or removing it. Which are the three options that Daniel shows at https://phabricator.wikimedia.org/T193690#4254713. [21:20:54] tgr: sorry - recompute what? [21:21:00] the diff size [21:21:29] but why? I'd want to preserve the original intent. the original diff. that would never change [21:21:39] Which are 1) use it as a cache only, which means when doing selective undelete, we'll have to do some additional secondary updates to surrounding revisions, 2) use it as parent preservation only, which means we'll have to come up with another way to do the size diff feature, 3) remove it, and come up with another way for the size diff feature. [21:21:42] anyway. i feel that that would be a new feature [21:22:21] Are there additional options we want to consider? [21:22:33] DanielK_WMDE: MediaWiki shows a linear history, that won't change any time soon. In a linear history, every diff must be against the previous revision, otherwise navigation gets weird. [21:22:38] Krinkle: ...or remove or replace the size diff feature, which is what tgr is proposing. I'm actually up for that, but only *after* fixing inconsistencies in current behavior [21:22:50] can someone remind me why we even allow selective undelete? [21:23:09] tgr: then we are back to rev_parent_id as a cache for the previous rev in the linear sorted history [21:23:42] DanielK_WMDE: no, I'm saying we should use a new field for that and use rev_parent_id for logical parent [21:23:43] TimStarling: becausue back in the day, we didn't have suppression. we should probably kill the feature now :) [21:23:56] TimStarling: I haven't used it recently, but back when I was more active, I used it a lot, to make history less useless, given rev_del still occupies space visually and break's peoples workflow. E.g. after heavy vandalism. [21:24:24] If there was a way to surpress in a way that excludes the row from view by default, then that would be sufficiently obsolete indeed. [21:24:27] tgr: how about using rev_parent_id for the linear parent, which it currently is used for, and add a new field for new behavior/features? [21:24:27] undelete is used for history merges, and then to undo history merges. very fringe use case though. [21:24:48] we have Special:MergeHistory for history merges [21:25:16] DanielK_WMDE: I think in the long term we want nicer diff sizes, not just bytewise difference. rev_parent_id won't be able to provide that. [21:25:33] my take is that a) having a "logical history" would be nice and b) we can't really use rev_parent_id for that without breaking things [21:25:35] To reconcile issues with prev/ etc. we could still leave a placeholder in the view for "1 or more revisions" so that it does go straight from one to an unrelated edit. But when there is 2000 edits in between now and the "really" last edit, people sometimes want to make that go away. [21:26:19] can we provide a selective deletion feature instead? [21:26:20] arguably we are violating moral rights by not having logical parents - changes get misattributed [21:26:23] tgr: i agree. but the first step is to fix current behavior. we can also propose to drop/replace it. [21:26:28] TimStarling: sure. [21:26:41] does that change the problem for us, though? [21:27:01] user intention is clearer [21:27:13] Yeah, and much better UX. [21:27:27] maybe there is no need for a checkbox "restore rev_parent_id" [21:27:51] tgr: a week ago i would have agreed with you that that's what rev_parent_id should do. after discussing this a lot, it seems to me like rev_parent_id can't be made to do that. we can add that as a new feature elsewhere though. [21:27:52] Right. [21:28:17] TimStarling: i don't think there is no need for that checkbox in any case... [21:28:31] if you delete a range of revisions, as part of an explicitly anti-vandalism workflow, you probably want to relink the remaining rev_parent_id, right? [21:28:41] you want it to be as if those revisions never existed [21:28:44] I use rev_parent_id to JOIN on revision to get the rev_sha1 of the surrounding edits for simple revert detection. So for me it is not only used for diff sizes. I currently assume (perhaps wrongfully) that rev_parent_id points to the revision that was "forked", and not a surrounding imported edit, etc. [21:29:19] TimStarling: Maybe, but I think it'd be fair to not be able to view a diff between the revert and the last revision before the vandalism (which would be identical anyway). [21:29:24] Given there'd be a deleted rev in between. [21:29:35] DanielK_WMDE: we can use rev_parent_id for diff sizes and a new field for logical parents. Or we can use rev_parent_id for logical parents and a new field for diff sizes. rev_parent_id is not useful in the long run for diff sizes, because we want something more meaningful than bytewise difference. So... [21:29:38] musikanimal: yea, that assumption doesn't hold... logically, it would make more sense to directly join on the sha1. is that an option? [21:29:39] But yeah, if we do it as archiving, then we need to recompute the parent_id. [21:29:48] that's option 1. [21:30:15] the same would also happen when importing a revision from xml that is older than the latest revision. [21:30:38] tgr: the thing is that we already have a lot of data in rev_parent_id which is *not* the logical history, and we cannot restore that info. so I'd propose to only record it for new edits. that's much simpler when putting the new info into a new place. [21:30:49] DanielK_WMDE: yeah I suppose! but I also use it for diff sizes, all in the same query [21:31:15] no matter how we implement logical history, we can't recover for edits that already exist [21:31:23] that's not really an argument either way [21:31:30] I believe MergeHistory requires that revisions are from distinct timestamp ranges, the only rev_parent_id that needs updating was originally zero [21:31:36] musikanimal: what does the size diff indicate for you? how do you interpret it? [21:31:45] that could be a model for Special:Import history merging also [21:32:11] to me it should be the size of the text I see when I view the diff directly [21:32:32] TimStarling: I suppose the current method of it mixing in the history is indeed not very useful. It creates a UX mess of revisions with non-sensical differences. [21:32:35] tgr: well, if we wanted to use rev_parent_id for thhe logical history for new, we'd have to blank it for all old revisions. And not show any size diff info for old revisions. [21:32:38] I've only noticed this to be wrong for really old revisions (2001-2003 or so), and pages with such revisions throw off the calculations in XTools [21:33:15] On the other hand, what would we do to solve that? E.g. an option on Special:Import to automatically rename or delete an existing page if the timestamps overlap? [21:33:30] musikanimal: if a replace 100 lines with completely different 100 lines of the same length, i'd have a huge diff, but 0 size difference. is that expected? [21:34:00] DanielK_WMDE: instead of not having any history for old edits, pretending they had a linear history is much less inaccurate (since that history was in fact linear most of the time) [21:34:38] DanielK_WMDE: right yeah that can be confusing, but I at least (and I think many others) are at the understanding it's the "diff" size, net change, and not number of added/removed lines or what have you [21:34:39] That means as preparation for non-linear history, we would actively want it to be computed as a cache of "prev rev by timestamp" [21:34:45] that seems good. [21:35:29] DanielK_WMDE: imagine we have a feature where you get a warning saying "this edit is not an accurate representation of what the user did" when the logical and timewise history disagrees - surely we wouldn't want to do that for every single pre-2018 edit? [21:36:15] tgr: no, only if we actually have the logical parent, and it's diffent. and yes, I'd really like that warning. [21:36:48] tgr: so, would you want to compute the size diff based on the logical parent, even though the diff link next to the size leads to a diff aganist the chronological parent? [21:37:20] DanielK_WMDE: so, not having a logical parent and having the timeline parent as the logical parent would be no difference in behavior. So, no harm in reusing rev_parent_id. [21:37:51] DanielK_WMDE: no, I think the size diff should correspond to the visible diff always, even if it is an artificial diff [21:38:12] we don't have a way to do that, if rev_parent_id is the logical parent [21:38:20] Circling back to what we should do with rev_parent_id whilst we don't have logical parents in MW yet? [21:38:26] yeah, so we would have to add one [21:39:03] IMO add a new field for diff size caching first, change rev_parent_id to be the logical parent second [21:39:13] Krinkle: relink, the current system of restoring it from ar_parent_id breaks referential integrity when the rev_id doesn't exist [21:39:21] i think it's easier to fix the old way, than edd a new way, instead of changing the old way to be the new way and then emulating the old way with something new. [21:39:41] hard to see how that is defensible [21:40:16] TimStarling: yea - the logical parent may be in the revision table, or in the archive table, or on a different wiki... [21:40:32] I think undeletion should relink according to the new visible chronological order [21:40:51] that meets musikanimal's use case for diffs matching size differences [21:41:25] it makes bahviour of the current ui code consistent [21:41:57] but it gives a distored view of author's intent in some cases. [21:42:27] i'd love to see tracking for a "logical parent" added somehow. it's completely unclear to me how that could work with imports [21:43:10] we don't have a UI remotely capable of showing VC-style branches [21:44:28] DanielK_WMDE: you're saying the problem is overlapping merged histories? [21:44:41] TimStarling: Right, but for the purposes of non-linear history and preserving author intent, it' fine for rev_parent_id to point to a deleted revision. The UIc can deal with that in some way to indicate it is absent, like with rev_deleted. [21:44:49] UIs* [21:45:33] just a time check that there is 15 minutes left. I'm not seeing that we are at an agreed solution yet. [21:45:41] TimStarling: especially if revisions weree improted from another wiki. [21:45:49] if we have selective deletion and relink rev_parent_id, and then we reverse that selective deletion, we can easily relink again, the information is not really lost [21:46:04] because it is all in rev_timestamp [21:46:07] also use of /#info would be most appreciated when ready [21:46:22] My impression is that we have rough consensus on using rev_parent_id to mirror chronological order of page history. With a wish to add something for trackingh the logical history somehow in the future. tgr decents, and whats to have it the other way around: track logical history in rev_parent_id, and add something new for quick calculation of size diff, and ideally replace size diff with something better. [21:46:39] DanielK_WMDE: I'm saying overlapping merged histories are going to be a UX disaster unless we actually build a UI for that [21:46:52] if rev_parent_id is used to mirroe chronological order, do we have any use of ar_parent_id? should it be ignored? dropped? [21:46:53] so we should discourage overlapping merged histories as far as possible [21:47:08] oh yea. but they do exist. [21:47:23] fixing them is not the subject of this RFC [21:48:06] well, if we go with the chronological order, that would "fix" them in a way [21:48:11] we're just talking about new undeletion actions, right? [21:48:14] they are a UX disaster. Also extremely rare so a fringe problem IMO. With logical parents, they can at least be fixed with some future UI change. [21:48:45] meh [21:48:50] for any merge/undeletion/import done without logical parent tracking, the chance of providing a sane view for that edit is forever lost. [21:49:05] TimStarling: no... we hijacked the original RFC to talk about the intended semantics of rev_parent_id, and the consequences of fully adopting that semantics. [21:50:21] tgr: what is your proposal? [21:50:32] DanielK_WMDE: I use ar_parent_id to identify the first revision of deleted pages. Is there another way of doing this? [21:51:21] musikanimal: timestamp is stored in archive as well, right? So you'd pick the oldest, the same as we for non-deleted pages [21:51:34] musikanimal: the new page creation log. but only for pages created from now on. [21:52:06] can I use that, say, to find all the deleted pages created by a given user? [21:52:15] yes. [21:52:18] tgr: add a new field to revision and archive to preserve this information? [21:52:19] TimStarling: add a new field to store diff size, recompute it when the (timewise) previous revision changes. Use page creation log for the N marks. Use rev_parent_id for logical parents (and do not erase the information we have from the time period when it was actually used as a logical parent.) [21:53:19] tgr: do you want to retro-fill the new field? [21:53:27] that would be fine I guess [21:53:42] a much bigger project though [21:53:48] yeah, it would have to be populated [21:53:56] it's definitely more work [21:54:01] tgr: with diff sizes, not size diffs? [21:54:09] that's goind to take aloooooong time.... [21:54:13] IMO it's work we want to do eventually anyway [21:54:22] Using page creation log might not be feasible for N markers. How would one efficiently query edits by a user (e.g. user contribs) and also get data for the N marks? [21:54:24] DanielK_WMDE: I would use size diffs for now [21:54:26] rev_diff_size and rev_size_diff and a user preference to select which one ;) [21:54:40] You'd have to query the log for all relevant page IDs and then compare the log_params value afterwards (assuming it would store the revision id that created the page) [21:54:40] hehe [21:55:20] Krinkle: perhaps the page table should just have page_first_rev. [21:55:33] also doing both time-based size difference and author-based N marks seems potentially confusing. Presumably, like with parent_id, we coudl store the mark in the same blob as the size diff [21:56:29] ok, so it seems we agree that we a) would like to preserve the logical parent and b) need some kind of de-normalization to cache the size diff (or the chronological parent). [21:56:54] #info tgr proposes: add a new field to store diff size, recompute it when the (timewise) previous revision changes. Use page creation log for the N marks. Use rev_parent_id for logical parents (and do not erase the information we have from the time period when it was actually used as a logical parent.) [21:57:12] we can make rev_parent_id chronological, which makes the current system behave consistenly, but potentially loses some info on old revisions. we can add the logical parent tracking later. [21:57:15] Krinkle: join logging and revision on page id + timestamp, if the joined log event is create/create, show the N [21:57:22] Just a few minutes left :) [21:58:01] or we try to make logical parent tracking work nicely, start caching size siffs/diff sizes, and back-populate. this means schema changes, and inconsistent behavior until the new cached data is available. [21:58:29] did i capture these two options correctly? [21:59:26] i think so [22:00:01] we can rule out the first option if tgr is objecting to it [22:01:06] i still don't quite see how we can make the second option work. I'm afraid it's a "keep doing the wrong thing until we can do the big right thing, which will be never"... [22:01:08] anddddd...we have run out of time. TimStarling are you suggesting then the 2nd is the answer? [22:01:10] I think the first option is a small step away from where we want to go, and the second one is a large step towards it [22:01:40] if we don't think we'll have the resources for a large step any time in the forseeable future, that rules that option out [22:01:50] but that's pretty pessimistic [22:02:01] make a subtask tgr [22:02:10] #action tgr to write this up [22:02:57] we don't have a resolution on the RFC [22:03:38] we can talk about resourcing [22:03:43] no, but we now have a better basis to discuss that rfc as proposed. [22:03:52] of tgr's idea? [22:03:54] the meeting should be closed now, I have to go [22:04:24] #endmeeting [22:04:24] Meeting ended Wed Jun 6 22:04:24 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:04:24] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-06-06-21.06.html [22:04:24] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-06-06-21.06.txt [22:04:24] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-06-06-21.06.wiki [22:04:24] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-06-06-21.06.log.html [22:04:31] kchapman: thanks for chairing [22:04:38] thanks everyone for the discussion! [22:04:50] you're welcome and thanks all [22:04:56] thanks!