[14:31:17] Technical Advice IRC meeting *Wikimania Special* starting in 30 minutes in channel #wikimedia-tech, hosts: @addshore & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [21:00:18] #startmeeting RFC meeting [21:00:30] Meeting started Wed Jul 18 21:00:18 2018 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:00:30] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:00:30] The meeting name has been set to 'rfc_meeting' [21:00:57] o/ [21:00:58] #topic RFC: Unify various deletion systems [21:01:07] T20493 [21:01:08] T20493: Unify various deletion systems - https://phabricator.wikimedia.org/T20493 [21:02:05] I like it :) [21:02:43] reading jcrespo's comment "one small nitpick" again since there's a lot of nuance in there [21:03:27] DanielK_WMDE: are you here? [21:04:10] huh, I never thought about it from the partitioning aspect [21:04:53] but I think overall it still seems like an advantage of reducing writes and moving less rows around [21:05:02] with his DBA hat he likes it but with his Database Engineer hat he doesn't [21:06:05] normalizing of titles to reduce the size of the links tables, that really needs to happen but probably should be a separate task [21:07:01] maybe we will have the bandwidth for it when MCR is finished [21:07:09] Yeah, as-is, we can't (easily) use page_ids in link tables for the targets given destinations are title based, not id based. If memory serves that used to be differnet but it was changed to not use pageids. [21:07:47] normalising separately would work yeah [21:08:27] in MW 1.4 we had a brokenlinks table, which contained links to nonexistent pages [21:08:45] For what it's worth, Contributors/Collaboration/Growth team has claimed the previously unowned component of Page deletion. [21:08:57] so you had to move links from brokenlinks to pagelinks when a page was created, which was a problem [21:09:05] that's why we got rid of it in MW 1.5 [21:09:56] and if you move a page, obviously the links still go by title, until the source text is updated [21:09:57] Yeah, if we ignore the time it takes to process jobs, we could maybe use page_ids directly in link tables, on the assumption that any move or deletion/restoring of a page would trigger parse jobs for incoming linked pages. [21:10:32] Hello everyone, I've Endnote x8 and need to activate it or a key for activation, any help? [21:10:35] as we currently do to make links blue=>red, red=>blue, or blue plain =>blue (class=redirect) [21:10:36] avoiding that sort of write traffic on move/create/delete was the motivation for the MW 1.5 schema changes [21:11:03] Regenpfeifer: wrong channel, sorry [21:11:22] legoktm: where to ask the q? [21:11:28] Regenpfeifer: not here... [21:11:47] Yes, do you have recommended channel? [21:13:26] TimStarling: Interesting, but we still do? [21:13:46] I suppose it does open the way for doing that in post-processing, but not currently the case afaik [21:14:01] in which case the jobs could be a fairly light html purge only [21:15:08] IIRC we don't do refreshLinks on move [21:15:32] I think there's a bug about how category collations aren't updated properly on page moves [21:15:41] (partly because we don't run refreshlinks iirc) [21:15:46] ah, so the stub/redirect classes lazy update? [21:16:15] Anyway, as Tim said, should be possible to treat orthogonally [21:16:51] I'm not sure I understand what this proposal wants to do with Special:Contributions [21:18:15] I suppose it, like action=history, will need to exclude deleted revisions by default. [21:18:28] "archived revisions" [21:18:54] it'd certainly make Special:DeletedCOntributions simpler, code and query-wise (no more special casing) [21:19:19] revisiondelete shows deleted contributions with a strikethrough, see e.g. https://en.wikipedia.org/wiki/Special:Contributions/Beletesh [21:19:20] could even become a redirect to a param for Special:Contributions. Same for action=history. [21:19:34] Yeah. [21:20:03] Special:DeletedContributions is admin-only, but I think Daniel imagines opening it up like revision deletion? [21:20:36] I dont think the community will accept page deletion in a way that leaves default-on rows on contribs and history. [21:20:56] Too noisy and hard to scan for non-deleted revisions. [21:21:03] also paging. [21:21:11] and table indexes. [21:21:22] Which may be reasons for why we haven't done that thusfar. [21:22:06] making it partly public can be separate change in the front-end if product wants that. [21:22:37] yes avoiding table scanning to skip deleted revisions is a reason for doing it the way we do it, but this is Daniel's idea though, to make this UI change [21:24:10] unless I misunderstood his comments in the committee meeting [21:25:00] Special:DeletedContributions is also available on tool labs [21:25:10] It's somewhat unrelated, but I've long been irritated by messy history, also for purposes of attribution. Should be possible to mark a revision as reverted/deleted without any visibility change, which could be excluded from credits and a (cleaner, possibly optional) history view. E.g. both the bad edit and revert in case of clean reverts. [21:25:21] I personally would like to see it either restricted on tool labs, or public on site (Don't care which) [21:25:29] [That's slightly offtopic though] [21:25:45] you can either have table scanning, or strikethrough, or have an additional index (user,deleted,timestamp) [21:26:06] to support a merged view by time you need to keep the existing (user,timestamp) index [21:26:08] Tool Labs basically provides sysop-level visibility into database information. Afaik we still restrict oversight in labs for that reason. Tools are required by terms of use not to make those visible withuot approval or user authentication. [21:26:38] #info - https://phabricator.wikimedia.org/T198156 -- wrt the unification of deletion systems, not sure if that would be related [21:27:09] Krinkle: That may have been the case at original toolserver, but its very public at this point in toolforge. You can just go down to quarry.wmflabs.org to get the data [21:27:19] Hauskatze: Yes, that tasks prompted T198176, which motivated this RFC. [21:27:19] T198176: Mediawiki page deletions should happen in batches of revisions - https://phabricator.wikimedia.org/T198176 [21:27:54] bawolff: Yeah, I don't oppose making it availaible in the UI on-wiki. But I do think there are social and product reasons for making it hidden by default, not relating to security. [21:27:55] I think I filed T198176 - I'm glad it is being discussed :) [21:28:26] And pretending something is secret, but letting an open group have access to it, is even worse, because people on wiki will assume its secret and act accordingly, which is more dangerous then making it entierly public where nobody makes incorrect assumptions [21:29:31] So.... to take it back to the RFC. What are technical problems with trying to implement current user experience of archive/restore with rev_del technology? So far we've identified the need for additional indexes. [21:29:44] I assume we'd need an index for (deleted, timestamp) as well, TimStarling ? [21:31:03] #info you can either have table scanning, or strikethrough, or have an additional index (user,deleted,timestamp) [21:31:37] (deleted,timestamp) would show all revisions on the site? [21:31:47] like recentchanges? [21:31:56] Ah, I mean, (rev_page, deleted, timestamp) [21:32:05] for action histiory [21:32:44] pages themselves would have a deleted flag [21:33:12] Right. But there's overlap with rev_deleted in existing history. [21:33:42] and merges, splits, selective deletion. [21:34:03] I suppose it depends on the size of those gaps whether `(rev_page,rev_timestamp);` suffices. [21:34:08] there's no selective deletion, only selective undeletion, which I think should be killed with fire [21:34:52] if we really need to support history splitting then I think that should make two separate pages [21:35:03] not scan for deleted=0 [21:35:33] Maybe half a dozen times a year or so I delete pages on wikis and selectively restore. Mainly in the event of large volume vandalism that makes pages unusable afterwards. E.g. someone's talk page and unable to see when the last message was due to 20 pages of striked out revisions. [21:36:19] and similarly, for current selective rev_delete use case, I think it'd be a good move (product/user-wise) to allow toggling visibility and have a way to page through non-deleted hisotry [21:36:48] you could probably just replace (user,timestamp) with (user,deleted,timestamp) [21:37:06] but do we need to add an index for that toggle or can we just scan [21:37:06] and then do queries with 'deleted in (0,1)' [21:37:26] tgr: you mean like sorting a union? [21:37:39] and the query optimizer can probably turn that into two parallel index walks [21:37:55] a couple weeks ago on wikitech there was a lot of vandalism and it basically required a few database queries to find whether or not we'd missed any contribs from that user (for both user-wise and page-wise, given it was both many user names and many pages, and together trying to discover them all). The on-wiki UI wasn't enough to find it all within reasonable time. [21:38:58] TimStarling: Yeah, could be a scan. It depends on whether it would perform well enough. If we suspect we may need an index oneday, could be bad to find out late since revision table indexes are non-trivial to add. [21:39:26] Maybe I'm overthinking it, but adding two new indexes might also be a motiviation to prioritise normalised contribution tables instead. There's been some talk about that. [21:39:47] adding indexes does have a cost in disk space and memory [21:39:51] Yeah [21:40:02] There's not infinite numbers of them we can add. [21:40:06] especially on revision [21:41:57] tgr: I always assume the query optimizer is pathologically stupid until proven otherwise [21:43:02] there is https://mariadb.com/kb/en/library/index_merge-sort_intersection/ but that's for merging two different indexes (also behind a feature flag), not sure if there's something similar for a single index [21:44:46] #info Krinkle says hiding revisions from Special:Contributions (not just strikethrough) is a hard requirement [21:45:17] Regarding the page table. Reading Daniel's comment now https://phabricator.wikimedia.org/T20493#4417467 [21:45:43] Basically need to decide whether re-creation will do what it does now (new page ID), or re-use the old. [21:45:44] #info There's also a need for selective deletion or some equivalent feature such as history splitting to clean up action=history on heavily vandalised pages [21:46:03] (where re-create != undelete) [21:47:45] #info for contributions, it's unclear whether table scanning or an additional index is optimal for performance: CPU/memory tradeoff [21:48:49] I don't think there's any reason for getting a new ID on recreation other than it was hard to retrieve the old one [21:49:13] although after this change it will still be hard [21:49:46] unless deleted revisions are linked to pageid not title, which would be a nontrivial change in behavior [21:49:48] #info third query plan option for contributions is tgr's idea of (user,deleted,timestamp) and have deleted IN(0,1) for unified views [21:51:33] can deleted pages stay in the page table? and have the title index be non-unique? [21:51:45] tgr: yeah, so now, when creating a new page that was deleted, it gets a new ID, but the logs (by title) do show its past creation/deletion, and admins do see (by title) the archived revisions. [21:52:25] If creating and deleting a page multiple times, archive will have each rendition of the title with its own page id [21:52:58] which feels like a useful distinction, except it's completley invisible to users and not usable in anyway right now. [21:53:16] we'd lose that given the page ID would be re-used [21:53:28] but we'd only lose a potential future feature, not anything current. [21:53:44] seems like such feature would be better suited by a more general tree-based history, not just based on create/delete. [21:54:24] we would lose the "feature" that moves abandon deleted revisions (which is abused for history split and merge) [21:54:40] TimStarling: Ah, that's interesting. We could have the deleted page as its own row? Basically merging the idea of page_archive into page. [21:55:13] I'd say that's a win, not a loss, but yeah. [21:55:38] it's a win as long as history split/merge remains possible [21:55:59] E.g. if creating/deleting "A", then undeleting it only in part (for whatever reason), and then a re-name unrelated to that at a later time, would currently abandon some revisions (invisibile from the title-based undelete) [21:56:01] it was always very sketchy from a legal POV, mind you, but it does get used [21:56:40] Hm.. I don't see why merge/split wouldn't be possible anymore but need to think about it. [21:57:17] Merge is currently emulated by deleting the destination, moving the subject there, and undeleting underneath it? [21:57:24] yes [21:57:47] it's a disaster as we've previously discussed in the context of rev_parent [21:58:10] merge needs to get its own UI [21:58:15] that currently works because undelete searches 'archive' by title. [21:58:46] in fact it already has its own UI, Special:MergeHistory, we just need to complete that project [21:58:51] if it continues to do that, by searching revision for current page ID + other page ids from page=archived, Tim's idea could make that work. [21:59:19] But I agree there is no reason not to just update MergeHistory to update pageid pointers instead. [21:59:20] almost out of time, what are the next steps here? [21:59:47] #info title table can be done first and separately [22:00:56] #info perhaps possible to have deleted pages in the page table, with (page_deleted,page_title_id) index [22:01:07] #info Without additional care, manual history merge via undelete might stop if no longer title based but page_id based. To be fixed by either continuing to be title based (using page_archive, or non-unique page_title), or by requiring use of Special:MergeHistory and updating that to do rev_page updates instead of delete/move/undelete. [22:01:34] tgr: does that sound right? [22:02:18] #info current selective deletion use case (hiding things from history) could be done by creating a new deleted page and moving selected revisions into it [22:02:48] Ha, nice hack :) [22:03:25] It's like how I use MergeUser to get rid of spam accounts on my third-party wikis. (Merge into "User:VandalDeposit") [22:04:37] ok, all done I guess [22:04:41] #endmeeting [22:04:41] Meeting ended Wed Jul 18 22:04:41 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:04:41] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-07-18-21.00.html [22:04:42] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-07-18-21.00.txt [22:04:42] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-07-18-21.00.wiki [22:04:42] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-07-18-21.00.log.html [22:05:09] Krinkle: yes [22:05:22] it's a big, complex project