[18:02:44] hello all, tech difficulties for credit. updated links coming shortly [18:04:58] use this for youtube: https://www.youtube.com/watch?v=1nKSUGGhpWk&feature=youtu.be [18:05:13] use this if presenting: https://hangouts.google.com/hangouts/_/bjmzkesqlja73hpiytbijb56jae [18:06:21] i can hear and see [18:07:43] thanks ebernhardson [18:13:57] fancy! thanks zjelko [18:15:45] ebernhardson: thanks :) [18:17:27] also i don't know how you can use the default listchars in vim, the ^I bothers me :P [18:18:12] ebernhardson: I don't even notice it :) [18:19:17] i guess i just had to be different, set listchars=eol:⏎,tab:▸·,trail:␠,nbsp:⎵ [21:03:33] #startmeeting ArchCom RFC meeting [21:03:34] Meeting started Wed May 3 21:03:34 2017 UTC and is due to finish in 60 minutes. The chair is DanielK_WMDE. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:03:34] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:03:34] The meeting name has been set to 'archcom_rfc_meeting' [21:03:50] #topic Storing long revision comments, see https://phabricator.wikimedia.org/T153333 [21:03:54] #link https://phabricator.wikimedia.org/T153333 [21:04:16] kaldari: you there? [21:04:27] yes [21:04:45] excellent! i'm afraid i didn't publicize this as well as I should have [21:05:02] I'm here, you're here, who else do we need? :-) [21:05:08] For anyone interested, Daniel's summary of the options at https://phabricator.wikimedia.org/T153333#3221042 is useful. [21:05:20] brion's not here [21:05:27] James_F: heh. well, brion would be good... [21:05:42] True. [21:05:59] yea. anyway, we'll probably have another related discussin next week, about brions proposal regarding the revision table. [21:06:09] that would be an opportinuty to resolve any open questions. [21:06:14] I'm here, but probably have nothing useful to add [21:06:26] * RainbowSprinkles settles back into his armchair [21:06:33] kaldari: do you have essential questions you want to have answered today? if not, i have some :) [21:06:43] would be good to have DBA input, but bad time zone for them :( [21:07:05] mostly just wanted to hear opinions on the different options you suggested. [21:07:13] In addition to purely architectural considerations, I'm extremely interested in migration path [21:07:27] per https://gerrit.wikimedia.org/r/#/c/350097/3/maintenance/tables.sql [21:08:01] ok. so here is one thing that could be done quickly: add a new table, which has (at least) a blob for the comment, and a revision id. [21:08:08] add code that writes comments to that table. [21:08:28] add code that reads comments from that table, if the revision has an entry there. otherwise, it falls back to the old system [21:08:55] what about log entries, does this discussion cover them? [21:08:59] this needs no schema change of existing tables, and no population script. it's easy to roll out, and fairly easy to back out of. [21:09:37] DanielK_WMDE: and do you feel like that's compatible with the long-term plans for the revision table? [21:09:40] MaxSem: that's one downside of this proposal: it does not cover log entries. [21:09:46] and it doesn't provide deduplication [21:09:59] it's not especially compatible [21:10:00] which is surprisingly relevant, according to jaime's analysis [21:10:30] I think it's an extra layer of legacy which would complicate the main schema change [21:10:55] so which approach would you prefer? [21:11:25] I think rev_comment_id -> comment_id, as proposed [21:11:41] and also log_comment_id -> comment_id and wherever else we need a comment concept [21:11:49] I'm not really sold on deduplication [21:12:11] or at least, I think it should be a storage concept [21:12:19] I don't have a strong opinion on deduplication but I'm really not sold on hashes as IDs [21:12:22] if comments are a UI concept, then I don't think it makes sense to deduplicate them [21:12:27] #info option 1: new table with comment blob and rev id. no deduplication. no support for log entries. easy to deploy. [21:12:52] RoanKattouw: that's not necessary for dedupe. but i really like the idea :) I seem to be the only one though, so let's not get hung up on that [21:13:04] I know statistically the probability of a collision may be minuscule, but when (not if) someone figures out a way to generate a collision, you're screwed [21:13:07] #info TimStarling re option 1: I think it's an extra layer of legacy which would complicate the main schema change [21:13:13] deduplication supposedly provides something like 50-70% saving, right? [21:13:23] cf. how WebKit bricked their SVN repo by checking in two files with the same hash [21:13:25] it was a bit hard to extract this figure from the task, I could have it wrong [21:13:33] #info TimStarling: I think rev_comment_id -> comment_id, and also log_comment_id -> comment_id [21:13:45] #info I'm not really sold on deduplication [21:14:17] TimStarling: for Wikidata at least, the saving would be > 50%. [21:14:32] Nobody has found a way to get duplicate hashes in git yet, but once they do, git repos will break too. So I think it's better to design for hash collisions happening eventually [21:14:35] RoanKattouw: yes, i loved that :D [21:14:50] or at least minimizing the impact of one to non-catastrophic levels [21:15:24] * anomie managed to be around for this meeting after all [21:15:36] RoanKattouw: yes, i was just going to say that: it should be easy to mitigate the impact for the use case at hand. but again, let's not get hung up on hashes-as-ids. you can have autoincremet ids plus hashes, for dedupe [21:15:45] Yeah I would favor that [21:15:46] hey anomie! [21:15:47] No DBA here? Sucks when it's an RFC about database stuff. [21:16:09] yes, i agree... [21:16:12] autoincrement as the "real" IDs, and we can still hav ehashes in there [21:16:19] Well that's because it's the same day as the DC switchover [21:16:25] TimStarling: for next week's meeting, and we bribe a DBA? Actually, one of them is in Berlin, no? [21:16:28] The DBAs are taking a well-deserved break :) [21:16:43] But yes we definitely one next week [21:16:48] RoanKattouw: In this case, the consequences of someone finding a colliding hash are either "revision is saved with the wrong comment" or "can't save due to hash collision" depending on how we validate. [21:17:05] i'd go for the latter. [21:17:11] Oh right, it's comment not content, good point [21:17:14] it's gonna be *RARE* [21:17:36] Here's another question, which is perhaps better for next week: do we want to do this as one massive schema change, or in parts [21:17:42] it would be nice if we could cluster by text and then apply innodb page compression on top of that [21:17:50] Because presumably the DBAs are going to be less than trigger-happy about anything that starts with ALTER TABLE revision [21:17:58] One important question is whether adding a column to revision is horrendously expensive or not. [21:18:22] e.g. have an 8-char prefix of comment_text at the start of the primary key [21:18:43] Brion's Schema Change To Rule Them All involves adding to rev fields, rev_comment_id and rev_actor [21:18:47] then deduplication would not be so necessary, right? [21:19:06] RoanKattouw: my idea is: build a test system that uses the new mechanism. deploy all the schema changes at once, but only start using the new structures bit by bit. [21:19:21] OK, so something like: [21:19:36] TimStarling: if we go for rev_comment_id -> comment_id, this is blocked on "all the stuff on the revision table". [21:19:41] so... 6 months? [21:19:42] 1. Deploy schema change that adds rev_comment_id, rev_actor, log_{whatever we need here}, bigger size for page_latest all at the same time [21:19:52] 2. Deploy code that starts using rev_comment_id with fallback [21:20:06] on rev_comment [21:20:18] 3. Migrate all rev_comment entries to rev_comment_id [21:20:29] 4. Deploy code that never looks at rev_comment any more [21:20:33] 5. Eventually drop rev_comment [21:20:38] Repeat 2-5 for other features [21:20:40] ? [21:21:13] Right, we could also consider ways to not make this blocked on the revision table schema cahnge [21:21:19] That sounds like a good enough plan. [21:21:21] e.g. by adding comment_rev with the plan to drop it eventually [21:21:41] RoanKattouw: that's kind of the subject of this meeting. kaldari wants to know whether he is blocked on that. [21:21:50] RoanKattouw: That'd be my thought too. I don't know if jynus and brion have a better idea. [21:22:04] Oooh this is for the community tech non-Latin languages thing, gotcha [21:22:09] and jaime said that's he'd *really* like to see comments move out of revision, *especially* before doing a big schema change [21:22:10] yep [21:22:16] Right, OK [21:22:33] Sounds good, Roan [21:22:35] Well in that case, for both comment and actor really, we can have a comment_rev field as a workaround for rev_comment_id [21:22:59] that's option 1, which TimStarling said he doesn't like... [21:23:01] that is a good point, ALTER TABLE would be faster if a script first blanked all the rev_comment fields [21:23:02] RoanKattouw: Wouldn't work too well for actor, we really need deduplication there. [21:23:04] One annoying detail is log comments [21:23:10] anomie: Good point [21:23:38] I suppose we could have a (rev_id, actor_id) table that we later merge into revision once rev_actor is created. [21:23:44] we can also do two new tables: comments, with a blog and an id (and whatever), and one that links comments to revision. [21:23:50] TimStarling: Oh, I see, you're saying if we blank all rev_comments before my step#1 then it'd be fsater [21:23:51] since it's a new table, we on't need to touch revision [21:24:02] Ooh, good idea [21:24:06] we can add rev_comment_id later, or keep relying on the relation table [21:24:07] Then getting rid of it later will also be cheap [21:24:14] yes, that it is actually jaime's main motivation for recommending this, you know [21:24:25] remove comments from revision to make future ALTER TABLEs faster [21:24:27] (anyone on labs that uses rev_comments is going to hate us) [21:24:32] Cause otherwise you have to ALTER the comment_rev field back out and that would be expensive as well [21:24:55] domas suggested just expanding the rev_comment field in place, because he doesn't care about ALTER TABLE time so much [21:24:58] Do we need a pointer from comment table to revision? [21:25:07] yeah, a bridge table sounds cleaner since you could just replace it with rev_comment_id later and drop the bridge table [21:25:10] rev_comment could be blob|bigint. [21:25:30] Although I guess that does't work with JOIN. [21:25:31] Krinkle: not if we have that link the other way around [21:25:46] Krinkle: We don't except to work around the initial absence of rev_comment_id [21:25:55] E.g. rev_comment OR rev_comment_id>comment_id [21:26:00] Storing the comment ID in the BLOB field is an interesting hack, I wonder how well that works [21:26:06] Krinkle: also, it can be done *now*, no blocking on the revision change [21:26:07] a left join should make that efficient [21:26:26] TimStarling: what do you think of having a comment table plus a revision_comment relation table? [21:26:28] It's a dirty hack that makes me cringe a bit but it's clever [21:26:41] Although... comments can be anything, so they could be numbers too and that would break [21:26:50] RoanKattouw: *ick* [21:27:06] RoanKattouw: Yeah, nevermind the multi-purpose field. We can add the field to revision but gradually migrate. [21:27:08] So I don't think this works unless you prefix all raw comments which defeats the purpose [21:27:14] We did that with rev_text [21:27:18] * DanielK_WMDE has lost track of the #info [21:27:18] but we never batch query that in the UI [21:27:20] Or \u0 or whatever. [21:27:32] Anyway [21:27:34] * DanielK_WMDE can't chair and take part in the discussion [21:27:35] Whereas do we batch query comments so we really want to avoid having to do a separate query right? [21:27:39] It sounds like the consensus is somethin glike [21:27:57] Initially put in a relation table / bridge table mapping rev_id to comment_id [21:28:06] Put in the comment table exactly the way we want it from the start [21:28:16] Migrate all comments to it, and blank rev_comment [21:28:20] #info option 3: two new tables; comment <-- revision_comment --> revision [21:28:28] Then alter the revision table to drop rev_comment and add rev_comment_id [21:28:49] And do some more migration stuff so we can eventually use rev_comment_id and drop the temporary revision_comment table [21:28:53] RoanKattouw: we may also decide we don't need or want that last step [21:28:59] Yeah we could also keep the relation table [21:29:04] maybe it is the best of a bad bunch of options [21:29:11] I wouldn't be opposed to that [21:29:15] It'd certainly ease the migration [21:29:19] #info possible, eventually drop the relation table, replacing it with rev_comment_id. [21:29:32] I suspect we'd want to drop the relation table eventually, more joins is more expensive. [21:29:45] #info It sounds like the consensus is somethin like: Initially put in a relation table / bridge table mapping rev_id to comment_id, Put in the comment table exactly the way we want it from the start, Migrate all comments to it, and blank rev_comment, Then alter the revision table to drop rev_comment and add rev_comment_id, And do some more [21:29:45] migration stuff so we can eventually use rev_comment_id and drop the temporary revision_comment table. [21:29:48] Is adding an optional field to revision still slow? I thought it was the migration that's slow. [21:29:54] i think this option gives us the most felxibility, and has the fewest blockers [21:30:12] #info It sounds like the consensus is somethin glike [21:30:12] the point is just to avoid two ALTER TABLEs of the revision table? [21:30:17] Krinkle: We don't have any DBAs in attendance, but I'm working on the assumption that anything that alters the revision table is slow [21:30:17] It's not clear to me why we need a separate table in the short term as opposed to adding rev_comment_id. [21:30:22] #info Initially put in a relation table / bridge table mapping rev_id to comment_id [21:30:24] Krinkle: I don't know, and the people I know would know aren't here. [21:30:50] afaik an alter to add a new optional column should be instantenous pretty much. [21:30:50] If adding fields is not super painful, then I agree there's no reason for this somewhat Rube-Goldberg-esque migration scheme [21:31:04] THe migration is gonna be slow and that will likely require a depool dance, but this proposal doesn't resolve that. [21:31:14] We'd just add the fields and then migrate towards using them over time [21:31:34] TimStarling: yes. because we want to do all the altering at a single time, and we'll need time to figure out all the altering we want to do to revision. the extra table allows long comments to be enabled soooner. [21:31:42] 14:24:15 yes, that it is actually jaime's main motivation for recommending this, you know [21:31:42] 14:24:26 remove comments from revision to make future ALTER TABLEs faster [21:31:52] how about having the relation table be very very temporary [21:31:52] Krinkle: It accomplishes not having the comments in the revision table when we start altering it [21:32:03] not even used by MW [21:32:06] #info afaik an alter to add a new optional column should be instantenous pretty much. [21:32:09] TimStarling: it also means we only need to alter revision after the comment fields have been blanked. that means moving less data around during the ALTER TABLE [21:32:25] TimStarling: Oooh, you mean purely as part of the migration? [21:32:32] yeah [21:32:39] Hm... interesting. Maybe the ALTER will be faster if the field is nulled out everywhere. [21:32:40] That's a good idea [21:32:46] #info how about having the relation table be very very temporary not even used by MW [21:33:04] Krinkle: Per Tim, Jaime appears to believe so [21:33:07] Krinkle: that'S how understood what jaime said [21:33:10] I still don't see why the ALTER would be slow for an extra field not part of any index at first, but yeah, if it is, a temporary table could be a neat way to speed up the migration script. [21:33:14] And things Jaime believes about DBs tend to be true :) [21:33:33] Okay, I'd like confirmation from Jamie on that, but +1 in that case. and if he says otherwise, I'm sure nobody would oppose adding the field directly to revision. [21:33:33] TimStarling: How would we have the relation table but not have MW using it at some point in the middle? [21:33:36] can we have a view that looks like the revision table, but with rev_comment being the result of a join with comment and the relation table? [21:33:46] That's a great DBA question [21:34:08] I think you probably can in principle, not sure if you can and expect things to stay up [21:34:23] TimStarling: I'd thnk so, but you'd have to rename the existing table, i think... not sure if that's real quick, or real slow. [21:34:34] surely it would not be worse for performance than having MW do the join [21:34:36] You'd never need to sort/where/join on that fake field though so maybe it'll be OK [21:34:38] For reads a view would certainly work, as long as we don't wind up with crazy-stupid query plans. Not sure about writes. [21:34:43] anomie: Presumably the migrate script we run on each slave would, instead of creating the table and moving values, and clear rev_comment, it would also at the same time then continue to create rev_comment_id and fill that too and then drop the table. [21:34:50] Krinkle: Maybe https://phabricator.wikimedia.org/T153333#3153430 answers your question. [21:35:32] ok, half time warning. i'll try to summarize, tell me if I'm wrong. [21:35:33] Krinkle: Except that as the maintenance script runs MW would be inserting more comments into rev_comment. [21:35:48] 1) no comment_rev. That's just wrong. [21:35:57] 2) introduce rev_comment_id eventually [21:36:00] So we'd probably have to go read-only at some point so the maintenance script could catch up. [21:36:01] Nikerabbit: Yeah, but it's not clear to me whether Jamie there is solely referring to the ALTER query that adds the optional field, or the migration as a whole (e.g. the filling of the new comments table etc.) [21:36:08] re #1 agreed, I only proposed it as a hack anyway [21:36:26] 3) maybe have a bridge table for a while, connecting comment and revision, so we can blank out rev_comment before altering the table [21:36:52] is that a fair summary? [21:37:19] do we need to doscuss this further? [21:37:50] if this is ok as a summary, i'd like to move on to another question regarding the comment table. [21:37:52] I think maybe not? [21:37:55] DanielK_WMDE: Summary seems good to me. [21:38:17] LGTM [21:38:20] ok. next question: what should be *in* the new comment table? [21:38:23] I really need to read the MySQL manual on views [21:38:29] I don't mind the bridge table as long as MW will never need to have code to support it. So it's a migration utility only. Otherwise we're looking at quite a bit more coding to do it. [21:38:47] we can have a blob field for plain text, or put json there and the plain text into the json [21:38:54] or have separate fields for plain text and meta-data [21:39:01] we can have a field for deletion flags [21:39:10] https://gerrit.wikimedia.org/r/#/c/350097/3/maintenance/tables.sql proposes auto_increment ID + a BLOB with raw text [21:39:20] any other fields that would be useful? [21:39:29] any thoughts on having a blob for structured meta-data here? [21:39:41] Hmm, what metadata would we have about comments? [21:39:55] Re deletion flags, doesn't deletion operate at the revision level? [21:39:56] If we don't ever want to deduplicate, the one piece of metadata I can think of is a field to indicate the comment has been RevDeled. [21:40:01] RoanKattouw: yes. but a flag for suppression would be good. and wikibase uses nasty hacks for translatable comments. [21:40:01] (Although I'm not sure how MCR proposes to change htat) [21:40:03] is it part of this proposal to make archive table use comment (like it does for text), or will that still "flatten" it (e.g. will rows ever be removed from the comments table) [21:40:43] Depending on that, we need a way to hide deleted content from the comments table (or decide not to publish it in labs) [21:40:56] #info Don't forget about the archive table: it stores revision comments of deleted revisions [21:41:01] Which is difficult, but I think we all agree that append-only is the preferred model. [21:41:03] Ah, publishing in labs is a good use case [21:41:07] RoanKattouw: I would *really* like a way to put a machine readable description of the edit there, and render it to users in their language. very helpful for multilingual projects [21:41:15] Krinkle: brion's change will flatten archive, I think? [21:41:34] James_F: As written in Gerrit it does not, but it needs to [21:41:35] Krinkle: I'd say that archive should have ar_comment_id. Really ideally we'd just do T57398, but that's another huge change to everything. [21:41:35] T57398: Move page deletion to a RevDelete mechanism; kill archive table (fire optional) - https://phabricator.wikimedia.org/T57398 [21:41:47] Right [21:41:48] Yeah, the draft has that as a TODO. [21:41:54] Oh yeah I'd love to kill the archive table, but agreed that in the meantime we need ar_comment_id [21:41:59] possible structired data: mark comment as supressed, an hypothetic translatable message data (like for displaying revert messages in the user language with a message key, user reverted)... [21:42:26] And possibly also another bridge table from archive to comment and all that [21:42:32] Oy. [21:42:39] nearly all edit comments done on wikidata need to be translated [21:42:48] Oooh, I like Vulpix's idea of translatable comments [21:43:07] +1 [21:43:09] At least in the case where the comment is a canned summary generated by software [21:43:10] That'd be very de-dupable, too. [21:43:14] Then it can just be a parameterized message [21:43:21] exactly [21:43:33] #info 14:42:00 possible structired data: mark comment as supressed, an hypothetic translatable message data (like for displaying revert messages in the user language with a message key, user reverted)... [21:43:39] we already do this for wikidata, but in a *very* hacky way [21:43:40] "translatable comment" == "serializing a logging table row in JSON and shoving it in the blob"? [21:43:41] Though we'd be in the exquisite case in the future if we wanted to add/change parameters to that message. [21:43:53] anomie: Something like that I gues [21:44:07] so, how about a type field? type could be plain, or parametrized, or message key, or JSON blob... [21:44:17] Yeah, that's how logging does it, right [21:44:28] Not using messages directly but using types that map to messages [21:44:34] I'd rather not make our data even harder to maintain, migrate, export, import. Historical data should ideally be least tied into active state. [21:44:54] Making it use mesages means everything breaks when you disable an extension that assisted in some edit. [21:45:00] that's true [21:45:00] comments are not diffs. [21:45:04] The logging table has this too [21:45:15] Yeah, which is an anti-feature. [21:45:32] Us never being able to disable LQT from production is not good. [21:45:38] You cannot render a log entry if the extension that generated it is no longer installed [21:45:38] Which, agreed, is an anti-feature [21:45:44] You could have the text as a backup though [21:46:12] The logging table does format entries in sentences using messages, but so does recent changes, but at a higher level, using only wiki-level concepts like pages and their actions, not at the content-level. [21:46:12] RoanKattouw: Also, Echo notifications. [21:46:29] Have the primary comment text be the text in the content language, and the metadata be the messageKey+params you used to generate that text, and the UI can decide which one it wants to render [21:46:30] But more volatile. [21:46:30] Similar issue if I recall correctly [21:46:51] Yes, same thing [21:46:53] Anyway -- it sounds like there are some use cases for comment metadata [21:46:56] RoanKattouw: having the text as a backup -> JSON with plain text plus message key plus parameters? [21:47:24] The ones I've heard are futuristic i18n message-based comments, and deletion flags so that labs replication is easier to censor [21:47:32] if the value is a blob either way, this could be moved to a separate proposal (dynamic/structured edit summaries). I thought we wanted long edit summaries sooner, not later :) [21:47:49] RoanKattouw: would you prefer a meta-blob, or a type field that would tell MW that the blob is JSON, and contains metadata along with the actual text? [21:47:57] Krinkle: Yes, but we would need to decide on whether it's plain text or a blob earlier [21:48:10] Krinkle: yes... my question is, do we need two blobs? [21:48:18] RoanKattouw: Hm. right, since you can't distinguish json from manually entered text otherwise. [21:48:23] DanielK_WMDE: Probably the latter [21:48:57] Since that's the pattern used in e.g. the text table to say "this text is actually gzipped / a URL / whatever" [21:49:06] ok. so we have id, text, meta, and.... more? [21:49:36] RoanKattouw: oh, sorry roan. you mean, a type field, not two blobs? [21:49:37] meta being a metadata blob, or flags/type things [21:49:50] Yes, I meant a type field or a flags field like the text table [21:49:53] But I have no strong opinion [21:49:59] Right. Either (id, blob, flags) where flags says whether blob is plain or json. Or, (id, blob text, blob metadata). In other words, if we have meta data should be put the blob in there as well? Or is there use in keeping in separate? [21:50:04] if we have two blobs, one could just always be the plain message. and the second one could contain whatever [21:50:05] I'm only saying that because the text table does it and because it allows us to have one blob instead of two [21:50:08] that makes things mor erobust... [21:50:17] Yeah that's a good idea [21:50:18] Presumably the meta data is not gonna be sql-queryable either way, so what's the added value in having it be in two columns? [21:50:22] Another thing to consider here is whether we're going to do deduplication. If so, we probably don't want to put RevDel in the comments table. [21:50:23] That would make things simpler for outsiders too [21:50:29] Less of a "WTF is this schema" factor [21:50:39] Krinkle: the value is in having one blob that is *always* plain text [21:50:47] DanielK_WMDE: Right. [21:50:54] Makes archiving easier as well. [21:50:57] Yeah I think I'm sold on that [21:50:57] I like that. [21:51:12] :) [21:51:23] Which means we don't need a type/flags column either, right. [21:51:23] ok. do we need a "deleted" field? [21:51:43] DanielK_WMDE: That depends on whether we want deduplication, IMO. [21:52:03] #info rough consensus forming for (id, text, meta-blob). possibly plus a hash for deduplication, or a deletion flag. [21:52:16] Reasons to not have comment_deleted: It's revision meta data, and like text, comment should be append-only and stateless. Reasons to have it: Deletion can be applied to user name, comment or text separately. So having it inside comment would make filtering for labs much easier. [21:52:21] anomie: does it really? i mean, if one occurrance of a comment should be supressed, shouldn't all? [21:52:30] So we would have one blob that is always just text for the raw comment and one blob that is for metadata? Is that the consensus? [21:52:38] kaldari: Yes [21:52:38] DanielK_WMDE: Then some fool RevDels a revision with the empty comment. [21:52:51] haha great point [21:53:02] anomie: so? [21:53:15] Now all revisions with empty comments have their comment hidden [21:53:24] Because the empty comment is massively deduplicated [21:53:33] DanielK_WMDE: So suddenly 100000 revisions have their comment hidden. How do we even log that? [21:53:47] I love how "How do we even log that" is your first concern [21:53:54] rev_deleted is already a bit field that indicates user ^ comment ^ text. So I don't think we want to move that? Same reason surpressing the user of 1 revision should not affect actor in the future. [21:53:56] ha, good point about the logging :D [21:54:06] #info Reasons to not have comment_deleted: It's revision meta data, and like text, comment should be append-only and stateless. Reasons to have it: Deletion can be applied to user name, comment or text separately. So having it inside comment would make filtering for labs much easier. [21:54:38] Yeah, I think I agree with that reasoning [21:54:44] To not have revdel be in the comment table [21:55:08] #info rough consensus forming against having a deletion flag in the comment table [21:55:12] One issue is though, labs would have to censor comment rows that have 0 non-revdel-ed uses in the revision table [21:55:23] That's probably going to be a pain [21:55:30] ugh. right. [21:55:34] OTOH a special-purpose index on rev_comment_id could deal with that [21:55:42] RoanKattouw: I already asked WMF Legal whether we'd really need to do that. [21:55:48] (Hopefully we wouldn't need that index everywhere, only at the replication point) [21:55:50] They haven't gotten back to me yet. [21:56:09] #info labs would have to censor comment rows that have 0 non-revdel-ed uses in the revision table. [21:56:26] anomie: Well if someone doxxes someone in an edit summary... [21:56:41] anomie: My getLazyLegalAdvice() says yes, we will. [21:56:42] ok, five minute warning [21:56:44] The drawback to that would be that every join on the comments table would have to re-join on every table with an xxx_comment_id field... [21:56:53] let's try to come to an end and summarize [21:57:19] anomie: For labs replication we would have to do something painful like that, but for production usage I hope we wouldn't have to [21:57:44] Because I don't see MW looking at the comment table without first/also looking at the revision/archive/logging row that directs it there, and that'll tell you if its revdeleted [21:57:51] * DanielK_WMDE is glad that he isn't responsible for toolserver replication anymore [21:58:02] RoanKattouw: For prod usage we wouldn't. Just for the Labs views and whatever generates the dumps. [21:58:08] Yeah [21:58:33] Good point on dumps [21:58:50] Though those would also use the revision/... tables as points of entry to get to the comment table [21:58:57] * anomie notes that someone could dox someone in a page title, which even when deleted is still in the part of the archive table that's visible in Labs [21:59:14] RoanKattouw: We have raw SQL dumps [21:59:18] Actually, my point about labs might be a red herring. I don't think we hide deleted content in labs, labs users have sysop view permission on the database essentially. It's one hte reasons why tools must use rev_deleted=0 in their queries whenever showing results publicly (unless they do OAuth). [21:59:19] Ugh right [21:59:42] Or maybe that is redundant nowadays, I can't remember. [21:59:48] #info consensus (earlier): no comment_rev field; have rev_comment_id eventually; use a bridge table between comment and revision as a stepping stone. Don't forget archive. have separate fields for text and meta-blob. [21:59:57] Krinkle: The views in Labs null out columns based on rev_deleted and the like. [22:00:02] Krinkle: Hmm, odd. I wonder how that deals with oversighted content, you need more than just sysop to see that [22:00:05] #info probably no deletion flag. but maybe, because labs. [22:00:21] #info deduplication needs further thought. attractive, but the details are tricky [22:00:26] RoanKattouw: yeah, maybe this policy changed when oversight changed to use rev_deleted too [22:00:41] Afaik archive is visible in labs though, and intended to be. [22:00:47] #info Also don't forget the logging table, although that's less urgent [22:01:03] which I guess now ironically is potentially less hidden than revdel in case of oversight. [22:01:06] ok, time is up! [22:01:12] that's the case in prod as well though, not labs related. [22:01:15] any last things to add to the log? [22:01:29] if you want to keep chatting, please take it elsewhere, kids :)I [22:01:47] DanielK_WMDE: Thanks for facilitating and note taking (and participating)! [22:01:57] elsewhere == #wikimedia-dev [22:02:10] kaldari: np! I hope this was helpful. [22:02:14] * anomie notes #wikimedia-dev belongs to the bots [22:02:23] we did find a way no to be blocked on the revision schema change, so yay :) [22:02:37] #mediawiki-core [22:02:42] that'S good, too! [22:02:47] i'm going to bed, though [22:02:55] #endmeeting [22:02:55] Meeting ended Wed May 3 22:02:55 2017 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:02:55] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-05-03-21.03.html [22:02:55] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-05-03-21.03.txt [22:02:55] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-05-03-21.03.wiki [22:02:56] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-05-03-21.03.log.html [22:03:07] DanielK_WMDE: Thanks for facilitating!