[00:55:05] kaldari: still interested in pizza? [01:37:35] Is the copyright shindig still happening? The livestream is blank. [01:38:00] oh... it just popped up! [20:59:49] #startmeeting RFC meeting [20:59:50] Meeting started Wed Jul 29 20:59:49 2015 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [20:59:50] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [20:59:50] The meeting name has been set to 'rfc_meeting' [21:00:25] #topic Content model storage | RFC meeting | Wikimedia meetings channel | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ [21:00:45] #chair gwicke legoktm [21:00:46] Current chairs: TimStarling gwicke legoktm [21:00:47] o/ [21:01:23] #link https://www.mediawiki.org/wiki/Requests_for_comment/Content_model_storage [21:02:05] we have jynus = jcrespo here who I think started this discussion on gerrit [21:02:23] sorry for that :-) [21:02:59] the original discussion started on https://gerrit.wikimedia.org/r/#/c/221306/, where wanted to populate rev_content_* fields by default , but that was vetoed due to space concerns [21:03:28] so this RfC is a way to always populate those columns in an efficient way w/r to storage space [21:03:45] I've been looking through the emails and gerrit changesets from 2012 to refresh my memory about the original design discussion [21:04:14] there was this change: https://phabricator.wikimedia.org/rMW906a1ba51f149d659e270597e4b964cc8c550357 [21:04:40] you see that before that, content models and formats were integers allocated in Defines.php [21:04:49] so basically this is normalizing the content model/format values [21:04:53] and those integers were also used as an internal interface [21:04:56] SMalyshev: yes [21:05:46] I had concerns about that original Defines.php scheme: extensibility, ease of use for non-MW clients of the MySQL database [21:05:48] if they won't have to be denormalized each time somebody touches the page then it looks fine. If they do, I think performance evaluation needs to be done [21:06:00] to clarify one point, if I am allowed to chime in, space == performance on a database; disk storage is not that a big of an issue as it is redundancy [21:06:44] I think I suggested normalized storage at the time but daniel ruled it out just on the basis of work needed [21:07:09] although I haven't found notes on that [21:08:06] revisions will be always the largest table on a wiki, and reducing its size even only some bytes per row means more data on memory and more speed [21:08:25] SMalyshev: we will need to know the content_models int -> text mapping pretty frequently, nearly all actions (including view) need to know it. I was thinking of either doing a JOIN on the table, or storing the mapping in APC/memcache [21:08:56] unfortunately DanielK_WMDE is not here to confirm [21:09:13] while a join on a table with 2-10 rows is negible. "Joins are slow" is a myth [21:09:16] legoktm: caching the mapping may be good, yes - but the question is how to invalidate it if somebody adds one in extension e.g. [21:09:46] oh, the join doesn't concern me, the performance overhead will be the insert [21:10:00] unless you have a subselect inside an insert? is that allowed? [21:10:13] inserts indeed allow joins [21:10:22] well there's insert ... select [21:10:26] heh [21:10:53] SMalyshev: look up cache, if an id from the database is missing, we can just hit the database and look it up. Adding new content models should be relatively rare [21:11:16] (and after looking it up in the db, update the cache) [21:11:43] yeah you can't remove IDs, so that would work well enough [21:11:52] and it would keep the queries simple [21:12:14] legoktm, yes, for me caching on app memory makes sense too, although i cannot comment on how easy is managing that [21:12:28] yeah given the cache probably no need in fancy inserts just look it up and do plain insert [21:12:45] how many content models are there? probably not a million [21:13:31] I think we probably have 10? in core + Wikimedia deployed extensions [21:13:45] Probably not even that in WMF-deployed. [21:13:55] yeah so keeping in in-memory is not an issue [21:14:03] jynus: any comments on the migration scheme? https://www.mediawiki.org/wiki/Requests_for_comment/Content_model_storage#Migration [21:14:25] wikitext, js, json, css, flow, wikibase, lqt… [21:14:30] CSS, JS, JSON, wikitext, Scribunto, Flow, Wikibase*2, MassMessage, ProofreadPage, JsonConfig off the top of my head [21:14:34] James_F, LQT does not have a content model. [21:14:41] matt_flaschen: Point. [21:14:54] as an ops, the only thing I would be worried about is changing "revision" [21:15:03] we've done it before [21:15:13] it's work though, and that migration scheme requires doing it twice [21:15:15] which I think it is not proposed [21:15:55] it suggests altering the revision table twice: once to add the new column, and once to remove the old one [21:15:58] yes, it is more than possible, but it would be the only thing that would cause work [21:16:25] but I do that every day :-) [21:16:26] removing the old column is technically optional, but it would be confusing imo to leave it around [21:16:55] so it could be deferred until there is another change to combine it with, if necessary [21:17:10] up to jynus and springle I guess [21:17:57] I discussed this briefly with gwicke last hour [21:18:02] we will never ever have more than 30K 64K models, right? [21:18:12] lol, I don't expect us to [21:18:21] jynus: you want to save 2 bytes per row? [21:18:43] I hope not :-) [21:18:45] No, each model is basically a major feature, and not all major features require them. [21:19:07] I can't imagine ever having more than 100. [21:19:16] legoktm: Famous Last Words. But I agree. [21:19:28] I see the proposed schema has a smallint as the primary key in the two new tables [21:19:32] Yeah I think it's unlikely we'll ever need more than 2^16 [21:19:37] Let's not make it 8 bits. :) [21:19:46] It's probably prudent not to... yes [21:19:50] let's be conservative for future changes but at the same time let's save space [21:19:53] There's already more than I thought as legoktm pointed out. [21:19:58] we can always alter the table if it grows larger [21:20:01] 2^8 is too small, we wanna leave a bit of wiggle room for content model migrations and stuff [21:20:23] +1 to 16 bits. That should be fine for forseeable future. [21:20:35] you know we may have to have a 64-bit text.old_id some day [21:20:46] no big deal, just alter the table [21:20:56] Hmm yeah that makes sense [21:21:06] enwiki revision is already, what, 600-700M? [21:21:51] max(old_id) is 679M [21:21:55] Right [21:22:11] So 2 billion text rows is a ways off but clearly going to happen [21:22:29] 334G Jul 29 21:22 revision.ibd [21:23:09] Just change it to unsigned int. ;) [21:23:37] Oh, it is unsigned, so we're talking 4 billion [21:23:39] Anyway --- [21:23:50] Yeah I agree, if we do run into trouble we can always alter the table again [21:23:57] And 64K content models should be plenty for years to come [21:24:03] so keep it as a smallint? [21:24:14] smallint = 2 bytes? [21:24:28] yes [21:24:40] https://dev.mysql.com/doc/refman/5.0/en/integer-types.html [21:24:57] Yeah seems fine to me [21:25:15] 64K is enough for everybody! [21:25:17] so the ContentHandler classes would continue to use the same strings? [21:25:25] yes [21:26:01] only Revision/PageArchive/LinkCache (and maybe Title) would be aware of the ints (since those do direct db queries) [21:26:19] and I suppose we'd have a helper function in ContentHandler to store the mapping [21:26:49] ok, sounds good to me [21:27:03] there's a bunch of mentions of *_content_format in the code [21:27:09] as DB rows seemingly [21:27:51] so will those be simulated somehow? [21:27:51] I'm probably being optimistic and there's other scary code hitting the database [21:28:04] yeah, legoktm is clever, he can deal with them ;) [21:28:35] legoktm, what code exactly do you see being responsible for populating the extension content models and content formats (e.g. flow-board) into the DB? [21:28:36] e.g. includes/specials/SpecialUndelete.php etc. [21:28:57] I see "_content_format" in Special:Undelete, Export, and WikiPage...I'll need to look at those further [21:29:02] I have some other musings distantly related to this RFC which I would like to raise with this audience, but first I'll ask if this RFC is approved [21:29:20] +1 to the RFC [21:29:21] I'm in favor [21:29:30] question: I assume the Migration will take a long time (IIRC adding content type for Flow took months). WMF's awesome sr. DBAs will manage it, but I wonder if some third-party wiki naively running update.php will be faced with a long outage? [21:29:49] matt_flaschen: when something attempts to do the mapping, it will populate and create an id if one doesn't exist. [21:30:00] spagewmf: That already happens :) [21:30:07] Can I ask a possibly stupid qustion? [21:30:17] spagewmf, any wiki with a big enough revision table to have such problems has probably stopped naively running update.php by now. [21:30:30] There's this proposal from Daniel to have multi-part content revisions. [21:30:51] It involves also changing the revision table. [21:30:52] actually, this migration (aside the alter), is quite trivial [21:31:15] Is it worth considering that work as part of this RfC? Or just ignoring it for now? [21:31:23] James_F: that's a bigger topic [21:31:27] Yes. [21:31:52] James_F: That seems like a much larger thing that should be handled on its own [21:32:03] and probably good to do this change first [21:32:16] legoktm: Will we delay that work by a year waiting for this schema change to be made, though? [21:32:22] storing multiple content models per revision points to an even stronger need to normalize them [21:32:38] * James_F nods. [21:32:54] James_F, faith on DBAs, there are a 100% increase since May [21:33:02] James_F: dunno. I don't know how long this is going to take from a schema change perspective. [21:33:18] jynus: I meant more that it'd take a year for it to run as a background process, not that you'd be in any way tardy. :-) [21:33:35] OK, consider my question withdrawn. :-) [21:33:46] not really, some days at most [21:33:59] hmm, but I think James_F brings up a good point, and that is whether we'd still want to have content models in the revision table at that point [21:34:01] anyone remember the bug for (roughly) populating the content model that took a lonnng time? [21:34:18] rather than storing those along with bits of content [21:34:25] yeah, they'll probably move out [21:34:27] spagewmf: You mean, the switch to content handler? [21:34:30] that's just keyed on the revision id, in a separate table [21:34:37] but I don't think we want to stall this work [21:35:11] ( https://phabricator.wikimedia.org/T96384 is the RfC Daniel's working on in this area.) [21:35:16] I'm just wondering if it's still worth doing [21:35:33] whether this RFC is worth doing? [21:35:35] James_F: yeah, that one. bz508xx or something. [21:35:42] Because they'd instead be in a different table only? [21:35:53] the reason this RFC is happening is because flow needs it yesterday [21:35:56] Revision -> revision_part -> contenttype. [21:36:30] The 'contenttypeid' column would move from revision to revision_part (or whatever it's called). [21:37:09] I think the problems outlined in the RfC (https://www.mediawiki.org/wiki/Requests_for_comment/Content_model_storage#Problems) are real and impacting us on a daily basis that it makes sense to fix it now [21:37:15] * James_F nods. [21:37:53] yeah, the default-as-null issue is probably the more pressing of the two? [21:38:48] the framework inside MW to have a cached mapping of IDs to MIME types will remain useful whatever we do [21:38:49] or are you running into concrete performance issues right now as well? [21:38:49] Yeah, the big issue is that the default can change, which means the meaning of null changes. [21:39:03] That's why it needs to be explicit. [21:39:21] *nod* [21:39:22] gwicke: there will be performance issues with "Moving a page from "MediaWiki:FooBar.js" -> "MediaWiki:FooBar.css" changes the default content model, so it has to update all revision history rows to set an explicit rev_content_model" for example [21:39:24] gwicke: Converting a namespace from one default to another means a very expensive change. [21:39:28] it was https://phabricator.wikimedia.org/T51193 (Set $wgContentHandlerUseDB = true on all WMF wikis), FYI jynus it took springle 3 months [21:40:00] but presumably not 3 months of human work [21:40:37] gwicke, there are also performance issues. E.g. having strings would make an index (which we need) more expensive. [21:40:42] TimStarling: no :) [21:40:49] If we don't have this RfC, we will have to * Run maint script to populate *_content rows, * Flip default * hope no one made any edits in the meantime, * Un-populate the new default, for every wiki and namespace [21:41:15] where we want to change the default [21:41:36] since all pages were populated with a string for format, that has forced us to buy new hardware faster [21:42:00] I think it is time to move on from this [21:42:03] kk [21:42:07] #agreed RFC approved [21:42:27] so gwicke and I were talking about what to do about revision storage in the longer term [21:43:09] you know there is an outstanding bug to increase the maximum length of the rev_comment field [21:43:35] which is already by far the largest field in the largest table [21:43:55] and I believe edit summaries are the bulk of recentchanges as well [21:44:32] https://phabricator.wikimedia.org/T6714 and https://phabricator.wikimedia.org/T6715 [21:44:33] the question is whether we should shard the table, or move the whole table to cassandra, or move the comments alone to cassandra [21:44:34] they could follow the same pattern of storing things separately, keyed on revision id or perhaps a uuid [21:44:57] TimStarling: The comments could be part of the multi-part revision concept. [21:45:12] TimStarling: Given they've got their own special content type (not called as such). [21:45:13] can I give some technicality? MySQL stores long blobs pysically separated [21:45:32] and they are not read unless selected [21:45:47] the reason comments are in the revision table is for efficient construction of the history page and similar things [21:45:52] jynus: comments are probably not long enough [21:45:54] all lists of revisions come with comments attached [21:45:57] watchlists, RC, etc. [21:46:44] move the whole thing to cassandra if it is up to the task and is fast enough so we don't need a cache for revision content anymore [21:47:11] legoktm: Is there an actual RfC for it? [21:47:33] jynus, is the physical separation based on the data type, or the actual size used? [21:47:50] James_F: no, bawolff+springle just did it. Now it's pending schema change on Wikimedia wikis, and making the UI actually allow it [21:47:51] jzerebecki: moving the whole thing is tricky because of all the joins we are doing on the revision table [21:48:05] on InnodB, rows longer than 8K are broken on static and dynamic parts [21:48:16] legoktm: :-( [21:48:46] dynamic parts are separated on a different extent (disk and memory allocation) [21:48:49] (some things currently use memcached to not hit mysql every time they need a revision) [21:48:54] If MySQL can handle the physical separation and optimization (only reading the comments when actually SELECTed), I don't see much reason to move it to Cassandra. [21:49:10] jzerebecki: yeah, we can certainly move in that direction [21:49:34] but, externalstore might be an easier target at first [21:50:04] the extra length byte is not the thing that will bloat the revision table [21:50:08] With cassandra there is some history, but I am not the best op to comment on that [21:50:33] the revision table will bloat when the UI shows a larger text box and people start writing more stuff [21:52:10] Agreed, people will definitely use more space if it's available. [21:52:16] for vertical partitioning there must be a reason such as many queries not needing the comment if the relationship is 1:! [21:52:28] I guess my question is: is revision table size a problem from the DBMS perspective and do we need to do things in the app layer to fix it [21:52:50] so, to compare previous RFC and this [21:53:00] o/ [21:53:23] previus was 1:N relationship, there is less ammount of data read and written [21:53:55] this is 1:1, only helpful if querys can avoid reading one table [21:54:07] (instead of the whole row) [21:54:24] I do not know if that is the case, sorry [21:54:28] I guess page views do not need comments [21:54:52] I think template reads are the most common revision table queries at the moment [21:55:01] which obviously don't need comments [21:55:03] unless you're viewing a diff or old revision, which does show the comment [21:55:44] yeah, but that's rare in comparison [21:55:44] yeah, I mean maybe there is a case for vertical partitioning if you look at query traffic [21:56:50] compression for comments could be better if it's in a separate table [21:56:51] if in that case, it is helpful, and can be implemented internally or externally [21:57:21] anyway, jynus, if you and springle have things that you need from MW, I would like to know that and have prioritized projects [21:57:40] TimStarling, thank you for feedback on that [21:57:53] link tables are another possible project area [21:57:59] I wanted to make you think a bit more on disk and things that can be touched [21:58:58] on the other side, let me offer all developers free database training at some point so you can make our life easier, too [21:59:03] we are starting to think about generalizing dependency tracking, which is related to link tables (see https://phabricator.wikimedia.org/T105766) [22:00:21] * Krinkle would love free db training any day [22:00:31] ok, any action items for the notes? [22:00:38] Same. I'd also like to take you up on that sometime. [22:01:03] TimStarling: can't think of anything concrete [22:01:09] thanks for the discussion! [22:01:21] I guess the action is that I'm going to implement the RfC? :) [22:01:35] whith help, legoktm [22:01:40] heh, yes ;) [22:01:41] #action legoktm to do that thing [22:01:49] #endmeeting [22:01:50] Meeting ended Wed Jul 29 22:01:50 2015 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:01:50] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-07-29-20.59.html [22:01:50] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-07-29-20.59.txt [22:01:50] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-07-29-20.59.wiki [22:01:51] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-07-29-20.59.log.html [22:02:11] jynus: indeed :) [22:02:15] thanks everyone! [22:02:17] Hm.. content model and content format are N:1, right? [22:02:33] e.g. schema and json are both json, but one has a stricter model [22:04:26] goodnight [22:04:58] Krinkle: afaik yes [22:17:41] Krinkle, legoktm : so why isn't revision.rev_content_format also an ID to a separate table? [22:18:47] Both rev_content_format and _model are not IDs right now, the RFC makes them IDs [22:18:52] https://www.mediawiki.org/wiki/Requests_for_comment/Content_model_storage [22:19:32] Krinkle: oh right, RFC changes content_format as well as content_model. Thx. [22:19:46] If they're N:1 though, I'm not sure what the point is of storing the content format in the revision database [22:19:52] gwicke: legoktm: ^ [22:20:01] we don't seem to store in the page table [22:20:10] that one seems content with just content model [22:20:18] (no pun intended) [22:20:20] content format is how the text is serialized [22:20:46] nothing uses a non-default content model right now, but when I talked with DanielK_WMDE about it, he suggested it could be used for content format migrations [22:21:02] except you can't right now, because it's practically impossible to change the default [22:21:23] I think we'd be fine to have just one of them [22:22:15] parsoid html for example uses a content-type of text/html;profile=mediawiki.org/specs/html/1.0.0;charset=utf8 [22:22:33] and we use that for format versioning in RESTBase [22:23:02] similar with json [22:25:26] the code reading it needs to know how to convert old content types to newer ones in any case