[21:00:26] #startmeeting RFC meeting [21:00:26] Meeting started Wed Oct 3 21:00:26 2018 UTC and is due to finish in 60 minutes. The chair is kchapman. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:00:26] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:00:26] The meeting name has been set to 'rfc_meeting' [21:00:36] #topic Spec for representing multiple content objects per revision (MCR) in XML dumps https://phabricator.wikimedia.org/T199121 [21:00:52] #link https://phabricator.wikimedia.org/T199121 [21:01:11] who is here to discuss today? [21:01:19] me. [21:01:41] here's the draft rfc: https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps [21:01:56] o/ [21:01:58] o/ [21:02:42] \o/ [21:03:32] may I ask: who is facilitating, and what is our agenda? [21:04:27] apergos: kchapman is chairing, the agenda is up to you :) [21:04:51] you could start with a quick recap of the proposal [21:04:58] yes, so does anyone attending have any questions for apergos? apergos do you have any specific areas you would like to get feedback? [21:05:07] (yes, a quick recap would be a good start) [21:05:22] ideally, I want to get this rfc into final form: that is, any issues with the draft as it currently exists, and agreement on any fixes, right here in this meeting [21:06:06] apergos: can you give a brief rationale for why you wrote the spec as it is, over some of the alternatives that were discussed before? [21:06:08] recap: xml dumps are used by Special:Export (people export one or a few pages in XML) and [21:06:48] by dump scripts that dump the entire content; the latter get used for various purposes (research, sucked into analytics, used by editing bots etc) [21:07:19] MCR means that instead of one revision having one piece of text, it may now have several, each contained in a 'slot'; this needs to be accounted for with a new schema [21:08:02] an early draft proposal is here (dues put this together, all credit to him): https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps [21:09:08] we need to keep backwards compatibility for the main slot (at least for awhile) and keep any new elements separate (hoping they will simply be ignored by scripts out there) [21:10:26] to confirm, the current version of the proposal is what's given at https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps#Proposal right? [21:10:41] correct. [21:11:24] excellent [21:11:44] one idea was to have multiple text tags inthe schema at the same level, rather than any content tags at all, but this could conceivably run into issues as described here: https://phabricator.wikimedia.org/T199121#4494050 [21:11:52] #info to confirm, the current version of the proposal is what's given at https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps#Proposal right? [21:11:52] for existing consumers [21:11:54] *nod* [21:12:20] i'm pretty happy with the proposal that keeps them on a separate level; it should maintain compatibility (IIRC I tested an earlier version of that proposal on the Special:Import to make sure it didn't expode) [21:12:42] I would love it if you test the final version (whatever we agree on here) too :-) [21:12:56] I also like the fact that the new content tags are structurally the same as the old, and just ocurr in a different place. [21:13:26] yes! [21:13:30] apergos: just to make things easier to discuss: the current, pre-mcr spec is version 10 of the xml schema, and we are discussing version 11, right? [21:13:41] yes, this would be version 11. [21:14:21] #info final version of the proposed schema should be tested with old special:import to confirm no problems (earlier version was tested, looks good in theory) [21:14:34] is this a sufficient recap? [21:14:49] apergos: considering the audience, I believe so :) [21:14:53] so, i have two questions, one of which we have already discussed on phab, and another one which I now realize i probably should have brought up earlier... [21:14:55] :-) [21:14:57] Yep, the entry element name not being the same seems fine and a big win for compat whilst only a minor annoyance to implement support for (given support would require a change either way). Having the internal structure remain fully the same is what sold me, which makes consuming still fairly easy by re-using the same code. [21:15:27] Oh dues is daniel [21:15:28] Hi! [21:15:32] :P [21:15:35] :D [21:15:36] new job, new nick ;) [21:15:53] or rather, old, old nick [21:16:32] so, the issue we discussed was the id attribute used on the text tag in stub dumps. it contains rev_text_id, which is going away. [21:16:44] content now as an uri style address. [21:17:02] right. [21:17:23] we agreed to add a "location" attribute that contains the URI of the text. and id tag can still be supplied for b/c as long as we only use text-table based storage internally. [21:17:58] we didn't reach agreement on this for the main slot [21:18:09] I'd propose to output both, id and location, if the location is in the text table (using the "tt:" prefix) [21:18:25] and omit the id if the location is something else (e.g. a direct pointer to external store) [21:18:46] I'm curious, what's the rationale of exposing the blob URLs at all? I don't think it's done anywhere else [21:19:16] it's an optimization for generating full dumps from stub dumps [21:19:21] Yeah, afaik only page id and rev id are publicly usable. [21:19:25] the (big huge hairy dumps) are done in two phases [21:19:31] phase one is write al the metadata [21:19:42] the internal ID is completely useless to other consumers, except perhaps as an ID. but the hash would be better for that [21:19:46] phase two is write the text content, reusing text content from previous dumps where possible [21:20:12] so basically it's to spare the revision table lookup on the second pass? [21:20:24] but retrieving from the database where needed; this means that the metadata dumps should have that information [21:20:39] apergos: right, but are revisions not considered immutable for that purpose? The meta data might change between dumps, but not the slots or text afaik. [21:21:07] Ah, I see. To save a look up. [21:21:39] we check to make sure that the content is ok (righ now it's a boring length check, but we've needed to in the past, we've had weird corruption issues, for example); if it's not, we retrieve anew [21:22:10] there may be another reason to expose it. which is the other issue i wanted to discuss. but I'd like to sort of the id-vs-location issue first. [21:22:12] Hm.. if it's not publicly usable, I suppose id="" could be re-used as is. Or would that be terrible? [21:22:17] and for revisions where we don't have the data in the previous dujmp, of course we request it. [21:22:30] Afaik rev_text_id supports that internally as well. E.g. we support numeral values and external store URIs there right now in core [21:22:31] I don't love re-using id with a tt:xxx in it [21:22:58] Krinkle: that was my original proposal, but apergos didn't like it, so we agreed on introducing a new attribute, which is cleaner [21:23:10] ah nvm, we always go via the text table. [21:23:32] yes, for now it's the text table; sometime in the future it will be ext:db:something as I understand it [21:23:37] apergos: i still think that the new location attribute should be on all text tags, though. for consistency, and to encourage any consumers to start using it instead of the id attribute. [21:23:45] The optimisation I'm referring to exists in Revision::getRevisionText, but not in rev_text_id. [21:23:50] I don't see any downside to also having it on the "main" text tag [21:24:09] I worry about possibly breaking fragile scripts [21:24:28] Krinkle: BlobStore is the service that supports that kind of thing. [21:24:42] reusing the id would be slightly more robust when dealing with old code, which might not expect the id not being there at all [21:24:51] apergos: if they are so fragile that they break on an unknown attribute, they will also break on an unknown tag, or multiple text tags. [21:24:55] but then I doubt scripts have much reason to use the id at all [21:25:17] tgr: but they would not expect the id to contain something that isn't a number, i wouldn't call that robust. [21:25:20] Yeah, I would assume id="" to be documneted as a private field containing an arbitrary string for import/export scripts to understand only. [21:25:24] well some scripts that convert xml to sql for import might use the text id as a convenience [21:25:25] yeah i'd worry about id changing formats [21:25:53] it's changing formats either way [21:26:00] Krinkle: [21:26:02] If some consumer does strictly expect it to be like an integer, then the fact that the schema version is changed should suffice in terms of "strictness". For non-strict consumers, I would assume id="" to be ignored or at least not parsed. [21:26:13] short of some horrible hack to express arbitrary blob URLs as integers [21:26:15] tgr: no, id is going away, not changing format [21:26:16] dues: Yeah, I'm proposing the new spec redefined it as arbitrary string [21:26:44] so another thing abuot the id is that everywhere else it's a number [21:26:49] or as anyUri, but I'm not sure it's worth it, it would only lock us in again in the future. [21:26:54] Krinkle: if we have consumers that expect ints there, they would break. [21:26:56] this builds an expectation; it's another reason I like having a separate attribute [21:27:29] but really, the most pressing issue I'd like to discuss is: can we make id *optional* in version 11? Or does that (or removing it) have to wait until version 12? [21:27:49] this was also raised on the phab task and flagged for dicussion here [21:27:50] dues: If a consumer exists that acknowledges the existence of id="" (narrow subset), and strictly parses it (even narrower), I think that's an area that would not matter here given it would not allow other attributes or chnages either. That's why we're bumping the schema version. So that consumers can support it cleanly. [21:28:04] The compat is only about consumers that don't do it cleanly, which may be the majority, but also doesn't care about id, there, right? [21:28:29] we have a ticket for using external store directly internally with mcr, without the text table: . We cannot do that as long as we need to suppor the id attribute [21:28:55] ah, adding to the agenda: what is the time frame for things: [21:29:08] for moving off the text table, for adding extra slots? [21:29:27] dues: is moving off text table a proposal, or approved/implemented and awaiting wmf migration only? [21:29:29] that would be helpful for dumps consumers to know [21:30:15] Krinkle: i'm not worrying about stricly validating parsers. I'm worried about code that *ready* the id into an int variable. in php, it would then always be 0, which may cause breakage/confusion. (Note that I initially also proposed doing it that way, and let myself be convinced by apergos to use the safer approach of keeping id as an int and adding a new attribute) [21:31:17] apergos: adding extra slots: next month, to support SDC, with high priority. moving things off the text table: some time next yea, not urgent, but DBAs would love it. [21:31:25] dues: The only code doing that though would be MediaWiki itself, which presumably we can afford to properly check on the schema version before doing anything. Don't we do that already? [21:31:35] dues: in theory you could move to extstore and put some kind of hash in the id [21:31:42] I don't think we can guarantee that at all, that it's only mediawiki [21:31:55] gets messy if it gets stored into an int with overflow issues, though [21:32:38] Krinkle: moving off the text table is something DBAs would love, but will not happen without further discussion/sign-off by stakeholders. [21:32:45] (e.g. if it's not a text id, use the content record id instead and prefix it with something to make it unique) [21:32:49] apergos: you mean that aside from a strictly validating consumer storing it as unused int for the current version, someone would actually do something with it besides maybe using it as a hashtable key? [21:33:13] #info ah, adding to the agenda: what is the time frame for things? for moving off the text table, for adding extra slots? [21:33:18] such use case would break regardless in the future version when they won't exist. The question is how. [21:33:23] tgr: the hash may also clash with existing text rows. we have a few billion of these, in a 32 bit hash the chance of collision is rather high [21:33:28] #info apergos: adding extra slots: next month, to support SDC, with high priority. moving things off the text table: some time next yea, not urgent, but DBAs would love it. [21:33:34] yes, I mean scripts written in perl or python or any kind of thing that use the id (expecting it to be an int) to write some sql [21:33:54] we shuld at least not break their code right away [21:33:59] yeah, it would have to be large enough to avoid that and then it probably wouldn't fit a C int for example [21:34:06] But, I withdraw my argument that we should re-purpose id="". I think the better model for any kind of breaking change is to only remove, not change. So we'd add it as a new attribute, "location" seems as good as name as any. [21:34:49] So we'd make id="" optional , and add location as optional field as arbitrary string, then in the next, remove id="". That should provide a smooth transition. [21:34:52] Krinkle: that'S exactly how ilet myself be convinced :) [21:34:59] so are we agreed on 'location', all of us? [21:35:03] +1 [21:35:08] going once... [21:35:12] #info ‎<‎Krinkle‎> I think the better model for any kind of breaking change is to only remove, not change. So we'd add it as a new attribute, "location" seems as good as name as any. [21:35:21] going twice... [21:35:33] :boom: [21:35:36] :) [21:35:40] sold to the highest bidder, thank you Krinkle [21:35:50] next is whether the id tag should be optional right away or not [21:35:53] is that next? [21:35:55] ok, now: should the "main" text tag also get the new location attribute? [21:36:04] I say yes, apergos is worried [21:36:05] ah. ok good [21:36:15] new location attribute, yes or no? [21:36:22] I assumed it would apply to both main and slots. Why would it not? [21:36:23] in main text [21:36:29] (next: should id be optional or guaranteed in version 11) [21:36:35] can code break from an unexpected attribute? seems unlikely [21:36:49] tgr: it *can*, but i also find it quite unlikely [21:36:55] that's what I don't know; it seems unlikely but ... [21:37:04] people write a lot of awful stuff in the world [21:37:07] tgr: regex based code, mostly :) [21:37:13] and we'll be giving virtually 0 lead time [21:37:14] if we consider not adding "location" to
, what is the alternative? [21:37:23] at that point it could break from the unexpected tags as well [21:37:30] that's possible too [21:37:40] so I wouldn't worry about it [21:37:43] that's a month away maybe? [21:37:44] tgr: that's kind of my point too, yes [21:37:53] we will break the *really* aweful consumers anyway [21:38:07] but a month of lead time is better than nothing [21:38:10] and all others will benefit from the consistency, and the options to start using the new attribute right away [21:38:11] #info For non-strict parsers, we can expect unknown tags like to be equally (in)tolerable as unknown attributes. [21:38:28] same issue with sha1 (for the main slot): would be nice to have it somewhere but requires adding an extra attribute [21:39:08] I guess the alternative would be a content tag for the main slot, which has somewhat different data from the other content tags? [21:39:19] apergos: I'd like to understand the concern about adding location to
, specifically what the alternative would be - maybe I'm missing something. [21:39:30] tgr: sha1 is actualy an important wrinkle that I wanted to bring up after we sorted out the id question. but we are running out of time [21:39:46] the alternative would be that we don't add it until we need it (until the text table is no longer used) [21:40:24] apergos: Ah, you mean to add in the N+1 version rather than now, but add either way? [21:40:24] ...no longer guaranteeed to be used [21:40:28] yes [21:40:35] that means we'd have to have a new spec before we can do that [21:40:52] IMO anything that breaks when new attributes are added deserves to die :) [21:41:00] we'd have to plan for version 12 very soon, then. it should then be part of the proposal. [21:41:01] presumably that would be a pretty tiny change to the spec [21:41:05] Hm.. I think splitting up versioning is useful to aid in migration, e.g. add the /new/ optional thing first, remove the old then-optional thing later. [21:41:20] But I'm not sure whether adding something new later would be beneficial. [21:41:21] +1 [21:41:25] add first, remove later [21:41:33] give people the option to adopt early [21:41:36] Yeah [21:41:44] :) [21:41:53] if we're adding now, might as well add location="" now, on that, and on
[21:41:57] I'm curious when the last XML schema version bump was, and if we have a record of the fallout, which consumers broke... [21:42:17] I never hear about that sort of thing. unfortunately. [21:42:17] #info I'm curious when the last XML schema version bump was, and if we have a record of the fallout, which consumers broke [21:42:20] excellent point [21:42:33] * Krinkle doesn't have the answer [21:42:33] #info <‎Krinkle‎>‎ Hm.. I think splitting up versioning is useful to aid in migration, e.g. add the /new/ optional thing first, remove the old then-optional thing later. [21:42:57] well I would be willing to negotiate: [21:43:16] (It might not be too hard to take a look at the top 10 awful consumers by importance and make sure they're doing something safe involving the version number.) [21:43:29] it could be optional in the spec as long as we don't actually write it until we start writing content tags (because there is a second slot, i.e. when structured data goes live) [21:44:40] what do folks think about that? [21:45:17] apergos: for later reference/discussion, there are two things I realized that need to be considered: (1) the sha1 of the revision is *not* the same as the sha1 of the content, we need a place for both, also for the main slot. (2) some slots are rarely changed and would be repeated over and over for every revision if we don't add a back-reference mechanism (based on hash or laocation). [21:45:40] the second concern had occured to me as well [21:45:52] (2) is already in issue with text in the current format, no? [21:45:56] no need to discuss it now, let's keep talking about id/location [21:46:04] I don't have a proposal for that now [21:46:08] just making sure these are not forgotten [21:46:12] right [21:46:22] and there doesn't seem much we can do about it in practice, building some sort of hash table seems super unfeasible [21:46:57] tgr: agreed, we can mull it over out of meeting time (unless someone has a brilliant idea right now) [21:47:03] apergos: whether the location attribute is needed doesn't have much to do with whether we have multiple slots. it's about the blob storage mechanism. these are pretty orthogonal, they just happen to be introduced at the same time. [21:47:21] wrt sha1, it's also worth thinking about which will cause more B/C break, putting the main slot sha1 or the revision sha1 in the current location [21:47:32] my notion is that if we are concerned about fragile scripts, let's introduce both things that might break them at the same time [21:47:36] instead of one of them right away [21:47:37] that's all [21:47:41] (I'd guess the first as people mainly use it for revert detection, but I don't actually know) [21:48:48] remind me later to find out how the revision sha1 will be computed when there's multiple slots of content [21:49:07] as soon as the dump script is updated, we'll potentially have multiple content tags right? [21:49:09] apergos: ok. for that i propose the have a flag in the dump script that tells it what version to generate. we keep outputting version 10 until we need version 11. when we switch to version 11, it will have both, the location attrib and (when needed) the context tag. [21:49:13] on testwiki if nowhere else [21:49:34] and it needs to be updated for 1.32 as third party wikis also make dumps [21:49:40] just 10 minutes left all [21:49:42] apergos: so you still have full control when that happens for our dumps, and you could even generate both. but the spec still has location on all text tags. [21:49:43] need version 11 = we have multiple slots with content? [21:49:43] so I don't see much wiggle room there [21:50:11] apergos: have multiple slots and want to include them in the dumps. we can generate dumps that just ignore the extra slots, nothing keeps us from doing that [21:50:15] for testwiki we can just keep dumping main slot only [21:50:25] I mean, it's not great, but it's ok-ish for a short time [21:50:36] when we roll out to production then it would be a requirement to dump all slots of course [21:51:10] apergos: that's actually not a hard requirement right now, according the product folks. and we'll likely have some slots somehwere before we have dump support. [21:51:14] but before wide adoption, yes [21:51:31] I have concerns about that but they can be discussed outside of this meeting [21:51:54] apergos: so, would you be ok with location-on-all-text-tags-always in version 11, and a way to tell the dump script which version to output? [21:52:05] a flag is fine, it can be removed later [21:52:43] #info <‎dues‎>‎ apergos: so, would you be ok with location-on-all-text-tags-always in version 11, and a way to tell the dump script which version to output? ‎<‎apergos‎>‎ a flag is fine, it can be removed later [21:52:52] now is that id tag optional (except that we will write it until we are using something other than the text table)? [21:53:01] for the main slot [21:53:20] again this is 'what the schema says' vs 'what we actually dump' [21:53:27] I'd love it to be optional in the spec. it will be a while until we would stop poutputting it ourselves [21:53:43] the alternative is to push a new version of the spect at that point [21:54:01] as long as we are clear about the decision of when to stop writing it, I'm ok with optional [21:54:05] no harm in making it optional [21:54:11] I'd say: optionsl in the proposal, but dump it as long as we can, i.e. until we move data out of the text table [21:54:17] +1 [21:54:29] I would still consider if we can find some way to use fake IDs [21:54:47] but that's not relevant for the spec [21:54:55] what would the purpose of having fake ids be there? [21:55:10] #info the id attriute should be optionsl in the proposal, but dump it as long as we can, i.e. until we move data out of the text table [21:55:18] not breaking consumers which expect an id attrib with a number being there [21:55:28] I see [21:55:42] I'll think about that (please comment on the phab task about this too) [21:55:52] 5 minutes, what else can we tackle? [21:55:59] tgr: only if they are guaranteed to not conflict with actual text tale ids that may show up in the same dump. that would be bad [21:56:23] apergos: i think that's all we can really tackle. the "repeated data" issue can be left for version 12, probably [21:56:41] we still do need to sort out revision sha1 vs content sha1. [21:56:47] just a couple more minutes [21:56:48] but there's not time for that now [21:56:56] let's flag both of those as needing followup, on the phab task [21:57:02] #info <‎dues‎>‎ we still do need to sort out revision sha1 vs content sha1 [21:57:12] apergos: yep. [21:57:58] 3 minutes... :-) [21:58:05] two quick notes on sha1: if there is only one slot, the revision hash is the content hash. and: all uses of the revision hash i know of are for detecting reverts and copies. [21:59:22] anything else we should put in the minutes with #info? [21:59:34] i think this was quite productive :) [22:00:04] some good progress indeed [22:00:44] thanks everyone [22:00:52] thanks for running the show [22:01:02] #endmeeting [22:01:02] Meeting ended Wed Oct 3 22:01:02 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:01:02] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-10-03-21.00.html [22:01:02] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-10-03-21.00.txt [22:01:02] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-10-03-21.00.wiki [22:01:02] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-10-03-21.00.log.html [22:01:18] f course 10 minutes before the end I got 24 pages at once ... [22:01:28] :P [22:01:51] ok, t's 1 am, I will not add to the phab task tonight [22:02:07] anything tht people did not add by the time I wake up this morning, I'll add then however [22:02:10] I'll put the log and minutes in there now [22:02:15] thank you! [22:02:43] see folks tomorrow [22:02:47] apergos you're welcome, have a good night