[21:00:09] #startmeeting ArchCom RFC Meeting W32: A spec for Wikitext [21:00:11] Meeting started Wed Aug 10 21:00:09 2016 UTC and is due to finish in 60 minutes. The chair is Krinkle. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:00:11] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:00:11] The meeting name has been set to 'archcom_rfc_meeting_w32__a_spec_for_wikitext' [21:00:34] https://phabricator.wikimedia.org/E259 [21:00:48] #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ [21:01:01] o/ [21:01:06] #link https://phabricator.wikimedia.org/E259 [21:01:09] Hey all [21:01:24] o/ [21:01:42] hey [21:01:44] * Krinkle has his first time meetbot experience [21:01:46] thanks for chairing Krinkle! [21:02:21] o/ [21:03:09] (Yay Krinkle :) [21:03:15] The main topic for today will be about whether and how we'll proceed with the specification of wikitext. This follows after an essay Subbu wrote on mediawiki.org and a wikitech-l thread. [21:03:17] #link https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext [21:03:24] #link https://lists.wikimedia.org/pipermail/wikitech-l/2016-August/086200.html [21:04:27] so, should we have a spec? I say "yes" [21:05:06] as i have argued in that essay, "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec". [21:05:14] and there are different specs for different needs / audiences. [21:05:15] surely we should. It'd be quite time-consuming task though [21:05:23] A spec for what, in what context :) [21:05:56] One of the problems people would like to see solved in this area is to be able to confidently interact with older content. The status quo is that things change with time, and that rendering is somewhat unpredictable for older revisions (expansion of templates and external links was mentioned). [21:06:25] specifically, older content stored in wikitext [21:06:29] # subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec" [21:06:33] #info subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec" [21:06:39] We could for instance specify "legacy" Wikitext as it exists, but that doesn't make it an easy system. It'll still be complex [21:06:49] a spec is also helpful to make sure that the markup means the same thing to humans and computers [21:06:58] I'd say at least a narrative spec from which a reasonably knowledgeable programmer would be able to implement a parser which renders at least 90% (arbitrary high percentage here) of wikitext properly [21:07:08] I suspect we need that for the context of interacting with old archival content, even if what we use for future content changes [21:07:33] SMalyshev: we pretty much have that already, in the form of a PEG grammar [21:07:47] gwicke, well ... tokenizer, you mean. [21:08:14] specifying "legacy" wikitext does not necesarily meet subbu's goals, e.g. "Ease implementation of tools and libraries that need to operate with wikitext directly" [21:08:15] would the spec include stuff like "call the Lua compiler version X with parameters Y"? [21:08:19] Also the concept of "what is wikitext" differs by installation. Are the core parser functions part of the spec? What about the others? What about syntax? Etc. Each change we make is generally breaking for people that use it. [21:08:23] it covers the "easy" parts SMalyshev mentioned [21:08:24] gwicke: PEG grammar alone doesn't seem enough for me... it specifies what can be said, but not what it means? [21:08:25] gwicke, there is no guarantee those tokens will render as they are tokenized. [21:08:26] a spec can also help us comment those areas where computers will do counterintuitive things with the markup [21:08:37] if it would, how is that different from "call the MediaWiki parser version X with parameters Y"? [21:08:43] #info would the spec include stuff like "call the Lua compiler version X with parameters Y"? [21:08:59] gwicke, right. it is useful for sure. [21:09:11] This is an interesting end use-case. How to deal with underlying dependencies. Same goes for syntax inside extension tags. [21:09:33] I think for future facing needs its more useful to have a clean document model [21:09:35] tgr: I think "call the Lua compiler" is not what most people mean by spec... [21:09:49] would extensions that add parser functions / tags have their own specs? [21:10:01] For instance if layouts for tables etc were mor structured and separate from the tokens for content [21:10:11] SMalyshev: we do rely heavily on Lua scripts, how would you specify around that then? [21:10:17] I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation [21:10:19] brion, yup .. to me, as well, the most useful argument for a spec is for future needs, i.e. clean up the processing / document model. [21:10:23] Cleaner interfaces between components such as templates and Lua modules [21:10:27] SMalyshev: it's a reasonable spec-writing crutch to refer to other specs, even if those specs aren't very well specified [21:10:34] legoktm: ideally, yes. at least, they can't really be covered by the main spec... [21:10:53] tgr: have docs for lua scripts... but I think we need to be declarative, not procedural, here [21:10:57] We could consider extension behaviour outside the scope of the spec. Extensions would need to either maintain indefinite compatibility, or somehow have required attributes for versioning (e.g. ) and then decide to keep older parsers or to transform it somehow internally. [21:10:59] gwicke, +1 [21:11:36] One thing to consider with extensions is having a clean enough interface that we can actually do that :) [21:11:42] #info I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation [21:11:45] i think it might be useful to address the first question: why a spec? before going into discussions about what kind of spec maybe. [21:11:46] gwicke: subbu : do you think the implementation should be the spec, then? [21:11:50] Krinkle: Including inside-core 'extension' tags like ? [21:11:54] Yeah. [21:11:59] SMalyshev: wikinews uses a Lisp interpreter written in Lua for some of its templates [21:12:00] For instance the data model of the text passed into a ref tag should be known [21:12:04] have fun documenting that [21:12:07] Essentially yes. [21:12:14] as gwicke said, it's just not a realistic goal [21:12:15] But the data model of a classic gallery tag is a distinct domain specific language [21:12:35] tgr: it's not realistic to invest in if we don't believe the data is very important [21:12:38] But I think it's more important for the spec to detail what the impact is of returned html from with the rest of the content, more important than the syntax of the text inside [21:12:40] robla, for legacy wikitext, i think that is the unfortunate reality. [21:12:57] #info But the data model of a classic tag is a distinct domain specific language [21:12:57] robla: that would imply that a spec would be the only way to preserve data [21:12:57] That's something our current system doesn't grok, so we lose the ability to refer to the contents of the ref [21:13:43] gwicke: no it doesn't. we would still have our implementation of the spec [21:13:52] it's far from clear that a spec is a viable way to do that at all, and even less so that it is the only viable way [21:14:05] I don't think focusing on a spec as the way to solve the old wikitext no longer works properly is a good way to frame this discussion - we should focus on the reasons we need a spec and that it could possibly help with that problem [21:14:31] One radical idea would be consider our shortcuts (extensions, templates) merely a way to provide input and create a revision, rather than being the revision itself. Essentially producing something in between that is still minimal and canonical but not end-user/localised/skinned. [21:14:48] I propose we first address the question: why a spec, i.e. what are the goals for writing a spec. [21:14:50] that would move expansion out of scope somewhat. (e.g. does ~~~~ need a spec?) [21:15:13] #info I propose we first address the question: why a spec, i.e. what are the goals for writing a spec. [21:15:39] * gwicke is curious [21:15:41] goal for writing a spec: have a human readable description of how wikitext will be interpreted by computers [21:15:58] Goal: to have a consistent way to interpret, edit, and display Wikitext from any era after the spec epoch consistently (? Sample) [21:16:16] some reasons to have a spec: allows a canonical test suit, allows transformation into other markup formats, enables multiple alternative parsers/processors (and editors) [21:16:35] * tgr finds the MediaWiki parser more human-readable than twenty pages of ABNF [21:16:35] brion: I suppose the example I keep trotting out is ANSI C. Very complicated, but very worthwhile [21:16:36] a spect would need to define the semantics of each syntactical element [21:16:37] DanielK_WMDE: we already have a test suite [21:16:42] just an accepting grammar would be useless [21:16:58] brion: does your proposal require a spec, or an implementation? [21:17:00] subbu's idea of an executable (HTML5-like) spec is interesting to me [21:17:01] gwicke: sure. so? [21:17:09] maybe there's a parallel to be drawn between wikitext and js's don't break the web model [21:17:29] gwicke: depends on what were specing [21:17:29] goal: write a future-looking spec to evolve the wikitext language and processing model .. which doesn't help with old wikitext of course. [21:17:31] gwicke: is it canonical? in thw way that we say if the software breaks a test, this MUST be because the software is wrong? [21:17:43] Are we specing Wikitext the character sequence [21:17:44] tgr: for C programming, do you think that compilers are easier to read than the ANSI C spec? [21:17:48] Or the document model? [21:17:49] i often end up fixing parser tests, not code... because the tests make assumptions, or rely on unspecified behavior [21:17:52] DanielK_WMDE: pretty much, yes [21:17:59] considering the amount of stored, existing content [21:18:07] TimStarling, yes .. that is one way html authors probably addressed the problem of old html out there and html compatibility. [21:18:12] brion: both I'd say. Character sequence is probably the easier part :) [21:18:15] A document model spec gives us what we and others need to transform our parsed documents into other formats etc [21:18:19] for some purposes, it would be nice to be reductionist, write grammars etc., but for archiving and ease of reimplementation it makes more sense to be complete [21:18:27] which is probably not dissimilar to the current problem we have. [21:18:28] and that really means specifying algorithms [21:18:33] Even if we only have one implementation of the tokenizer/parser [21:18:34] gwicke: the best spec is the stored, existing content, then. if you break it, you are doing it wrong... [21:18:47] TimStarling: subbu: an executable spec, would that mean expansion is part of the model, or left to producers of content (e.g. notion of a "template" could be in the data atrributes, but not required for consumers to understand) [21:18:55] that's how tests are set up, and parsoid was tested [21:19:08] robla: no clue about that, but there are several magnitudes in difference between the resources behind C and wikitext, and also the potential userbase, so I don't think it's a useful comparison [21:19:08] another goal: interoperability between multiple implementations [21:19:22] DanielK_WMDE: the problem it's not a constructive spec :) It doesn't giv you a way to do it right, only tells you when it's wrong [21:19:24] robla: hell yea [21:19:32] so, let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0 [21:19:41] the former lets you deal with old content. [21:19:43] :) [21:19:51] the latter lets you clean up wikitext and move forward. :) [21:19:58] SMalyshev: many things in life are like that ;) [21:19:59] I like [21:20:04] #info just an accepting grammar would be useless [21:20:14] subbu: is there an example of a really good executable spec? [21:20:14] DanielK_WMDE, i agree reg. an accept grammar. [21:20:17] html5 [21:20:24] Krinkle: I think expansion would have to be fully part of the executable spec, including what version of Lua to use etc. [21:20:37] Subbu do you imagine a common document model attached to both old and new grammars? [21:20:40] well, "yes" is an accepting grammar [21:20:46] since all text is valid wikitext [21:20:50] Heh [21:20:55] #info Are we specing Wikitext the character sequence Or the document model? [21:21:03] this is what subbu means by an executable spec: https://www.w3.org/TR/html5/syntax.html [21:21:06] * subbu is trying to find a link to the html5 tree building algo. [21:21:12] oh, there TimStarling posted it [21:21:17] it's a natural language description of an algorithm [21:21:25] * robla looks at the link [21:21:33] many have made the case that that wikitext should be treated as a textual UI & not as a storage format [21:21:52] * robla sees a lot of English in that ;-) [21:21:53] I tend to agree, it's not a good storage format [21:21:57] i thin we need both: characters -> AST -> semantics. We could rely on the HTML spec for some parts of the AST and semantics. [21:22:18] brion, i didn't understand your qn. reg. common document model [21:22:22] and then the spec would throw up its hands any time an extension is involved? what is the use case for that? [21:22:29] gwicke: yes, though our main alternative now is HTML which is at the wrong abstraction level :) [21:22:41] #info let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0. the former lets you deal with old content. [21:22:41] brion: is it? [21:22:46] to be able to write an alternative parser that absolutely cannot handle real wikitext as it appears on Wikipedia? [21:22:49] #info the latter lets you clean up wikitext and move forward. [21:23:02] brion: which issues do you see in the Parsoid DOM spec? [21:23:10] (If there's a choice in spec development between archiving a lot/all and archiving some, I'd be for the former, with some limitations). [21:23:13] subbu: I'm thinking like, would old and new parser rules end up creating compatible in memory object representations that could be transformed to one another [21:23:19] robla: "Script data double escaped less-than sign state" <-- do we really want to go there? [21:23:43] #info many have made the case that that wikitext should be treated as a textual UI & not as a storage format. I tend to agree, it's not a good storage format [21:23:48] brion, ah .. i see. well, parsoid is an example implementation that can bridge between the two. [21:23:52] I don't think separating old/new wikitext format is workable [21:24:07] Platonides, content-handler. [21:24:07] gwicke: in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc. nothing wrong with Parsoid but it makes it inefficient and easy to change in ways that will be weird [21:24:13] lots of old-page calling new-template [21:24:24] I'd love a model that's slightly higher level than HTML [21:24:25] or in other order [21:24:28] tgr: I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates. But we can get at least basic syntax? [21:24:40] wiktitext fragments being passed as parameters... [21:24:48] DanielK_WMDE: it's simpler to write that than to try to reduce the existing algorithm to a formal description [21:24:54] brion: it's trivial to simplify to XML, but then you lose the rendering [21:25:22] Yes, you also lose the rendering if your details change like URL structure [21:25:23] in any case, the purpose of the DOM model is to clearly define what is semantically important, and what is just one way to format it [21:25:35] Yep [21:25:39] Dom is good :) [21:25:52] (If MediaWiki Content Translation became part of this spec writing, or similar, in what ways would it be best to write a spec to facilitate much translation of past resources?) [21:26:00] otherwise, Parsoid could just have said "here, it's HTML5" [21:26:13] & not bothered with a DOM sepc [21:26:17] brion: We can express a higher model in HTML potentially. Parsoid does that to some extent already when an expansion stage exists (e.g. we could have instead of , the model doens't need to be renderable in browsers as-is per se) [21:26:17] *spec [21:26:22] Scott_WUaS: persistent ids for content snippets that don't interfere with human readability [21:26:38] #info in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc. [21:26:42] subbu: What were you hoping to accomplish in this session? [21:26:46] brion: thanks [21:26:55] Krinkle: yes, that's about what I'm thinking :) [21:26:58] hm, has anyone looked at the attempt to build an ANTLR grammar for wikitext? If I remember corectly, the effort came quite far [21:26:59] brion, but to answer your qn. i think using that is an implementation question ... but, the compatibility between old & new would be bridged via the output spec perhaps .. i am thinking aloud here. [21:27:15] brion: OKay, so you didn't mean that the higher model should be something that isn't HTML syntax [21:27:28] DanielK_WMDE: but failed as always [21:27:30] DanielK_WMDE: there is not much benefit over PEG [21:27:32] * subbu is personally not interested in the grammar as a spec direction [21:27:33] Kringle yeah I'm agnostic to syntax really [21:27:34] if I remember right [21:27:36] Heh autocorrect [21:27:38] (it could be browser renderable too, with custom elements nowadays) [21:27:45] the first 90% is easy [21:27:49] Platonides: sure, it failed to cover the critical last 5%, but perhaps it's a good start. or a good lesson. [21:27:54] but the last 10% makes you mad [21:27:58] for reference: https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft [21:28:02] note that we don't actually have a full PEG spec [21:28:15] #info I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates. [21:28:18] Debra, I was responding to robla's call which I felt was a good question to delve into about a spec ... to understand where perspectives are wrt to wikitext, spec, old / new wikitext, evolving wikitext, etc. [21:28:25] a full grammar spec is impossible anyway [21:28:35] Debra, so far, i think it is being met well. as far as I am concerned, not sure what robla thinks. :) [21:28:41] All right. [21:28:44] unless the grammar is fully turing complete [21:28:52] subbu: I think I mostly agree with you (not interested in grammar being the sole focus), but I do think syntax is important [21:28:52] I think general discussion is fine, but sometimes people want more concrete action items from these meetings. [21:29:04] we have PEG+stops, and I think it would be interesting to replace stops with an extension to the PEG formalism, instead of being implemented in JS [21:29:21] stops are just a compression technique [21:29:26] since we are probably forking PEG.js anyway [21:29:30] you can unroll that into a larger grammar [21:29:47] yeah, but unrolling defeats the purpose of having the grammar, the purpose is reduction [21:29:51] #info * is personally not interested in the grammar as a spec direction [21:29:51] very tedious, but possible [21:30:21] #link https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft [21:30:33] robla, i think wikitext syntax parsing is a solved problem ... the peg grammer with stops is good enough .. TimStarling also pointed me to an ebnf grammar for the php preprocessor .. i think between the two, we have wikitext syntax tokenization covered. [21:30:56] do we have a link to the PEG thing? [21:30:58] but, that syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it. [21:31:06] Full data model is more complex potentially yes! [21:31:08] DanielK_WMDE: you won't like it [21:31:11] :) [21:31:20] DanielK_WMDE: https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt [21:31:35] subbu: yup, the latter is way more interesting (understanding wikitext semantics or generaitng html from it) [21:31:45] #info syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it. [21:31:49] gwicke: thanks [21:31:56] subbu: does the Parsoid Dom model help in terms of defining some elements? [21:31:59] subbu: Exactly, we need to specify not the syntax, but what the tokens mean in relation to each other. [21:32:00] #link https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt [21:32:12] Or do we need to go farther with semantic info [21:32:18] spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF [21:32:19] brion, https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext#What_kind_of_specs_can_we_develop.3F addresses your qn i think [21:32:20] Eg, here's how a section is ordered [21:32:35] #link https://www.mediawiki.org/wiki/Markup_spec [21:32:41] I'm personally most interested in developing the DOM spec further, as well as cleaning up the messy semantics around transclusion & templating [21:33:00] the awkward thing about ABNF is that precedence is unspecified [21:33:08] in PEG, there is a specified precedence [21:33:10] #info spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF [21:33:24] Subbu: Yes good :) [21:33:26] robla, for semantics ... i think TimStarling is on the right track about an executable spec .. since we have html5 tree builder as a successful model for dealing with legacy formats. [21:33:33] gwicke: wow that file breaks syntax highlighter :) [21:33:42] but, i think if we stuck just there, that would be unfortunate. [21:33:47] if we *were* .. [21:33:55] SMalyshev: I think the .txt suffix throws it off [21:33:56] (a side note for us to cover in the last few minutes: should we have a Phab task to track the state of Parsing/Notes/A_Spec_For_Wikitext and turn the latter into an RFC) [21:34:00] The idea that extensions may need some more description is on point I think. [21:34:32] Params for modules/templates/extensions may themselves be Wikitext, or may be just little string tokens [21:34:47] subbu: TimStarling: So the executable spec would dictate that when encountering (or {{template) that it include that target in a certain way? [21:34:53] If you want a library that greps or search replaces in text, that matters to you [21:35:24] Krinkle, it would define that that token be preprocessed to generate expanded wikitext, for example. [21:35:35] a reference implementation does not have to be a high-performance implementation. [21:35:37] Similar to needing to know that some HTML tags are self closing or etc [21:35:50] it can be very slow, but built with the goal of being understandable and easy to grok. [21:35:57] TimStarling: hm... much of the problematic parts are related to tag extensions, parser functions, and other transclusion mechanisms. it seems to be the preprocessor is a tool to separate these from the "wikitext proper" parts. that would perhaps make it easier to write a spec just for these parts. [21:36:20] subbu: Is the goal for the spec to allow third-parties to do what MediaWiki does now when viewing an old revision? (e.g. current version of templates) - or do we intend to improve that behaviour as part of this? [21:36:25] the preprocessor is relatively easy [21:36:31] "relatively" [21:36:44] Heh [21:36:45] "just do it as this code does" [21:36:46] Krinkle, that is a reasonable goal, yes. [21:36:47] Krinkle: i think an executable speci is called a "parser". [21:36:56] (or perhaps even move it out of the spec, by as gwicke mentioned, to consider that only an input method to the model, and never a storage format) [21:37:33] the model would still have a canonical disk representation [21:37:49] DanielK_WMDE, maybe .. but, I think it would also extract the most important semantics out of the guts of mediawiki. [21:38:09] a really well written, nicely readable parser could server as a spec. probably a recursive descend parser. [21:38:25] DanielK_WMDE: PEG is recursive descent [21:38:28] i.e. how far can you pull the parser out and see how much spaghetti links back into the mediawiki guts you can cut out without breaking the essential interpretation of content. [21:38:44] so, for example, red links, etc. may not be essential in the executable spec. [21:39:01] they are all best viewed as post-parser transformations even for old revisions. [21:39:07] and need not be part of the spec. [21:39:15] gwicke: so is the PEG code sufficiently readable that others could reasonably use it to build their own grammaer or parser? [21:39:27] the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful [21:39:35] DanielK_WMDE: yes, it even compiles to a tokenizer out of the box [21:39:55] it's still not easy, but there is only so much complexity you can pretend to not be there [21:40:24] focusing on making the spec executable seems like a "nice to have", not a hard and fast requirement. It seems like overengineering [21:40:47] gwicke: can we package the complexity into nice bundles, that can be covered or ignored? a modular spec? [21:40:58] In other words, if a third party has an xml dump of all page titles and revisions, can they use this to figure out how to render them? Or do we involve other stuff that wouldn't be in there. References to other pages is doable I guess, but references to other stuff gets more complicated ({{int:}}, {{gender:}}, including special pages, extension tags) [21:41:08] (to an extent, all specs are modular, since they all build on top of pre-established conventions) [21:41:17] like wikitext-without-italic-and-bold? [21:41:19] Mediawiki standard library ;) [21:41:22] the PEG grammar would be more readable if the JS event code was separated from the PEG recognizer [21:41:25] DanielK_WMDE: yes, exactuly (builidng on other specs) [21:41:31] #info so, for example, red links, etc. may not be essential in the executable spec. [21:41:33] many PEG libraries actually do that [21:41:57] #info the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful [21:42:17] I think the executable spec would be written in natural language. The HTML5 executable spec is that way. [21:42:17] an incomplete spec isn't useful for the archival problem [21:42:30] Krinkle: I like the idea of a layered spec. Things like extensions and parser functions are an additional layer needed for some uses but not all [21:42:33] TimStarling: we should do that [21:43:02] gwicke: Well, a parser/etc. for the log wikitext sub-type (bold/italics/links and nothing else) is probably needed. [21:43:17] gwicke: I think usefulness a matter of degrees, not binary. A really complete spec is most useful, but an incomplete spec can be useful. [21:43:19] also, no links / images, I guess [21:43:21] Eg if mediawiki burns in a fire and we have all this data, what do we need to separate out the various levels of data in its place [21:43:27] as those depend on the target of the link [21:43:39] And how can we rearrange and refactor internally to represent those layers more maintainable [21:43:43] * subbu is trying to imagine software bits burning in a fire ... [21:43:46] Hehe [21:43:46] If I understand correctly, Parsoid currently considers extension tags as instructions to make a request for generated content (aside from the few ones it implements natively). So it would depend on the availability of an HTTP service. [21:44:06] this is extendable, but probably not desireable for the archival use case. [21:44:21] hey, you can make each extension its own spec [21:44:37] Krinkle, wikitext-native extensions need to have a parsoid-native equivalent. [21:44:49] i mean .. extensions that process wikitext. [21:44:57] ex: ref, gallery [21:45:18] And then there's extensions which don't process wikitext but do depend on it, like [21:45:43] the old cobol folks would have smugly argued that their language was so close to human readable that they wouldn't need to bother writing separate prose [21:46:10] gwicke: :-) [21:46:11] perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize [21:46:15] Hehe [21:46:24] it has the advantage that we already have half of it. [21:46:48] it would not cover all layers. the missine layers should have well defined interfaces. [21:46:51] DanielK_WMDE, it only tokenizes right now. [21:47:05] well, that's at least the first layer [21:47:06] #info  perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize [21:47:17] my position on project planning is that a reductionist spec, which does not necessarily precisely reflect legacy wikitext, would be more useful than a complete spec [21:47:43] for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things [21:47:49] it's more likely to actually happen [21:47:58] but for archival purposes, it doesn't seem to have much value [21:48:05] #info for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things [21:48:09] although storing wikitext is still essential, to keep a record of user intentions with each edit [21:48:27] maybe we should be storing parsoid HTML before and after each VE edit for the same reason [21:48:29] For historical review and annotation/blame [21:48:40] #info  although storing wikitext is still essential, to keep a record of user intentions with each edit [21:49:05] we already store Parsoid HTML for each edit [21:49:10] but only going forward, not for the past [21:49:19] gwicke: can we publish dumps of that? [21:49:25] TimStarling, given that parsoid converts that back to equivalent wikitext, why would you need to store parsoid html before/after each ve edit? [21:49:40] DanielK_WMDE: yes, subject to some attention from ops [21:49:47] Well there's the whole template update issue [21:49:48] subbu: because parsoid changes [21:49:53] That too [21:50:05] brion: yes, indeed [21:50:07] one consideration with storing expanded canonical form is oversight/deletion. [21:50:19] brion, but, that is a generic wikitext problem, not is not restricted to ve edits. [21:50:25] yeah, because parsoid changes, but I'm not going to try to sell you this idea right now [21:50:26] Yup [21:50:49] any spec needs versioning & format upgrades [21:51:08] so....this seems worthy of being an RFC going forward, no? [21:51:11] #info we already store Parsoid HTML for each edit gwicke: can we publish dumps of that? DanielK_WMDE: yes, subject to some attention from ops [21:51:22] maybe it should be noted what an abysmally bad job we are doing with historical preservation of edits, for reasons unrelated to a spec [21:51:36] #info the idea that stroing (and possibly publishing) parsoid HTML for each revision for achieval seems to have some support [21:51:41] for example, most of the first 12 months of the history of the project are still missing [21:51:46] DanielK_WMDE: https://phabricator.wikimedia.org/T133547 [21:52:01] an old dump at https://dumps.wikimedia.org/htmldumps/dumps/ [21:52:37] TimStarling: I wouldn't call that abysmally bad, but I agree it'd be really fantastic to make the first 12 months more easily accessible [21:52:41] 4 of those 12 months only exist in a landfill probably somewhere in Orange County [21:53:19] faithfully rendering early history would require changes similar to what the Memento project did [21:53:28] the other 8 just nobody could be bothered importing [21:53:28] now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :) [21:53:31] * robla is now depressed after Tim's landfill remark :-( (because he imagines Tim's not wrong) [21:53:41] and even that would only go half way at best [21:53:53] subbu: :) [21:54:02] but, fortunately early content is also a lot simpler, so in practice not much actual content should be lost [21:54:18] I love the idea of specing things like extension input models better [21:54:22] #info now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :) [21:54:26] maybe we could start importing that early data [21:54:46] then we could more optimistacally go ahead with the future wikitext [21:54:47] subbu: I wouldn't call it wikitext spec [21:54:51] subbu, it'd be great for you to file a stub Phab task for us to track state of the wiki page, could you do that? [21:55:08] "wiki content processing spec" or "wiki content model spec"? [21:55:16] gwicke, sure .. spec [21:55:25] * DanielK_WMDE whispers "hooks" @brion [21:55:31] subbu: yeah, content model or document model perhaps is the angle [21:55:41] the page component / composition stuff is aiming in that direction as well [21:55:52] robla, i didn't follow reg. "track state of the wiki page" part .. can you say more? [21:55:52] brion: If we go with the route of promoting (expanded, but annotated) html as archival format, that would remove dependencies like extensions. We'd store wikitext for review only (as direct input, no intent to re-parse). [21:55:52] we can morph https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext into an RFC [21:55:52] * brion is a pirate and has hooks for hands. Arrrrr! [21:55:54] ah, morphing that into a rfc .. task for that? [21:56:16] if yes, sure. i can. [21:56:19] Krinkle: mmmmm, depends on the extension. If it needs js, your life is harder [21:56:22] subbu, yup [21:56:44] #action subbu file a Phab task to track https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext for possible conversion to RFC [21:56:58] the fun part is that anything new will have to consider existing content [21:57:01] brion: enhancement (same for CSS). Presumably the core content wouldn't depend on JS. Interaction and styling are up to the consumer to decide on. [21:57:11] (Cheers:) [21:57:12] *nod* [21:57:15] If not, that's a bug in the extension :) [21:57:17] "fun" [21:57:18] :) [21:57:20] (and we have some) [21:57:33] Any last thoughts? [21:57:56] specs are hard [21:57:58] #info subbu to create RFC [21:58:05] Specs and models for everyooooooone [21:58:24] #info specs are hard [21:58:29] :-) [21:58:31] True :) [21:58:36] #info Specs and models for everyooooooone [21:58:38] #endmeeting [21:58:39] Meeting ended Wed Aug 10 21:58:38 2016 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [21:58:39] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-08-10-21.00.html [21:58:39] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-08-10-21.00.txt [21:58:39] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-08-10-21.00.wiki [21:58:39] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-08-10-21.00.log.html [21:58:40] Haha [21:58:45] Thanks everybody! [21:58:58] thanks y'all! thanks Krinkle for chairing! [21:59:02] :) [21:59:05] thanks .. great to hear all these varied perspectives. [22:01:29] I feel i'm in 2010 again [22:01:45] * gwicke feels more like 2011 [22:02:25] hmmm, maybe [22:02:37] I was giving an approximate date [22:03:35] * subbu doesn't understand the year references [22:03:55] I'm guessing this ground has been partially covered before [22:04:32] ah, i see .. could be .. i entered this universe may 2012. [22:04:54] * bd808 is a mid-2013 youngster [22:05:26] my how you've aged [22:05:46] bd808, arlolra is calling you old. [22:05:48] :) [22:05:53] * subbu runs away [22:06:33] the UI for your client needs a DAG [22:06:41] * bd808 waves his cane at arlolra and shoos the kids off of his lawn [22:07:06] :) [22:09:41] wikitext-l was created in November 2007 [22:09:53] split from a long wikitech thread [22:10:09] about creating a wikitext grammer [22:10:14] *grammar