[20:59:37] Phab:E203 starting in just a sec, after I get fully situated [21:01:22] #startmeeting ArchCom RFC Meeting: T89331 Replace Tidy in MW parser with HTML 5 parse/reserialize [21:01:22] Meeting started Wed Jun 8 21:01:58 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:01:23] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:01:23] The meeting name has been set to 'archcom_rfc_meeting__t89331_replace_tidy_in_mw_parser_with_html_5_parse_reserialize' [21:01:23] T89331: Replace Tidy in MW parser with HTML 5 parse/reserialize - https://phabricator.wikimedia.org/T89331 [21:01:36] #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ [21:01:58] hi everyone! [21:01:59] o/ [21:02:41] #link https://phabricator.wikimedia.org/E203 <-Phab event info, with links [21:03:05] hey there [21:03:08] I wrote a bit of an update in the task description [21:03:49] TimStarling: you had wanted to have this discussion be about the migration, and not so much about the design choices, right? [21:04:15] yes [21:04:48] I think the design was decided last year, and we've gone ahead and implemented it, but there's still a fair bit of work to do to get it out the door [21:05:12] could you perhaps give a short summary of how this fits into your larger-picture roadmap? [21:05:58] whose roadmap specifically? [21:06:05] in the forseeable future, we plan on maintaining both the MW parser and Parsoid [21:06:08] the parsing team roadmap [21:06:39] so it makes sense to do some work to bring them closer together in terms of syntax [21:07:38] there are two questions Tim brought up at the end of the description of T89331. Question 1 ... [21:07:38] T89331: Replace Tidy in MW parser with HTML 5 parse/reserialize - https://phabricator.wikimedia.org/T89331 [21:07:41] "Are we close enough now in visual diff testing to call that part of the project done? (96.79% showed less than 1% differences, 93.35% rendered with pixel-perfect accuracy.)" [21:08:10] I think there is benefit in and of itself to replace Tidy ... it has been requested for a long time. Plus, what TimStarling said about this bringing Parsoid and PHP parser output closer together. [21:08:14] are you planning to add more DOM functionality in Html5Duperate? [21:08:56] cscott has some ideas about maybe using it for #balance [21:09:04] in which case I suppose it would need more DOM functionality [21:09:09] but, why is that relevant to this discussion? [21:09:35] I'm mainly trying to understand where you are planning to go with Html5Duperate [21:10:06] ok .. [21:10:26] (one option for implementing #balance is to do the actual balancing in tidy or post-tidy, when we have clean serialized html5 with all tags matched, etc.) [21:10:34] depending on your answers, it will be more or less required for third party users [21:10:58] currently we do not actually build a DOM in depurate, the parser gives us an event stream which we serialize [21:11:34] (one of *my* goals for a tidy replacement is obtaining better semantics for what the "tidy phase" does, exactly. tidy's actual transformations are not written down anywhere, and WMF wikis depend on their exact behavior.) [21:11:45] #link https://www.mediawiki.org/wiki/Html5Depurate [21:12:10] gwicke, we want a tidy-replacement solution for 3rd party users and that replacement solution right now is html5depurate. but, other options are not ruled out for someone wanting to provide them. [21:12:18] https://github.com/wikimedia/html5depurate/blob/master/src/main/java/org/wikimedia/html5depurate/CompatibilitySerializer.java is a first step to teasing out exactly what tidy is doing, although I'd like to eventually document this properly on-wiki in english, not in code. [21:12:42] whatever that solution is has to solve the problems that we are solving right now, and for deployment on the wmf cluster, we need to solve the migration / how-to-roll-out problem. [21:13:38] have we looked at https://github.com/Masterminds/html5-php for a pure-php solution? [21:13:39] if we can separate the "tidy compatibility" part from the html5 parsing/serialization part, we can (a) more easily use different html5 parsing/serialization solutions, depending on performance, ops considerations, etc, and (b) gradually deprecate the strangest corners of the "tidy compatiblity" part. [21:13:49] so, are you planning to basically develop html5depurate to eventually be equivalent to Parsoid's DOM passes? [21:14:02] brion: yes, I think I had some notes about it on the task [21:14:09] nice, i'll read up [21:14:34] yeah, show older comments and then search for it [21:15:00] I'm actually working on my own HTML 5 parser in PHP now [21:15:06] afaik, html5depurate is intended to be html5 parse+serialize only, with "as little as possible" of mediawiki-specific compat stuff. this means that some of what tidy is doing we end up moving into php. [21:15:17] gwicke, we haven't gotten that far yet since the core parser doesn't provide the functionality needed to implement Parsoid's DOM passes on top of that. [21:15:25] https://github.com/tstarling/remex-html [21:15:46] so, you are planning to have a DOM in PHP, too? [21:15:51] so, short answer .. we won't be implementing those dom passes in depurate. [21:15:58] for example, https://gerrit.wikimedia.org/r/286928 moves some tidy-specific self-closing tag functionality into the sanitizer, so it doesn't have to be part of tidy (any of the possible tidy implementations) [21:16:10] #link https://github.com/tstarling/remex-html parser that Tim is working on [21:16:28] gwicke, still trying to figure how that discussion is relevant to tidy replacement. [21:16:31] ah looks like it (html5-php) may not handle all error corrections correctly per spec, which is worrying :) [21:16:45] subbu: still trying to figure out where you are headed with this [21:16:52] and i think tim's doBlockLevels work is also intended to move more of the fixups done by tidy (sometimes incorrectly) into core. right? [21:16:57] we are replacing tidy with html5depurate .. that is where are headed right now. [21:17:14] #info first 15-20 minutes of the meeting have been about different Tidy alternatives [21:17:40] #info we are replacing tidy with html5depurate .. that is where are headed right now. [21:17:58] doBlockLevels is actually generating invalid HTML from valid input [21:18:07] creating an awful mess for tidy to clean up [21:18:08] brion: just for completeness, https://gerrit.wikimedia.org/r/#/c/279669/7/includes/tidy/Balancer.php is another html5 "parser" (actually just the treebuilder phase) written in PHP. [21:18:08] "pp [21:18:13] I would like it to not do that in the first place [21:18:23] cscott: tx [21:19:08] right, so the general idea is to gradually move to the place where "tidy" is just a standards-compliant html5 parse-and-reserialize, and all the other fixups are done in core (or avoided entirely, like with the doBlockLevels fixes) [21:19:23] this sounds a lot like parsoid [21:19:24] we're not going to get there in one leap, the initial html5depurate will have some tidy compatibility hacks still. [21:20:05] gwicke: one goal is easier compatibility between php and parsoid, yes. but parsoid doesn't have a separate 'tidy' phase really. [21:20:16] I wonder if we should make Sanitizer into a proper balancing HTML parser [21:20:50] it does beg the question if you are planning to gradually convert the PHP parser into a Parsoid port [21:20:54] it would simplify the main pass if the input were valid [21:21:13] well, parsoid is just differently structured. parsoid does token stream -> manipulations -> tree builder phase -> final manipulations. php does wikitext -> partially parsed html -> sanitizer -> doBlockLevels -> tidy -> languageconverter. or something like that. [21:21:26] gwicke: we've already talked about that twice, I don't really want to get into it again [21:22:03] should we talk about visual diff test pass requirements? [21:22:16] i think this discussion is side-tracking into all the other things we can do with depurate / what the parsing team is doing / might be doing .. and is not about whether html5deuprate is an acceptable tidy replacement and what blocks its deployment. [21:22:26] TimStarling: it would be interesting if you could clarify what your answer is [21:22:28] TimStarling: the input to removeHTMLtags is not fully-parsed HTML, so we can't actually do a proper HTML5 parse at that point. I tried that once. [21:22:31] robla: I think there are no comments on that [21:22:45] TimStarling: OTOH arlo has been gradually fixing the attribute regexps in the sanitizer (for example) to be html5-spec-compliant. [21:23:02] but https://phabricator.wikimedia.org/T134469#2281710 are my thoughts. [21:23:39] i'll prepare a wiki page addressing questions about parsing team roadmap / 2 parsers / etc .. i don't think that discussion is relevant to this rfc. [21:23:51] we need to start getting the community involved in text migration [21:24:10] subbu: it matters wrt third party support, which is related to the migration [21:24:17] then they will have opinions on diff targets [21:24:57] gwicke, as it stands today .. TimStarling has written an abstract interface into which a tidy replacement can be dropped, and html5depurate is one viable option. [21:25:01] TimStarling: I guess that's what I meant by requirements. I'm assuming you're trying to figure out how to get editors engaged on fixing the few bugs that are past the point of diminishing returns, right? [21:25:51] if a pure php alternative is avaialble that can be used .. if parsoid is available, then parsoid can be used ... but if parsoid is used for read views there, tidy is irrelevant ... so many possibilities. [21:26:12] but, for mediawiki installs that don't want parsoid, parsoid is not a solution for replacying tidy. [21:26:13] robla: yes, although diffs are not necessarily bugs [21:26:51] we don't really want to have exactly the same behaviour as tidy because we don't actually like tidy's behaviour [21:27:13] so it comes down to rolling out changes gradually, providing tools for editors [21:27:20] robla, https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy has a classification of diffs we currently see and possible strategies for addressing them. [21:27:32] perhaps James_F would have a sense of how the project would go [21:27:51] do you envision a heterogeneous deployment for this? (i.e. one specific wiki first, then rolling out more widely)? [21:27:52] again, for completeness: https://gerrit.wikimedia.org/r/282733 is 90% of a pure-PHP implementation of html5depurate. It's not as fast as html5depurate, so WMF probably wouldn't use that in production, but it could ease the political issue for non-WMF users. [21:28:16] if html5depurate is the equivalent of a HTML5 round-trip, then there are many alternative implementations [21:28:39] I'm mostly worried about other features that require a DOM [21:28:50] it's an HTML parse with a custom serializer [21:29:05] *not* a compliant serializer [21:29:23] gwicke: sure. but only a pure PHP implementation really helps with the political problem. [21:29:29] right now, html5depuate it implements additional some simple tidy-like transforms to ease the migration from tidy .. otherwise the diffs would be numerous. the plan is to remove those passes one at a time using visual idffing and the process we come up with now ... that will migrate syntax gradually. [21:29:45] TimStarling: "not a compliant serializer" meaning the tidy-compatibility hacks? or the XHTML-compatibility hacks? or something else? [21:29:59] the tidy compatibility hacks [21:30:03] ok. sure. [21:30:13] the XHTML hacks will give you roughly the same thing if you reparse it [21:30:25] cscott: there are two schools of thought there- one is that really the most important thing for third party users is a) resource needs, and b) setup / maintenance complexity [21:30:32] the other is that it has to be PHP [21:31:22] +1 to subbu's comment about gradually migrating syntax. [21:31:25] those are of course two separate things :) but related, in that extra daemons/languages/binaries add to the setup cost [21:31:32] gwicke: the argument being that PHP still provides the best bang for the buck in terms of (b) [21:31:34] how html5depurate fits into either remains to be seen; for the first case, it depends a lot on its resource needs [21:31:49] the "90% implementation" i mention above doesn't have tim's tidy compatibility hacks yet, that's the part that's still missing there. [21:32:32] we can solve the setup issue with containers & automation (see mediawiki-containers) [21:32:38] gwicke, this is rehashing the first RFC discussion. [21:33:02] what is the expected memory usage of html5depurate? [21:33:05] what are yo ugetting at exactly? [21:33:29] are you saying we shouldn't deploy html5depurate on the wmf cluster? [21:33:44] i didn't mean to distract us with my mention of the pure php option. in fact, what i meant was the opposite -- we should feel free to ignore some of the political issues wrt services because we might be able to do an end-run around them. [21:33:54] memory isn't an issue on the cluster, but it is a critical resource for small VMs [21:34:10] so hopefully we can concentrate on wmf needs, and then have the "what's best for other deployments" as a separate discussion [21:34:11] ok, I think this meeting is about deployment to the wmf cluster, no? [21:34:56] so...I'd like to reask my earlier question: which production wiki would you plan to deploy to first? [21:35:13] i'm not quite clear on where the service is going to run. it's not going to be a lonely tomcat instances somewhere, will it? [21:35:44] we will have to do something about third party use sooner rather than later [21:35:46] it can run with one instance per MW host [21:35:51] listening on localhost [21:35:57] can i read the plan for the service setup somewhere? [21:35:58] gwicke: not in this meeting, though [21:36:30] certainly not *doing*, but I think we'll need a plan soon [21:36:33] the plan is to basically install the existing debian packaging, and use it with the bundled configuration [21:37:19] silly question: *are* we using tomcat? is it decent these days? [21:37:34] #info we will have to do something about third party use sooner rather than later this is a topic for a different meeting [21:37:41] (another silly question: can we use unix sockets instead of tcp-over-loopback?) [21:37:51] no, it's grizzly, it does its own daemonization and has an embedded grizzly webserver [21:37:53] TimStarling, can you also address robla's questions about deployment? [21:38:14] I started off in tomcat but switched to grizzly to make it easier to manage [21:38:17] what servlet engine are we using for other stuff? [21:38:19] if you are committing to addressing this issue as a team in a way that doesn't result in a significant change in overall MW system resource requirements, then this works for me [21:38:37] there is still a servlet class in the source tree but I'm not sure if it still works [21:38:42] #info i have (the start of a) slower pure-PHP implementation which might help with third party use. (note for that different meeting) [21:39:01] DanielK_WMDE__: we don't have other servlets AFAIK [21:39:07] most java services do the same, they run an embedded webserver [21:39:15] that's how cirrus and gerrit work [21:40:08] TimStarling: ah ok, i only saw that you talked about tomcat on the ticket, and I got worried ;) [21:40:33] subbu: I think the answer to your question is "no" :-) [21:40:45] so...I'd like to reask my earlier question: which production wiki would you plan to deploy to first? [21:40:54] TimStarling: also "is tomcat decent these days" - "no, it's grizzly" actually makes sense in it's own way ;) [21:41:11] first it should be deployed to all wikis as a gadget or other opt-in tool [21:41:36] when it is fully deployed, I'm not sure, maybe it would just follow the release train [21:42:08] the weekly train is not a bad deployment sequence [21:42:09] #info first it should be deployed to all wikis as a gadget or other opt-in tool [21:42:09] TimStarling: how would the opt in work? split or skip the parser cache based on user preferences, or a cookie or something? [21:42:14] why not deploy to meta first? [21:42:29] performance would likely be poor when used as a beta feature, because of parser cache misses [21:42:30] I think what we don't have full clarity is at what point we pull the switch on this being the default .. since there will be some rendering diffs that require fixing syntax on pages. [21:42:53] and when do we just deploy it and have some pages look broken and rely on editors fixing up the wikitext / templates. [21:43:05] maybe fetch a DocumentFragment via an API, switch it out when you click a button [21:43:15] there are communities that are very open to trying new things, like Catalan [21:43:17] i feel per-user opt-in makes this pretty tricky, and will be confusing to users. they can't easily show issues to each other, either... [21:43:29] maybe have a position:fixed button hovering over the page so you can flip back and forth [21:43:37] our visual diffing results and documentation on https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy gives us a good basis to make claims about what kind of things might possibly break. [21:43:54] mw.org could also be a good testing ground [21:43:55] TimStarling: ah, you want this on demand, with user interaction. that would be ok I guess. [21:44:08] gwicke, yes .. mw.org and catalan wikis are a good idea. [21:44:12] otherwise, i'd have suggested to do per-namespace experiments [21:44:25] and cscott mentioned meta as well. [21:44:59] we could have a special page as well, to make it linkable [21:45:22] /wiki/Special:NewParser/Barack_Obama or something [21:45:23] commons ought to be safe-ish as well, since in theory most of the content there should be reasonably well-formed. [21:45:29] * robla tries to remember who at WMF is most appropriate to loop into a rollout discussion [21:45:46] liaisons and james might have other ideas for communities to target [21:45:48] #info Tim intends to make the new html available to users as an experiment, by clicking a button to replace the old html with the new html, on request. [21:46:01] nice [21:46:34] #info we could have a special page as well, to make it linkable; /wiki/Special:NewParser/Barack_Obama or something [21:47:14] thanks DanielK_WMDE__ and thanks TimStarling . That clarifies the strategy a lot [21:47:53] TimStarling, robla what about deployment to individual wikis at some point .. like mw.org, meta .. before doing a full rollout? [21:47:54] #info Tim suggests to run one instance of html5depurate on each app server, so mw can access it on localhost; html5depurate runs an integrated webserver (grizzly). [21:48:58] subbu: yeah, I think that'll be a must. I think we'll need to get someone from liaisons to weigh on rollout ordering [21:49:09] k [21:49:39] who is going to decide on where service instances live, how many we need, how they get managed, etc? [21:49:45] mw.org has a good dogfooding factor [21:49:46] that would be the next thing that needs planning, right? [21:50:05] operational ownership as well [21:50:06] hmmm dog food... [21:50:22] om nom nom [21:50:38] yes, we will need to have a conversation with ops about it [21:51:02] are there already metrics & logging? [21:51:03] I imagine the instances would be fully rolled out before the gadget deployment [21:51:25] logging yes, metrics no [21:51:46] the package installs a local log file [21:51:51] w [21:52:04] TimStarling: one per app server makes communication and setup easy, but may be overkill. do you think we need that many instances? [21:52:52] so, the order is 1) rollout html5depurate instances 2) rollout special page+gadget 3) rollout to first wikis 4) plan full rollout based on step 3 (?) [21:53:40] hm, most parsing doesn't happen on web accessible hosts, but on the jobqueue runners, right? [21:53:46] DanielK_WMDE__: I don't mind, I think that plan came out of a previous IRC discussion but it can go wherever ops wants it to go [21:53:48] step 0: talk with ops, puppetize, etc. i suspect. [21:54:14] prerequisite for step 1: discussion with ops. prerequisite for step 3: discussion with liaisons, correct? [21:54:15] TimStarling: yea, i'm just saying, we'll need a plan for that :) [21:54:28] robla, sounds about right. [21:54:34] 3) is a wmf-config change, should be easy to quickly rollback if there are any problems. [21:54:53] #info so, the order is 1) rollout html5depurate instances 2) rollout special page+gadget 3) rollout to first wikis 4) plan full rollout based on step 3 (?) [21:55:03] subbu: are you planning to help with third-party support work? [21:55:18] #info prerequisite for step 1: discussion with ops. prerequisite for step 3: discussion with liaisons, correct? robla, sounds about right. [21:55:38] robla: (2) should also be discussed with comcom/liaisons, i think [21:55:52] 2a) is publicize the gadget well in tech news so that folks can help fix up bad markup, know what to look for, etc. [21:55:58] we're coming up on the top of the hour [21:56:32] gwicke, we can have additional conversation about that .. but initially, a simple solution is: tidy replacement is not recommended for 3rd parties till we have some experience with this .. that is just my thought .. cscott and TimStarling might have other ideas. [21:56:45] #info robla: (2) should also be discussed with comcom/liaisons, i think [21:57:22] yeah, not many benefits for third parties at this point [21:57:36] vagrant is another question [21:57:44] there's a bit of a messaging question there: do we pitch this as "just a faster tidy, don't worry about it if you don't have a huge wiki"? [21:57:51] TimStarling: anything else you want to make sure we cover in this hour, or should I plan on hitting #endmeeting in a minute? [21:57:51] good discussion! [21:58:03] if we pitch it as "cleaner markup" or something like that, then folks may wish to run it immediately [21:58:08] ya .. adventurous wikis might decided to go with it, but it is at their own risk (like broken rendering, need to fix wikitext, templates, etc.). [21:58:15] there's no time for new topics [21:58:26] we'll have it packaged, so there's nothing *stopping* 3rd parties from running it [21:58:35] config option [21:58:38] discussion can continue on #mediawiki-parosid? [21:58:38] it's just we won't push it as a "you should run this". not yet at least. [21:58:47] at some point in the future we'll probably deprecate the WMF fork of tidy. [21:58:48] robla, works for me. [21:58:53] parsoid even? :-) [21:58:54] #mediawiki-parsoid perhaps [21:58:56] (since WMF doesn't even run stock tidy) [21:59:10] * cscott has to turn into a pumpkin [21:59:16] * DanielK_WMDE__ is getting parasoid [21:59:17] to clarify, I strongly support the general direction of moving to HTML5 parsing over tidy; my concerns are mostly about the details of the implementation [21:59:23] i'm just saying let's be careful about messaging. [21:59:24] gwicke, thanks. [21:59:26] alright meeting turning into a pumpkin in a few seconds :-) [21:59:41] thanks everyone! [21:59:43] #endmeeting [21:59:44] Meeting ended Wed Jun 8 22:00:19 2016 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [21:59:44] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-06-08-21.01.html [21:59:44] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-06-08-21.01.txt [21:59:44] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-06-08-21.01.wiki [21:59:45] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-06-08-21.01.log.html