[16:02:28] Hey folks, just getting setup for the office hour today. [16:02:35] * Keegan waves [16:02:51] #startmeeting Reimagining WMF grants office hour [16:02:51] Meeting started Wed Aug 26 16:02:51 2015 UTC and is due to finish in 60 minutes. The chair is i_jethrobot. Information about MeetBot at http://wiki.debian.org/MeetBot. [16:02:51] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [16:02:51] The meeting name has been set to 'reimagining_wmf_grants_office_hour' [16:03:51] fyi my bouncer is lagging something awful [16:06:54] For those of you who haven't had a chance to read over the idea, feel free to refer to it here through the office hour today: [16:06:55] https://meta.wikimedia.org/wiki/Grants:IdeaLab/Reimagining_WMF_grants [16:07:39] Before we jump into things, I wanted to provide a brief overview of the consultation and the actual Reimagining WMF grants idea itself. [16:08:23] (Also, welcome everyone, and thanks for coming!) [16:08:37] So, the main reason behind organizing this consultation was due to a few recurring problems within the existing grants program. [16:09:21] First, in the existing system, it is not alway clear for which grant is the best fit for a request. [16:10:00] Second, the processes behind applying and reporting are complicated and burdensome. [16:10:25] And finally, our grant committees are overburdened. [16:11:06] Community Resources has developed an idea to address these concerns, which I've linked to above. [16:11:32] I'll let everyone refer to it as needed for details, but the basic structure consists of three main types of grants: Projects, Events, and Annual Plans. [16:12:19] Each type contains a few options that fit into common grant requests we receive (https://meta.wikimedia.org/wiki/Grants:IdeaLab/Reimagining_WMF_grants#Three_types_of_grants). [16:13:40] We're first looking to make changes to the Annual Plan Grant program, specifically by incorporating the Simpler Process APG option this fall, and changes to the other programs are likely to occur this year. [16:14:27] er, not "later this year," but "over the next year. [16:15:36] We've gotten helpful feedback so far that has resulted in some changes to the idea, and we'd like to hear more from you during office hours today. [16:17:18] So, please feel free to chime in with questions and comments about the idea or the consultation. [16:23:32] i_jethrobot: First of all, thank you. My interest regards the annual plan process. The project and event grant process currently emphasizes the outcomes of individual projects, which misses a large and necessary component: the growth and maturity of organizations that carry them out. Is there any plan to address this? [16:24:05] Great point, James. Yes, there is. [16:24:25] We are making a simpler APG process, that will be for organizations with annual plans that are not in the full APG process. [16:24:40] So the funds through these grants will be used for both projects and operating expenses. [16:24:53] Along with funding, we want to offer organizations more support from WMF staff. [16:25:37] We have been working on different ways to do capacity building, including the org effectiveness work and the community capacity development framework (which is for communities rather than organizations, however). [16:25:59] We hope that by having a funding option that's specifically for organizations with annual plans that are not in full process APG, [16:26:05] we can better address those needs. [16:27:21] We realize volunteer-powered organizations or organizations with lean staff have particular challenges and strengths. So we want the new funding option to help them. [16:28:08] James, does that answer your question / concern? [16:28:53] hare: ^ [16:30:06] It does to some extent. I'm also interested in knowing how you plan on evaluating organizations for organizational health as you do for program effectiveness. [16:30:37] We have been doing this as part of the full process APG program for some time. [16:30:48] I think the criteria will be simpler, but based on the full process APG criteria. [16:31:17] Let me point you to the draft version of the full process APG staff proposal assessment form. [16:31:22] To give you a general idea. [16:31:40] https://meta.wikimedia.org/wiki/Grants:APG/Staff_proposal_assessment_form [16:32:40] We probably won't have a staff proposal assessment as part of the simpler APG process, [16:32:51] but it's likely we'll have a rubric or scoring table like the one you see in that form. [16:33:09] Organizational health / effectiveness / capacity, will be an important component, alongside program effectiveness. [16:34:55] You can see the staff proposal assessment framework has 6 criteria in the organizational effectiveness dimension, focusing on different aspects of an organization's potential to achieve impact. [16:35:42] It's likely that the assessment for simpler process APG will have fewer, and that they will be more relevant to the organizations in that program. [16:35:45] All that said, [16:36:06] we know organizational effectiveness and capacity is a complex dimension of any annual plan grant application. [16:36:14] We can't reduce it to a few criteria in a table. [16:36:18] And we don't plan to. [16:36:27] For simpler process or full process APGs. [16:36:58] We want the support to continue to be specific and relevant to each organization, as for full process APGs now. [16:37:21] One thing we do want to do, [16:38:04] is make more guidelines (not requirements) as part of the simpler process APG option. [16:38:04] These will be general guidelines to help organizations make sure they are on a good path, for their own contexts and goals. [16:38:46] E.g. "These are some recommended steps organizations should take before increasing staff from 0.5 to 1.0 FTE, based on the experience of other organizations." [16:39:12] E.g. "Here are some ways other organizations have experimented with pilot projects before scaling them." [16:39:34] E.g. "Here are some controls you should put in place to prepare yourself for handling a larger budget." [16:41:24] James, are those details helpful? Can we provide more? [16:41:34] hare: ^ [16:42:23] wolliff_: What about volunteer-to-staff transitions? 0.0 to 0.5? [16:42:55] Yes, we want to provide guidelines for that too. That's going to be an important transition for many in this program. 0.5 to 1.0 was just one example :) [16:44:08] Works for me. Thank you. [16:44:16] Great. Thanks for the question! [16:48:28] Any other questions out there before we start wrapping up? [16:48:41] Ideas about grants? ") [16:49:24] One other thing we [16:50:06] We will have more opportunities for conversations at these hangouts: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Events ! [16:50:41] One other thing we're interested in hearing about is how we can better gather community feedback during grants. If you have ideas on how we can better perform this outreach, feel free to join us at one of our future events. [16:53:21] Thanks for joining us for office hours today. If you didn't get a chance to participate today, you can also feel free to provide feedback on the Idea talk page: [16:53:31] https://meta.wikimedia.org/wiki/Grants_talk:IdeaLab/Reimagining_WMF_grants [16:54:30] Here's a question: how many of your grantees use IRC? :) [16:55:32] James, we didn't include that in the survey!! :) [16:55:35] Tally is one so far. : ) [16:55:35] Evidently one. [16:55:55] That's not true though. We've had IRCs about the FDC process in the past with more participation. [16:56:00] Not a lot... but some :) [16:56:06] But I'll be charitable and include myself as a former IEG grantee. : P [16:56:07] We like to be available. ;) [16:56:31] #endmeeting [16:56:32] Meeting ended Wed Aug 26 16:56:31 2015 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [16:56:32] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-08-26-16.02.html [16:56:32] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-08-26-16.02.txt [16:56:32] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-08-26-16.02.wiki [16:56:32] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-08-26-16.02.log.html [20:59:20] #startmeeting RFC meeting [20:59:20] Meeting started Wed Aug 26 20:59:20 2015 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [20:59:20] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [20:59:20] The meeting name has been set to 'rfc_meeting' [20:59:47] #topic Replace Tidy in MW parser with HTML 5 parse/reserialize | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ [21:03:22] so, I'm all for it :) [21:03:27] we are missing a few people today due to the management offsite [21:05:00] TimStarling, anyone in particular you wanted to see? [21:05:06] previously a few people were sceptical about SOA in general [21:05:15] I would like to hear from those people [21:05:20] but failing that, there are a few things to discuss [21:05:25] It would be nice if parsing wikitext didn't require a chain of services. [21:05:39] Some running Node, others running Java, etc. [21:05:54] yeah I guess I can make do with Marybelle [21:05:58] JAVAJAVAJAVA! JAVA FOR PEOPLE [21:06:31] I understand that Wikimedia can set up and maintain such services, but others really aren't interested in all that. [21:06:41] the implementation has changed in the last week or so [21:06:58] it used to be a servlet, but now it has an HTTP server embedded in it [21:07:12] and I used apache commons daemon to daemonize it [21:07:25] rationale? [21:07:49] the idea is to package it as a .deb, which makes it easy for ops [21:08:01] since it won't change much [21:08:15] they don't seem to have problems with deploying jars with git-fat [21:08:16] TimStarling: can people still use tidy if they want to? or nothing? Or is the new service a requirement? [21:08:17] apache commons daemon is packaged already [21:08:43] I haven't done the client side yet, but yes, I plan on making tidy an option [21:09:02] the new service won't be a requirement [21:09:08] I'm not sure the analogy to Hierator holds, since almost nobody wants hieroglyphics support on their own wiki. [21:09:36] Marybelle: if you can still use tidy in php, then why shouldn't wmf use something more targeted to the needs of giant wikis? [21:09:38] TimStarling: maybe not the time and place, but commons-daemon has always been a pain [21:09:40] also there is gwicke's rust parser, which he says can start up from the shell and parse a big page in 70ms or so [21:10:01] so that would be a simpler migration path from tidy for small installations [21:10:22] * DanielK_WMDE needs a reason to play with rust [21:10:36] what's the recommended solution for standalone installations? something shell-out-able? [21:10:38] Do small installations typically support executing Rust? [21:10:38] ah ok [21:10:56] it sounds like the small installation plan is to continue using tidy? [21:11:04] it could be distributed as a statically linked binary, like what we do with lua [21:11:05] ebernhar1son: I thought the goal was to kill Tidy. [21:11:08] just behind some interface [21:11:09] nice [21:11:15] :) rust++ [21:11:20] Marybelle: for WMF [21:11:27] Not for everyone else? [21:11:33] Marybelle: rust has no runtime, it's static-compilation [21:11:36] Wouldn't that make the situation even worse? [21:11:48] Where templates from Wikipedia that people want to re-use would behave even more unexpectedly? [21:12:13] Marybelle: i was under the impression that rust compiles to ELF [21:12:24] but i may be wrong, never played with it [21:12:24] I know almost nothing about Rust. [21:12:30] so a binary install can be done same as with tidy, texvc, etc, with no dependencies [21:12:31] yeah, it compiles to ELF [21:12:45] Which I imagine is also true of the people running the small installations we're attempting to support. [21:13:19] ebernhardson: My understanding was that a major goal of killing Tidy was more consistency long-term. [21:13:30] Marybelle: reuse templates from wikipedia?... haha... urg. [21:13:50] yeah, people do try to reuse templates from wikipedia [21:13:55] It's incredibly common. [21:13:57] yes; html5 standard is a known quantity and while it may change it'll change in well-documented ways if it does :) [21:13:58] and people who do that have always tried to match our configuration [21:14:07] so presumably they'll want to switch away from tidy when we do [21:14:30] * cscott pokes his head in [21:14:30] Of course. Otherwise you'll be HTML debugging hell forever. [21:14:35] Do most people who want to copy templates from wikipedia control their server? [21:15:03] I'm not sure it's all sysadmins. [21:15:04] * subbu lifts up his head as well [21:15:22] TimStarling: Is there a wiki page or some other thing that describes the proposal? [21:15:31] #link https://phabricator.wikimedia.org/T89331 [21:15:42] I see that there's now a disabletidy API parameter. [21:15:47] Aha thanks [21:15:47] Right to fork for the content is important IMO. I'm not sure adding this feature specifically is the straw that breaks the camel's back but I have some of the same concerns as Marybelle that there be a reasonable replacement for small and mid size wikis. [21:15:51] Krenair: most wikis have a sysadmin of some kind [21:16:11] I don't think completely unmaintained wikis are something we need to worry about [21:16:35] Most sysadmins (including our own?) are hesitant to stand up more micro-services in various languages. [21:16:44] you think it's better to depend on tidy than on any HTML 5 parser? [21:16:47] That was my reading of part of the Phabricator task. [21:17:34] look, here is the plan [21:17:54] the user will have a number of options: [21:17:57] it is easy to provide a html5 parser in a service in multiple langs ( php doesn't have right now since there isn't one .. but i imagine that will change over time). [21:18:04] 1. use the old pure PHP default [21:18:06] 2. use tidy [21:18:15] 3. shell out to a new thing equivalent to tidy [21:18:20] 4. use a web service [21:18:54] i think most low profit sites will opt to shell out [21:18:58] which should work fine [21:19:01] for 3 there are a few options, the fastest is html5ever/rust [21:19:04] I think that's the most encompassing solution anyone could ask for [21:19:14] (an overarching goal here is to replace tidy with something *with a written specification*. that allows multiple implementations (including PHP!) but I understand that the PHP pure HTML5 parsers are pretty pokey.) [21:19:22] probably the most portable option for 3 is python [21:19:29] there is a pure python parser which would suit [21:19:47] isn't python slow? [21:19:59] not really. it's faster than php [21:20:00] TimStarling: can your service use unix sockets instead of tcp? [21:20:16] (at least for interpreter startup time) [21:20:19] php is probably the "worst" for option 3, but it would satisfy the "i don't want to install anything else" crowd. [21:20:19] (does java support unix sockets? i don't remember) [21:20:25] python gives 2.1s for [[Barack Obama]] [21:20:46] of course the wikitext parser is far slower than that for the same page [21:20:46] MaxSem: honestly, there are more html5 parsers out there than you can shake a stick at. [21:21:23] if there was a PHP HTML 5 parser then it would replace 1 not 3 [21:21:26] Do all HTML5 parsers try to fix markup? [21:21:29] i think the bigger qn. to ask here is more about what the implications are of replacing tidy with a html5 parser. [21:21:35] I didn't realize the default is $wgUseTidy = false in DefaultSettings.php, so it seems to me this proposal doesn't affect small wikis at all. [21:21:36] Using "parser" to mean "magic fixer" seems like a novel usage to me. [21:21:47] Marybelle: no [21:22:04] perhaps i should mention that we eventually hope to do tidy on something like a per-template basis, to ensure that templates don't screw up the rest of the page. so performance is not entirely an academic interest here. [21:22:20] or should i say, "even small wikis might want to use something reasonably performant" [21:22:22] like I was saying on the ML, there is a distinction between a valid document and error recovery [21:22:25] spagewmf: Many wikis change that to true when they try to import content from elsewhere. [21:22:45] spagewmf: And many shared hosts include Tidy. I think it comes with PHP(???). [21:22:49] Marybelle, what do you mean by "fix markup"? .. but all html5-tree-build-algorithm parsers will build a well-formed DOM and they will be identical independent of implementation. [21:22:53] but both are specified in the "syntax" chapter of the HTML 5 spec [21:23:00] Marybelle: all HTML5 parsers generate a specific DOM for a specific input. [21:23:02] Marybelle: all tags are balanced in the DOM by definition [21:23:33] spagewmf: it does, beause plenty of wikis turn it on to use Wikipedia's templates [21:23:33] but there are other parts of the HTML content model which are not guaranteed by the parsing algorithm. like, nested inside , IIRC. [21:23:38] subbu: How do they know where the closing tags should be if they're missing? [21:23:44] subbu: s/will be/should be/ [21:23:46] Is it just guesswork? [21:23:52] if you have a complete implementation of the syntax chapter of the HTML 5 spec, then you have a magic fixer [21:23:55] Marybelle, it is well specificed as part of the tree building algorithm. [21:23:58] * subbu looks for the link [21:24:01] *specified [21:24:17] yeah, missing closing tags and all that are specified [21:24:23] Don't you also run into a lot of issues (as I think we have with Tidy) where the spec isn't what actual clients/browsers support? [21:24:51] http://www.w3.org/TR/2011/WD-html5-20110525/tree-construction.html#tree-construction [21:24:56] we're going to use essentially the same HTML parser as firefox [21:25:03] we may in fact want to apply certain clean-ups beyond simple parse/reparse. we could integrate the sanitizer, for instance. but it's easier to write those based on DOM manipulations, not regexps on unparsed string input. [21:25:11] DanielK_WMDE, right :) [21:25:16] will vs. should [21:25:27] so it is the same as what actual browsers support [21:25:38] not that that is really critical [21:26:00] Supporting browsers seems somewhat critical. [21:26:04] and again, the goal is that if we do any cleanups past simple parse/serialize, we will have an actual specification for those, and they could be straightforwardly ported to , all of which are going to export the same DOM API on the parsed content. [21:26:15] Marybelle, browsers and html5 parsers are all based on the same spec. [21:26:24] no, it's not at all critical for browsers to be able to parse wikitext [21:26:28] * cscott is getting ahead of TimStarling here, we should probably get back to discussing packaging issues. [21:26:32] they only need to parse the HTML output [21:26:38] Sure. [21:26:41] Marybelle: subbu: in fact, the spec is based on browsers always did :P [21:26:42] TimStarling: were there things other than SOA and Java that you wanted to get broader input on? [21:26:50] yeah, testing [21:27:01] that is the biggest and most complex part of the project [21:27:20] I want to know if anyone has ideas about how testing should be done [21:27:39] Isn't the standard approach to use sampling with a dataset this large? [21:27:50] I've only started to sketch out a solution, and I wrote a few lines of code in java, but I'm really not sure if that is the right language [21:28:19] TimStarling, one option is to use the visual-diffing solution we have in parsoid. [21:28:35] compare the html rendering for tidy-generated html and the html5-parser-generated html [21:28:39] and see how bad the diffs are. [21:28:49] it diffs the source? or the rendered layout? [21:28:53] you can start with a small sample initially .. 100 odd pages. [21:28:53] I'm not sure what you're trying to test. Performance/throughput? Code differences in HTML output? Visual differences in rendered HTML? [21:28:56] rendered layout. [21:28:57] compares images. [21:29:01] TimStarling: not sure what you are getting at wrt testing, but if we are talking about html DOM manipulation... we could use a nice replacement for phpunit's deprecated assertTag [21:29:08] yeah, that could be useful [21:29:14] any chance of getting that as a by-product of all this? [21:29:31] writing complex tests in Java can often be painful. For integration testing things I would personally prefer some scripting language (python, ruby, even php) [21:29:41] I don't know what assertTag is [21:29:47] bd808: Groovy? [21:29:51] TimStarling, ex: http://parsoid-tests.wikimedia.org/visualdiff-item/pngs/enwiki/Leon_Garfield.diff.png [21:30:01] Is the new disabletidy API parameter intended to be used for testing? [21:30:03] that compares php parser rendering and parsoid rendering [21:30:06] bd808: it would have been a whole new set of dependencies, really it deserved a new project [21:30:27] the pink pieces are where there were diffs .. in this case, minor pixel diffs which are mostly irrelevant [21:30:47] that kind of visual diffing should give you a first pass sense of how bad the diffs are going to be. [21:31:00] Marybelle: it is explicitly specified in the html5 spec. [21:31:09] that is why all browsers now display the same content when given the same "broken html". [21:31:22] If the long term intent is to have competing implementations then maybe a standard test suite as a separate project would be approriate? [21:31:32] Marybelle: all modern clients/browsers implement the html5 parsing spec. [21:31:45] http://parsoid-tests.wikimedia.org/visualdiff-item/pngs/enwiki/Medha_Patkar.diff.png is another one that shows more pixel diffs. [21:32:06] bd808: yeah, I guess so [21:32:22] cscott: So browsers already know how to handle mismatched tags, you're saying? [21:33:14] Marybelle: the concept for disabletidy is to have a script which fetches the untidied output, then it can be tidied in multiple ways by the script [21:33:31] All right. [21:33:33] DanielK_WMDE: one replacement for assertTag() to use real browser tests [21:33:56] I had a thought that you could put the page content in a frame. [21:34:03] it's basically a codified version of what the different browser vendors had groped their way blindly towards for interoperability [21:34:11] bd808: not really, since via the browser it's hard to do unit tests. [21:34:14] And that would probably stop mismatched tags from escaping? [21:34:20] and full integration tests cover fewer code pathes [21:34:22] But I think everyone hates frames. [21:34:43] Marybelle, perhaps we could continue the tag fixup conversation after this is done? [21:35:14] Perhaps. [21:35:18] so we can do visual diffs [21:35:47] seems subbu's https://www.mediawiki.org/wiki/Parsoid/Visual_Diffs_Testing is tanned, rested, and ready [21:35:50] I think we should also assemble a set of test cases based on tidy bugs according to the tidy tracking bug on phabricator [21:36:42] spagewmf, yes .. mostly. [21:36:51] https://github.com/subbuss/parsoid_visual_diffs can also be used on the commandline. [21:37:25] if we have a set of fixed test cases, we can do regression testing with it [21:37:43] parser tests? a bunch of them mention tidy specific things [21:38:12] well, we can just plug it into MW and then have MW run its parser tests [21:38:18] does that work? [21:38:33] there's the wikilint project, parsoid has an rt-testing framework, and we've also got a visual diff framework. [21:38:43] it probably works, I should probably do that, it would be easy [21:38:45] for starters, commandline testing should help get a grip on it, but, you could also integrate it with the a client-server testing setup (right now, unfortunately, still part of the core parsoid repo). [21:39:05] my question would be: how much of the content must be 'unchanged' in order to turn on the HTML5 tidy? [21:39:26] maybe that is not the right question [21:39:38] since there are desirable changes [21:40:23] maybe the desirable changes are specifically rare that we can make that the question [21:40:31] s/specifically/sufficiently/ [21:40:58] cscott: we have a visual diff framework? we were just discussing the need for that the other das with the wikibase team [21:41:04] the harder question is: what if it's impossible to fix a certain template so that it displays correctly in both tidy and the new html5 tidy? [21:41:11] DanielK_WMDE, https://github.com/subbuss/parsoid_visual_diffs [21:41:22] neat! [21:41:37] can we export a parser function or some such identifying which is turned on? are there better options? [21:41:44] subbu: specific to parsoid, right? [21:41:52] TimStarling, no. you can feed it any two urls [21:41:57] because I guarantee that there will need to be some template cleanup. html tidy has *so many bugs*. [21:42:10] i mean .. some code clenaup will be required, but mostly minor. [21:42:11] i mean, "existing tidy has so many bugs". and they've been worked around in quite clever ways. [21:42:58] for example, almost all navboxes export empty table rows at the bottom, then rely on tidy to remove empty rows. [21:43:24] but i think we've worked around that in CSS now, by adding a CSS rule to `display:none` empty rows. [21:43:41] but there will be things like that, where template authors are relying on specific obscure tidy fixups. [21:43:52] cscott: if a template is relying on tidy and can't be made to work on both simultaneously, maybe it should stop relying on tidy and generate balanced output [21:43:55] TimStarling, maybe more than a little since it right now always fetches from a wiki (for php). [21:44:15] but yes, this brings up the question of migration timeline [21:44:15] cscott: I agree with Tim. Fixing the output is what I want. [21:44:30] I don't think there's any rush and I think we could fix most of it in a few years. [21:44:32] TimStarling: well, let's have this discussion later after we get some visual diff results. my intuition is that the ways we depend on tidy can get extremely baroque. [21:44:34] first there will be parser tests and what not [21:44:37] then visual diffs [21:44:57] let's assume that we get past those steps and discover that some templates will need to be fixed [21:44:58] TimStarling, DanielK_WMDE i am happy to do additional cleanup of that codebase to make it simpler to feed it two html files or two urls. [21:45:11] what then? how do we expose that? [21:45:21] or feel free to jump in as well -- fork the repo and all. [21:45:27] DanielK_WMDE, TimStarling: there's also https://github.com/Huddle/PhantomCSS which is another framework for visual diffs [21:45:39] there are quite a lot of them, actually, most listed at http://phantomjs.org/related-projects.html [21:45:44] can we identify templates that need to be fixed so that we can add them to a tracking category or something? [21:45:53] or is it only possible to identify them by eye? [21:46:03] it's possible that using a third-party visual diff framework might be better than hacking the parsoid one, since we're looking for specific sorts of things with parsoid. [21:46:24] TimStarling, we should be able to detect those automatically in some fashion ... [21:46:42] You could generate database reports. [21:46:45] there will need to be some heuristics, probably [21:46:48] Wikipedians love working through reports. [21:46:48] yes, heuristics. [21:47:02] like, for example, "ignore diffs which are only the removal of empty tags" [21:47:07] to reuse my favorite example [21:47:21] Does Tidy remove all empty tags or just table rows? [21:47:28] just about all empty tags. [21:47:29] seems to be all. [21:47:40] which we can mimic easily with css as well in the new output. [21:47:49] so, that is not such a big deal, imo. [21:47:50] really? wouldn't that be semantically wrong? [21:47:50] i'm not sure it's *actually* all, because if there's one thing you can predict about tidy, it's its inconsistency. [21:47:51] I don't think adding CSS is a reasonable solution. [21:47:51] what is the simplest migration path if everything goes well? [21:48:08] Marybelle: the css is already live on all production wmf wikis [21:48:12] Marybelle, why not? [21:48:25] I think layering CSS on top of bad HTML puts us back into a position where we're ignoring/hiding problems instead of fixing them. [21:48:36] it's part of how VE ensures that the layout doesn't jump and shift when it loads parsoid HTML over top of the PHP HTML. [21:48:39] Which is where Tidy has gotten us, IMO. [21:49:03] simplest? just turn it on and slowly purge the parser cache? [21:49:28] i think people are underestimating the scale of the fixups that could be required. [21:49:29] it would be nice if pages could opt-in somehow to not using tidy (__NOTIDY__) or something first [21:49:39] well, depending on where you set the "compatibility" dial. [21:49:57] cscott, i think testing will give us that input. so, let us not jump ahead of ourselves yet. [21:50:07] Ten minute warning folks [21:50:08] for example, if you turned off the "don't display empty tag" CSS, just about every page on enwiki would have an issue. at least all the pages with navbars. [21:50:15] cscott: subbu: hmm, i didn't know we do that. sounds suboptimal to me too. can we replace that with some DOM transform? (either in regular JS or PHP code, or perhaps an XSLT transform if you want to be "declarative") [21:50:20] i'm just saying, let's not try to boil the ocean in one go. [21:50:30] cscott: I think the concerns are pretty overblown. There's a lot of talk about regular wiki editors writing bad HTML, which doesn't seem to be a very common scenario. [21:50:38] Marybelle: ha! [21:50:42] Time to help TimStarling out by getting things into the meeting minutes [21:50:53] Sorry I'm late. I'll chime in on an issue that was already discussed because I don't think what I have to say is controversial. I looked at Tim's Java code yesterday. I am not a Java developer, but it struck me as idiomatic and clear, and I think it would be straightforward to maintain. Regarding 3rd party support, we could even consider providing a free service. The Foundation could afford it. But that's another conversation. [21:50:54] we've seen a *lot* of bad HTML (and bad wikitext) [21:51:07] cscott: In ten years of wiki editing, I can't remember ever adding a "
" to a Wikipedia article, for example. [21:51:11] ok, summaries and action items only please [21:51:28] anyway, what i'm saying is lets try to get compatibility as good as possible first, using any tricks we need to, and then we can turn them off gradually over time. [21:51:36] #info first step: plug into MW and run parser tests [21:51:47] it would be easy to generate a "this page has an empty tag" report, once the html5 tidy is deployed, for instance. [21:51:54] but it's not useful doing that quite yet. [21:51:57] #info second step: visual diffing [21:51:59] #info second step: generate visual diffs on a sample of test pages [21:52:01] oops [21:52:04] heh [21:52:18] :) [21:52:25] third step: profit [21:52:30] TimStarling: note that there's already support in parser tests for "html+tidy" output, which is where the output of the parser test after you run tidy differs dramatically from the vanilla php output. [21:52:49] (You might want #action instead of #info?) [21:53:13] first step is probably to add something like "html+timstidy" support to parser tests, and start documenting places where we are making a conscious decision to fix tidy bugs. [21:53:42] #info third step: pilot deployment, e.g. opt-in, report generation [21:53:49] The last item you all seemed to agree on was to defer hashing out a migration strategy for things that may exist that may depend on tidy until the test results come in and give us a sense of the magnitude of the problem. [21:54:01] ori: +1 [21:54:06] DanielK_WMDE, wrong. third step is always ... [21:54:29] TimStarling: note that there's a branch in the Sanitizer that depends on isTidy and otherwise does some cleanup of its own [21:54:35] #summary lots of options for wiki installs in terms of tidying (1) no tidy (2) tidy (3) shell out to html5 parser (4) web service providing html5 parse/serialize [21:54:47] * subbu is not sure if that was the right meeting bot incantation [21:55:28] ori: faidon seemed dubious about supporting the Hierator Java service in production, https://phabricator.wikimedia.org/T93787#1384436 [21:55:41] "Java (& JVM) is not one of the production platforms we support" [21:55:46] ori do you want to summarize that for meeting bot? [21:55:50] we didn't talk much about production packaging since we don't have ops representation [21:55:55] we can save that for another time [21:55:59] spagewmf: yeah, but I know him well, and I think he'd be cool with what tim came up with. I can't speak for him but I don't think it'd be a huge issue. [21:56:11] yeah, +1 [21:56:52] #summary The last item you all seemed to agree on was to defer hashing out a migration strategy for things that may exist that may depend on tidy until the test results come in and give us a sense of the magnitude of the problem. [21:56:59] subbu: thanks [21:57:02] I don't think #summary is a real thing. [21:57:16] http://meetbot.debian.net/Manual.html#commands [21:57:40] #info lots of options for wiki installs in terms of tidying (1) no tidy (2) tidy (3) shell out to html5 parser (4) web service providing html5 parse/serialize [21:57:46] #info The last item you all seemed to agree on was to defer hashing out a migration strategy for things that may exist that may depend on tidy until the test results come in and give us a sense of the magnitude of the problem. [21:58:06] any document updates needed? [21:58:21] We should cross-reference the meeting highlights with the Phabricator task. [21:58:21] test plans? [21:59:02] #action TimStarling to document test plans and update phabricator task summary [21:59:51] #endmeeting [21:59:51] Meeting ended Wed Aug 26 21:59:51 2015 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [21:59:51] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-08-26-20.59.html [21:59:51] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-08-26-20.59.txt [21:59:51] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-08-26-20.59.wiki [21:59:51] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-08-26-20.59.log.html [21:59:57] subbu, cscott: Wikitext (without HTMLTidy enabled) also seems to have some sanity checking in it. For example, ''' (bold markup) doesn't extend past a paragraph. [22:00:22] thanks TimStarling [22:00:48] next week we have master/slave datacentre strategy [22:01:01] AaronSchulz's RFC [22:01:01] Marybelle: yes, there are hacks all over. [22:01:23] cscott: That's a hack? [22:01:33] the purpose of next week's master/slave datacenter discussion is not merely to get sign-off but also to socialize the ideas, so please come even if it's not something you expect to have a strong opinion on [22:01:37] hack == "poorly documented fixup" [22:01:39] Lee Daniel Crocker once said that he considered wikitext to be a line-based language [22:01:42] Marybelle, that is just normal browsers fixing up broken html. [22:01:54] subbu: I believe MediaWiki is outputting . [22:02:00] ah, ok. [22:02:01] so bold runs to the end of the line because all markup runs to the end of a line [22:02:24] So we're not worried about wikimarkup screwing up the site interface, then? [22:02:25] Marybelle: and yes, i've had unexpected results when i wrapped a line in the middle of some boldface [22:03:02] TimStarling: there is a crop of so-called 'perceptual diff testing' tools, like https://dpxdt-test.appspot.com/ , but for a first pass, `links -dump` will generate formatted text output from html [22:03:09] If we're not worried about wikimarkup and that's what most users are using, I continue to wonder what problems we're actually solving with HTMLTidy and equivalent. [22:03:09] Marybelle: the in that case isn't done by tidy or the sanitizer, though. it's done by doQuotes as part of the parsing. [22:03:22] cscott: Sure. [22:03:32] ori: we can do a lot better than links --dump [22:03:45] cscott: I'm still trying to understand what problems that Tidy is solving for us. [22:03:48] ori: if you want plaintext, ocg just turned on its plaintext backend in production, for instance. [22:04:04] ori: to really socialize the ideas WMF should arrange field trips to the master and slaves :) Feel the RTT overhead [22:04:11] cscott: i know, i merged that patch :P [22:04:24] ori: oh right, thanks again for that btw! [22:04:48] but sure, there are lots of tools that could be used [22:04:51] anyway, i think there are lots of possibilities for visual diff testing. [22:04:53] :) [22:04:58] Marybelle: oh, that's a question with a very long answer. [22:06:19] cscott: I'm ready. [22:06:19] starting with https://www.mediawiki.org/wiki/Manual:$wgUseTidy, but see also https://www.mediawiki.org/wiki/Manual:$wgUseTidy and search for 'tidy' [22:06:32] That's the same link twice. [22:07:29] sorry, the second link should have been https://github.com/wikimedia/mediawiki/blob/master/tests/parser/parserTests.txt [22:07:39] > Note: HTML tidy will irreversibly and unexpectedly mangle standard HTML markup when it feels like it. [22:07:45] yes indeed! [22:07:53] that's why we're replacing it with html5 tidy. [22:07:53] Sounds great. [22:08:23] it does some good things (see https://github.com/wikimedia/mediawiki/blob/master/tests/parser/parserTests.txt#L1987 for a tiny example) but it also does lots of bad things and has some terrible bugs. [22:08:46] so the goal here is to replace it with an html5-based tidy which does only the (rather minor) good things. [22:08:47] Right. I'm still in favor of fixing bad markup. [22:08:50] I guess I'm an outlier. [22:08:59] subbu: in https://www.mediawiki.org/wiki/Parsoid/Visual_Diffs_Testing, parsoid_vd is now your parsoid_visual_diffs on GitHub, correct? [22:09:10] ori: if you want plaintext, ocg just turned on its plaintext backend in production, for instance. [22:09:22] Marybelle: I agree with you, but there is a lot of bad markup. [22:09:28] this ties into a recent fantasy of mine [22:09:40] cscott: Sure. It grows and spreads when we put hacks on top of it that mask/hide it. [22:09:40] telnet wikipedia? [22:09:51] Marybelle: so something like html5 tidy is a good step in any case, because it sets expectations for exactly what sort of fixup we are going to do. [22:09:53] I was reading a news article recently where someone was quoted to have said "I would never trust a Tor hidden service anymore" [22:10:06] because you know they are all compromised by the FBI via zero days [22:10:08] spagewmf, yes, it is the diffing service that is based on that github repo [22:10:08] the html5 spec even has written places where warnings should be issues because the input is invalid [22:10:16] and then the FBI compromises your browser when you visit them [22:10:22] subbu: I'll update [22:10:25] Marybelle: so really the html5 tidy work is a good first step to realizing your ambition as well. [22:10:38] so the solution is obvious [22:10:51] hidden telnet service [22:10:54] The NSA, not the FBI, but yeah. [22:11:10] deliver wikipedia content via telnet over tor [22:12:02] cscott: Perhaps, yeah. I'm still pretty hesitant to see another layer. [22:12:04] arlolra might have something to say about tor services all being compromized by the nsa/fbi via zero days [22:12:09] the media says the NSA, the leaks say the FBI [22:12:14] > look [22:12:14] You are in a maze of twisty little passages, all alike. There is a quill on the floor. [22:12:14] > pick up quill [22:12:16] You pick up the quill. [22:12:18] > edit page with quill [22:12:34] * subbu has wasted a lot of time on that stupid game :) [22:12:42] yeah [22:12:51] cscott: If we only want very minor behavior from Tidy, I'm not sure we need a separate (micro)service or whatever. [22:12:59] * ori lost a year of life in his teens to MUDs [22:13:06] but this has all been done before, large text files delivered by BBS [22:13:29] zmodem! [22:13:56] no, if you download an HTML file and then open it in your browser, you are equally screwed as if you went to the site with HTTP [22:14:06] it has to be actual plain telnet [22:14:40] maybe some ASCII art is permissible [22:14:44] ha [22:14:49] MatmaRex, to circle back to your remark .. i don't if there is css currently live that turns off empty elements .. i was remarking we could do that .. but alternatively, yes, we could done a post-dom-building normalizing pass that might do things like that. [22:14:58] i don't *know* if [22:15:24] TimStarling: my text backend uses unicode so you get some nice math and super/subscripts, etc. [22:15:25] TimStarling: let me know the .onion of your telnet services [22:15:46] TimStarling: but you could convert mw-ocg-texter into your plaintext service very easily. [22:15:56] TimStarling: it takes parsoid HTML output, but you can get that from restbase. [22:16:15] there's nothing http specific about hidden services [22:16:36] I actually asked politely on alt.ascii-art (still exists) for an ascii WMF logo and got one: https://gist.githubusercontent.com/atdt/acd6a851fb1a488176ea/raw/230ca20bad9c13d723fd7e61ef1fc2c731d4c895/gistfile1.txt [22:16:38] telnet should just work [22:16:45] i think you'd actually want telnets [22:17:04] there's a cool wikipedia one but i can't find it now (not the crappy one generated programmatically from the bitmap) [22:17:04] so that the connection between WMF and the tor entry node wasn't snoopable [22:18:08] oh yeah, here we go [22:18:22] Wikipedia telnet server: https://gist.github.com/atdt/4037228 [22:19:01] TimStarling, what if NSA is badass enough to 0day your telnet via ASCII art? [22:19:06] var logo = fs.readFileSync( 'wiki-logo.txt' ); [22:19:06] i wish i had saved that [22:19:10] "secure telnet" is on port 992 according to my /etc/services [22:19:32] ha [22:21:22] legoktm: I think you were around when I hacked that together, do you know which logo I'm talking about? [22:21:39] ori: speaking of ASCII, your innate IRC alignment skill is impressive https://phabricator.wikimedia.org/F2403163 [22:22:17] visual poetry [22:23:16] spagewmf: ph33r my l33t alignment skills ph33r my l33t alignment skills [22:23:46] is the secret message "him think 'd be"? [22:24:58] or "pao him think 'd be " [22:26:06] The secret message is "Join the Navy" [22:32:15] subbu|afk: BTW, the @remote path in visualdiff results returns 502 Bad Gateway, e.g. http://parsoid-tests.wikimedia.org/visualdiff-item/diff/enwiki/Fragnes