[21:02:46] #startmeeting RFC meeting: Migrate to HTML5 section ids [21:02:46] Meeting started Wed Aug 30 21:02:46 2017 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:02:46] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:02:46] The meeting name has been set to 'rfc_meeting__migrate_to_html5_section_ids' [21:03:18] task ID for those with leaky memories? :) [21:03:25] https://phabricator.wikimedia.org/T152540 [21:03:51] thanks :) [21:04:00] MaxSem: ping [21:06:03] or kaldari? [21:07:35] I just realized I didn't annouonce this meeting on the ticket. sorry about that. [21:07:40] it was in the radar mail, though [21:08:22] TimStarling, pong [21:08:32] ah, good. *phew* [21:08:35] yay [21:09:44] so question 1 carried over from the committee meeting: what exactly is the problem here? [21:10:09] MaxSem said https://gerrit.wikimedia.org/r/#/c/370101/ is "stuck in bikeshedding" but we could not find the actual bikeshedding, maybe it is offline? [21:10:31] the bikeshedding about it is in the bug itself, not gerrit [21:11:20] MaxSem: so the question is where in the DOM the legacy encoded ID should reside, right? [21:12:21] Krinkle said that the issue is mainly what to tell people who need to migrate their scripts. They need to know how to find the actual hading based on the element with the legacy id. [21:12:28] what shall we tell them? [21:13:56] regarding Krinkle's comment https://phabricator.wikimedia.org/T152540#3559993 [21:14:26] ...and the original community request that promted this is https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Miscellaneous#Non-Latin_section_headings_are_displayed_terribly_in_URL_anchors_and_can.27t_be_reached_directly [21:14:26] with this patch, there should be no difference: the tag is still the parent of the elment whose id is matched [21:14:59] that should be uncontroversial, then. was there any opposition to this approach? [21:15:02] before the patch, the old ID is moved outside the , right? [21:15:14] MaxSem: Yep, I just left a comment on the patch. The change LGTM. It maintains viewport compat for end-users and tree logic for JS code. [21:15:31] TimStarling, yes, it used the existing code that put it in a div before the h [21:15:39] Just one issue left for element-finding of the mw-headline element, which in CSS cannot go via the heading parent [21:16:41] one thing I would really like to have discussed is whether we should percent-encode [21:17:25] doing so appears more "correct", however raw UTF-8 has better browser support [21:17:51] we have to percent-encode at least any double quotes, right? [21:17:58] you mean percent-encoded in IDs [21:18:11] double quotes are HTML-encoded [21:18:24] ah, right, of course [21:18:57] was there any argument in favour of it apart from it appearing correct? [21:20:23] if there are *any* characters that need to be thus encoded, it may be good to encode more, to raise awareness [21:20:24] one argument was that consitently percent-encoded URLs are easier to copypaste and automatic linking works better [21:20:44] are there any characters that need to be encoded? what about line breaks? [21:21:16] whitespace is basically the only things that needs to be encoded, and we replace it with underscores [21:21:27] MaxSem: but that is only relevant if you copy&past from the html source, right? [21:21:49] no, the use case discussed was from browser address bar [21:22:19] ah - i guess the address bar would keep the encoding, at least when copying, if not for display [21:22:34] that's a valid concern, i guess [21:22:51] yep, even when it "decodes" the URL for display it still copies the original [21:22:56] without the %-encoding, copy&past of such a url into wikitext won't work reliably [21:23:11] but honestly, with percent encoding we're under 50% support [21:23:27] without it, even Microsoft junk works [21:23:48] works as in displays actual unicode in the location bar? [21:23:53] yup [21:24:03] definitely nice to have [21:24:18] Hello, I, as initial proposer of this change, would like to accent that with percent-encoding, the solution of this issue doesn't meet the initial problem, which is: "Non-Latin section headings are displayed terribly in URL anchors and can't be reached directly" [21:24:19] the most important browser we're losing is Chrome [21:24:24] which is most ppl use [21:24:47] I think displaying the actual unicode is way better UX-wise. same for c&p. [21:25:02] <_joe_> MaxSem: so in chrome percent-encoding is not working, but plain unicode does? [21:25:04] there was an agrument made that this should be fixed in the browser [21:25:08] originally I spoke in favour of having percent-encoded URLs, but I think a close-reading of the spec showed UTF-8 to be technically valid? [21:25:35] btw do we verify anywhere it is a valid UTF-8 actually? [21:25:50] _joe_, yup [21:25:58] because if we put direct utf-8 non-encoded there could be trouble without validating [21:26:05] well, parser input is NFC [21:26:18] TimStarling: i think copy&past from address bar to wikitext is a valid concern. [21:26:40] we normalize both input and output [21:26:42] SMalyshev: %00%00 :) [21:27:04] DanielK_WMDE__: also the %E0 thing [21:27:08] it doesn't work in the non-fragment part, does it DanielK_WMDE__ ? [21:27:17] MaxSem: ok, cool then :) [21:27:37] you can't copy the path part of the URL into an internal link, people try it but it's not the right way to do it [21:28:11] I was thinking of external link syntax [21:28:24] for internal link syntax, percent-encoding is actually a problem, because it's am,biguous [21:28:24] DanielK_WMDE__, the thing here is that we don't percent-encode () anyway, which was one of original arguments for copypastability [21:28:58] MaxSem: we could just allow () in the url regex [21:29:31] but i'm starting to realize that there is a fundamental problem with the copy&paste argument [21:29:45] what's better for the external link syntax is worse for the internal link syntax [21:29:46] in evey email/IM client, every markup parser? :P [21:29:52] which one do we want to work nicely?... [21:30:32] MaxSem: if all browsers decoded properly for display, we could just encode everything, and be safe :) [21:30:41] our external link syntax severely sucks, we should just change it [21:30:55] +1 [21:31:05] we could make it the same as internal links [21:31:06] It's not only copy&paste that matters. Readability itself and ability to exchange links with readable fragments matters a lot. [21:31:22] although I suppose that would bring protocols into conflict with namespace names [21:31:42] so right now we're talking about stuff like https://ru.wikipedia.beta.wmflabs.org/wiki/%D0%93%D0%BE%D1%82%D1%8C%D0%B5,_%D0%96%D0%B0%D0%BD-%D0%9F%D0%BE%D0%BB%D1%8C#%D0%A5%D3%99%D0%B9%D1%80%D0%B8%D3%99%D0%BB%D0%B5%D0%BA_%D1%8D%D1%88%D0%BC%D3%99%D0%BA%D3%99%D1%80%D0%BB%D0%B5%D0%B3%D0%B5 [21:31:43] TimStarling: ok, amen to that! but we'd still have to support the old syntax, at least for old revisions. or support multiple syntax versions [21:32:16] you know cscott's nuclear option for syntax change [21:32:23] MaxSem: is there any problem with that except for display in the address bar? [21:32:36] I'd like to note that the ONLY browser that was reported to decode fragments in the address bar is Firefox. Not only Chrome lacks support for it. [21:32:44] just parse all existing wikitext and reserialize it into a different format [21:33:01] well, people wanting fragments to display nicely was kinda the reason we're doing this :P [21:33:10] Exactly. [21:33:36] right now my chrome shows the URL as proper words and fragment as percent-encoded mess.... weird [21:33:51] strangely inconsistent... [21:33:52] why the difference? [21:34:06] Jack_: you support having unicode in the URL fragment then? [21:34:31] <_joe_> SMalyshev: it has to do with security concerns [21:34:36] and that is the current implementation, right? [21:34:52] _joe_: do you know any rtfm link for that? [21:35:07] <_joe_> people used weird unicode tricks to do url spoofing IIRC, but that was for domain names, now that I recall [21:35:14] TimStarling, right now we encode the links, but not IDs [21:35:21] <_joe_> SMalyshev: lemme look [21:36:14] what happens if we stop encoding the links? [21:36:39] dancing angels etc. [21:36:53] btw, regarding copy&paste: iirc, firefox will copy the text as displayed if you only select part of the url. If you select all, you get the original encoding, and the leading https:// [21:36:57] I am fine with changing it to plain unicode in both places, if there are no objections to that [21:37:32] I remember when this feature was introduced into firefox and may have been influenced by firefox in previously recommending percent-encoding [21:37:56] TimStarling: I saw two options from the beginning: 1) urge Chrome developers to decode fragments in address bar, 2) implement Unicode fragments. I don't see any enthusiasm regarding the first part, so only the second is left for me to support. [21:38:04] obviously what firefox is doing is inelegant, and there were odd consequences like what DanielK_WMDE__ says [21:38:45] i actually like the fact that someone sat down and thought of a heuristic for this :) but i'm not sure i like the inconsistency. [21:39:00] I followed a mozilla bug wherein copying to the clipboard worked, but the middle-button register on X11 was getting the unicode URL [21:39:16] I read the patch where they fixed it, it was an eyesore [21:39:28] hehehe... [21:39:56] it's nasty hacks all the way down... [21:40:04] and what's more important, we can reenable percent encoding if something goes wrong wothout having to invalidate any HTML [21:40:57] DanielK_WMDE__: "firefox will copy text as displayed..." - chrome does that too. [21:41:01] yeah copypaste is still percents on firefox for me (mac) but at least the display is nice [21:41:07] (I just searched on "developing wikidata wikibase with HTML 5 plan" and also found this - https://www.mediawiki.org/wiki/HTML5 - which is a kind of a Mediawiki HTML5 plan. Will today's Wikidata office hour further support this Mediawiki HTML5 please - or where does today's office hour fit into both Wikidata/Wikibase working with HTML5 and MediaWiki also working with HTML5? Thanks) [21:41:52] maybe we should just enable unicode and tell people to go complain to chrome devs so they make it nice :) [21:42:26] Scott_WUaS: this is about making use of the additional freedom that html5 gives to allow better support for the display of non-ascii section titles in the browser address bar, when jumping to sections following a link. [21:42:50] SMalyshev: you mean enable links with percent-encoded href attributes, that's the thing that currently works in firefox only [21:43:13] but if it is security, they probably won't change it [21:43:25] Scott_WUaS: so, a slight improvement for internationalization, touching a surprising number of nasty technicalities. [21:43:31] someone only needs to say "but security" and that will end the discussion [21:44:00] Thnx, DanielK_WMDE__ [21:44:18] I'm not seeing any security issue with this... of course, it is almost always wrong to say "it's irrelevant for security" but fragments kinda seem that way [21:44:31] TimStarling, http://i.imgur.com/KxivB9k.jpg [21:45:03] since we are now posting memes, let me ask an ignorant question... [21:45:22] i read somethign about cache warming. i didn't quite get that bit. can someone explain? [21:45:29] <_joe_> SMalyshev: I wasn't implying that [21:45:52] _joe_: yeah I was answering to TimStarling [21:46:00] <_joe_> ahah ok I didn't get that :P [21:46:03] also, https://memegenerator.net/instance/64669371 [21:46:11] DanielK_WMDE__: articles need to have new IDs before you can link to the new IDs [21:46:11] <_joe_> DanielK_WMDE__: cache warming? [21:46:16] DanielK_WMDE__, we start by serving IDs for both old and new encoding. then we switch encoding in links [21:46:44] ah, i see, that bit. i thought it was something else. thanks! [21:46:54] so for a month we'll be populating parser cache [21:49:10] maybe this is relevant: https://bugs.chromium.org/p/chromium/issues/detail?id=265346 [21:49:21] just doing some searches in the chromium bug tracker, there may be more relevant bugs [21:50:32] The implementation in core was made configurable so we can switch from legacy-encoding to clean-looking html5-encoding. In addition, the configuration allows two modes to be enabled, in which case the HTML output for ==heading== will produce 2 html elements, one for each of the two ID encodings (one empty element, and one element with the actual heading text). And then once parser cache has rolled over, we can make the new method [21:50:32] primary for [[Links#heading]] produced by parser. [21:51:54] sounds like you can go ahead and change the link encoding, MaxSem [21:52:01] hehe, that bug reminds me of an issue with an CA issuing certificats for domains with line breaks in them, and browsers comparing the domain only up to the newline... so you get yourself a cert for google.com\nfuddledaddy.cx, and... [21:52:32] ok, I'll remove percent-encoding [21:52:44] +1 for no peprcent-encoding [21:53:05] also +1 for moving the element with the id into the tag [21:53:32] for the latter, you can just press +2 ;P [21:54:15] in that chromium bug, U+2028 "Line Separator" in the URL causes a line break in the address bar, and then the user sees only the second half of the URL [21:55:01] presumably it would happen regardless of percent encoding, it just depends on unicode being displayed [21:55:10] TimStarling: *that* can be a security issue [21:55:24] MaxSem: shouldn't the second span also have class="mw-headline"? [21:55:32] there are also similar bugs relating to bidi [21:55:49] <_joe_> TimStarling: but the link if not percent encoded would be hader to spot as a forgery, maybe [21:55:51] DanielK_WMDE__, hmm [21:55:52] also, who had the bright idea of displaying newlines in one-line URL bar? [21:56:27] <_joe_> so maybe that's the reason for that, but - is there a point in trying to guess chromium developers' intentions? [21:56:27] in any case, it was fixed in 2014 [21:57:08] Thanks, All! [21:59:33] #endmeeting [21:59:33] Meeting ended Wed Aug 30 21:59:33 2017 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [21:59:33] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-08-30-21.02.html [21:59:33] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-08-30-21.02.txt [21:59:33] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-08-30-21.02.wiki [21:59:33] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-08-30-21.02.log.html