Final Sigma (was: RE: Esszett, Final Sigma, ZWJ and ZWNJ)

Mon Mar 2 17:51:17 CET 2009

I imagine that one of the biggest objections to this idea is that many
users copy and paste domain names or URLs into the URL bar, and they
will expect it to work, even if the original was written to work with
IDNA2003 or a local mapping for a different language. This is indeed
one of the problems with local mapping in general.

One idea might be to try the user's language's local mapping first,
then IDNA2003, then local mappings for other languages. This could
lead to security problems, so if the protocol is HTTPS, the client
might want to avoid the other language mappings (and maybe even the
IDNA2003 mapping).

Any thoughts about local mapping in general, John?

Erik

On Sun, Mar 1, 2009 at 10:52 AM, Erik van der Poel <erikv at google.com> wrote:
> Hi John,
>
> This is an interesting discussion about server-side matching and
> display issues. More comments below.
>
> On Fri, Feb 27, 2009 at 3:05 PM, John C Klensin <klensin at jck.com> wrote:
>> However, there is one difference if one went to a server-side
>> matching model (independent of whether the input to that model
>> was Punycode, UTF-8, or something else).  If the comparison and
>> equality/ equivalence check is done on the server, then we could
>> go back to separating the question of "what is represented and
>> encoded" from that of "what matches", just as the ASCII DNS
>> model does with case-matching.
>>
>> From that point of view, I don't know where Eszett or the French
>> discussions would fall out, but it is clear to me that the
>> preferred solution to Final Sigma would be to keep it in the
>> stored domain names (facilitating the desired display) but be
>> sure that the matching procedure treated upper-case sigma,
>> lower-case sigma, and final sigma as equivalent for DNS purposes.
>>
>> Doing that matching operation on the server would, however,
>> require modifications to the DNS at least as significant as
>> those that Andrew described and, like them, could not
>> realistically be expected to be deployed in less that a decade
>> and perhaps much longer.   And, as you suggest, we would then
>> have to wrestle with exactly the same issues about what should
>> be considered equal (matching) and what would not -- doing
>> server-side matching would merely parse out the display issues.
>
> I agree that server-side matching would not work, because we would
> have the same issues about what should be considered equal. This is
> language-dependent, as we have seen with the Turkish case-mappings for
> the letter i. We probably cannot perform language-dependent matching
> on the server side, because it would be hard to pass language
> information to the server in the current DNS.
>
> So, we are left with client-side /mapping/ because you cannot /match/
> on the client, where we don't have the set of names to match against.
>
> Until now, most implementations have been performing global
> (language-independent) mappings, but there appears to be a need for
> local (language-dependent) mappings. The most commonly used example of
> local mapping is the Turkish i.
>
> So, the key question is where to introduce local mapping. I believe
> the answer may be protocol-dependent. In this email, I will divide and
> conquer, focusing on HTML only. Later, we can see whether this model
> applies to other protocols.
>
> In the case of HTML, the implementers have unfortunately allowed the
> href protocol to become polluted with non-Punycode IDN labels. They
> did this by applying the IDNA2003 mappings to labels found inside
> hrefs. Once they did this, we saw a relatively but not absolutely
> small number of HTML files take advantage of the mappings.
>
> Now we seem to be stuck with these mappings because a number of key
> individuals appear to be reluctant to get rid of the mappings. Note
> that I said "seem to be" stuck.
>
> Let's assume that we are stuck with those mappings in HTML, just for
> the moment. Where would we perform local mappings? I am going to give
> two examples: keyboard and registrar.
>
> Keyboard. Specifically, the place where the user (not HTML author) can
> type a URL. The user selects the language of the UI menus, dialogs,
> etc, in some fashion. For example, a Turkish user might buy a PC in
> Turkey. That copy of Windows and MSIE might come up in Turkish. In
> that case, MSIE ought to perform Turkish case-mappings for the letter
> i.
>
> Registrar. A Turkish registrant uses a Web search engine to find a
> registrar with a Turkish UI. The registrant clicks on that link and is
> presented with a Turkish HTML document containing a form where the
> registrant can type the desired domain name. Given that the HTML
> document is in Turkish, the registrar performs Turkish case-mappings
> and asks the registrant to confirm the (lower-cased) name. This might
> be accompanied by a "Help" link that the registrant can click on to
> get an explanation of the case-mappings that were just performed. In
> general, however, there would be no surprises, because the registrar
> performs locally well-known mappings.
>
> Now let's turn to the Greek tonos. Unlike the case of the letter i,
> where the Turkish case-mappings are different from those used in
> English and many other languages, the Greek script and tonos accents
> are /not/ used by very many languages other than Greek. Even if there
> are language communities that use the Greek script but do not want the
> tonos to be stripped the way .gr folks want, those language
> communities can be served by language-specific mappings in e.g.
> keyboard and registrar.
>
> So, for the Greek script, Latin-script-based language UIs like English
> can strip the tonos and fold the 3 sigma cases to normal lower-case
> sigma. This means that even if some special-interest group (e.g.
> mathematicians, scientists) want to distinguish normal sigma from
> final sigma in the DNS, they cannot do so unless they successfully
> lobby for their own "UI language" and new prefix other than xn--
> (given that xn-- labels containing final sigma will cause problems in
> contexts like HTML).
>
> Then one of the biggest remaining issues is what to do about display.
> You correctly pointed out that allowing the full range of IDNA2003
> mappings in the checking that is performed on the display name (in the
> idndisp.txt scenario) could lead to a confusing variety of displays.
> This is an area where we could try a number of different things,
> perhaps dependent on language or context. For example, if Japanese
> mobile phone implementers do not wish to take up a lot of space with
> full-width Latin letters, they might disallow those in display (and
> simply display the normal width ones).
>
> Alternatively, we could pursue Mark's proposal, where you use the
> upper/lower case of the A-label in the DNS response to indicate a
> display preference. For example, Vaggelis is written as Βαγγέλης.
> IDNA2003 maps this to βαγγέλησ (note the lower-case β, the tonos on έ
> and the normal sigma σ). The local mapping would strip the tonos to
> give βαγγελησ and the A-label would be xn--mxabeakm1a9d. As far as I
> can tell, only the α, ε and η can have a tonos, and the σ should be
> displayed as ς in this case. So we need 4 bits of info, and in this
> case we have far more than 4 ASCII letters in the A-label that can be
> used for those bits. Since we only need 4 bits for βαγγελησ, we use
> the first 4 letters of the A-label, namely, x, n, m and x. Since we
> want the ε to have a tonos (έ) and since we want the sigma to be
> final, the n and the 2nd x should be upper-case, giving
> xN--mXabeakm1a9d.
>
> Now, each language or script is likely to have certain needs in the
> local mapping and display scheme. So if we pursued this, I believe it
> would be best to separate the local mapping and display scheme from
> the current IDNA2008 documents (definition, protocol, table, bidi and
> rationale). It might be OK to publish the local mapping and display
> scheme later (after the core IDNA2008 documents).
>
> During the transition period, we'd have new clients that strip tonos
> and old clients that don't, so DNAMEs would still be necessary (until
> the old clients almost disappear). We'd also have old clients that
> ignore the display scheme and various pieces of software that break
> the display scheme by lower-casing or upper-casing the whole A-label,
> but that might not be so serious. If all of the letters in the A-label
> were upper-case but the Unicode string did not contain enough
> characters that need those bits (e.g. for tonos and final sigma), then
> the client could ignore that display hint. (And there's the special
> Appendix A in Punycode (RFC 3492), but I don't know whether anybody
> has implemented that, and it may not have been specified properly,
> since the last character in the delta might be a digit, which does not
> have upper/lower case.)
>
> This email is getting too long, but Eszett, ZWJ and ZWNJ should not be
> placed in the xn-- space. They should receive a different prefix. The
> local mappers should try both (e.g. Eszett with new prefix, and ss
> with or without xn--, depending on the rest of the string).
>
> And if the French really want to distinguish ecole and Ecole, they
> will not only need a different prefix, but also something other than
> Punycode (as far as I can tell).
>
> Erik
>