Final Sigma (was: RE: Esszett, Final Sigma, ZWJ and ZWNJ)

Erik van der Poel erikv at google.com
Sun Mar 1 19:52:50 CET 2009


Hi John,

This is an interesting discussion about server-side matching and
display issues. More comments below.

On Fri, Feb 27, 2009 at 3:05 PM, John C Klensin <klensin at jck.com> wrote:
> However, there is one difference if one went to a server-side
> matching model (independent of whether the input to that model
> was Punycode, UTF-8, or something else).  If the comparison and
> equality/ equivalence check is done on the server, then we could
> go back to separating the question of "what is represented and
> encoded" from that of "what matches", just as the ASCII DNS
> model does with case-matching.
>
> From that point of view, I don't know where Eszett or the French
> discussions would fall out, but it is clear to me that the
> preferred solution to Final Sigma would be to keep it in the
> stored domain names (facilitating the desired display) but be
> sure that the matching procedure treated upper-case sigma,
> lower-case sigma, and final sigma as equivalent for DNS purposes.
>
> Doing that matching operation on the server would, however,
> require modifications to the DNS at least as significant as
> those that Andrew described and, like them, could not
> realistically be expected to be deployed in less that a decade
> and perhaps much longer.   And, as you suggest, we would then
> have to wrestle with exactly the same issues about what should
> be considered equal (matching) and what would not -- doing
> server-side matching would merely parse out the display issues.

I agree that server-side matching would not work, because we would
have the same issues about what should be considered equal. This is
language-dependent, as we have seen with the Turkish case-mappings for
the letter i. We probably cannot perform language-dependent matching
on the server side, because it would be hard to pass language
information to the server in the current DNS.

So, we are left with client-side /mapping/ because you cannot /match/
on the client, where we don't have the set of names to match against.

Until now, most implementations have been performing global
(language-independent) mappings, but there appears to be a need for
local (language-dependent) mappings. The most commonly used example of
local mapping is the Turkish i.

So, the key question is where to introduce local mapping. I believe
the answer may be protocol-dependent. In this email, I will divide and
conquer, focusing on HTML only. Later, we can see whether this model
applies to other protocols.

In the case of HTML, the implementers have unfortunately allowed the
href protocol to become polluted with non-Punycode IDN labels. They
did this by applying the IDNA2003 mappings to labels found inside
hrefs. Once they did this, we saw a relatively but not absolutely
small number of HTML files take advantage of the mappings.

Now we seem to be stuck with these mappings because a number of key
individuals appear to be reluctant to get rid of the mappings. Note
that I said "seem to be" stuck.

Let's assume that we are stuck with those mappings in HTML, just for
the moment. Where would we perform local mappings? I am going to give
two examples: keyboard and registrar.

Keyboard. Specifically, the place where the user (not HTML author) can
type a URL. The user selects the language of the UI menus, dialogs,
etc, in some fashion. For example, a Turkish user might buy a PC in
Turkey. That copy of Windows and MSIE might come up in Turkish. In
that case, MSIE ought to perform Turkish case-mappings for the letter
i.

Registrar. A Turkish registrant uses a Web search engine to find a
registrar with a Turkish UI. The registrant clicks on that link and is
presented with a Turkish HTML document containing a form where the
registrant can type the desired domain name. Given that the HTML
document is in Turkish, the registrar performs Turkish case-mappings
and asks the registrant to confirm the (lower-cased) name. This might
be accompanied by a "Help" link that the registrant can click on to
get an explanation of the case-mappings that were just performed. In
general, however, there would be no surprises, because the registrar
performs locally well-known mappings.

Now let's turn to the Greek tonos. Unlike the case of the letter i,
where the Turkish case-mappings are different from those used in
English and many other languages, the Greek script and tonos accents
are /not/ used by very many languages other than Greek. Even if there
are language communities that use the Greek script but do not want the
tonos to be stripped the way .gr folks want, those language
communities can be served by language-specific mappings in e.g.
keyboard and registrar.

So, for the Greek script, Latin-script-based language UIs like English
can strip the tonos and fold the 3 sigma cases to normal lower-case
sigma. This means that even if some special-interest group (e.g.
mathematicians, scientists) want to distinguish normal sigma from
final sigma in the DNS, they cannot do so unless they successfully
lobby for their own "UI language" and new prefix other than xn--
(given that xn-- labels containing final sigma will cause problems in
contexts like HTML).

Then one of the biggest remaining issues is what to do about display.
You correctly pointed out that allowing the full range of IDNA2003
mappings in the checking that is performed on the display name (in the
idndisp.txt scenario) could lead to a confusing variety of displays.
This is an area where we could try a number of different things,
perhaps dependent on language or context. For example, if Japanese
mobile phone implementers do not wish to take up a lot of space with
full-width Latin letters, they might disallow those in display (and
simply display the normal width ones).

Alternatively, we could pursue Mark's proposal, where you use the
upper/lower case of the A-label in the DNS response to indicate a
display preference. For example, Vaggelis is written as Βαγγέλης.
IDNA2003 maps this to βαγγέλησ (note the lower-case β, the tonos on έ
and the normal sigma σ). The local mapping would strip the tonos to
give βαγγελησ and the A-label would be xn--mxabeakm1a9d. As far as I
can tell, only the α, ε and η can have a tonos, and the σ should be
displayed as ς in this case. So we need 4 bits of info, and in this
case we have far more than 4 ASCII letters in the A-label that can be
used for those bits. Since we only need 4 bits for βαγγελησ, we use
the first 4 letters of the A-label, namely, x, n, m and x. Since we
want the ε to have a tonos (έ) and since we want the sigma to be
final, the n and the 2nd x should be upper-case, giving
xN--mXabeakm1a9d.

Now, each language or script is likely to have certain needs in the
local mapping and display scheme. So if we pursued this, I believe it
would be best to separate the local mapping and display scheme from
the current IDNA2008 documents (definition, protocol, table, bidi and
rationale). It might be OK to publish the local mapping and display
scheme later (after the core IDNA2008 documents).

During the transition period, we'd have new clients that strip tonos
and old clients that don't, so DNAMEs would still be necessary (until
the old clients almost disappear). We'd also have old clients that
ignore the display scheme and various pieces of software that break
the display scheme by lower-casing or upper-casing the whole A-label,
but that might not be so serious. If all of the letters in the A-label
were upper-case but the Unicode string did not contain enough
characters that need those bits (e.g. for tonos and final sigma), then
the client could ignore that display hint. (And there's the special
Appendix A in Punycode (RFC 3492), but I don't know whether anybody
has implemented that, and it may not have been specified properly,
since the last character in the delta might be a digit, which does not
have upper/lower case.)

This email is getting too long, but Eszett, ZWJ and ZWNJ should not be
placed in the xn-- space. They should receive a different prefix. The
local mappers should try both (e.g. Eszett with new prefix, and ss
with or without xn--, depending on the rest of the string).

And if the French really want to distinguish ecole and Ecole, they
will not only need a different prefix, but also something other than
Punycode (as far as I can tell).

Erik


More information about the Idna-update mailing list