Q2: What mapping function should be used in a revised IDNA2008 specification?

Wed Apr 1 17:29:06 CEST 2009

I tend to agree that many of the NFKC mappings are unnecessary since
people don't type those accidentally. They are potentially even
confusing in the context of domain names. One downside is that
implementations would need separate tables or routines for NFKC and
IDNA.

I also think that it would be good to look for 3-way relationships
like Eszett/ss/SS and Final/Normal/Capital Sigma in Unicode CaseFold
and decide on a case-by-case basis which ones should be mapped,
especially if we can provide a display mechanism.

However, I would take this one step further, and look for 3-way
relationships that are not in Unicode CaseFold, such as Greek
Capital/Small-with-Tonos/Small-without-Tonos. For these, I think one
possible answer is to disallow the Small-with-Tonos. We wouldn't even
need a transition mechanism. Over time, the Small-with-Tonos would
disappear in stored and registered domain names, and the Greeks could
stop bundling for that 3-way.

Erik

2009/4/1 John C Klensin <klensin at jck.com>:
> FWIW, I pretty much agree with Martin on this, but would prefer
> that we try to restrict (1) to the simple LowerCase operation,
> rather than using CaseFold for mapping.  When the latter
> produces results different from LowerCase, some of them seem to
> astonish non-expert users.
>
> I also believe that compatibility mappings identified with
> <font> are also clear candidates for exclusion.
>
>    john
>
>
> --On Wednesday, April 01, 2009 16:29 +0900 "\"Martin J.
> Dürst\"" <duerst at it.aoyama.ac.jp> wrote:
>
>> My preference would be to use a significantly more restricted
>> set of  mappings than for IDNA2003. At the very highest level,
>> the IDNA2003 mappings contained:
>> 1) Case mappings
>> 2) NFC mappings (canonical equivalence)
>> 3) NFKC mappings (compatibility equivalence)
>>
>> I think the best thing would be to retain 1) and 2), but only
>> a very small part of 3). The reason for this is that 1) is
>> used as a parallel to the ASCII case equivalence in the ASCII
>> DNS, 2) is an inherent representational issue of an encoding
>> that (like Unicode) provides composing of accents and the
>> like, but 3) is a hodgepodge collection of various kinds of
>> equivalences.
>>
>> Indeed, in the Unicode data file, canonical equivalences are
>> marked with various different tags such as <super>,
>> <fraction>, <final>, <medial>, <vertical>, <small>, <wide>,
>> <narrow>, and so on.
>>
>> I haven't done a full analysis, but I think we need to keep
>> <wide> and <narrow> because of how East Asia IMEs work,
>> but we should definitely get rid of <super>, <fraction> (which
>> can produce slashes), and so on, because they really, really
>> don't make sense in a domain name context.
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>