NFKC and dots

Erik van der Poel erikv at google.com
Mon Mar 3 17:42:27 CET 2008


Martin and Simon,

The current drafts of IDNA200X are based on a model that is somewhat
different from IDNA2003's. IDNA200X specifies which characters are
allowed in Unicode labels that are encoded in Punycode and put on the
wire in a DNS packet.

IDNA200X does not specify (in detail) the "mappings" that were
introduced in IDNA2003 (and Stringprep2003). I.e. case-foldings, NFKC,
map "to nothing" (delete), and even dot-like characters that are
mapped to the regular dot.

IDNA200X essentially pushes those mappings to the UI and other
intervening layers. For example, we may even map a larger number of
different dot-like characters to the regular dot *in the UI*.

However, I see a clear need to maintain interoperability among HTML
processors, which have implemented the mappings specified in IDNA2003.
We cannot easily expand or shrink the set of dot-like characters that
are mapped to the regular dot *in HTML*.

So, we need a spec for interchange in the "intervening layer" that is
at the level of HTML. That spec must maintain interoperability with
IDNA2003 and must leave local mappings like the Turkish 'i' and
dot-like characters other than the IDNA2003 dots to the UI.

Initially, I expect HTML to be the only application of this spec. It
looks like the email specs are going to specify U-labels (which are
not mapped; they are only Punycode-encoded). Of course, we may start
to see email implementations that relax this rule (against the wishes
of many), thereby opening certain cans of worms, in much the same way
that HTML implementors ignored the IDNA2003 rule that IDNA-unaware
domain name slots must receive Punycode-encoded labels.

Erik

On Mon, Mar 3, 2008 at 12:00 AM, Simon Josefsson <simon at josefsson.org> wrote:
> I think such a document could update IDNA2003, or at least provide
>  informational documentation on how parts of the community have chosen to
>  implement IDNA2003 instead.  As far as I understand, this relates to all
>  user entered hostnames, and is not restricted to HTML.
>
>  For reference and an example of a confusing strings, see:
>  http://josefsson.org/idn.php/?data=%E5%8D%81%E2%80%A4com&profile=Nameprep&mode=toascii&charset=UTF-8&lastcharset=UTF-8
>
>  I'm told both MSIE and Firefox does not yield the same IDN as the
>  correct xn--.com-pg0g here.  Arguable the MSIE/Firefox behaviour is more
>  reasonable.
>
>  /Simon
>
>
>
>  Martin Duerst <duerst at it.aoyama.ac.jp> writes:
>
>  > I think that some aspects of this may be related to HTML.
>  > But domain names are used much more widely than HTML, and
>  > it would be a bad idea to have HTML behave differently from
>  > other, similar formats. As far as IDNA2003 did lead to
>  > unintuitive or clearly underspecified behavior for
>  > generic (from an IDN viewpoint) "higher-level protocols",
>  > it should be fixed and the fix documented in IDNAbis.
>  > These considerations are crossing label boundaries, but
>  > then so do bidi considerations. Although wherever
>  > possible, we should limit IDN work to single-label
>  > considerations, cross-label issues are sometimes
>  > unavoidable.
>  >
>  > Regards,    Martin.
>  >
>  > At 15:42 08/03/03, Simon Josefsson wrote:
>  >>"Erik van der Poel" <erikv at google.com> writes:
>  >>
>  >>> Hi Shawn,
>  >>>
>  >>> Thanks for the info. After I sent that email, I discussed it with some
>  >>> of the ICU folks, and they also said that one way to do this would be
>  >>> to perform NFKC on the entire domain name before splitting it into
>  >>> labels. Mark's pre-processing draft says something similar:
>  >>>
>  >>> http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx&pli=1
>  >>>
>  >>> Actually, I've been meaning to gather folks who are interested in HTML
>  >>> and IDNA so that we can discuss this pre-processing spec. However, I
>  >>> do not want to distract the nascent working group, which probably
>  >>> wants to focus on the on-the-wire specs (IDNA200X, 4 drafts: issues,
>  >>> protocol, tables and bidi).
>  >>
>  >>For what it's worth, I'm interested in seeing the work-around
>  >>documented.  Old IDNA behaviour is unintuitive here.
>  >>
>  >>/Simon
>  >>_______________________________________________
>  >>Idna-update mailing list
>  >>Idna-update at alvestrand.no
>  >>http://www.alvestrand.no/mailman/listinfo/idna-update
>  >
>  >
>  > #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>  > #-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp
>


More information about the Idna-update mailing list