NFKC and dots

Simon Josefsson simon at
Tue Jan 15 16:26:26 CET 2008

Erik directed my attention to this thread, and I've looked into how
libidn behaves.

As far as I can tell, libidn follows the IDNA specification to the
letter.  See <>
for background to that conclusion.

Arguable the Firefox/MSIE behaviour is better in the sense that it
reduces user-confusion, but it doesn't follow the IDNA specification.
I've put together a problem description, some discussion and a
recommendation (based on Erik's ideas) for the libidn manual.  Is there
any interest in updating RFC 3490 with the simple ideas here?  It seems
the ideas are deployed, and they offer some advantage to what's in the

Comments on whether ToASCII(NFKC(in)) would lead to other problems are
appreciated, as are all other comments on the text as well.


Appendix B On Label Separators

Some strings contains characters whose NFKC normalized form contain the
ASCII dot (0x2E, ".").  Examples of these characters are U+2024 (ONE
DOT LEADER) and U+248C (DIGIT FIVE FULL STOP).  The strings have the
interesting property that their IDNA ToASCII output will contain
embedded dots.  For example:

     ToASCII (hi U+248C com) =
     ToASCII (räksmörgås U+2024 com) =

   This demonstrate the two general cases: The first where the ASCII dot
is part of an output that do not begin with the IDN prefix "xn-".  The
second example illustrate when the dot is part of IDN prefixed with

   The input strings are, from the DNS point of view, a single label.
The IDNA algorithm translate one label at a time.  Thus, the output is
expected to be only one label.  What is important here is to make sure
the DNS resolver receives the correct query.  The DNS protocol does not
use the dot to delimit labels on the wire, rather it uses length-value
pairs.  Thus the correct query would be for `{7}' and
`{22}' respectively.

   Some implementations (1) have decided that these inputs strings are
potentially confusing for the user.  The string "hi U+248C com" looks
like "" on systems that support Unicode properly.  These
implementations do not follow RFC 3490.  They yield:

     ToASCII (hi U+248C com) =
     ToASCII (räksmörgås U+2024 com) =

   The DNS query they perform are `{3}hi5{3}com' and
`{18}xn--rksmrgs-5wao1o{3}com' respectively.  Arguably, this leads to a
better user experience, and suggests that the IDNA specification is
sub-optimal in this area.

B.1 Recommended Workaround

It has been suggested to normalize the entire input string using NFKC
before passing it to IDNA ToASCII.  You may use
`stringprep_utf8_nfkc_normalize' or `stringprep_ucs4_nfkc_normalize'.
This will avoid the problem, and appears to lead to similar behaviour
as IE/Firefox.

   Alternative workarounds are being considered.  Eventually Libidn may
implement a new flag to the `idna_*' functions that implements a
recommended way to work around this problem.

   ---------- Footnotes ----------

   (1) Notably Microsoft's Internet Explorer and Mozilla's Firefox, but
not Apple's Safari.

More information about the Idna-update mailing list