NFKC and dots

Sun Jan 6 15:03:39 CET 2008

I think it may help with this discussion if I respond to Ken's
comments and Erik's data together, rather than separately.  This
is going to be long, but the discussion has turned up several
interesting issues and more than a few bugs in current practice
and/or specifications.

--On Saturday, 05 January, 2008 08:48 -0800 Erik van der Poel
<erikv at google.com> wrote:

> On Dec 12, 2007 6:42 PM, Kenneth Whistler <kenw at sybase.com>
> wrote:
>> If we had wanted to extend this set to all the compatibility
>> NFKC variants, then we would also add the following:
>> 
>> 2024  ONE DOT LEADER
>> FE12  PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
>> FE52  SMALL FULL STOP
>> 
>> However, there is no need for that at all, since those
>> characters will not be entered in by accident on Chinese and
>> Japanese computers.

But, while I'll stipulate that it is unlikely, they might be
entered by accident by someone picking characters from a chart.
I think this may be precisely where our view of the problem
differs.  As long as we assume that all character entry will be
by a period who uses language X that, in turn, uses script Y,
entering characters on a keyboard layout familiar to people
using that language and script.   If one includes the
possibility of someone faced with a "foreign" (i.e., very
unusual to them) keyboard or picking characters by looking at a
pictorial chart and selecting them with a mouse or equivalent,
the picture changes to "does this character look (in whatever
fonts that charts is presented in) enough like that other one
that, together with whatever additional information is presented
with it, it might plausibly be confused with the relevant base
character".

It seems to be that we need to be careful that we are designing
for normal human beings -- people who, at most, are likely to be
thoroughly familiar with only a handful of scripts but who still
might encounter the Internet in unusual (to them) environments
-- rather than those who have spent time studying a broad range
of scripts or Unicode specialists who know immediately what NFKC
does to different characters.

Also note that IDNA does not use NFKC to perform the dot
equivalence mappings -- more on that below.

Once one moves to "might by selected from a chart or in a very
foreign environment" then, ignoring an issue that I'll discuss
below, it seems to me that one has two choices:

(i) map (actually convert) all dot-ish characters to U+002E,
including ones that don't map under NFKC because they are
something else entirely, such as U+0660 or U+06F0 (Arabic-Indic
Digit Zero and Extended Arabic-Indic Digit Zero, respectively)
or U+0701 or U+0702 (Syriac Superlinear and Sublinear Full
Stops).  Conversely, someone who is used to writing African
languages in Arabic-derived script but who is also used to
transliterations of those languages into Roman characters,
might, if we start dot-mapping, plausibly expect _their_ full
stop (U+061E, Arabic Triple Dot Punctuation Mark) to be
substitutable for ASCII U+002E.

(ii) Treat dot-mapping as a user interface matter, with
appropriate cautions, and make sure that identifiers on the wire
are as standardized --permit as few variant forms--  as possible.

I think the thinking behind the mapping/conversion rule in 3490
was correct.  If we could get it right, and get everything to
behave consistently and predictably to non-experts, then
treating all plausible dot-forms as if they were ASCII full stop
(U+002E) and converting them from whatever form was used into
U+002E when one converted from U-labels to A-labels would make a
lot of sense.  The problem is that one can't get it right in
enough cases to make it predictable, intuitive, and consistent
across the Internet.  If variant dots appear to an application
that is not IDNA-aware and does not make an LDH-and-U+002E
check, it will not be able to parse the FQDN into labels.  Such
an application is especially likely to get into trouble if
variant dots are used to separate labels all of which are
LDH-conforming, a situation that IDNA does not precisely
prohibit.

If one could know definitively at DNS lookup time whether an
FQDN was an extended internationalized one such that an
unextended application simply could not look up an IDN, we'd
have a lot more flexibility about this.  But IDNA was carefully
designed, for good reason, to avoid needing or having that
particular bit of knowledge.

Converting dots also subjects us to two other vulnerabilities.
First, we need to make fairly subjective decisions about what
converts and what does not (see the examples discussed above)
and, especially among users who don't understand the details of
our logic, those decisions may be intuitive to some and
astonishing to others.   And, second, it locks us in place at
some current version of Unicode since more characters that look
(to someone) like dots may be added later.  We run into nasty
forward and backward compatibility problems if we treat any of
the new characters as equivalent to dots and therefore subject
to conversion.   Conversely, to tell the users of such
characters that theirs don't get converted just because they
were added to Unicode too late is ethically and politically
untenable.

So I think we are forced into UI-level mapping, even though, in
a more perfect world, conversion in the protocol might be better
(although I'd still be arguing for a single form on the wire).

>> I'm agnostic about where FULL STOP and IDEOGRAPHIC FULL STOP
>> equivalence get handled in the protocol stack, by the way.

I'm not agnostic for the reasons discussed above.  But I want to
stress that we agree that they should treated as equivalent, as
least in environments in which Ideographic Full Stop is likely
to occur in context.

> Speaking of U+2024 and where in the protocol stack to handle
> things, I just discovered that MSIE 7 and Firefox 2 both
> perform NFKC on this character, to yield U+002E (.). After
> that, they divide the host name into labels *again*, so the
> new U+002E becomes a new label separator.

I hope you are referring to U+3002 (Ideographic Full Stop) or
its width-variants (U+FF0E, fullwidth, and U+FF61, halfwidth),
since U+2024 (One Dot Leader) is not treated as a dot (ASCII
full stop, U+002E, variant) at all.  It is _mapped_ to U+002E by
NFKC, but that is intra-label and much too late to be relevant.
More on this below. I'm going to examine U+3002 as a better
example in the discussion below and then come back to the
problems with U+2024.  

> If we ever get around to writing a document about IDNA in
> HTML, we may want to make a note of this. I.e. the steps are:
> 
> (1) Divide the domain name into labels by looking for IDNA2003
> dots. (2) Perform Nameprep2003 on each non-ASCII label.
> (3) Divide each label into multiple labels, by looking for
> regular dots. (4) Perform Punycode2003 on each non-ASCII label.

This works reasonably well as long as the application that
invokes IDNA is the same as the application that performs
FQDN-with-dots translation to a length-value label list.  That
will not always be the case.  It is important to remember that
even U+002E is a valid character within a label.   For example,
while I don't believe that a sensible registry should permit
registration of what would typically appear (in dot-separated
form) as 
     foo\.bar.TLD.
it is valid in the DNS (see the discussion in RFC 1035 Section
5.1 and elsewhere).  Worse, once things are converted to
length-value form (using a notation I just made up but that
should be obvious), it would turn into
     (7)foo.bar(3)TLD(0)
An attempt to re-parse the string and turn it into
     (3)foo(3)bar(3)TLD(0)
would be an error and a fairly serious one at that.

It is important to note that some protocols that invoke the DNS
permit escaping of characters within labels using the syntax
above (and that I've used for illustration below) and that some
do not.  In particular, SMTP (important in its own right and
because many other protocols use its rules) does not permit
within-label escaping at all (another good reason why sensible
registries should not get anywhere near this issue).  So we
already have application-to-application differences in how some
domain names that contain only ASCII characters are interpreted
and whether or not they are permitted.   We are not creating new
problems if we say that interpretation of variant dots is
application or UI-dependent.

Now consider applying the algorithm you describe above when some
mappable dot is used instead of U+002E in the middle-of-label
position in the example above.  Assume, for consistency with
your (corrected) example, that it is U+3002 and let me represent
it as X in deference to readers who either can't handle UTF-8 or
who don't have the appropriate fonts installed.  Then we start
with
     foo\Xbar.TLD.
It is parsed into the equivalent of
     (7)fooXbar(3)TLD(0)
Now, since the original escape information is lost, it is
re-parsed into
     (3)foo(3)bar(TLD)(0)
which is exactly the error above.

So I contend that, in cases of escaped dots, the MSIE 7 and
Firefox 2 behavior you describe is a bug if U+3002 is used (it
is even worse for U+2024, see below).

But your description of what is occurring in IDNA terms may not
be precisely accurate because, regardless of the NFKC mapping
performed by Stringprep/Nameprep, the conversion/equivalence of
dot-equivalents is not performed by them but by a conformance
statement in IDNA (RFC 3490) itself.  And I can find no language
in RFC 3490 that indicates it, as distinct from Nameprep, should
(or even may) be involved multiple times on the same string.
Viewed that way, MSIE 7 and Firefox 2 are trying to work around
a protocol deficiency (whether they do it in an optimal fashion
or not).  That they have noticed and tried to compensate for the
protocol deficiency strengthens my view that there is something
a bit wrong with the model (YMMD, of course).

There is a little bit of ambiguity here because RFC 3490 does
not appear make it precisely clear whether condition 3.1(1) is
applied before or after Nameprep.  This is one of the problems
with defining a standard in terms of an algorithm and then
putting algorithmic steps into a conformance clause and one of
the motivations for changing the definition model in IDNA200X.
However, I believe that a careful reading of RFC 3491 makes the
intent perfectly clear: since Nameprep is used only on labels,
not domain names (see the last two sentences of DSection 1.1),
it is up to IDNA (RFC 3490) to figure out where the label
boundaries lie.  And its only discussion about label boundaries
is in its Section 3.1, at least as far as I can find.

> Interestingly, Opera 9 appears to perform a slightly different
> set of steps (see step 2):
> 
> (1) Divide the domain name into labels by looking for IDNA2003
> dots. (2) Perform Nameprep2003 on each non-ASCII label, and,
> if result is non-ASCII, perform Punycode2003.
> (3) Divide each label into multiple labels, by looking for
> regular dots.

Through step 2, this is actually the only way I can plausibly
interpret the way that IDNA invokes the equivalence rule in a
conformance statement.  One has to divide the FQDN into labels
treating all of the variant dot-forms as equivalent and only
then apply Nameprep and (if needed) Punycode conversion to the
individual labels.   I think step 3 either gets Opera into the
same trouble that MSIE and Firefox are in (see above and below)
or is a NOOP depending on whether they pay attention to escapes.
Using this algorithm, and the examples above,
   foo\.bar.TLD -> (7)foo.bar(3)TLD(0)
   foo.bar.TLD -> (3)foo(3)bar(3)TLD(0)
as specified in RFC 1035 and
   foo\Xbar.TLD -> (7)fooXbar(3)TLD(0)
   fooXbar.TLD -> (3)foo(3)bar(3)TLD(0)

There is a little bit of uncertainty in the next-to-last case
above although I think the rules are actually quite clear.
Since X (aka U+3002) is not mapped to U+002E in Nameprep or
Stringprep, I think the 1034/1035 rules probably apply and the
first label ends up as "foo(U+3002)bar".  Note that RFC 3490
doesn't say "convert to U+002E" but "MUST be recognized as..."
and, in the context of IDN-unaware slots only, "changing all the
label separators to U+002E".  Note that is "all label
separators" not "all dot-equivalents".  So, since X in foo\Xbar
is not a label separator, I don't believe there is any protocol
justification for changing it.  Hence IDNA presumably ends up
with
   foo\Xbar.TLD -> (7)fooXbar(3)TLD(0) -> 
        (11)foobar-rr3e(3)TLD(0)

If my reading of the spec is correct, then both the ICU and GNU
Libidn test pages get it wrong, which is pretty scary, arguably
even more scary than the three browsers getting it wrong (albeit
in different ways and if Opera applies its third step as you
describe).

> Opera 9 is somewhat more conformant to RFC 3490, but it
> re-divides the labels instead of inserting 0x2E (.) into the
> DNS packet. (One might argue that RFC 3490 did not really take
> this into account.)

If the dot is a label-separator, it doesn't end up in the DNS
packet in any event.  But, in any event, I don't think one can
"re-divide" without violating IDNA as I read it.

> I haven't tried it in Safari 3 or MSIE 6 with Verisign
> plug-in. The HTML I used for testing was:
> 
> <a href="http://google&#x2024;com">one</a><br>
> <a href="http://&#x5341;&#x2024;com">two</a>

Again, if U+2024 is being treated as a label-separating dot by
anything at all, it is a protocol violation.  The "dot
equivalence" text in RFC 3490 doesn't list it and while
Stringprep maps it to U+002E (via NFKC), RFC 3491 is very clear
than Nameprep is used on labels, not FQDNs.

But, given this, let's come back to U+2024.   We start with,
e.g.,
   google&#x2024;com
Since none of the characters in this string is not on the "dot
equivalent" list of RFC 3490 Section 3.1(1), this string parses
as a single label which is then passed to Nameprep.  It comes
out of Nameprep (via NFKC) as a single label  
   google.com
or what would be represented in DNS and HTTP escape forms,
respectively, as 
   google\.com  or  google&#002E;com
That is an ASCII label.  SMTP would reject it because it won't
permit "\" to appear in a domain name nor "." in a label.    I
presume that HTTP would try to treat it as 
   (7)google.com(0)
but that typical browsers, before getting to HTTP, would either
get hysterical or would turn it into
   (3)www(7)google.com(3)com(0)
Not being particularly fond of obvious phishing opportunities,
I'd prefer "hysterical", but that is another matter.

Now, to complete this already over-long story, let's compare the
above, with perceived ambiguities about processing, bugs
(different bugs) in browsers and test programs, etc., to what
happens with IDNA200X as proposed.

First, because there is no dot-equivalence rule in the protocol,
the only label separator is U+002E itself.

Second, because NFKC mapping is not really used and characters
that are mapped into others by NFKC are prohibited in the
protocol (i.e., treated as NEVER), U+2024 MUST NOT appear in a
label.  Any label containing it --in the protocol or on the
wire-- is an unambiguous error and gets rejected by a conforming
implementation.

And, third, because these rules, and the prohibitions against
characters that are mapped into other things being put on the
wire, are clear, an application user interface may say "in my
particular environment users are as likely to confuse and
mis-enter U+2024 with U+002E as they are, say U+3002.  Indeed,
they are more likely to do so since they are not using an
ideographic script.  So I will  treat U+2024 as equivalent to
U+002E and warn users if I see U+3002 at all, both before I get
to IDNA".   A different formulation, also possible for a UI,
would be to say "all of the several-times cursed compatibility
characters are bogus and I'm going to apply a DNS-extended
version of NFKC to anything that will later be processed as an
FQDN".  That rule would be a near-necessity if something in the
operating environment applied NFKC to entire bodies of text
before strings got anywhere near the relevant application, which
I can imagine happening.  It is probably also the right rule to
use if one wanted a single pre-IDNA rule globally rather than
having localized UIs.   And, for that purpose, "DNS-extended
NFKC" would presumably consist of 
    NFKC
    Case-mapping as specified in Stringprep appendix B.2
    U+3002 -> U+002E
(note that the other two cases listed in RFC 3690 don't need
special treatment here because NFKC maps them to U+002E and
U+3002 respectively).

Without the prohibitions, one can't do this and could easily end
up in a situation in which two applications, with different
interpretations about what IDNA really intends, treating 
    google&#x2024;com
as any of 
    itself
    google\.com  (i.e., google&#x002E;com )   or
    google.com   (i.e., two labels)
And that is a fairly severe interoperability failure.

The proposed IDNA200X rule and application of "DNS-extended
NFKC" first interestingly produces a result equivalent to what
you describe the browsers as doing, with no trick reparsing that
could foul up escapes, assuming that one really wanted 
    google&#x2024;com
to turn into
    google.com (i.e., (6)google(3)com(0) )
My guess is that, if you want to permit U+2024 anywhere in the
system, that is exactly what you would want, both from the
standpoint of user intuition and because the other cases would
create phisher paradise.

    john