NFKC and dots

Sun Jan 6 19:14:21 CET 2008

John,

Thank you for taking the time to analyse these issues so thoroughly.

I assure you that I was referring to U+2024, and not U+3002.

>From my point of view, Opera, ICU and GNU libidn are relatively minor
players in the Web arena. The players that truly determine how the Web
evolves and how its components interoperate are MSIE, Firefox and to
some extent Safari. Since MSIE 7 supports IDNA and there are quite a
lot of MSIE 7 users, more and more registrants are trying to use IDNA
on the Web. However, MSIE 6 still has a large market share and it does
not support IDNA, so the registrants and others tend to use A-labels.
The numbers I posted previously show this too.

However, since MSIE6's market share is dwindling, MSIE7's and
Firefox2's behavior may start to have more of an effect on the Web. Of
course, the MSIE and Firefox developers may come up with patches for
MSIE7 and Firefox2, or they may choose to implement MSIE8 and Firefox3
differently.

But my guess is that this is unlikely, since MSIE7's and Firefox2's
current behavior with U+2024 (and other characters that yield U+002E
under NFKC) is actually quite reasonable.

Your point about users accidentally entering the "wrong" type of dot
is well taken. In fact, I only discovered this U+2024 issue after
someone else at Google sent me some data that happened to include it.
I.e. it does occur on the Web, whether accidental or not.

Of course, these things don't occur very often on the Web. But that is
not the point. The point is that implementors need to make some
decision about these details. And I would hope that the participants
on this mailing list agree that it would be good to get implementors
to move in the same direction.

I like your idea of a DNS-extended NFKC, if I understand it correctly.
How much of this would actually appear in your protocol draft,
rationale draft and/or elsewhere?

Erik

PS Opera 9 does roughly the same thing with U+2024 as ICU's and GNU
libidn's demo pages.

On Jan 6, 2008 6:03 AM, John C Klensin <klensin at jck.com> wrote:
> I think it may help with this discussion if I respond to Ken's
> comments and Erik's data together, rather than separately.  This
> is going to be long, but the discussion has turned up several
> interesting issues and more than a few bugs in current practice
> and/or specifications.
>
>
> --On Saturday, 05 January, 2008 08:48 -0800 Erik van der Poel
> <erikv at google.com> wrote:
>
> > On Dec 12, 2007 6:42 PM, Kenneth Whistler <kenw at sybase.com>
> > wrote:
> >> If we had wanted to extend this set to all the compatibility
> >> NFKC variants, then we would also add the following:
> >>
> >> 2024  ONE DOT LEADER
> >> FE12  PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
> >> FE52  SMALL FULL STOP
> >>
> >> However, there is no need for that at all, since those
> >> characters will not be entered in by accident on Chinese and
> >> Japanese computers.
>
> But, while I'll stipulate that it is unlikely, they might be
> entered by accident by someone picking characters from a chart.
> I think this may be precisely where our view of the problem
> differs.  As long as we assume that all character entry will be
> by a period who uses language X that, in turn, uses script Y,
> entering characters on a keyboard layout familiar to people
> using that language and script.   If one includes the
> possibility of someone faced with a "foreign" (i.e., very
> unusual to them) keyboard or picking characters by looking at a
> pictorial chart and selecting them with a mouse or equivalent,
> the picture changes to "does this character look (in whatever
> fonts that charts is presented in) enough like that other one
> that, together with whatever additional information is presented
> with it, it might plausibly be confused with the relevant base
> character".
>
> It seems to be that we need to be careful that we are designing
> for normal human beings -- people who, at most, are likely to be
> thoroughly familiar with only a handful of scripts but who still
> might encounter the Internet in unusual (to them) environments
> -- rather than those who have spent time studying a broad range
> of scripts or Unicode specialists who know immediately what NFKC
> does to different characters.
>
> Also note that IDNA does not use NFKC to perform the dot
> equivalence mappings -- more on that below.
>
> Once one moves to "might by selected from a chart or in a very
> foreign environment" then, ignoring an issue that I'll discuss
> below, it seems to me that one has two choices:
>
> (i) map (actually convert) all dot-ish characters to U+002E,
> including ones that don't map under NFKC because they are
> something else entirely, such as U+0660 or U+06F0 (Arabic-Indic
> Digit Zero and Extended Arabic-Indic Digit Zero, respectively)
> or U+0701 or U+0702 (Syriac Superlinear and Sublinear Full
> Stops).  Conversely, someone who is used to writing African
> languages in Arabic-derived script but who is also used to
> transliterations of those languages into Roman characters,
> might, if we start dot-mapping, plausibly expect _their_ full
> stop (U+061E, Arabic Triple Dot Punctuation Mark) to be
> substitutable for ASCII U+002E.
>
> (ii) Treat dot-mapping as a user interface matter, with
> appropriate cautions, and make sure that identifiers on the wire
> are as standardized --permit as few variant forms--  as possible.
>
> I think the thinking behind the mapping/conversion rule in 3490
> was correct.  If we could get it right, and get everything to
> behave consistently and predictably to non-experts, then
> treating all plausible dot-forms as if they were ASCII full stop
> (U+002E) and converting them from whatever form was used into
> U+002E when one converted from U-labels to A-labels would make a
> lot of sense.  The problem is that one can't get it right in
> enough cases to make it predictable, intuitive, and consistent
> across the Internet.  If variant dots appear to an application
> that is not IDNA-aware and does not make an LDH-and-U+002E
> check, it will not be able to parse the FQDN into labels.  Such
> an application is especially likely to get into trouble if
> variant dots are used to separate labels all of which are
> LDH-conforming, a situation that IDNA does not precisely
> prohibit.
>
> If one could know definitively at DNS lookup time whether an
> FQDN was an extended internationalized one such that an
> unextended application simply could not look up an IDN, we'd
> have a lot more flexibility about this.  But IDNA was carefully
> designed, for good reason, to avoid needing or having that
> particular bit of knowledge.
>
> Converting dots also subjects us to two other vulnerabilities.
> First, we need to make fairly subjective decisions about what
> converts and what does not (see the examples discussed above)
> and, especially among users who don't understand the details of
> our logic, those decisions may be intuitive to some and
> astonishing to others.   And, second, it locks us in place at
> some current version of Unicode since more characters that look
> (to someone) like dots may be added later.  We run into nasty
> forward and backward compatibility problems if we treat any of
> the new characters as equivalent to dots and therefore subject
> to conversion.   Conversely, to tell the users of such
> characters that theirs don't get converted just because they
> were added to Unicode too late is ethically and politically
> untenable.
>
> So I think we are forced into UI-level mapping, even though, in
> a more perfect world, conversion in the protocol might be better
> (although I'd still be arguing for a single form on the wire).
>
> >> I'm agnostic about where FULL STOP and IDEOGRAPHIC FULL STOP
> >> equivalence get handled in the protocol stack, by the way.
>
> I'm not agnostic for the reasons discussed above.  But I want to
> stress that we agree that they should treated as equivalent, as
> least in environments in which Ideographic Full Stop is likely
> to occur in context.
>
>
> > Speaking of U+2024 and where in the protocol stack to handle
> > things, I just discovered that MSIE 7 and Firefox 2 both
> > perform NFKC on this character, to yield U+002E (.). After
> > that, they divide the host name into labels *again*, so the
> > new U+002E becomes a new label separator.
>
> I hope you are referring to U+3002 (Ideographic Full Stop) or
> its width-variants (U+FF0E, fullwidth, and U+FF61, halfwidth),
> since U+2024 (One Dot Leader) is not treated as a dot (ASCII
> full stop, U+002E, variant) at all.  It is _mapped_ to U+002E by
> NFKC, but that is intra-label and much too late to be relevant.
> More on this below. I'm going to examine U+3002 as a better
> example in the discussion below and then come back to the
> problems with U+2024.
>
> > If we ever get around to writing a document about IDNA in
> > HTML, we may want to make a note of this. I.e. the steps are:
> >
> > (1) Divide the domain name into labels by looking for IDNA2003
> > dots. (2) Perform Nameprep2003 on each non-ASCII label.
> > (3) Divide each label into multiple labels, by looking for
> > regular dots. (4) Perform Punycode2003 on each non-ASCII label.
>
> This works reasonably well as long as the application that
> invokes IDNA is the same as the application that performs
> FQDN-with-dots translation to a length-value label list.  That
> will not always be the case.  It is important to remember that
> even U+002E is a valid character within a label.   For example,
> while I don't believe that a sensible registry should permit
> registration of what would typically appear (in dot-separated
> form) as
>      foo\.bar.TLD.
> it is valid in the DNS (see the discussion in RFC 1035 Section
> 5.1 and elsewhere).  Worse, once things are converted to
> length-value form (using a notation I just made up but that
> should be obvious), it would turn into
>      (7)foo.bar(3)TLD(0)
> An attempt to re-parse the string and turn it into
>      (3)foo(3)bar(3)TLD(0)
> would be an error and a fairly serious one at that.
>
> It is important to note that some protocols that invoke the DNS
> permit escaping of characters within labels using the syntax
> above (and that I've used for illustration below) and that some
> do not.  In particular, SMTP (important in its own right and
> because many other protocols use its rules) does not permit
> within-label escaping at all (another good reason why sensible
> registries should not get anywhere near this issue).  So we
> already have application-to-application differences in how some
> domain names that contain only ASCII characters are interpreted
> and whether or not they are permitted.   We are not creating new
> problems if we say that interpretation of variant dots is
> application or UI-dependent.
>
> Now consider applying the algorithm you describe above when some
> mappable dot is used instead of U+002E in the middle-of-label
> position in the example above.  Assume, for consistency with
> your (corrected) example, that it is U+3002 and let me represent
> it as X in deference to readers who either can't handle UTF-8 or
> who don't have the appropriate fonts installed.  Then we start
> with
>      foo\Xbar.TLD.
> It is parsed into the equivalent of
>      (7)fooXbar(3)TLD(0)
> Now, since the original escape information is lost, it is
> re-parsed into
>      (3)foo(3)bar(TLD)(0)
> which is exactly the error above.
>
> So I contend that, in cases of escaped dots, the MSIE 7 and
> Firefox 2 behavior you describe is a bug if U+3002 is used (it
> is even worse for U+2024, see below).
>
> But your description of what is occurring in IDNA terms may not
> be precisely accurate because, regardless of the NFKC mapping
> performed by Stringprep/Nameprep, the conversion/equivalence of
> dot-equivalents is not performed by them but by a conformance
> statement in IDNA (RFC 3490) itself.  And I can find no language
> in RFC 3490 that indicates it, as distinct from Nameprep, should
> (or even may) be involved multiple times on the same string.
> Viewed that way, MSIE 7 and Firefox 2 are trying to work around
> a protocol deficiency (whether they do it in an optimal fashion
> or not).  That they have noticed and tried to compensate for the
> protocol deficiency strengthens my view that there is something
> a bit wrong with the model (YMMD, of course).
>
> There is a little bit of ambiguity here because RFC 3490 does
> not appear make it precisely clear whether condition 3.1(1) is
> applied before or after Nameprep.  This is one of the problems
> with defining a standard in terms of an algorithm and then
> putting algorithmic steps into a conformance clause and one of
> the motivations for changing the definition model in IDNA200X.
> However, I believe that a careful reading of RFC 3491 makes the
> intent perfectly clear: since Nameprep is used only on labels,
> not domain names (see the last two sentences of DSection 1.1),
> it is up to IDNA (RFC 3490) to figure out where the label
> boundaries lie.  And its only discussion about label boundaries
> is in its Section 3.1, at least as far as I can find.
>
> > Interestingly, Opera 9 appears to perform a slightly different
> > set of steps (see step 2):
> >
> > (1) Divide the domain name into labels by looking for IDNA2003
> > dots. (2) Perform Nameprep2003 on each non-ASCII label, and,
> > if result is non-ASCII, perform Punycode2003.
> > (3) Divide each label into multiple labels, by looking for
> > regular dots.
>
> Through step 2, this is actually the only way I can plausibly
> interpret the way that IDNA invokes the equivalence rule in a
> conformance statement.  One has to divide the FQDN into labels
> treating all of the variant dot-forms as equivalent and only
> then apply Nameprep and (if needed) Punycode conversion to the
> individual labels.   I think step 3 either gets Opera into the
> same trouble that MSIE and Firefox are in (see above and below)
> or is a NOOP depending on whether they pay attention to escapes.
> Using this algorithm, and the examples above,
>    foo\.bar.TLD -> (7)foo.bar(3)TLD(0)
>    foo.bar.TLD -> (3)foo(3)bar(3)TLD(0)
> as specified in RFC 1035 and
>    foo\Xbar.TLD -> (7)fooXbar(3)TLD(0)
>    fooXbar.TLD -> (3)foo(3)bar(3)TLD(0)
>
> There is a little bit of uncertainty in the next-to-last case
> above although I think the rules are actually quite clear.
> Since X (aka U+3002) is not mapped to U+002E in Nameprep or
> Stringprep, I think the 1034/1035 rules probably apply and the
> first label ends up as "foo(U+3002)bar".  Note that RFC 3490
> doesn't say "convert to U+002E" but "MUST be recognized as..."
> and, in the context of IDN-unaware slots only, "changing all the
> label separators to U+002E".  Note that is "all label
> separators" not "all dot-equivalents".  So, since X in foo\Xbar
> is not a label separator, I don't believe there is any protocol
> justification for changing it.  Hence IDNA presumably ends up
> with
>    foo\Xbar.TLD -> (7)fooXbar(3)TLD(0) ->
>         (11)foobar-rr3e(3)TLD(0)
>
> If my reading of the spec is correct, then both the ICU and GNU
> Libidn test pages get it wrong, which is pretty scary, arguably
> even more scary than the three browsers getting it wrong (albeit
> in different ways and if Opera applies its third step as you
> describe).
>
>
> > Opera 9 is somewhat more conformant to RFC 3490, but it
> > re-divides the labels instead of inserting 0x2E (.) into the
> > DNS packet. (One might argue that RFC 3490 did not really take
> > this into account.)
>
> If the dot is a label-separator, it doesn't end up in the DNS
> packet in any event.  But, in any event, I don't think one can
> "re-divide" without violating IDNA as I read it.
>
> > I haven't tried it in Safari 3 or MSIE 6 with Verisign
> > plug-in. The HTML I used for testing was:
> >
> > <a href="http://google&#x2024;com">one</a><br>
> > <a href="http://&#x5341;&#x2024;com">two</a>
>
> Again, if U+2024 is being treated as a label-separating dot by
> anything at all, it is a protocol violation.  The "dot
> equivalence" text in RFC 3490 doesn't list it and while
> Stringprep maps it to U+002E (via NFKC), RFC 3491 is very clear
> than Nameprep is used on labels, not FQDNs.
>
> But, given this, let's come back to U+2024.   We start with,
> e.g.,
>    google&#x2024;com
> Since none of the characters in this string is not on the "dot
> equivalent" list of RFC 3490 Section 3.1(1), this string parses
> as a single label which is then passed to Nameprep.  It comes
> out of Nameprep (via NFKC) as a single label
>    google.com
> or what would be represented in DNS and HTTP escape forms,
> respectively, as
>    google\.com  or  google&#002E;com
> That is an ASCII label.  SMTP would reject it because it won't
> permit "\" to appear in a domain name nor "." in a label.    I
> presume that HTTP would try to treat it as
>    (7)google.com(0)
> but that typical browsers, before getting to HTTP, would either
> get hysterical or would turn it into
>    (3)www(7)google.com(3)com(0)
> Not being particularly fond of obvious phishing opportunities,
> I'd prefer "hysterical", but that is another matter.
>
> Now, to complete this already over-long story, let's compare the
> above, with perceived ambiguities about processing, bugs
> (different bugs) in browsers and test programs, etc., to what
> happens with IDNA200X as proposed.
>
> First, because there is no dot-equivalence rule in the protocol,
> the only label separator is U+002E itself.
>
> Second, because NFKC mapping is not really used and characters
> that are mapped into others by NFKC are prohibited in the
> protocol (i.e., treated as NEVER), U+2024 MUST NOT appear in a
> label.  Any label containing it --in the protocol or on the
> wire-- is an unambiguous error and gets rejected by a conforming
> implementation.
>
> And, third, because these rules, and the prohibitions against
> characters that are mapped into other things being put on the
> wire, are clear, an application user interface may say "in my
> particular environment users are as likely to confuse and
> mis-enter U+2024 with U+002E as they are, say U+3002.  Indeed,
> they are more likely to do so since they are not using an
> ideographic script.  So I will  treat U+2024 as equivalent to
> U+002E and warn users if I see U+3002 at all, both before I get
> to IDNA".   A different formulation, also possible for a UI,
> would be to say "all of the several-times cursed compatibility
> characters are bogus and I'm going to apply a DNS-extended
> version of NFKC to anything that will later be processed as an
> FQDN".  That rule would be a near-necessity if something in the
> operating environment applied NFKC to entire bodies of text
> before strings got anywhere near the relevant application, which
> I can imagine happening.  It is probably also the right rule to
> use if one wanted a single pre-IDNA rule globally rather than
> having localized UIs.   And, for that purpose, "DNS-extended
> NFKC" would presumably consist of
>     NFKC
>     Case-mapping as specified in Stringprep appendix B.2
>     U+3002 -> U+002E
> (note that the other two cases listed in RFC 3690 don't need
> special treatment here because NFKC maps them to U+002E and
> U+3002 respectively).
>
> Without the prohibitions, one can't do this and could easily end
> up in a situation in which two applications, with different
> interpretations about what IDNA really intends, treating
>     google&#x2024;com
> as any of
>     itself
>     google\.com  (i.e., google&#x002E;com )   or
>     google.com   (i.e., two labels)
> And that is a fairly severe interoperability failure.
>
>
> The proposed IDNA200X rule and application of "DNS-extended
> NFKC" first interestingly produces a result equivalent to what
> you describe the browsers as doing, with no trick reparsing that
> could foul up escapes, assuming that one really wanted
>     google&#x2024;com
> to turn into
>     google.com (i.e., (6)google(3)com(0) )
> My guess is that, if you want to permit U+2024 anywhere in the
> system, that is exactly what you would want, both from the
> standpoint of user intuition and because the other cases would
> create phisher paradise.
>
>     john
>
>