NFKC and dots

Sun Jan 6 20:20:31 CET 2008

--On Sunday, 06 January, 2008 10:14 -0800 Erik van der Poel
<erikv at google.com> wrote:

> John,
> 
> Thank you for taking the time to analyse these issues so
> thoroughly.

Glad it was useful.   My apologies for the delay after your note
(and Ken's), but I sometimes need to try to examine these things
from several different directions before I can say anything
coherent and it often takes a while.

> I assure you that I was referring to U+2024, and not U+3002.

That is what I was afraid of :-)

> From my point of view, Opera, ICU and GNU libidn are
> relatively minor players in the Web arena.

Agreed and understood.  But I'm at least as worried about the
precedents they (especially the libraries) set for the next sets
of applications (the non-web ones) as I am for the web.  In
particular, I'm expecting that, once the specs for email address
i18n are clearly stable, we will see a number of implementations
--which necessarily include IDNA implementations-- very quickly.
And, while there may be more users of web clients (as a group)
in the world today than there are of mail clients (other than
webmail users), there are a lot more mail client
implementations.  I expect the libraries may be important to at
least some of them.   This is also one of the reasons I'm
getting concerned about getting IDNAbis wrapped up, rather than
continuing to theorize: once we have mail implementations
deployed against a spec, however the implementers choose to
interpret the spec, it gets significantly harder to change...
both because the number of code bases is larger and because
fewer of the mail programs come with automatic updating
machinery.

>   The players that
> truly determine how the Web evolves and how its components
> interoperate are MSIE, Firefox and to some extent Safari.
> Since MSIE 7 supports IDNA and there are quite a lot of MSIE 7
> users, more and more registrants are trying to use IDNA on the
> Web. However, MSIE 6 still has a large market share and it does
> not support IDNA, so the registrants and others tend to use
> A-labels. The numbers I posted previously show this too.

Understood, although with the further understanding that the
"only registered scripts" and "suspicious TLD" logic in MSIE and
Firefox may limit the number of IDNs that are seen as native
characters rather than ACE and thereby make the issues less
obvious in practice than they might be otherwise.   That is
another reason why the mail cases are important:  since there is
no ACE representation for the local part, the principle of least
astonishment may cause implementations who display the
equivalent of UTF-8 at ACE, rather than UTF-8 at UTF-8, to take a lot
of abuse.

> However, since MSIE6's market share is dwindling, MSIE7's and
> Firefox2's behavior may start to have more of an effect on the
> Web. Of course, the MSIE and Firefox developers may come up
> with patches for MSIE7 and Firefox2, or they may choose to
> implement MSIE8 and Firefox3 differently.

Yes.  But the point I was trying to make is that MSIE7 and
Firefox2 are not conformant to IDNA2003, at least with regard to
characters that are compatibility equivalents of ASCII FULL
STOP.  That incompatibility indicates either

	* a bug that they may or may not choose to fix
	* a symptom of a poor definition in the standard that
	resulted in variant understandings of what they were
	supposed to do.
	* a decision on the part of the implementers that they
	knew better about what to do for their users, regardless
	of what the spec says.

The fact that there is even a question about that tells me that
we need to fix IDNA2003 in that area.  A spec that at least five
implementers / implementer groups, each presumably working in
good faith, are interpreting in ways different from what the
spec says is an indication that the spec is broken.  Whether the
correct fix is to better align the spec with practice, to
explain better why it says what it says and should be followed,
or to do something more drastic, is a separate question but,
coming back to my "minimal changes" discussion of a couple of
weeks ago, _something_ needs to be done.

> But my guess is that this is unlikely, since MSIE7's and
> Firefox2's current behavior with U+2024 (and other characters
> that yield U+002E under NFKC) is actually quite reasonable.

I agree as long as the way in which they do it doesn't break
embedded use of U+002E which is clearly now permitted under RFC
1035.   For reasons I hope are obvious, I'm a lot more concerned
right now about "breaks 1035" than I am about "doesn't precisely
conform to IDNA" If I correctly understand the implications of
your description, they are probably doing exactly that, even
though a slightly different algorithm would solve the problem
and preserve the reasonable behavior for the reasonable cases.

> Your point about users accidentally entering the "wrong" type
> of dot is well taken. In fact, I only discovered this U+2024
> issue after someone else at Google sent me some data that
> happened to include it. I.e. it does occur on the Web, whether
> accidental or not.

And that means to me that we had better be very careful and
specific about what gets registered and how the strings are
interpreted.  Otherwise, we create "opportunities".

> Of course, these things don't occur very often on the Web. But
> that is not the point. The point is that implementors need to
> make some decision about these details. And I would hope that
> the participants on this mailing list agree that it would be
> good to get implementors to move in the same direction.

Certainly I agree (in case that hasn't been clear).

> I like your idea of a DNS-extended NFKC, if I understand it
> correctly. How much of this would actually appear in your
> protocol draft, rationale draft and/or elsewhere?

I need to think about this a little more (i.e., I may change my
mind), but my current thinking is that most of this needs to end
up in the rationale draft as part of a discussion about what UI
implementations should and should not do. (See below.) 

> PS Opera 9 does roughly the same thing with U+2024 as ICU's
> and GNU libidn's demo pages.

That is what I had inferred from your description.

I think the UI situation, at least as I see it right now, should
be based on the following assumptions.   All but the first are
conditioned on the assumption, for which I think we already have
ample evidence, that, if we tell people to do (or not do)
something that they believe is important for their users, they
will ignore us.  That is ultimately reasonable, so part of our
job is to define things so that they interoperate anyway.

(1) Actual URLs to be transmitted "on the wire" (and, if
relevant to protocols, IRIs), as distinguished from what people
type, should be in as minimal and standardized form as possible.
In the language of IDNA200X, that means they should contain
A-labels or U-labels only, with U+002E as a label separator.
The IDNA200X protocol functions only on U-labels and A-labels:
anything else is either an error or something that needs to be
fixed up elsewhere.

(2) Character mappings, especially from characters that people
are likely to type (by mistake or for convenience) instead of
the characters that actually appear in U-labels, are an
important function of pre-IDNA processing.  There are two
theories about how to do such mappings.  Either is acceptable
from an IDNA standpoint (since this is all about preprocessing
-- none of those mapped characters will ever appear as the
result of conversion between an A-label and a U-label).  They
are:

	(2a) Apply DNS-extended-NFKC to a domain string before
	IDNA processing.  This is the global approach, it is
	maximally permissive about the use of compatibility
	characters and case variations, and, modulo some edge
	cases (particularly the ones concerned with funny dots),
	it is forward compatible from IDNA2003.

	(2b) Apply mapping rules that make sense from the point
	of view of a particular localized implementation.   This
	is a localized approach that may not map some of the
	characters that (2a) maps (and treats them as errors
	instead) and that may map some characters that would
	otherwise fall into NEVER into valid characters (funny
	dots are particularly important here).  The one thing
	that must not be done is to map any IDNA-valid character
	(ALWAYS, MAYBE YES, MAYBE NO) into anything else (to do
	so breaks interoperability in a way that I hope is
	obvious).

(3) In making the choice between (2a) and (2b) and, in
particular, in deciding what to map and what to not map in (2b),
the possibility of user astonishment is an important
consideration.  If a reasonable and moderately educated user
types a character in the good-faith belief that it will be
treated in a particular way, that had better happen.   In
particular, if a user types something dot-like assuming it will
be treated as a label separator, it should be treated as a label
separator and not, e.g., be converted into some strange
embedding.  Similarly, if a reasonable user whose native script
makes case distinctions expect case-independent matching for
that script, then mappings should occur that make that happen.

(4) Because we know that there are URIs and IRIs out there that
assume the mappings of IDNA2003 (rather than containing
exclusively U-labels or A-labels in domain name slots), an
implementation that chooses (2b) rather than pushing all
putative URIs and IRIs through DNS-extended-NFC should be very
careful about error handling and reporting so that the user gets
a clue about malformed syntax rather than just a "not found"
error.

If that about covers it, I pretty much know what I need to put
in the rationale document.

    john