NFKC and dots

Mon Jan 7 02:40:01 CET 2008

On Jan 6, 2008 11:20 AM, John C Klensin <klensin at jck.com> wrote:
> --On Sunday, 06 January, 2008 10:14 -0800 Erik van der Poel
> <erikv at google.com> wrote:
> > From my point of view, Opera, ICU and GNU libidn are
> > relatively minor players in the Web arena.
>
> Agreed and understood.  But I'm at least as worried about the
> precedents they (especially the libraries) set for the next sets
> of applications (the non-web ones) as I am for the web.  In
> particular, I'm expecting that, once the specs for email address
> i18n are clearly stable, we will see a number of implementations
> --which necessarily include IDNA implementations-- very quickly.
> And, while there may be more users of web clients (as a group)
> in the world today than there are of mail clients (other than
> webmail users), there are a lot more mail client
> implementations.  I expect the libraries may be important to at
> least some of them.   This is also one of the reasons I'm
> getting concerned about getting IDNAbis wrapped up, rather than
> continuing to theorize: once we have mail implementations
> deployed against a spec, however the implementers choose to
> interpret the spec, it gets significantly harder to change...
> both because the number of code bases is larger and because
> fewer of the mail programs come with automatic updating
> machinery.

That is a good point. Other than the U+2024 issue, the libraries offer
the AllowUnassigned and UseSTD3ASCIIRules flags. The former is clearly
problematic, as we have discussed, but the latter might not be used if
the implementor would like to allow for the underscore (_), which is
quite widely supported, at least on the Web browser side. The problem
is that this flag is an all-or-nothing flag, or rather, all-ASCII or
only-LDH, so it does not allow the email app implementor to take
advantage of the underscore while still avoiding all the ASCII syntax
characters that are clearly problematic in the email protocols.

The AllowUnassigned and UseSTD3ASCIIRules options in the libraries are
a direct result of the IDNA2003 spec, so this should be addressed in
IDNA200X too.

> > However, since MSIE6's market share is dwindling, MSIE7's and
> > Firefox2's behavior may start to have more of an effect on the
> > Web. Of course, the MSIE and Firefox developers may come up
> > with patches for MSIE7 and Firefox2, or they may choose to
> > implement MSIE8 and Firefox3 differently.
>
> Yes.  But the point I was trying to make is that MSIE7 and
> Firefox2 are not conformant to IDNA2003, at least with regard to
> characters that are compatibility equivalents of ASCII FULL
> STOP.  That incompatibility indicates either
>
>         * a bug that they may or may not choose to fix
>         * a symptom of a poor definition in the standard that
>         resulted in variant understandings of what they were
>         supposed to do.
>         * a decision on the part of the implementers that they
>         knew better about what to do for their users, regardless
>         of what the spec says.
>
> The fact that there is even a question about that tells me that
> we need to fix IDNA2003 in that area.  A spec that at least five
> implementers / implementer groups, each presumably working in
> good faith, are interpreting in ways different from what the
> spec says is an indication that the spec is broken.  Whether the
> correct fix is to better align the spec with practice, to
> explain better why it says what it says and should be followed,
> or to do something more drastic, is a separate question but,
> coming back to my "minimal changes" discussion of a couple of
> weeks ago, _something_ needs to be done.

Yes, and my vote is to fix the specs and other implementations to
align with MSIE7 and Firefox2. See below.

> > But my guess is that this is unlikely, since MSIE7's and
> > Firefox2's current behavior with U+2024 (and other characters
> > that yield U+002E under NFKC) is actually quite reasonable.
>
> I agree as long as the way in which they do it doesn't break
> embedded use of U+002E which is clearly now permitted under RFC
> 1035.   For reasons I hope are obvious, I'm a lot more concerned
> right now about "breaks 1035" than I am about "doesn't precisely
> conform to IDNA" If I correctly understand the implications of
> your description, they are probably doing exactly that, even
> though a slightly different algorithm would solve the problem
> and preserve the reasonable behavior for the reasonable cases.

I haven't been able to figure out a way to get the browsers to insert
U+002E into a label (from the HTML <a> tag). Although I agree with you
that RFC 1035 clearly allows for it via the \. syntax, I do not
consider the browsers' non-support of this feature to be "breaking"
RFC 1035. The browsers do not "break" the most important parts of RFC
1035, and I hereby suggest that the \. syntax of RFC 1035 is both
against the principle of least astonishment and probably quite
unnecessary.

By the way, MSIE, Firefox and Opera disallow URLs with %5C in the host
name, while Safari removes it if it is not at the end of the label,
and it emits \ in the DNS packet if at the end of the label. MSIE,
Safari and Opera treat \ in the host name as if it were / (thereby
terminating the host name and beginning the path), presumably to be
compatible with DOS/Windows file system use of \ as the directory
separator, while Firefox disallows host names with \ in them.

Note that I am not saying that the browsers' behavior with \ is not broken. :-)

I am saying that RFC 1035's \. syntax seems unnecessary and surprising.

> > I like your idea of a DNS-extended NFKC, if I understand it
> > correctly. How much of this would actually appear in your
> > protocol draft, rationale draft and/or elsewhere?
>
> I need to think about this a little more (i.e., I may change my
> mind), but my current thinking is that most of this needs to end
> up in the rationale draft as part of a discussion about what UI
> implementations should and should not do. (See below.)

I tend to think of the rationale as the set of reasons for a
particular spec (and maybe some background). So I'd think that the
DNS-extended NFKC prose should either go in the spec (protocol) or in
a document called "guidelines".

> I think the UI situation, at least as I see it right now, should
> be based on the following assumptions.   All but the first are
> conditioned on the assumption, for which I think we already have
> ample evidence, that, if we tell people to do (or not do)
> something that they believe is important for their users, they
> will ignore us.  That is ultimately reasonable, so part of our
> job is to define things so that they interoperate anyway.
>
> (1) Actual URLs to be transmitted "on the wire" (and, if
> relevant to protocols, IRIs), as distinguished from what people
> type, should be in as minimal and standardized form as possible.
> In the language of IDNA200X, that means they should contain
> A-labels or U-labels only, with U+002E as a label separator.
> The IDNA200X protocol functions only on U-labels and A-labels:
> anything else is either an error or something that needs to be
> fixed up elsewhere.
>
> (2) Character mappings, especially from characters that people
> are likely to type (by mistake or for convenience) instead of
> the characters that actually appear in U-labels, are an
> important function of pre-IDNA processing.  There are two
> theories about how to do such mappings.  Either is acceptable
> from an IDNA standpoint (since this is all about preprocessing
> -- none of those mapped characters will ever appear as the
> result of conversion between an A-label and a U-label).  They
> are:
>
>         (2a) Apply DNS-extended-NFKC to a domain string before
>         IDNA processing.  This is the global approach, it is
>         maximally permissive about the use of compatibility
>         characters and case variations, and, modulo some edge
>         cases (particularly the ones concerned with funny dots),
>         it is forward compatible from IDNA2003.
>
>         (2b) Apply mapping rules that make sense from the point
>         of view of a particular localized implementation.   This
>         is a localized approach that may not map some of the
>         characters that (2a) maps (and treats them as errors
>         instead) and that may map some characters that would
>         otherwise fall into NEVER into valid characters (funny
>         dots are particularly important here).  The one thing
>         that must not be done is to map any IDNA-valid character
>         (ALWAYS, MAYBE YES, MAYBE NO) into anything else (to do
>         so breaks interoperability in a way that I hope is
>         obvious).
>
> (3) In making the choice between (2a) and (2b) and, in
> particular, in deciding what to map and what to not map in (2b),
> the possibility of user astonishment is an important
> consideration.  If a reasonable and moderately educated user
> types a character in the good-faith belief that it will be
> treated in a particular way, that had better happen.   In
> particular, if a user types something dot-like assuming it will
> be treated as a label separator, it should be treated as a label
> separator and not, e.g., be converted into some strange
> embedding.  Similarly, if a reasonable user whose native script
> makes case distinctions expect case-independent matching for
> that script, then mappings should occur that make that happen.
>
> (4) Because we know that there are URIs and IRIs out there that
> assume the mappings of IDNA2003 (rather than containing
> exclusively U-labels or A-labels in domain name slots), an
> implementation that chooses (2b) rather than pushing all
> putative URIs and IRIs through DNS-extended-NFC should be very
> careful about error handling and reporting so that the user gets
> a clue about malformed syntax rather than just a "not found"
> error.
>
> If that about covers it, I pretty much know what I need to put
> in the rationale document.

I think that does cover it, pretty much. The challenge is to write it
in clear language (not that the above is unclear). And then we just
hope that implementors will read it and understand it (and follow it).

Erik