Comments on IDNAbis protocol-03

Fri Jan 11 16:39:41 CET 2008

Mark,

I had hoped to send the response to your comments on "protocol"
and those on "issues" together because they are so intertwined.
But the latter is requiring even more study and consideration
than this one so, as not to appear unresponsive, I'm going to
send them one at a time after all.  The "issues" comments are
not being ignored; I will send that response as soon as
possible consistent with other commitments. 

--On Wednesday, 09 January, 2008 16:26 -0800 Mark Davis
<mark.davis at icu-project.org> wrote:

> I sent this almost a month ago, and got no reply. I'm assuming
> that the lack of response was due to the holidays, and some
> discussion or response for these items will be forthcoming
> soon.

At least in my case, your assumption is largely correct,
although an extremely heavy travel schedule in the last two
months of the year and some other priorities (the EAI and
net-utf8 work are the two that have been visible to the IETF,
but I have non-IETF responsibility as well) were at least as
important as the holidays.  FWIW, I explained that situation to
this list as part of my note to Erik on December 23, so it
should not have come as a surprise.

Despite the travel schedule and other constraints, I have also
responded already to some of the comments in other contexts (in
particular, in the note entitled "Minimal IDNAbis requirements"
sent on 17 December and some follow-ups to it).  For those
topics, I felt that more comprehensive response would be better
for all concerned than a specific line-by-line response to each
separate note from a different person, but perhaps you
disagree.  I've cited those earlier notes below and in the
forthcoming response to your comments on "issues" in case you
want to examine the responses --and the development of thinking
as the result of comments-- in detail, but will try to
summarize the reactions and current plan below. 

I've elided considerable text below to make this response
shorter and easier to follow.  I don't believe I've dropped
anything important but, if you believe I've made a mistake
about that, please identify the outstanding questions in
another note and I'll get to them as soon as possible.

On Dec 13, 2007 7:47 PM, Mark Davis
<mark.davis at icu-project.org> wrote:

> http://www.ietf.org/internet-drafts/draft-klensin-idnabis-pro
> tocol-02.txt Overview:

> Protocol-1. By excluding case/width folding, there will be
> significant backwards compatibility problems, caused by
> having no standard folding. Examples of current usage:
> 
> 
> U-Label
> U-Label Escaped
>...

> I'll copy one portion. As of last March, "Out of a
> significantly large sampling of the web, there were about
> 800,000 cases where an HTML document contained an href="..."
> that contained a host name that was valid IDNA2003. We tested
>...
> But the folding case is different. The case/NFKC folding of
> IDNA is not just a UI issue; there are a huge number in
> email, web pages, and so on. I'm very leary of causing 4% of
> embedded URLs to break. And we haven't seen any real evidence
> that case/width folding is a real, demonstrable problem.

I covered this fairly extensively in the "Minimal IDNAbis
requirements" document mentioned above and other parts of it in
the 15 December response to Erik's "IDN trends" note and the 17
December response to him with a subject of "Re: idna
folding...".  See particularly point (4) of the original
("Minimal") message, which explicitly addresses the Turkish
dotless "i".  Those discussions led to a summary of UI issues
in a note sent 6 January, with a subject line of "Re: NFKC and
dots", in response additional comments on that subject from
Erik and Ken (start about halfway down the message with the
paragraph that begins "I think the UI situation...".  Text is
being prepared for issues-06 that incorporates that material. 

Some of that text parallels work for a possible revision of the
EAI framework document that discusses normalization and casing
of local-parts and is also reflected in notes I posted within
the last 24 hours about related issues in net-utf8.  In a more
perfect world, I think we would do net-utf8 first, then IDNs,
then EAI, but circumstances have resulted in our having to deal
with them in parallel, with the thinking about each informing
the others.  I hope to get the revised version of "issues" up
soon, but will probably wait until next week to see if you or
others have reactions to this note and the next one that can be
incorporated.

To summarize the comments that will go into issues-06 and
elsewhere (this is a different form of the comments in the 6
January "Re: NFKC and dots" note), as IDNs become more heavily
used and in more contexts, it will become ever more important
(if we want interoperability) that the on-the-wire forms be as
simple, consistent, and variation-free as possible.  That
argues for the use of punycode-processed forms (A-labels) in
URIs.  Since U-labels are not permitted to appear explicitly in
URIs (as distinct from IRIs), the alternative would be
octet-level encodings of the UTF-8 form of those labels.  That
is consistent with how the URI and IRI specs are written today
(as well as with IDNA2003) so doesn't raise any new issues.
However, as Erik (and others) have pointed out, there are
important contexts other than direct URI uses in protocols.
For them, and even for presentation of URIs and IRIs, there are
two types of contexts in which IDNs are likely to be
interpreted.  One of those is intended to be international.  It
must make tradeoffs for the best balance of presentation and
operations in an international environment (rather than being
tailored to a specific script and culture).  The other is
highly localized, with the goal of presenting the best possible
user experience to a local population.  Both should have an
objective of confusing the user as little as possible, but will
likely reach different answers because of the tradeoffs.

Remembering that this is all about presentation, rather than
what happens in the protocol, for the first, "internationalized
application", case one would have to be nuts to not do
case-mapping and one would probably have to be nuts to not
apply NFKC.  That is the right answer from a user experience
standpoint.  Perhaps not coincidentally, it is the right thing
to do from the standpoint of compatibility with IDNA2003.  

However, for the second, "highly localized", case, we shouldn't
try to tell the designers and implementers how to handle
characters with which their local users may be unfamiliar and
that might be mapped to others.  One can imagine switches about
whether some sorts of mappings occur, warnings before applying
them or, in a slightly more extreme version the IE7 case, utter
refusal to handle "strange" characters at all in U-label form.

We have lots of sad experience in the IETF that tells that, if
we tell implementers to do something they consider bad for
their users, they will ignore us (the wide use of punycode
display, triggered by various conditions not anticipated in RFC
3490, in existing browsers is a fairly typical example of the
problem).  So, for the long term, it is much better to say
"here is some general guidance that you may follow if it meets
your needs" but to focus on what is actually needed for
interoperability and how that is accomplished.  And that means
A-labels and, in other than presentation and input contexts,
U-labels as defined in idnabis-issues, i.e., the forms obtained
by mapping A-labels back.  As a result, there are no upper-case
characters (and no unnormalized strings) in U-labels,
independent of what forms may be used to lead to them.

> Note: There is only really one locale where locale-sensitive
> lowercasing is needed, and that is for Turkish (and related
> languages using the same conventions in Latin).
>...

This takes us to one of the places where we may have a
substantive philosophical disagreement about protocol design
(there is a better example below, from which I will point back
to this text).  Patrik and, less recently, Paul Hoffman have
discussed aspects of this from a different perspective but,
rather than getting tied up with comments about
highly-contextual definitions of stability, agreements,
promises, or who has or has not committed evil deeds, let's
just treat it as a philosophical disagreement.  In designing
protocols we prefer, whenever possible, to not rely on
predictions of the future, promises about the future, and, most
important, on assertions about universal negatives about
membership in sets that cannot be comprehensively enumerated
(i.e., that no other cases will ever be found).   In this case,
it isn't necessarily to rely on "really only one case" and
therefore we shouldn't.

Instead, we (i) keep case-folding out of the protocol, (ii)
recommend that application and implementations that have "use
internationally in the same form or with the same tools"
constraints do the same case-folding and NFKC mapping that they
would do under a forward projection of IDNA2003 and do it
before the string is presented for IDNA200X processing (note
that this also solves the IDNA2003 dot-mapping problem as
discussed on the list) and (iii) recommend to highly localized
applications and implementations that they do whatever mapping
makes sense to them and their users (including local
interpretation as needed of what constitutes a "dot") before
passing strings off to IDNA200X, conditioned only on not
changing a codepoint that can occur in a U-label into something
else.

> Now, one possibility is that we have a separate IDNA-Folding
> document that preserves the case/width folding of IDNA2003.
> Then other standards, protocols, and implementations (such as
> browsers) could also claim conformance to that. This wouldn't
> be as good as keeping it inside the IDNA umbrella, but would
> be better than a potential huge backwards compatibility
> breakage.

See above.   This is a plausible alternative, except that, in
highly localized contexts, it is likely to be ignored by
someone who says "if we have to let users supply or read those
silly-looking Latin-derived (or Greek or Cyrillic) characters
at all, we want them to be forced to supply them in U-label form
so that they, and we, can be as sure as possible that they are
getting what they want to get".

> Protocol-2.  Section 5 has Normalization (5.5), but it is
> missing from Section 4. It must be there also (probably just
> an oversight).

Yes.  Caught from an earlier note (I can go back and figure out
which one if it is important) and fixed in the working draft
some time ago.

> Protocol-3.  It needs to have a prohibition on a leading
> combining mark. See Michel's emails.

Yes.  There is an interesting question about just how to apply
this prohibition (I don't remember how much of it has been
on-list), but it is clear that the prohibition must exist.

> Protocol-4.  Some of the same issues as
> draft-faltstrom-idnabis-tables-03.txt<http://www.ietf.org/int
> ernet-drafts/draft-faltstrom-idnabis-tables-03.txt>, such as
> MAYBE YES vs MAYBE NO.

Assuming that you are referring to the belief that there is no
semantic difference between MAYBE YES and MAYBE NO, there is no
protocol difference (which is why they are not distinguished in
the protocol).   However, we anticipate differences in
registration policies and, more important, in registration
metapolicies as they might be applied to different types of
domains.  It appears to be us to be worth preserving the
distinction, and the rules that support it, for that reason.
Because there is no protocol difference, little harm is done if
the distinction is made now and dropped later if it isn't
useful.  One the other hand, as I think we all know, making
distinctions later that were not made initially is often
terribly difficult.

> Protocol-5.  The "Contextual Rules" need to be supplied.
> (What is the format? Machine readable? Are there default
> required ones -- there should be, for ZWJ/ZWNJ).

Yes they need to be supplied, and as quickly as possible.  The
list, and the rules themselves, are presumably a job for an
IANA registry, probably initialized by a piece of the "tables"
document (since that is how related things are being done).
Machine readable would certainly be good, but there has been no
in-depth discussion yet about how to do that.  In particular it
is not clear whether all rules that are possible (i.e., that
may be required) can be appropriately expressed in regular
expression form using set elements of those expressions that
are well-defined and persistent.  But I don't understand what
you intend by "default required ones".  When contextual rules
are required, they are required and there is no default other
than "treat the corresponding code point as invalid".  Could
you explain? 

> Protocol-6.  Section 5.1 assumes that URLs are entered by
> users, when they are often (perhaps most often) interpreted
> by machines. That is of great importance, of course, for
> search engines, email readers, browsers, and others.

I'll see if the text can be tuned to better reflect this, but,
again, one of the goals of this work is to encourage as little
variation as possible in URLs themselves (e.g., only A-labels
go into URLs and only A-labels or U-labels go into IRIs).  Your
statement of the problem above is symptomatic of the confusion
here, which is that there is no conflict or contrast between
"entered by users" and "interpreted by machines".  In other
words, URLs are, except in very rare circumstances, interpreted
by machines no matter how they are entered and where they come
from.  Certainly some URLs are _generated_ by machines rather
than being entered by users, but we hope and expect (and, read
carefully, IDNA2003 expects) that such URLs will contain IDNs
in A-label or U-label form, not in any of the arbitrary forms
that can be converted into A-labels by IDNA2003.    The intent
of IDNA200X is to turn that expectation into a firm rule (while
allowing some flexibility to interpretation of legacy URLs that
didn't meet it).

Even when users interpret addresses and then have to enter them
(e.g., in what we often describe as the "side of the bus"
problem), people are generally better off with the U-label
forms.  The easiest example of this is, I believe, originally
due to your remarks: lower-case is more distinguishable than
upper case or, put differently, far more of the upper-case
characters of Greek and Cyrillic are confusable with their
Latin look-alikes than are their lower-case counterparts.
Perhaps that is what you intended?

Or perhaps I don't understand what you are concerned about.  If
so, could you clarify?

> Details
> Protocol-7.
> 
>    Unicode (without surrogates), paralleling the process above
> 
> (Minor) this is unnecessary. The tables disallow surrogates.

But the tables aren't authoritative, the rules are.  I can try
to work out a better way to say this with Patrik but, unless
you think it is harmful (as well as unnecessary), it might be
easier to just leave it.

> Protocol-8.
> 
>       a character is never removed from
>       it unless it is removed from Unicode.
> 
> This is not necessary. If you really have to have it, then
> add "(however, the Unicode stability policies expressly
> forbid this)"

This is the other case I was referring to above.  Certainly, if
it will improve world harmony, we can insert the parenthetical
note, presumably with a reference to the "Stability of
Properties" discussion in TUS5.0, Section 3.5, or Appendix F of
that document, and/or the "policies" web pages, as you prefer.
But we need to say something like this.  If we do not, we'll
have to go back to being version dependent, because each
version of the Unicode policy documents will need to be
separately evaluated when it appears.

thanks for the comments,
    john