Comments on IDNAbis protocol-03
mark.davis at icu-project.org
Sat Jan 19 20:24:05 CET 2008
Thanks for your responses. Most of your suggestions seem fine to me, so I'm
only including the ones where I have a comment.
> > Protocol-1. By excluding case/width folding, there will be
> > significant backwards compatibility problems, caused by
> > having no standard folding. Examples of current usage:
> Instead, we (i) keep case-folding out of the protocol,
> recommend that application and implementations that have "use
> internationally in the same form or with the same tools"
> constraints do the same case-folding and NFKC mapping that they
> would do under a forward projection of IDNA2003 and do it
> before the string is presented for IDNA200X processing (note
> that this also solves the IDNA2003 dot-mapping problem as
> discussed on the list) and
(i), (ii) As per Michel's suggestion, that is fine as long as we have a
standard "Preprocessing IDNA" specification that is available for those who
wish to state conformance to it. I'll send around a draft for discussion.
(iii) recommend to highly localized
> applications and implementations that they do whatever mapping
> makes sense to them and their users (including local
> interpretation as needed of what constitutes a "dot") before
> passing strings off to IDNA200X, conditioned only on not
> changing a codepoint that can occur in a U-label into something
I think "recommending" this would be much to strong. As you say, some people
might do it anyway, but it does represent an interoperability problem, since
the same string would preprocess different ways in different environments.
And the IDNs have a way of percolating into many different environments.
> > Protocol-4. Some of the same issues as
> > draft-faltstrom-idnabis-tables-03.txt<http://www.ietf.org/int
> > ernet-drafts/draft-faltstrom-idnabis-tables-03.txt>, such as
> > MAYBE YES vs MAYBE NO.
> Assuming that you are referring to the belief that there is no
> semantic difference between MAYBE YES and MAYBE NO, there is no
> protocol difference (which is why they are not distinguished in
> the protocol). However, we anticipate differences in
> registration policies and, more important, in registration
> metapolicies as they might be applied to different types of
> domains. It appears to be us to be worth preserving the
> distinction, and the rules that support it, for that reason.
> Because there is no protocol difference, little harm is done if
> the distinction is made now and dropped later if it isn't
> useful. One the other hand, as I think we all know, making
> distinctions later that were not made initially is often
> terribly difficult.
I'll comment on the MAYBE topic later.
> > Protocol-5. The "Contextual Rules" need to be supplied.
> > (What is the format? Machine readable? Are there default
> > required ones -- there should be, for ZWJ/ZWNJ).
> Yes they need to be supplied, and as quickly as possible. The
> list, and the rules themselves, are presumably a job for an
> IANA registry, probably initialized by a piece of the "tables"
> document (since that is how related things are being done).
> Machine readable would certainly be good, but there has been no
> in-depth discussion yet about how to do that. In particular it
> is not clear whether all rules that are possible (i.e., that
> may be required) can be appropriately expressed in regular
> expression form using set elements of those expressions that
> are well-defined and persistent. But I don't understand what
> you intend by "default required ones". When contextual rules
> are required, they are required and there is no default other
> than "treat the corresponding code point as invalid". Could
> you explain?
What I mean is that certain contextual rules, like no combining mark at the
start, or restrictions on ZWJ/ZWNJ need to be always present. Others may be
optional, depending on the registry.
However, we should also point out that registries may have rules that
require access to more than just a label, such as:
- use folding table A to map a proposed registration to a canonical
form (eg simplified chinese form).
- if any already registered label has that form as well, reject the
That would allow both traditional and simplified characters, but the first
one registered takes precedence. I mention this as an example; there are
many other possibilities.
> > Protocol-6. Section 5.1 assumes that URLs are entered by
> > users, when they are often (perhaps most often) interpreted
> > by machines. That is of great importance, of course, for
> > search engines, email readers, browsers, and others.
> I'll see if the text can be tuned to better reflect this, but,
> again, one of the goals of this work is to encourage as little
> variation as possible in URLs themselves (e.g., only A-labels
> go into URLs and only A-labels or U-labels go into IRIs). Your
> statement of the problem above is symptomatic of the confusion
> here, which is that there is no conflict or contrast between
> "entered by users" and "interpreted by machines". In other
> words, URLs are, except in very rare circumstances, interpreted
> by machines no matter how they are entered and where they come
> from. Certainly some URLs are _generated_ by machines rather
> than being entered by users, but we hope and expect (and, read
> carefully, IDNA2003 expects) that such URLs will contain IDNs
> in A-label or U-label form, not in any of the arbitrary forms
> that can be converted into A-labels by IDNA2003. The intent
> of IDNA200X is to turn that expectation into a firm rule (while
> allowing some flexibility to interpretation of legacy URLs that
> didn't meet it).
> Even when users interpret addresses and then have to enter them
> (e.g., in what we often describe as the "side of the bus"
> problem), people are generally better off with the U-label
> forms. The easiest example of this is, I believe, originally
> due to your remarks: lower-case is more distinguishable than
> upper case or, put differently, far more of the upper-case
> characters of Greek and Cyrillic are confusable with their
> Latin look-alikes than are their lower-case counterparts.
> Perhaps that is what you intended?
> Or perhaps I don't understand what you are concerned about. If
> so, could you clarify?
This would be satisfied if we have a separate Preprocessing IDNA document.
> > Details
> > Protocol-7.
> > Unicode (without surrogates), paralleling the process above
> > (Minor) this is unnecessary. The tables disallow surrogates.
> But the tables aren't authoritative, the rules are. I can try
> to work out a better way to say this with Patrik but, unless
> you think it is harmful (as well as unnecessary), it might be
> easier to just leave it.
There are many other exclusions in the rules that form the tables. If you
are going to say that, it would be better to rephrase to indicate that.
> > Protocol-8.
> > a character is never removed from
> > it unless it is removed from Unicode.
> > This is not necessary. If you really have to have it, then
> > add "(however, the Unicode stability policies expressly
> > forbid this)"
> This is the other case I was referring to above. Certainly, if
> it will improve world harmony, we can insert the parenthetical
> note, presumably with a reference to the "Stability of
> Properties" discussion in TUS5.0, Section 3.5, or Appendix F of
> that document, and/or the "policies" web pages, as you prefer.
> But we need to say something like this. If we do not, we'll
> have to go back to being version dependent, because each
> version of the Unicode policy documents will need to be
> separately evaluated when it appears.
Let's discuss this separately. I think I see why you want to say it, but
there are many other possible changes (like the removal of "-" from hostname
labels by the IETF) that we don't call out.
> thanks for the comments,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update