Rationale-01 and issues list
John C Klensin
klensin at jck.com
Mon Jul 14 04:40:53 CEST 2008
draft-ietf-idnabis-rationale-01 has just been queued for
posting. The balance of this note consists at a first cut at an
issues and status list and discussion based on comments received
Similar notes on the Protocol document, and -02 of that
document, will follow before the posting cutoff.
I look forward to interesting and helpful on list conversations
in the next two weeks and to productive sessions in Dublin.
Issues list, IDNABIS Rationale (as of 20080711)
Section numbers refer to both draft-ietf-idnabis-rationale-00
and -01 except as noted.
This is more or a status summary and issue discussion for that
document in the hope of facilitating wider discussion. I will
further update the list as needed and create a list without
comments for discussion and tentative decisions before Dublin.
The numbers given are to facilitate referencing.
The issues/comments from R.10 through R.25 inclusive onward are
derived from messages from Mark Davis, primary the note sent
Monday, 07 July, 2008 08:05 -0700. Consequently and just as a
convenience, in places the text reads as a reply directed to
** R.1 ** Should this document exist at all?
There is an ongoing discussion about whether the documents
should be reorganized in part to move material that is
necessary to the protocol implementation entirely out of
this document and then to drop this document.
Status: Under discussion. The rest of this list assumes
the answer is "no"; otherwise it would be irrelevant.
** R.2 ** Normative text in this document.
If the document is retained, should we try to remove all
normative references from the other documents to it and/or
remove all normative text from it and put it in the other
documents? (Those two definitions of the issue probably
amount to the same thing.)
Comment: This issue is discussed at some length in my note
on document organization.
** R.3 ** PROTOCOL-VALID Explanation
Section 6.1.1 contains an overview of Protocol-Valid. Is
that explanation adequate? If more is needed, what?
In addition, there is an inconsistency between Rationale
and Tables about the relationship between the two CONTEXT
categories and PROTOCOL-VALID. Tables treats the two
categories as disjoint. The explanation in Rationale
treats them as special subset cases of PROTOCOL-VALID,
more or less as PROTOCOL-VALID with special flags set.
Comment: The approach used in Tables is computationally
easier and reflects the structure of both documents before
a tentative decision in January to try the subset approach.
My impression from discussions of these revisions in
various meetings around the world is that the Tables
(disjoint) approach is more generally understood. The two
document obviously should be consistent; unless there are
strong enough reasons to use the subset approach to justify
changing Tables, Rationale will be changed to match it in
Status: The overview text about PROTOCOL-VALID has been
changed to something that may be more clear. Comments or
suggestions for improvements on that issue or the more
general ones discussed above are welcome.
** R.4 ** Contextual rules and their application
Section 126.96.36.199 contains a discussion of contextual rules
and a placeholder for an in-depth explanation of the syntax
for those rules. It may not be right and/or clear.
Status: Deferred to -02 and, more important, to a review
of the alternatives that are now shown in Protocol.
** R.5 ** Permanence of DISALLOWED
There has been extensive on-list discussion about whether
migration of characters from DISALLOWED to PROTOCOL-VALID
(or CONTEXTO) should be easy, or at least easier than
migration from PROTOCOL-VALID to DISALLOWED. I do not
believe we have reached consensus even though the material
in 6.1.2 reflects what I believe to have been the general
trend of the discussions when they ran down. Does anyone
have anything new to say about this and, if not, should I
remove the placeholder?
Status: Placeholder still present in -01.
** R.6 ** User agents and warnings
The last paragraph in Section 6 (actually 6.3 on layered
restrictions) explicitly points out the role of user agents,
and then concludes with a warning against threats that
cannot be completely prevented or blocked. That sentence
is redundant with a disclaimer in Security Considerations.
Should it be removed (I believe that someone explicitly
asked that something along these lines be said in 6.3, but
it is redundant, so I'm checking). Silence will be
interpreted as "leave the text, drop the placeholder".
Status: The placeholder/discussion question still appears
** R.7 ** Explanation of removal of symbols
Section 10.5 discusses the reasons why symbols are not
permitted in IDNA2008. There has been controversy about
some of the statements and examples, with some disagreement
about whether some of them are even factually correct. We
either need to identify everything that is controversial out
of this section and trim it out (which might leave very
little), fix specific examples on which we can agree, accept
(and possibly note) the disagreements, or adopt some other
strategy. Of course, if we drop rationale and explanatory
material in favor of a "this is just how it is" approach,
the specific issues with this section vanish.
See also R.25, which focuses on a specific part of the
** R.8 ** Mechanisms for updating the context registry
Section 13.2 ("IDNA Context Registry") contains a discussion
of the updating rules for the Contextual Rules registry.
Are those rules the ones we want?
Status: Pursuant to discussion on-list, this section has
been rewritten to require IETF review and approval. As
discussed on the list, I (and at least some others) still
hope that we can eventually get to a process that is more
based on expert review, but it appears to be well in the
future. A placeholder has been left on that section, but
it has been rewritten.
** R.9 ** Scope of requirements on Registries
The language of the first sentence of Section 10.1.2
appears to make a requirement on all DNS names and servers.
That would require an update to RFC 1123, among other
things. Such an update is out of scope and probably
Comment: That reading was never intended. The language
assumed more context than the reader might have had and was
Status: This text has been changed in the -01 document
to restrict its scope to zones supporting IDNA and
eliminate the restriction on other prefixes.
WG members should review both the change and the note in
the text of -01 to verify that they are what is desired.
** R.10 ** Stability of Labels.
Mark Davis wrote: I believe quite strongly that once a
domain name is valid, it should not be invalidated by any
later version of IDNA. Now, while we cannot prevent a later
RFC from doing that, we *can *prevent such invalidation by
the normal process of updating tables under these RFCs for
new versions, adding exceptions, and changing contextual
Comment: I've included this in the Rationale list to be sure
that it doesn't get lost. I believe that there is consensus
on this point. If that is the case, the only issue is
identifying any places where the documents might be
inconsistent with it.
** R.11 ** Instability of Nonlabels.
I also think that making nonlabels stable should *not *be a
goal. It can't really be achieved anyway, since the presence
of an UNALLOWED character can make a label be invalid in
version X yet valid in version Y (where that character is
Comment: See above. I've included this in the Rationale
list to be sure that it doesn't get lost. I believe that
there is consensus on this point although perhaps not as
clearly so. If that is the case, the only issue is
identifying any places where the documents might be
inconsistent with it.
On the other hand, we may just have a misunderstanding. See
the second "Description of major changes..." list, below.
** R.12 ** Management.
(Again from Mark's note and again in this list to prevent
its getting lost)
The process of adding backwards compatibility characters,
context conditions, and exceptions needs to be much more
Comment: I think the discussions during and since IETF 71
have focused on "change the RFCs using normal IETF
processes". -01 has been changed to reflect that (see R.8
for more discussion on the specific topic of Contextual
Rules). If we believe that, I don't know how much more can
be said at this stage.
** R.13 ** Statement about the reason why IDNA uses Unicode
The statement in 00 was "strange" and problematic.
Status: After discussion on the list, Ken Whistler's
suggested text was substituted into the text of -01. To the
extent necessary (I think we reached agreement on the list),
people should verify that is what is wanted.
** R.14 ** General editorial suggestions from Mark.
Status: all of these that appear editorial and
uncontroversial and have been incorporated. Those that were
not considered editorial and uncontroversial are noted
elsewhere in this list.
People should check diffs and/or Mark's list to be sure they
are satisfied with the changes.
** R.15 ** Description of major changes from IDNA2003, Bidi
Mark suggested removing item 9 ("Make bidirectional domain
names in a paragraph display in a non-surprising fashion.")
because it is just a special case of the previous item.
Comment: that text wasn't mine and I'd like to hear from Paul
and/or Harald and Cary before making any changes. I believe
the intent of the two separate statements was to distinguish
between the case in which one knows that the string is a
domain and the case in which one needs to deduce that the
string is a domain name from running text.
Status: A placeholder comment has been inserted in the -01
document to identify this issue.
** R.16 ** Description of major changes from IDNA2003, invalid
Should item 11 ("Make some currently-valid labels that are
not actually IDNA labels invalid.") be dropped?
Mark said: 'Why do we care that labels invalid under IDNA2003
are also invalid under IDNA2008? Why wouldn't they be?
Perhaps an example would help to clarify this.'
Comment: That isn't what it says, actually. I think the
intent of the statement was to call out the fact that
conforming IDNA2003 implementations can look up labels
starting in "xn--" that are not valid A-labels and labels
that contain "--" in the third and forth positions that don't
start in "xn". In IDNA2008, both of those lookup operations
are somewhat discouraged and conforming applications MAY
decline to look them up. (N.B., those two reasons for not
looking something up are separate and part of the
outstanding issues list for Protocol.)
** R.17 ** Safe, but only in conjunction...
Mark wrote: '"that are safe for use only in
conjunction". Since you never say why they are unsafe, this
needs clarification. Do you mean this because of visible
Comment: The reference was in conjunction with combining
characters that have represent much the same situation as
the joiners, i.e., that they don't add decoration to the
previous character but, instead, just change its
presentation form. An example of this is Arabic Tatweel
(U+0640), which has been discussed extensively in comments
by individuals and from ASIWG. I don't know how many other
examples there are. In the case of Tatweel, the ASIWG
recommendation has been to ban it entirely (i.e., treat is a
DISALLOWED). If we follow that advice, it may not be a
real example, but perhaps it illustrates the point.
It is worth noting that, as long as the approval process for
changes remains IETF Review, it is sufficient for this
document to use a relaxed definition of "safe", to be
evaluated on a case-by-case basis, rather than trying to
narrow things down to definitions that would be automatic.
Text that would further clarify this would be welcome.
** R.18 ** DISALLOWED in error.
Mark commented on the statement 'If a character is
classified as "DISALLOWED" in error and the error is
sufficiently problematic, the only recourse would be either
to introduce a new code point into Unicode and classify it as
"PROTOCOL-VALID..."'. Unless you have some evidence to
think that this is a real possibility (I don't), it should
Comment: I don't feel that this is worth fighting wars, or
even writing long explanations (again), over. If there is
consensus that it should come out, it will come out.
However, I think we should all understand that most of
these kinds of "unless you have evidence", "unless you
can prove this could happen", and "unless you can show an
example" arguments can produce different conclusions
depending on how they are stated. For example, the issue
with the above could be stated (with apologies for being
obnoxious) "Unless there is evidence that the Unicode
Consortium has never made, and will never make, a serious
mistake, then the text should stay in". Again, I'm not
arguing for keeping the text, but I do think we need to make
decisions without getting trapped by the phrasing of
** R.19 ** Slightly-redundant text.
The last paragraph of section 6.3 is redundant with a
similar comment in Security Considerations. Should it be
Comment: Mark suggests "yes" and I'm inclined to agree. If
others agree, I'm remove the "anchor" comment.
Status: Discussion anchor still present in -01
** R.20 ** Display of A-labels (punycode-coded strings) to users
The text says 'Applications MAY allow the display and
user input of A-labels, but are encouraged to not do so
except as an interface for special purposes, possibly for
debugging, or to cope with display limitations.'
Mark writes: There is widespread use of the A-Label to
signal a possible spoof -- while you discuss that later, I
think it's swimming against the tide not to mention it here.
Comment: We definitely need to talk about this. There is a
difference between recognizing that something is done, even
on a "widespread" basis, and encouraging it. From a
security practices and human factors standpoint, switching
into A-labels for too many different reasons is a bad idea
under both the principle that excessive warnings cause
typical users to ignore all of them after a while and
because the user has no way to differentiate among the cases
(at least without a handy A-label -> Unicode code point
list mapper in addition to an A-label -> U-label mapper and
knowledge as to what to do with both). Consider the two
most popular causes of A-label display today -- failure to
have the relevant script installed (much more likely an
indicator of "you are unlikely to be able to read the
content at that destination" than of evil-doing unless one
makes the classic assumption that anyone you don't know is a
nasty barbarian) and failure to be part of a TLD that
practices approved registration hygiene (extremely prone to
false positives unless the TLD is actively recruiting
evildoers) -- and then think about the effectiveness of
A-labels as a spoof warning in environments in which we know
that the typical user's response to being barraged by "an
<incomprehensible> thing might happen if you continue, do
you want to continue" messages is to click "yes" every time.
Your original suggestion (a half-dozen years ago) to color
these things was actually much better.
I don't think there is a place for any of that discussion or
associated recommendations in the document, but I also don't
think we should be saying things that constitute
recommending the practice. The text as it stands is just a
MAY with a fairly weak "encouraged" clause so people who
believe that A-label display is the best solution are still
Whatever is said about display of A-labels, it seems clear
that users should be able to type them, in some way, on
input. Does that need to be said explicitly?
** R.21 ** The "ae" example and discussion text.
In Section 7.3, Paragraph 4, the text reads 'the
two-character sequence "ae" is usually treated as a fully
acceptable alternate orthography.' Mark suggests adding
'for the "umlauted a" character'.
Comment: This should have been clear from the previous
paragraph and the fact that the sentence in question starts
with "That character (U+00E4)", a very explicit
back-reference to that previous paragraph. However, the
suggested change seems relatively harmless.
Status: The suggested change has been made to the text.
Anyone who is unhappy about it should say so. Previous
experience indicates that the RFC Editor may take exception
to that many repetitions of "unlauted a", but we will deal
with that if and when we get there.
** R.22 *** "cannot be represented directly in domain names"
In the first paragraph of Section 9, the text "use
characters that cannot be represented directly in domain
names but for which interpretations are provided." appears.
Mark asks: "What is meant by this, and how is it different
in IDNA2008? In both IDNA2003 and 2008 they are illegal."
Comment: This was intended to get at the mapping issues,
both with characters that disappear under NFKC and hence can
be interpreted to be part of domain names and, in
particular, to odd cases like the Sharp S (Eszett) mapping
to "ss". The situations are definitely different between
IDNA2003 and 2008. I think what I was trying to do with
that convoluted sentence was to identify the situation
without getting into a discussion of mapping and mapping
issues. I obviously failed.
Status: A placeholder has been inserted in the text, along
with preliminary suggested text from Patrik. Please review,
comment, and suggest improvements or alternatives if
** R.23 ** Detecting domain names in text
Section 9, Paragraph 2 of the text contains the statement
"If a domain name appears in an arbitrary context (such as
running text), one may be faced with the requirement to know
that a string is a domain name in order to adjust for the
different forms of dots but also to have traditional dots to
recognize that a string is a domain name -- an obvious
Mark wrote: "Not a contradiction, remove. Example, if one
recognizes full-width dot in detecting URLs, then one can
clearly use them in parsing within labels."
Comment: Either I don't understand your example, or it
supports my point. If one "recognizes full-width dot in
detecting URLs", then one has already made a decision that
goes outside IDNA and the URL standards. One could as
easily "recognize" any other character, whether dot-like or
not and whether on the IDNA2003 "treat as dot" list or not.
That recognition would occur in one of two cases:
(i) One treats the dot-like character as equivalent to an
ASCII dot throughout the application. The domain name is
then recognized, in essence, by recognizing the ASCII dot as
usual. That might well work for a full-width character in
contexts that are used to treating all full-width
Latin-derived characters as identical to their ASCII
equivalents, but certainly would not apply to various other
characters that people have insisted are dot-like enough (or
sentence-separator-like enough) to be treated as label
(ii) One treats the dot-like character as an ASCII dot
because one knows that one is in a domain name or URL
context. But therein lies the contradiction to which I
referred: if you need to know the context in order to
determine whether the dot-oid is to be treated as an ASCII
dot and/or label separator, then you cannot use the
dot-oid to determine the context.
Obviously the explanation in the text is not good enough.
Status: New text has been inserted, along with a placeholder
note. Comments and suggestions welcome.
** R.24 ** Local (user interface) mappings
The text says: "None of those local decisions are a threat
to interoperability as long as (i) only U-labels and
A-labels are used in interchange with systems outside the
Mark writes: Doesn't really follow that there are no
problems. The obvious example of interoperability problems
are where a Turkish friend has a URL that works in his
browser, copies the text in an email and sends to me. When
I click on it, it either 404's or **much worse**, goes to a
Comment: First of all, Mark and others have suggested many
times that the Turkish "i" is such a rare and unique case
that there is no point worrying about it. I'm not convinced
of that, but I think we either need to take it seriously as
a problem... or not.
More to the point, the key point is in the text following
that which you quoted, i.e., "(ii) no character that would
be valid in a U-label as itself is mapped to something
else,...". As long as the dotless "i" (U+0131) remains
PVALID (it is today), then the case I assume you are calling
out above (since you specifically mentioned a Turkish
friend) is prohibited (which doesn't mean it won't happen --
But I think this case identifies the sources of our
disagreement about required standardized mapping and a
number of other issues (including the one about the [mis]use
of A-label display discussed above). If I'm making the right
inference from several of your comments, you tend to look at
whatever is going on today and say "this is the pattern, it
is in wide use, we need to accept it, standardize it, and
promote it". My view is that, in a world in which we are
talking about planning for the next billion Internet users
and the billion after that, we should be figuring out what
is optimal, including learning from the consequences of
previous behaviors and decisions, and then developing plans
about how to get there (even if that implies a
In this particular case, I look at some of the possibilities
that could lead one to end up with a different target host
or site than intended and say "another reason why we need to
restrict URLs to final (non-mapped) characters only and get
people out of the habit of believing that they can use, or
advertise, one character in a URL or email address and
expect a different character in the DNS". You look at
figures about how often Google encounters characters that
need mapping and say "we have to preserve the mappings
forever" while I look at the same data and say "we have to
think through the transition and legacy issues very
carefully, but need to get that behavior under control
before it gets even worse".
Suggestions about better text would be welcome, but the
group needs to figure out what it wants to do with the
underlying difference in models between "make the Internet
better by stabilizing current behavior patterns" and "make
the Internet better by facilitating evolution to
Status: Placeholder/discussion anchor inserted in text.
** R.25 ** Explanation of the symbol prohibition
In Section 10.5, one of the bullet points starts "Most
Unicode names for letters are, in most cases, fairly
intuitive, unambiguous and recognizable to users of the
relevant script....and there are far more squares of
various flavors in Unicode than there are hearts or stars."
Mark wrote: This just needs to be removed; the argumentation
is faulty. For the same pronunciation, Chinese has hundreds
of possible characters. If you want another reason (and
someone to point a finger at), you could say: "The Unicode
Standard recommends that these types of identifiers not
contain symbols [UAX31].
Comment: This needs further discussion. I believe that the
comment about Chinese is totally irrelevant: after all, a
large number of languages have multiple characters within
the same script for the same phoneme (as well as using the
same character to represent multiple phonemes) and none of
that generally affects the name of the character. We may
need to agree to disagree and then let the WG decide whether
the discussion is important enough to try to tune in the
light of the disagreement. I don't know if better phrasing
would help things; I gather from the above that you are
convinced that it would not.
Status: A placeholder/discussion anchor has been inserted
into the text.
** R.26 ** Trimming additional text.
An editorial note appears in the Change Log at the end of
Section 15.8 about whether a list of additional text
sections should be rewritten, trimmed, or dropped. The
group should review that list.
Status: The note appeared in -00 and is retained in -01. No
comments have been received that are not addressed above or
in changes already made. The note itself will disappear in
-02 when the change log entries describing pre-WG drafts are
More information about the Idna-update