Rationale-01 and issues list

Mon Jul 14 04:40:53 CEST 2008

Hi.

draft-ietf-idnabis-rationale-01 has just been queued for
posting.  The balance of this note consists at a first cut at an
issues and status list and discussion based on comments received
to date.

Similar notes on the Protocol document, and -02 of that
document, will follow before the posting cutoff.

I look forward to interesting and helpful on list conversations
in the next two weeks and to productive sessions in Dublin.

   john

-----------

Issues list, IDNABIS Rationale  (as of 20080711)

Section numbers refer to both draft-ietf-idnabis-rationale-00
and -01 except as noted.

This is more or a status summary and issue discussion for that
document in the hope of facilitating wider discussion.  I will
further update the list as needed and create a list without
comments for discussion and tentative decisions before Dublin.
The numbers given are to facilitate referencing.

The issues/comments from R.10 through R.25 inclusive onward are
derived from messages from Mark Davis, primary the note sent
Monday, 07 July, 2008 08:05 -0700.  Consequently and just as a
convenience, in places the text reads as a reply directed to
him.

** R.1 ** Should this document exist at all?

	There is an ongoing discussion about whether the documents
	should be reorganized in part to move material that is
	necessary to the protocol implementation entirely out of
	this document and then to drop this document.

	Status: Under discussion.  The rest of this list assumes
	the answer is "no"; otherwise it would be irrelevant.

** R.2 ** Normative text in this document.

    If the document is retained, should we try to remove all
	normative references from the other documents to it and/or
	remove all normative text from it and put it in the other
	documents?  (Those two definitions of the issue probably
	amount to the same thing.)

    Comment: This issue is discussed at some length in my note
	on document organization.

** R.3 ** PROTOCOL-VALID Explanation

    Section 6.1.1 contains an overview of Protocol-Valid.  Is
    that explanation adequate?  If more is needed, what?

    In addition, there is an inconsistency between Rationale
	and Tables about the relationship between the two CONTEXT
	categories and PROTOCOL-VALID.  Tables treats the two
	categories as disjoint. The explanation in Rationale
	treats them as special subset cases of PROTOCOL-VALID,
	more or less as PROTOCOL-VALID with special flags set.

    Comment: The approach used in Tables is computationally
    easier and reflects the structure of both documents before
	a tentative decision in January to try the subset approach.
	My impression from discussions of these revisions in
	various meetings around the world is that the Tables
	(disjoint) approach is more generally understood.  The two
	document obviously should be consistent; unless there are
	strong enough reasons to use the subset approach to justify
	changing Tables, Rationale will be changed to match it in
	-02.

    Status: The overview text about PROTOCOL-VALID has been
	changed to something that may be more clear.  Comments or
	suggestions for improvements on that issue or the more
	general ones discussed above are welcome.

** R.4 ** Contextual rules and their application

    Section 6.1.1.2 contains a discussion of contextual rules
    and a placeholder for an in-depth explanation of the syntax
    for those rules.   It may not be right and/or clear.

    Status:  Deferred to -02 and, more important, to a review
	of the alternatives that are now shown in Protocol.

** R.5 ** Permanence of DISALLOWED

    There has been extensive on-list discussion about whether
    migration of characters from DISALLOWED to PROTOCOL-VALID
    (or CONTEXTO) should be easy, or at least easier than
    migration from PROTOCOL-VALID to DISALLOWED.   I do not
    believe we have reached consensus even though the material
    in 6.1.2 reflects what I believe to have been the general
    trend of the discussions when they ran down.  Does anyone
    have anything new to say about this and, if not, should I
    remove the placeholder?

	Status: Placeholder still present in -01.

** R.6 ** User agents and warnings

    The last paragraph in Section 6 (actually 6.3 on layered
    restrictions) explicitly points out the role of user agents,
    and then concludes with a warning against threats that
    cannot be completely prevented or blocked.   That sentence
    is redundant with a disclaimer in Security Considerations.
    Should it be removed (I believe that someone explicitly
    asked that something along these lines be said in 6.3, but
    it is redundant, so I'm checking).  Silence will be
    interpreted as "leave the text, drop the placeholder".

    Status: The placeholder/discussion question still appears
	in -01.

** R.7 ** Explanation of removal of symbols

    Section 10.5 discusses the reasons why symbols are not
    permitted in IDNA2008.   There has been controversy about
    some of the statements and examples, with some disagreement
    about whether some of them are even factually correct.   We
    either need to identify everything that is controversial out
    of this section and trim it out (which might leave very
    little), fix specific examples on which we can agree, accept
    (and possibly note) the disagreements, or adopt some other
    strategy.    Of course, if we drop rationale and explanatory
    material in favor of a "this is just how it is" approach,
    the specific issues with this section vanish.

    See also R.25, which focuses on a specific part of the
	explanation.

** R.8 ** Mechanisms for updating the context registry

    Section 13.2 ("IDNA Context Registry") contains a discussion
    of the updating rules for the Contextual Rules registry.
	Are those rules the ones we want?

    Status: Pursuant to discussion on-list, this section has
	been rewritten to require IETF review and approval.  As
	discussed on the list, I (and at least some others) still
	hope that we can eventually get to a process that is more
	based on expert review, but it appears to be well in the
	future.  A placeholder has been left on that section, but
	it has been rewritten.

** R.9 ** Scope of requirements on Registries

	The language of the first sentence of Section 10.1.2
	appears to make a requirement on all DNS names and servers.
	That would require an update to RFC 1123, among other
	things.  Such an update is out of scope and probably
	undesirable.

	Comment: That reading was never intended.  The language
	assumed more context than the reader might have had and was
	generally sloppy.

	Status: This text has been changed in the -01 document
	to restrict its scope to zones supporting IDNA and
	eliminate the restriction on other prefixes. 
	WG members should review both the change and the note in
	the text of -01 to verify that they are what is desired.

** R.10 ** Stability of Labels.

   Mark Davis wrote: I believe quite strongly that once a
   domain name is valid, it should not be invalidated by any
   later version of IDNA.  Now, while we cannot prevent a later
   RFC from doing that, we *can *prevent such invalidation by
   the normal process of updating tables under these RFCs for
   new versions, adding exceptions, and changing contextual
   rules.

   Comment: I've included this in the Rationale list to be sure
   that it doesn't get lost.  I believe that there is consensus
   on this point.  If that is the case, the only issue is
   identifying any places where the documents might be
   inconsistent with it.

** R.11 ** Instability of Nonlabels.

   Mark wrote:
   I also think that making nonlabels stable should *not *be a
   goal. It can't really be achieved anyway, since the presence
   of an UNALLOWED character can make a label be invalid in
   version X yet valid in version Y (where that character is
   defined).

   Comment: See above. I've included this in the Rationale
   list to be sure that it doesn't get lost.  I believe that
   there is consensus on this point although perhaps not as
   clearly so.  If that is the case, the only issue is
   identifying any places where the documents might be 
   inconsistent with it.

   On the other hand, we may just have a misunderstanding.  See
   the second "Description of major changes..." list, below.

** R.12 ** Management.

   (Again from Mark's note and again in this list to prevent
   its getting lost) 
   The process of adding backwards compatibility characters,
   context conditions, and exceptions needs to be much more
   definitive. 

   Comment: I think the discussions during and since IETF 71
   have focused on "change the RFCs using normal IETF
   processes".  -01 has been changed to reflect that (see R.8
   for more discussion on the specific topic of Contextual
   Rules).  If we believe that, I don't know how much more can
   be said at this stage.

** R.13 ** Statement about the reason why IDNA uses Unicode

  The statement in 00 was "strange" and problematic.

  Status: After discussion on the list, Ken Whistler's
  suggested text was substituted into the text of -01.  To the
  extent necessary (I think we reached agreement on the list),
  people should verify that is what is wanted. 

** R.14 ** General editorial suggestions from Mark.

  Status: all of these that appear editorial and
  uncontroversial and have been incorporated.  Those that were
  not considered editorial and uncontroversial are noted
  elsewhere in this list. 
  People should check diffs and/or Mark's list to be sure they
  are satisfied with the changes.

** R.15 ** Description of major changes from IDNA2003, Bidi

  Mark suggested removing item 9 ("Make bidirectional domain
  names in a paragraph display in a non-surprising fashion.")
  because it is just a special case of the previous item.

  Comment: that text wasn't mine and I'd like to hear from Paul
  and/or Harald and Cary before making any changes.  I believe
  the intent of the two separate statements was to distinguish
  between the case in which one knows that the string is a
  domain and the case in which one needs to deduce that the
  string is a domain name from running text.

  Status: A placeholder comment has been inserted in the -01
  document to identify this issue.

** R.16 ** Description of major changes from IDNA2003, invalid
       labels 

  Should item 11 ("Make some currently-valid labels that are
  not actually IDNA labels invalid.") be dropped?

  Mark said: 'Why do we care that labels invalid under IDNA2003
  are also invalid under IDNA2008? Why wouldn't they be?
  Perhaps an example would help to clarify this.'

  Comment: That isn't what it says, actually.  I think the
  intent of the statement was to call out the fact that
  conforming IDNA2003 implementations can look up labels
  starting in "xn--" that are not valid A-labels and labels
  that contain "--" in the third and forth positions that don't
  start in "xn".   In IDNA2008, both of those lookup operations
  are somewhat discouraged and conforming applications MAY
  decline to look them up.  (N.B., those two reasons for not
  looking something up are separate and part of the
  outstanding issues list for Protocol.)

** R.17 ** Safe, but only in conjunction...

   Mark wrote:   '"that are safe for use only in
   conjunction". Since you never say why they are unsafe, this
   needs clarification. Do you mean this because of visible
   confusability?'

   Comment: The reference was in conjunction with combining
   characters that have represent much the same situation as
   the joiners, i.e., that they don't add decoration to the
   previous character but, instead, just change its
   presentation form.  An example of this is Arabic Tatweel
   (U+0640), which has been discussed extensively in comments
   by individuals and from ASIWG.  I don't know how many other
   examples there are.  In the case of Tatweel, the ASIWG
   recommendation has been to ban it entirely (i.e., treat is a
   DISALLOWED).  If we follow that advice, it may not be a
   real example, but perhaps it illustrates the point.

   It is worth noting that, as long as the approval process for
   changes remains IETF Review, it is sufficient for this
   document to use a relaxed definition of "safe", to be
   evaluated on a case-by-case basis, rather than trying to
   narrow things down to definitions that would be automatic.

   Text that would further clarify this would be welcome.

** R.18 ** DISALLOWED in error.

   Mark commented on the statement 'If a character is
   classified as "DISALLOWED" in error and the error is
   sufficiently problematic, the only recourse would be either
   to introduce a new code point into Unicode and classify it as
   "PROTOCOL-VALID..."'. Unless you have some evidence to
   think that this is a real possibility (I don't), it should
   be removed.' 

   Comment: I don't feel that this is worth fighting wars, or
   even writing long explanations (again), over. If there is
   consensus that it should come out, it will come out.
   However, I think we should all understand that most of
   these kinds of "unless you have evidence", "unless you
   can prove this could happen", and "unless you can show an
   example" arguments can produce different conclusions
   depending on how they are stated.   For example, the issue
   with the above could be stated (with apologies for being
   obnoxious) "Unless there is evidence that the Unicode
   Consortium has never made, and will never make, a serious
   mistake, then the text should stay in".  Again, I'm not
   arguing for keeping the text, but I do think we need to make
   decisions without getting trapped by the phrasing of
   questions.

** R.19 ** Slightly-redundant text.

   The last paragraph of section 6.3 is redundant with a
   similar comment in Security Considerations.  Should it be
   retained? 

   Comment: Mark suggests "yes" and I'm inclined to agree.  If
   others agree, I'm remove the "anchor" comment.

   Status: Discussion anchor still present in -01

** R.20 ** Display of A-labels (punycode-coded strings) to users

   The text says 'Applications MAY allow the display and
   user input of A-labels, but are encouraged to not do so
   except as an interface for special purposes, possibly for
   debugging, or to cope with display limitations.'

   Mark writes: There is widespread use of the A-Label to
   signal a possible spoof -- while you discuss that later, I
   think it's swimming against the tide not to mention it here.

   Comment: We definitely need to talk about this.  There is a
   difference between recognizing that something is done, even
   on a "widespread" basis, and encouraging it.  From a
   security practices and human factors standpoint, switching
   into A-labels for too many different reasons is a bad idea
   under both the principle that excessive warnings cause
   typical users to ignore all of them after a while and 
   because the user has no way to differentiate among the cases
   (at least without a handy A-label -> Unicode code point
   list mapper in addition to an A-label -> U-label mapper and
   knowledge as to what to do with both).  Consider the two
   most popular causes of A-label display today -- failure to
   have the relevant script installed (much more likely an
   indicator of "you are unlikely to be able to read the
   content at that destination" than of evil-doing unless one
   makes the classic assumption that anyone you don't know is a
   nasty barbarian) and failure to be part of a TLD that
   practices approved registration hygiene (extremely prone to
   false positives unless the TLD is actively recruiting
   evildoers) -- and then think about the effectiveness of
   A-labels as a spoof warning in environments in which we know
   that the typical user's response to being barraged by "an
   <incomprehensible> thing might happen if you continue, do
   you want to continue" messages is to click "yes" every time.
   Your original suggestion (a half-dozen years ago) to color
   these things was actually much better.

   I don't think there is a place for any of that discussion or
   associated recommendations in the document, but I also don't
   think we should be saying things that constitute
   recommending the practice.   The text as it stands is just a
   MAY with a fairly weak "encouraged" clause so people who
   believe that A-label display is the best solution are still
   conforming.

   Whatever is said about display of A-labels, it seems clear
   that users should be able to type them, in some way, on
   input.  Does that need to be said explicitly?

** R.21 ** The "ae" example and discussion text.

   In Section 7.3, Paragraph 4, the text reads 'the
   two-character sequence "ae" is usually treated as a fully 
   acceptable alternate orthography.'  Mark suggests adding
   'for the "umlauted a" character'. 

   Comment: This should have been clear from the previous
   paragraph and the fact that the sentence in question starts
   with "That character (U+00E4)", a very explicit
   back-reference to that previous paragraph.  However, the
   suggested change seems relatively harmless.

   Status: The suggested change has been made to the text.
   Anyone who is unhappy about it should say so.  Previous
   experience indicates that the RFC Editor may take exception
   to that many repetitions of "unlauted a", but we will deal
   with that if and when we get there.

** R.22 *** "cannot be represented directly in domain names"

   In the first paragraph of Section 9, the text "use
   characters that cannot be represented directly in domain
   names but for which interpretations are provided." appears.
   Mark asks: "What is meant by this, and how is it different
   in IDNA2008? In both IDNA2003 and 2008 they are illegal."

   Comment: This was intended to get at the mapping issues,
   both with characters that disappear under NFKC and hence can
   be interpreted to be part of domain names and, in
   particular, to odd cases like the Sharp S (Eszett) mapping
   to "ss".  The situations are definitely different between
   IDNA2003 and 2008.    I think what I was trying to do with
   that convoluted sentence was to identify the situation
   without getting into a discussion of mapping and mapping
   issues.   I obviously failed.

   Status: A placeholder has been inserted in the text, along
   with preliminary suggested text from Patrik.  Please review,
   comment, and suggest improvements or alternatives if
   appropriate.

** R.23 ** Detecting domain names in text

   Section 9, Paragraph 2 of the text contains the statement
   "If a domain name appears in an arbitrary context (such as
   running text), one may be faced with the requirement to know
   that a string is a domain name in order to adjust for the
   different forms of dots but also to have traditional dots to
   recognize that a string is a domain name -- an obvious
   contradiction."

   Mark wrote: "Not a contradiction, remove. Example, if one
   recognizes full-width dot in detecting URLs, then one can
   clearly use them in parsing within labels."

   Comment: Either I don't understand your example, or it
   supports my point.  If one "recognizes full-width dot in
   detecting URLs", then one has already made a decision that
   goes outside IDNA and the URL standards.  One could as
   easily "recognize" any other character, whether dot-like or
   not and whether on the IDNA2003 "treat as dot" list or not.
   That recognition would occur in one of two cases:

   (i) One treats the dot-like character as equivalent to an
   ASCII dot throughout the application.  The domain name is
   then recognized, in essence, by recognizing the ASCII dot as
   usual.  That might well work for a full-width character in
   contexts that are used to treating all full-width
   Latin-derived characters as identical to their ASCII
   equivalents, but certainly would not apply to various other
   characters that people have insisted are dot-like enough (or
   sentence-separator-like enough) to be treated as label
   separators.

   (ii) One treats the dot-like character as an ASCII dot
   because one knows that one is in a domain name or URL
   context.  But therein lies the contradiction to which I
   referred: if you need to know the context in order to
   determine whether the dot-oid is to be treated as an ASCII
   dot and/or label separator, then you cannot use the
   dot-oid to determine the context.

   Obviously the explanation in the text is not good enough.

   Status: New text has been inserted, along with a placeholder
   note.  Comments and suggestions welcome.

** R.24 ** Local (user interface) mappings

   The text says: "None of those local decisions are a threat
   to interoperability as long as (i) only U-labels and
   A-labels are used in interchange with systems outside the
   local environment,...".

   Mark writes: Doesn't really follow that there are no
   problems. The obvious example of interoperability problems
   are where a Turkish friend has a URL that works in his
   browser, copies the text in an email and sends to me. When
   I click on it, it either 404's or **much worse**, goes to a
   different website.

   Comment: First of all, Mark and others have suggested many
   times that the Turkish "i" is such a rare and unique case
   that there is no point worrying about it.  I'm not convinced
   of that, but I think we either need to take it seriously as
   a problem...  or not.

   More to the point, the key point is in the text following
   that which you quoted, i.e., "(ii) no character that would
   be valid in a U-label as itself is mapped to something
   else,...".  As long as the dotless "i" (U+0131) remains
   PVALID (it is today), then the case I assume you are calling
   out above (since you specifically mentioned a Turkish
   friend) is prohibited (which doesn't mean it won't happen --
   see below).

   But I think this case identifies the sources of our
   disagreement about required standardized mapping and a
   number of other issues (including the one about the [mis]use
   of A-label display discussed above). If I'm making the right
   inference from several of your comments, you tend to look at
   whatever is going on today and say "this is the pattern, it
   is in wide use, we need to accept it, standardize it, and
   promote it".  My view is that, in a world in which we are
   talking about planning for the next billion Internet users
   and the billion after that, we should be figuring out what
   is optimal, including learning from the consequences of
   previous behaviors and decisions, and then developing plans
   about how to get there (even if that implies a
   slightly-bumpy transition).

   In this particular case, I look at some of the possibilities
   that could lead one to end up with a different target host
   or site than intended and say "another reason why we need to
   restrict URLs to final (non-mapped) characters only and get
   people out of the habit of believing that they can use, or
   advertise, one character in a URL or email address and
   expect a different character in the DNS".   You look at
   figures about how often Google encounters characters that
   need mapping and say "we have to preserve the mappings
   forever" while I look at the same data and say "we have to
   think through the transition and legacy issues very
   carefully, but need to get that behavior under control
   before it gets even worse".

   Suggestions about better text would be welcome, but the
   group needs to figure out what it wants to do with the
   underlying difference in models between "make the Internet
   better by stabilizing current behavior patterns" and "make
   the Internet better by facilitating evolution to
   better-designed patterns".

   Status: Placeholder/discussion anchor inserted in text.

** R.25 ** Explanation of the symbol prohibition

   In Section 10.5, one of the bullet points starts "Most
   Unicode names for letters are, in most cases, fairly
   intuitive, unambiguous and recognizable to users of the
   relevant script....and there are far more squares of
   various flavors in Unicode than there are hearts or stars."

   Mark wrote: This just needs to be removed; the argumentation
   is faulty. For the same pronunciation, Chinese has hundreds
   of possible characters. If you want another reason (and
   someone to point a finger at), you could say: "The Unicode
   Standard recommends that these types of identifiers not
   contain symbols [UAX31].

   Comment: This needs further discussion.  I believe that the
   comment about Chinese is totally irrelevant: after all, a
   large number of languages have multiple characters within
   the same script for the same phoneme (as well as using the
   same character to represent multiple phonemes) and none of
   that generally affects the name of the character.  We may
   need to agree to disagree and then let the WG decide whether
   the discussion is important enough to try to tune in the
   light of the disagreement.  I don't know if better phrasing
   would help things; I gather from the above that you are
   convinced that it would not.

   Status: A placeholder/discussion anchor has been inserted
   into the text.

** R.26 ** Trimming additional text.

   An editorial note appears in the Change Log at the end of
   Section 15.8 about whether a list of additional text
   sections should be rewritten, trimmed, or dropped.  The
   group should review that list.

   Status: The note appeared in -00 and is retained in -01.  No
   comments have been received that are not addressed above or
   in changes already made.  The note itself will disappear in
   -02 when the change log entries describing pre-WG drafts are
   dropped.