Rationale problems

John C Klensin klensin at jck.com
Thu Nov 27 22:07:01 CET 2008


Suggestions for which comments do not appear belong has been
responded to and incorporated into the document, although not
always exactly as suggested.  See the comment about "harmless"
in an earlier note for several of these -- if people don't like
suggested rephrasing or other changes, they should object
on-list in a timely fashion.  And, whether they like the precise
changes suggested or not, they should check the document when it
appears to be sure they are comfortable with it.


--On Thursday, 20 November, 2008 13:12 -0800 Mark Davis
<mark at macchiato.com> wrote:

>...
> ------------------------------
> 
> Some
> characters are sufficiently problematic for use in IDNs that
> they should be excluded for both registration and lookup
> (i.e., IDNA- conforming applications performing name lookup
> should verify that these
> characters are absent; if they are present, the label strings
> should be rejected rather than converted to A-labels and
> looked up.
> 
> =>
> 
> Some characters are inappropriate for use in IDNs and are thus
> excluded for both registration and lookup (i.e., IDNA-
> conforming applications performing name lookup should verify
> that these characters are absent; if they are present, the
> label strings should be rejected rather than converted to
> A-labels and looked up. Some of these characters are
> problematic for use in IDNs (such as the FRACTION SLASH
> character), while some of them (such as the HEART symbol)
> simply fall outside the conventions for typical identifiers
> (basically letters and numbers).
> 
> 
> *Rationale. *For only an miniscule fraction of the characters
> that were in unmapped in IDNA2003 and illegal in IDNA2008 is
> there any evidence of being "problematic". Also the "should
> be" language is more appropriate for a proposal, not
> describing the current spec. The above wording makes clear the
> main reason for this break in compatibility, while noting the
> problematic nature of some characters. Also supplies some
> concrete examples (we could use more of that!).

While I've made this change almost as suggested, people should
check it carefully, since "problematic" is somewhat in the eye
of the beholder and, more important, in operational experience
with IDNs.  I also note that I've had as many requests to take
examples out (because they can be misleading and/or because
people disagree with the exact ones used) as I have to include
more of them.  Either the WG needs to make up its collective
mind, or I will keep applying my editorial judgment.

>...
> ------------------------------
> 
> If
> a character is classified as "DISALLOWED" in error and the
> error is sufficiently problematic, the only recourse would be
> either to introduce a new code point into Unicode and classify
> it as "PROTOCOL-VALID" or for the IETF to accept the
> considerable costs of an incompatible change and replace the
> relevant RFC with one containing appropriate exceptions.
> 
> =>
> 
> If a character is classified as "DISALLOWED" when it should
> not have been, there are two possible recourses:
> (a)
> Replace the relevant RFC with one containing appropriate
> exceptions, accepting a change that many people feel to have
> considerable costs. (b) Propose a new character in Unicode
> that would be identical (except for its behavior in IDNA),
> which would be extremely unlikely to be accepted, since it
> would violate Unicode and ISO policies on duplicate encoding.

People have complained about statements that appear to predict
what the IETF will do in the future or to commit it to future
actions and those statements have been removed or corrected as
they have been identified.  It is, IMO, much less reasonable to
make statements that predict or attempt to constrain future
Unicode or ISO policies.  If the WG wants this sort of
statement, I believe it should be made only as a direct and
attributable quotation, e.g., "Mark Davis, speaking for the
Unicode Consortium and ISO/IEC JTC1/SC2, says 'such a proposal
would be extremely unlikely...' [Ref to date/time/place of
statement]".

There are some similar statements in the documents that we need
help identifying so that they can be removed or converted to the
above form.

> *Rationale.* Reflect reality. There is no particular consensus
> that changing DISALLOWED to PVALID would in fact have such
> costs. 

I note that the claim that it is reasonably safe to change
DISALLOWED to PVALID has been suggested several times and has
gotten little or no traction in the WG or at its meetings.   If
anything, there is rough consensus in the other direction (not
just "some people do feel..."). This one is up to Vint but,
absent instructions from him, I do not believe it is appropriate
to change this section (or the corresponding text in Tables) in
the direction of weakening the requirement.

> However, because some people do feel that is to be the
> case, we can certainly reflect that opinion here. We also
> don't want to hold out false hope that Unicode/ISO would do a
> duplicate encoding just for the purpose of IDNA.

> ------------------------------
> 
> For
> example, it is generally believed that labels containing
> characters from more than one script are a bad practice
> although there may be some important exceptions to that
> principle.
> 
> =>
> 
> For example, labels containing characters from more than one
> script are problematic where those characters can cause
> problems of visual confusion, such as using a Cyrillic
> character for "a" -- which looks exactly like a Latin "a" --
> in the midst of an otherwise Latin label. In other cases,
> mixing scripts may be perfectly acceptable, such as using
> Latin letters in the midst of Chinese characters.
> 
> *Rationale. *Use concrete examples of problems.

See comment above about examples.  Since confusion is not the
only criterion, what is, or is not, appropriate in mixing
scripts is very much a local judgment, I don't believe we should
go much further than the existing text unless there is fairly
clear WG consensus (and maybe some evidence of registry/operator
consensus) to do so.

> ------------------------------
> 
> Other
> issues in domain name identification and processing arise
> because IDNA2003 specified that several other characters be
> treated ...
> [[anchor16:
> Above text is a substitute for an earlier (pre -01) version
> and is hoped to be more clear. Comments and improvements
> welcome.]]
> 
> =>
> 
> [Remove]
> 
> 
> Rationale. Unless at least one example of a concrete problem
> can be provided, this needs to be removed. What is wrong with
> changing the Regex for recognizing a URL from using [.] as the
> label delimiter to using [.。。﹒.]  (that is,
> [\x{002E}\x{FF0E}\x{FE52} \x{3002}\x{FF61}])?

This has been explained several times in terms of the
requirement to be able to identify domain names in text
independent of whether they contain non-ASCII characters or not
and independent of whether the classification algorithm is
IDNA-aware.  There are firm DNS and security considerations for
such identification, and the main clue is the presence of what
is called a "faceted name" in other contexts, i.e., a string of
characters containing embedded periods without spaces before or
after the periods.  There is no regex inherently involved in
that test, so "changing the regex" is irrelevant.  Issues about
URL recognition independent of domain names are almost as much
so, since URLs do come with additional clues about recognition.
If the text isn't good enough, I invite suggestions about how to
fix it, but the DNS community has already told us, multiple
times, that eliminating the rule or broadening the selection of
"dot-oids" is a showstopper.


> ------------------------------
> 
> Highly Localized Preprocessing.
> 
> =>
> [Remove this section and reword neighboring text. ]
> 
> 
> *Rationale.* Major security issue (see notes on protocol).

Whether the text can be improved or not, the principles behind
it text reflect operational reality.  As far as I can tell,
there are only two ways to eliminate the issue you are concerned
about:

	(i) Require support, in the protocol, for all IDNA2003
	mappings, including character-elimination and case
	folding mappings.   That would prevent some changes we
	have decided to make because they are substantively
	important for some scripts, would cause Unicode
	characters added after 3.2 to behave differently from
	characters present in 3.2, and would deprive us of the
	isomorphism between A-label and U-labels.   It would
	probably not require support for base characters that
	IDNA2008 disallows (such as symbols), but any mappings
	onto an IDNA2008 PVALID target would have to be
	supported.
	
	(ii) Prohibit all mappings, insisting that the _only_
	native-character form for IDNs is the IDNA2008 U-label
	form.  That would yield absolutely consistent and
	predictable behavior across implementations but, by
	prohibiting obvious case mappings and the like, would be
	nearly certain to be ignored by at least some
	implementers.  Note that "ignore by some in violation of
	the protocol" is far more dangerous than "apply a
	permitted local interpretation" because, in the former
	case, other implementations not only have no idea what
	to expect but may safely believe that no mappings occur.

If one doesn't like either of those extreme options, there is no
choice but to put in some weasel-words about local mappings and
interpretations.   I have to say that I've got a growing level
of sympathy for the second, but I do not believe it to be
acceptable to the WG or the community.

> ------------------------------
> 
> Anyone looking up a label in a DNS zone is required to
> ...
> o Avoid validating other contextual rules about characters,
> including mixed-script label prohibitions, although such rules
> may be used to influence presentation decisions in the user
> interface.
> 
> =>
> [Remove last clause. Add editorial note that this needs to be
> reviewed against final text in protocol.]
>
> *Rationale. *The protocol does not, and should not, *require*
> someone not to validate that a purported U-Label is actually a
> U-Label!! Secondly, this and any other place in the document
> that reiterates what protocol requires needs to be marked to
> be verified before publication for accuracy, so that items
> like this don't mistakenly get through.

Note added.  Last clause will not be removed until we finalize
the protocol text and can do that verification.  

That said, the purpose of this statement is to do something you
have argued for elsewhere and have even argued that not doing it
is a security risk.  Statements like this draw a clear boundary
between things about which it is appropriate to warn the user
but then look up and things that are appropriately not looked
up.  We have to have predictable behavior about the latter or we
get into really strange territory.

> ------------------------------
> 
> characters are permanently excluded
> 
> =>
> 
> characters are excluded
> 
> *Rationale. *
> In accordance with "expected" language elsewhere. We need only
> say "excluded".

See comments above.  The IETF expectation is "permanently".
This has been discussed several times and no other conclusion
has been reached.

> ------------------------------
> 
> For example, an essential element of the ASCII case- mapping
> functions is that uppercase(character) must be equal to
> uppercase(lowercase(character)). That requirement may not be
> satisfied with IDNs. For example, there are some characters in
> scripts that use case distinction that do not have
> counterparts in one case or the other.
> 
> =>
> [Delete]
> 
> *Rationale. *While the roundtripping under case operations of
> ASCII of characters is a feature, it is not an "essential"
> feature of ASCII. 

It is an essential feature of DNS, hostname, ASCII.

> Moreover, even in ASCII, strings do not
> roundtrip: "McGowan" doesn't roundtrip, for example. And
> neither of these points are relevant to the argument at hand,
> they just weaken it.

See if you can get WG consensus for changing this text.

> ------------------------------
> 
> For
> example, putative labels that contain unassigned code points
> will now be rejected, while IDNA2003 permitted them (something
> that is now recognized as a considerable source of risk)
> 
> =>
> 
> For
> example, putative labels that contain unassigned code points
> will now be rejected, while IDNA2003 permitted them (something
> many feel to be a considerable source of risk)
> 
> 
> *Rationale.*
>  It isn't so recognized. There is no example of any case where
> it is a risk, since a label containing unassigned characters
> was always rejected in registration in IDNA2003, and therefore
> couldn't be matched. Note that IDNA2008 also does not require
> lookup to completely verify that putative U-Labels are actual
> U-Labels.

While I respect your opinion, please see if you can get WG
consensus for changing this text.   Even if you do, I'd expect
that you would get some pushback in IETF Last Call because
several people have done the analysis, particularly about
normalization of code points not yet assigned, and come to a
conclusion different from yours.

    john




More information about the Idna-update mailing list