I-D ACTION:draft-klensin-idnabis-issues-01.txt

Thu Mar 1 15:34:04 CET 2007

--On Tuesday, 27 February, 2007 10:40 -0800 Mark Davis
<mark.davis at icu-project.org> wrote:

> Some very quick notes on
> http://www.ietf.org/internet-drafts/draft-klensin
> -idnabis-issues-01.txt (I'll be out the rest of this week).

Some equally quick responses...  "#" denotes quotes from the
I-D text, and ">" denotes your comments.

# 2.1
# 
#    The registrant or user typically produces the request
#    string by keyboard entry of a character sequence.  That
#    sequence is validated only on the basis of its displayed
#    appearance, without knowledge of the character coding used
#    for its internal representation or other local details of
#    the way the operating system processes it.

> This makes it sound like software validates the sequence,
> which is incorrect.

I certainly did not read that into it and note that "software"
does not appear anywhere in the statement.  The text was
intended to indicate that whatever validation is performed is
performed by the user or not at all.  Suggestions for better
wording would be appreciated, but see below.

> No software validates input character
> sequences on the basis of displayed appearance. If the user
> might look at the sequence and "validate" it (although that
> is odd phrasing, inspect might be better); there is also some
> "validation" in that the user is typing, and general knows
> what keys are hit -- the validation is only in the sense of
> verifying that the correct keys are hit.

And the notion of "correctness" in hitting keys, and in what is
echoed on the screen (which may not be the same thing as the
exact keys hit, of course), is very much a matter of user
validation (or, if you prefer, inspection and approval).

# 2.3.  Character Mappings
# 
#    NFKC [Unicode-UAX15] which converts compatibility
#    characters to their base forms, resolves the different
#    ways in which some characters can be represented in
#    Unicode into a canonical form, and performs one-way case
#    mapping (partially simulating the query-time folding
#    operation that the DNS provides for ASCII strings).

> NFKC does not perform case mapping.

Cut and paste error.  Will be fixed -- my apologies.

# 3.2.1.2.  Conversion to Unicode

> This is still a bit much, and is not substantiated with
> examples.

Your opinion is noted.   Ultimately, we have different
perspectives on this and on the related topic of what is in
scope.  Based on your earlier comments, you believe that almost
all systems these days are Unicode-based or, when they use
a local CCS, use ones that were incorporated directly enough
into Unicode that precise mapping is trivial.  We keep hearing
about CCSs based on 2022-switching, local adaptations for what
you consider as presentation forms, need to distinguish
characters that you have unified, and so on.   I suspect that
at least part of this has to do with hearing from different
parties based on their perceptions of what you/we are willing
to hear.  

I don't want to be harsh about this or have discussion of it
turn into a distraction, but there appears to be a perception in
some parts of the Internet and character-set-using community
that the Unicode Consortium is sufficiently confident about the
knowledge and skills of the core Unicode technical team that
there is little point identifying problematic issues or
identifying critiques of Unicode to the Consortium generally or
to UTC in particular: at best, such comments are rejected or
ignored; at worst, they result in the individual or organization
raising them being abused or set up in ways that are not
directly related to the substance of the concern but that permit
them to be dismissed as uninformed idiots.  Whether the
perception is accurate or not, its existence might explain why
we are hearing some complaints, and about some problems, that
aren't making it through to you or the UTC.

# 3.2.1.4.  Nameprep Mappings

> There is reference elsewhere, but should be clear here, that
> if all characters cp such that NFKC(cp) != cp are removed,
> then NFC can be used instead of NFKC.

As we have discussed in other contexts, at least some of us
with experience with poor-quality implementations of, or
deliberate variations on, Internet protocols generally and IDNA
in particular are inclined to be extremely conservative about
some of these steps.  If, in fact, we succeed in eliminating
all cp such that NFKC(cp) != cp, then NFC can be used instead
but use of NFKC is, at worst, harmless.  Retaining NFKC
provides robustness against future compatibility characters
that, for one reason or another, are not eliminated by the
tabulated "Never" rules.  

Of course, an argument against it and in favor of specifying
things in terms of NFC only would be the possibility of having
future characters introduced that UTC identifies as
compatibility characters for existing ones but that the relevant
language community considers distinct from them.  If NFKC is
used, those characters are effectively permanently banned.  If
only NFC is used, then those characters can be permitted by
appropriate decisions about the "IDN-permitted" property.
Thinking about that tradeoff is leading me to believe that the
document should reference NFC (watch for separate note to the
list).  However, FWIW, any argument that NFC would provide the
potential for an additional check and flexibility in the use of
"IDN-permitted" would argue against having UTC make final
decisions about the content of "IDN-permitted".

# 3.2.2.  Flow Model for Domain Name Resolution (Lookup)

> This needs an example to help substantiate the claims.

In this case, and elsewhere, while examples might help clarify
the text, it does not seem helpful to get into a battle of
examples and counterexamples.  Terms like "substantiate" seem
to imply an invitation to the latter.

# 3.2.2.3.  User Interface Character Changes

> I find the MAY here quite troublesome for backwards
> compatibility. If a webpage right now has <a
> href="http://Bücher.de">... then any IDN compliant browser
> will work correctly. With the proposed change, it may or may
> not fail, depending on the brower (or other interpreter of
> the HTML). I am less concerned by compatiblity (NFKC)
> variants not mapping, simply because of their frequency of
> use, but case changes are not uncommon. We really need to see
> evidence that this will not cause problems before we make
> case mapping a MAY.

On the other hand, we received a good deal of input from parts
of the browser vendor community that they wanted to be able to
treat cases in which, if the domain name in either the text
associated with a link or the link itself was different from
the name that actually appeared in the DNS, they wanted to
treat the name as suspicious.  That position was reinforced by
input from you and your colleagues that strings that were all
lower case were less subject to spoofing than strings that
contained upper-case characters.  So we are in a difficult
position, one in which
  <a href="http://bücher.de">
is likely, at least in those browers and other applications
that imitate them, to be displayed to the user that way if the
link is displayed, while
  <a href="http://Bücher.de"> 
is likely to be displayed as 
  http://xn--bcher-kva.de/
This gets even worse if the contxt for that link is, e.g.,
  ... please click http://Bücher.de 
      <a href="http://Bücher.de"> ...
which may result in Punycode display or a nasty warning pop-up
whether the link is written in terms of Bücher.de or bücher.de,
since only "bücher" can actually be a U-label.

There is nothing that makes any of these UI behaviors
non-conforming (unless one interprets the standards such that
_any_ display of Punycode in a Unicode-capable environment is
non-conforming), so the browsers are IDN-compliant.  But, from
a user point of view, Punycode display certainly defeats the
intent of having IDNs.   

Note also that exactly the same arguments apply to the use of
compatibility characters in IRIs.  Athough we can quibble about
the intent of those doing so, we know that some of them are used
today.   So there is some risk that this will constitute an
incompatible and surprising change.  Making UIs take
responsibility for case-mapping (or compatibility character
mapping) and their consequences (and for being consistent about
how they are handled) and keeping upper-case characters out of
IRIs will actually yield better interoperability and
compatibility in the long term than having in-protocol
requirements about case-mapping.  

The problem discussed in 8.2 reinforces this point, even though
you apparently disagree with it also.

# 6.1.  Display and Network Order
# 
#    Questions remain about protocol constraints implying that
#    the overall

> This is all out of scope; if present, it should probably be
> in an appendix (and needs some work).

What makes it out of scope, Mark?  Users of IDNs in context
believe that it is an issue and that puts it in scope, even if
it were the case that nothing can be done about it.

# 6.2.  The Ligature and Digraph Problem

> ditto. Also, the definition and usage of ligature, digraph,
> phoneme needs considerable work.

See above.  If you have specific suggestions about improvements
to those definitions, please make them.  However, note that we
have been led to understand that the definitions used are
appreciably closer to those traditionally used in linguistics
and the study of typography and writing systems over the
centuries than the subset of them that appear in the Unicode
Standard.

# 7.  IDNs and the Robustness Principle

#   Registries, registrars, or other actors who do not do so,
#   or who get too liberal, too greedy, or too weird may
#   deserve punishment that will primarily be meted out in the
#   marketplace or by consumer protection rules and
#   legislation.

> This language seems inappropriate; and examples need to be
> provided.

See comments above about examples.  As far as appropriateness
of language is concerned, this language is generally consistent
with language used in discussions about "enforcement" of
Internet Standards in contexts in which there is no
regulation-based enforcement mechanisms.   Again, if you have
specific alternate suggestions, please make them.

# 8.1.  Design Criteria

#       *  Characters that are unassigned in the version of
#          Unicode being
#          used by the registry or application are not
#          permitted, even on resolution (lookup).

> This needs better justification, with examples.

> There is a general problem with the lack of substantiation,
> at least examples of perceived problems motivating the
> changes.

See above.

# 8.2.  More Flexibility in User Agents

#   For example, an essential element of the ASCII
#   case-mapping functions, that uppercase(character) =
#   uppercase(lowercase(character)),

> Replace character by string, and you see that this is false
> for ASCII (and it is not clear what the relevance is).

I have made that replacement and, if it is not true for ASCII
even after it, I'm missing something very fundamental.  Unless
I have somehow mis-stated the condition, it is essential to
the matching rules of the DNS, so, if there is a flaw, it
hasn't been obvious.

> (I ran out of time, and will try to get to this next week.)

I look forward to your additional comments.

regards,
   john