Comments on IDNAbis issues-05

Mark Davis mark.davis at icu-project.org
Sat Jan 19 20:27:34 CET 2008


Original document:
http://www.ietf.org/internet-drafts/draft-klensin-idnabis-issues-05.txt

Thank you very much for providing detailed responses. I'll comment on
them below. Most of my original comments were textual, but some were
model issues (substantive). I'll try to to point out which are which,
and then summarize what I think are the substantive issues in a
separate email.

I think we are making a lot of progress, and I see basically 2
substantive issues to settle. (That doesn't count areas where I think
we are on course, like  BIDI, and are making reasonably steady
technical progress.)


On Jan 13, 2008 10:48 AM, John C Klensin <klensin at jck.com> wrote:
> --On Wednesday, 09 January, 2008 16:28 -0800 Mark Davis
>
> In rereading both your note and my first draft of my response,
> I realized that I have assumed that most of your comments were
> substantive --i.e., suggesting that the model or underlying
> design of the specification was incorrect-- rather than
> requests for editorial clarifications.  If they were the
> latter, please give me that information and let's try to focus
> on substantive matters now and editorial ones later (or at
> least separately).
>
>
>
> On Dec 13, 2007 7:48 PM, Mark Davis
> <mark.davis at icu-project.org> wrote:
>
> > http://www.ietf.org/internet-drafts/draft-klensin-idnabis-iss
> > ues-05.txt Overview.Many nice improvements to the text.
>
> Thanks.  It is good to hear that we are making progress.
>
>
> > Issues-1. IDNAbis has a major backwards compatibility issue
> > with IDNA2003: thousands of characters are excluded that used
> > to be valid. What reason might people have to believe that
> > despite the terms NEVER and ALWAYS that some future version,
> > IDNAbis-bis, might not also do the same?

This was not a model issue, it was a textual issue. The issues
document should contain text that justifies the incompatibilities in
more detail.

Now, some of these incompatibilities -- perhaps even most -- I think
we have general consensus on. In particular, the exclusion of non-LMN
characters. But textually, we can't overplay the value of this change
either.

>
> Those "thousands of characters" fall into three categories, and

There is no need to quote "thousands"; it really is thousands any way
you cut it. See below for figures.

(Note: While I'm just copying figures that have several places of
accuracy, the exact values do depend on what is in the tables
document, which is still in flux. So don't look beyond the magnitude.)

> it is worth examining them in the following groups.  In no
> particular order, they are:
>
>        (i) Characters that cannot actually be represented in a
>        domain name (i.e., in A-label or ToUnicode(ToASCII(string))
>        form) even though they can be mapped into it.  These
>        characters include upper-case ones and ones mapped into
>        other things by NFKC plus, depending on how things are
>        defined, the "variant dots" that have been extensively
>        discussed on list, I think since your note was sent.  The
>        issues with them have been extensively explored elsewhere,
>        most notably in the recent thread about dot-mapping and my
>        recent response to your note about the
>        "protocol" document.

In terms of numbers, this is about 4,598 Code Points. That figure is
arrived at by comparing (with the utilities):

[[:idna=remapped:][:idna=ignored:]]
with
[[:L:][:Nd:][:Mn:][:Mc:]
&[:isCaseFolded:]
-[:NFKC_QuickCheck=NO:]
-[:Default_Ignorable_Code_Point:]
 [\u00B7\u05F3\u05F4\u3007\u30FB]
 [a-zA-Z0-9\-]
&[:age=3.2:]]

Here is a link.

That does not count any further exclusions in the tables document,
such as removing historic scripts.

The difference between input and output is a model issue, although one
that I think can be handled along the line of Michel's proposal:

    * Have a separate document for "Preprocessing IDNAs" -- it would
not be required for the use of IDNA, but strongly encouraged in
certain environments (anywhere backwards compatibility with IDNA2003
is needed, and in UIs, and in places where people might be copying
strings that are valid in UIs (eg href)).


More on this later.


>
>        (ii) Characters that can be represented in a domain name
>        but that have always (i.e., since IDNA2003 was published)
>        been discouraged or prohibited by various statements and
>        guidelines which were intended to be applicability
>        statements about the protocol.  This group of characters
>        includes characters that are not used to write the words of
>        any language, such as the various symbols, line-drawing,
>        and punctuation objects.  While we know that some of those
>        characters (fortunately a very small percentage) have been
>        used in domain names, it seems to us that a few bad
>        practices, some of them usage on the "because we can"
>        principle rather than out of perceived necessity, should
>        not prevent revising the protocol to make IDNs more robust
>        and interoperable.

This amounts to around 3,000 code points.

Here I think we have general consensus. I just don't think we want to
overplay this in the text. The reasons that we feel comfortable
removing them are:

    * None of these have any great value.
    * They are very rarely used (currently)
    * A small number of them can cause problems (like fraction slash).


Thus the cost of removal is low, and there is some benefit to removal.
But we can't overplay the value of the removal either, because any
user-agent worth its salt will need to have much more sophisticated
spoof-detection to catch the "paypal.com" type problems anyway. So the
incremental value of removing these characters is maybe a few percent
better than IDNA2003. It is because the value/frequency of the
characters are so low that it is worth the incompatibility,


>
>        (iii) Characters that are excluded by IDNA200X but
>        permitted by IDNA2003 that do not fall into one of the
>        above groups.  There are very few of these characters,
>        perhaps none, certainly not "thousands" or even "tens".
>        Any of them can be dealt with as special cases if they are
>        important enough.

If we take the very strong approach that Patrik has currently, where
only Latin, Cyrillic, and Greek are really guaranteed, then over 90
thousand characters are no longer guaranteed to be part of IDNA. So it
is "thousands".

I use the word "not guaranteed" specifically. In Google, for example,
we want to be able to look at a URL and say that it is either
compliant or not. And we don't want its compliance to change according
to browser, or in the future. The difference between MAYBE and ALWAYS
is a problem. This is a substantive issue, and I'll raise it
afterwards.

>
> I repeat the above, not because the statements are new (we have
> been over that list, in various forms, many times before), but
> because the answer to your question really lies in it.
> IDNA200X is excluding many characters that were previously
> permitted (by one definition of "permitted" or another) because
> it represents a change in design principles from IDNA2003.
> Those design principles are far more consistent with normal
> design principles for Internet protocols than the earlier ones.
> For example they exclude things that are not demonstrably
> necessary, rather than including everything because it appears
> that one can, and move more of the prohibitions into the
> protocol rather than "guidance" hand-waving about what
> registration entities, at all levels of the domain tree, should
> do.

Trying to assess every last Unicode character to "see if it is
demonstrably necessary" is an intractable problem. More on this issue
separately.

>
> Could the principle change yet again?  In principle, of course.
> The community could, for example, decide that IDNs are
> impossible without language information, that a user or system
> needs to have language information available when strings are
> looked up, and, as a consequence, that the prefix and coding
> structure must be changed to include a language code in both
> the ACE and native strings.  I believe the likelihood of that
> is on a par with your belief in the likelihood of Unicode ever
> removing a character (or fundamentally changing the definition
> of a character, which is the same thing in practice).  But in
> neither case is there a firm way to bind behavior in the
> long term.

The main issue here is the issue with MAYBE. If a hostname can change
overnight from valid (MAYBE or ALWAYS) to prohibited (NEVER), then
that is a major instability. That's the main issue I was trying to
raise. More on that later.

>
> Just as you have suggested that we incorporate a parenthetical
> note in various places indicating that something could not
> happen without violating a Unicode policy, it would be
> plausible to insert notes of that type in some of this text.
> However, there is a difference which may be important.
>
> Let me illustrate with an example from my history with
> programming languages and standards.  The users expected that
> the language definitions and implementations would be much more
> stable than the applications they wrote using them.  If
> something compiled one day and not the next or, worse, compiled
> both days but produced different behavior, the programmers and
> perhaps the end-users would typically be severely irritated
> --even if the change was to fix a bug that they had worked
> around with the effect that the change broke the work-around
> and hence the program.  We are, I think, in much the same
> situation with Unicode.  For applications that depend on the
> standard to be stable, the requirements for Unicode stability
> have to be much stronger than the requirements on the
> applications themselves.   IDNs are not, in that sense, quite
> an application but just as stability in IDN behavior is
> necessary to keep application behavior stable, an even
> stronger standard of stability in Unicode is needed to keep
> IDNs stable.

Stability of ALWAYS/MAYBE is a key issue -- see other threads, though,
for its resolution.

>
>
>
> > Issues-2. IDNA provided for backwards compatibility, by
> > disallowing Unassigned characters in registration, but
> > allowing them in lookup. That let old clients work despite
> > new software. While once we update to U5.1 that is not as
> > much of a problem, it should be made clear why this change is
> > made.
>
> It has been explained several times and in several places.  I
> can try adding more text here, but the key reason is that there
> is no way to know the normalization and character class
> properties, or other key properties, of a code point that has
> not been assigned yet.  I thought we had agreement on the
> importance of that principle back when stabilized string
> normalization and what is now called NPSS were being discussed.

While you've given an explanation, there needs to be more. Suppose we
have an unassigned code point X in Unicode version V1, that is later
assigned in V2. We'll consider a possible IDNA "aXb.com". The model in
IDNA2003 is that:

    * if aXb is not a valid input label in V2, there is no change from
V1 to V2 -- whether or not the registry or user-agent upgrade to V2.
    * if aXb is a valid output label in V2, then once the registry
upgrades to V2, it can be registered, and the user agent will work,
without an upgrade to V2.
    * if aXb is a valid input but not output label in V2 (eg it needs
normalization), then it will only work once both the registry and user
agent upgrade to V2.


The advantage of this approach is that user-agent 1 (working on V2)
can normalize a label to aXb (which works), and pass it to user-agent
2. User-agent 2 can still process aXb correctly despite not being on
V2.

What is clearly a problem in IDNA2003 is that it was unclear *what*
was being registered, the input form or the output form. That problem
is solved in IDNAbis; it is only the "output" (normalized) form. But I
haven't seen text that clearly states what was wrong with the rest of
the above model for unassigned characters. Perhaps I have missed the
text, and you could point me to it.

However, bottom line, as far as this text goes, what we could say is:

Because the characters added after Unicode 5.1 are far less frequently
used, there is no need for a model that tries to allow for forward
compatibility for unassigned characters.

>
>
> > Issues-3. In general, whenever a statement is made about some
> > class of characters causing a problem, at least one clear
> > example should be provided, as in
> > draft-alvestrand-idna-bidi-01.txt<http://www.ietf.org/interne
> > t-drafts/draft-alvestrand-idna-bidi-01.txt>
>
> Noted.  However, in some places, examples have been omitted
> because we know that they would cause more controversies than
> they are worth.  See comments below about appropriation of
> languages into scripts.

I strongly disagree here. In my experience, when trying to address a
problem, you always have to have clear examples that illustrate the
problem before trying to correct them. Unless we can provide them, we
have no legs to stand on.

(text, not model issue)

>
>
> > Issues-4. I would strongly suggest separating all of the "why
> > did we do this" and "how is it different from IDNA2003" into
> > a separate document. It will be of only historical interest
> > after this becomes final, and will then only clutter the
> > document.
>
> This document is primarily about explanation and rationale (and
> should probably be retitled accordingly).   In other words,
> "issues" is intended to become that "separate document", with
> all of the actual protocol materials moved into the "protocol"
> specification. If you believe there is remaining material here
> that should be there, your help in identifying it would be
> appreciated.  However, in circumstances in which client
> implementers are likely to do whatever they think best (see
> discussion in my response to your notes on "protocol") the type
> of material you characterize as "why did we do this" may be
> important for more than historical reasons.  It, far more than
> some "IETF stamp of approval", may be critical in persuading
> people that this is the right thing to do and to do it.

Good. Making this an Rationale document is a good approach.


>
>
>
> > Details.
> > Issues-5.
> >
> >    IDNA uses the Unicode character repertoire, which avoids
> >    the significant delays that would be inherent in waiting
> >    for a different and specific character set be defined for
> >    IDN purposes, presumably by some other standards
> >    developing organization.
> >
> > Seems odd. There are no other contenders in the wings. Would
> > be better, if this has to be said, to just cite other IETF
> > documents describing the reasons for using Unicode.
>
> A frighteningly large (frightening, at least, to me) number of
> the discussions of IDNA in the target communities start or end
> with a statement of belief that Unicode is hopelessly broken or
> very badly designed for their purposes, at least with their
> languages or scripts.  While some of those concerns ultimately
> involve misunderstandings, others involve information that
> appears to be needed but isn't available, presentation form
> difficulties, etc.

Any enterprise as large, dealing with as complex issues, and with as
much impact as Unicode is bound to have controversies. Some of these
may be legitimate, most are based on misunderstandings, and resolve
over time. For example, a number of Japanese were unhappy with Unicode
in the early 90s -- that has long since blown over. Khmer was a hot
topic a few years ago, with some high level meetings with
representatives from the Cambodian government. But just recently Lisa
encountered a minister involved in those discussions, and he expressed
how happy people were with Unicode. Yet there are still some people
unhappy with the way Unicode represents Khmer, and you'll undoubtedly
hear from them.

The key question is whether all text in any given language can be
represented by a sequence of Unicode characters.

The principle we adhere to is to ensure that Unicode is sufficient to
represent the text of languages using a particular script. It may or
may not be the "preferred" way according to particular groups of
people. The goal is representability. Often there are philosophic
differences within the language communities, and you hear from the
people whose position didn't end up being incorporated in Unicode most
vocally. There are cases where we would have done things differently,
if we could roll back the clock, but stability prevents us from making
a change. We do make additions and sometimes behavioral changes where
it turns out that the set of characters or behavior is insufficient
for representing all text in a particular script. For example, in
Unicode 5.1 we added some characters for Myanmar because the previous
model missed a few written distinctions made in Burmese and other
languages using the Myanmar script.

> Some of the issues involved are very
> specific to IDNs (e.g., when language information needed,
> Unicode strings in XML can be tagged with it, but domain names
> cannot be).
> However odd it might be, that paragraph is an
> attempt to say to the folks who want to propose different
> character coding systems that, to use your words above, "there
> are no other contenders in the wings" and it is time to adapt
> as needed to Unicode's coding structure and move on.  Proposals
> as to better ways to say that would be welcome but, although
> I'd welcome pointers to alternatives, I don't see a better
> place to put it.

How about simply:

IDNA uses the character repertoire from the Unicode Standard, which is
the preferred representation of international text on the web.
>
>
> > Issues-6.
> >
> >    To improve clarity, this document introduces three new
> >    terms.  A string is "IDNA-valid" if it meets all of the
> >    requirements of this
> >
> >    specification for an IDNA label.  It may be either an
> >    "A-label" or a "U-label", and it is expected that specific
> >    reference will be made to the form appropriate to any
> >    context in which the distinction is important.
> > ...
> >    A "U-label" is an IDNA-valid string of
> >    Unicode-coded characters that is a valid output of
> >    performing ToUnicode on an A-label, again regardless of
> >    how the label is actually produced.
> >
> > These definitions appear circular, so they need to be teased
> > out a bit.
>
> They are certainly not circular.  I'll take another look at the
> form of the definitions.  Again, specific suggestions welcome.
>

   1. IDNA-valid = A-label OR U-label
   2. U-label = IDNA-valid string of ....


That appears circular. D1 depends on D2, and D2 depends on D1.

I think it is sufficient to make the change:

A "U-label" is an IDNA-valid string of
=>
A "U-label" is a string of

because the real condition (looking at the text) is "valid output of
ToUnicode..."

>
>
> > Issues-7.
> >
> >    Depending on the system involved, the major difficulty may
> >    not lie in the mapping but in accurately identifying the
> >    incoming character set and then applying the correct
> >    conversion routine.  It may be especially difficult when
> >    the character coding system in local use is based on
> >    conceptually different assumptions than those used by
> >    Unicode about, e.g., how different presentation or
> >    combining forms
> >    are handled.  Those differences may not easily yield
> >    unambiguous conversions or interpretations even if each
> >    coding system is internally consistent and adequate to
> >    represent the local language and script.
> >
> > I suggest the following rewrite:
> >
> > The main difficulty typically is that of  accurately
> > identifying the incoming character set so as to apply the
> > correct conversion routine. Theoretically, conversion could
> > be difficult if the non-Unicode character encoding system
> > were based on conceptually different assumptions than those
> > used by Unicode about, e.g., how different presentation or
> > combining forms are handled. Some examples are the so-called
> > "font-encodings" used on some Indian websites. However, in
> > modern software, such character sets are rarely used except
> > for specialized display.
>
> While it is clearly easier to read, it doesn't say quite the
> same thing.  I will try to rework. More important, there seems
> to be some disagreement about the last comment.  Can you supply
> a citation to something that I can use in the document?  That
> should ideally be to other than a Unicode Consortium study or
> opinion, since the people who claim that these transcoding
> functions are causing pain tend to believe their solutions are,
> or should be, much more widely deployed and that Unicode
> Consortium statements and data articulate the organization's
> own views by definition, and can therefore not be unilaterally
> invoked in as fully objective discussion.

Unicode and its member companies and organizations, plus ISO/IEC JTC1
SC2/WG2 have had concerted, long term efforts to identify all code
pages that were in any kind of common use, and make sure that there
were mappings to Unicode/10646. The only concrete example I know of is
the Indic code pages, but they are only used for display of HTML (and
that they are being replaced by UTF-8, which is growing apace on the
web according to our (Google's) data).

Unless you have some concrete example of a code page that doesn't map
to Unicode and is in any kind of common usage, then you don't have any
evidence to make that claim.

>
>
> > Issues-8.
> >
> >    That, in turn, indicates that the script community
> >    relevant to that character, reflecting appropriate
> >    authorities for all of the known languages that use that
> >    script, has agreed that the script and its components are
> >    sufficiently well understood.  This subsection discusses
> >    characters, rather than scripts, because it is explicitly
> >    understood that a script community may decide to include
> >    some characters of the script and not others.
> >
> >    Because of this condition, which requires evaluation by
> >    individual script communities of the characters suitable
> >    for use in IDNs (not just, e.g., the general stability of
> >    the scripts in which those characters are embedded) it is
> >    not feasible to define the boundary point between this
> >    category and the next one by general properties of the
> >    characters, such as the Unicode property lists.
> >
> > There is no justification given for this process. Moreover,
> > it will be doomed to failure. Merely the identification of
> > "script communities" is an impossible task. Who speaks for
> > the Arabic script world? Saudi Arabia (Arabic)? Iran
> > (Persian,...)? Pakistan (Urdu,...)?, China (Uighur,...)?
>
> Actually, the description above, although it could be made much
> longer and more detailed, is the motivation and justification
> for some careful per-script process that involves
> representatives of each script community.  This is obviously a
> key problem.  The fact that non-specialists think in terms of
> languages and language-specific writing systems, not scripts
> (as Patrik has pointed out many times), makes it much more
> difficult.

Agreed. That is a huge problem. The Latin script is used by hundreds
of languages; who speaks for each of those languages? Trying to set up
any process that would terminate within our lifespans, or even our
children's is hopeless.

>
> I note that, while you say that identification of script
> communities is an "impossible task", you don't offer a solution
> or even a better suggestion.

My suggestion is and has been: permit all characters from all modern
scripts. Those are easily identified, and do not disadvantage any
language group. This protocol is the wrong place to be making
fine-grained linguistic determinations. Restrictions can be imposed by
registries or other parties, and user-agents. More on that in a
separate mail.

===========
Slightly out of order, a side comment on your side comment
...
> With the understanding that some of these are the same folks
> who are arguing for junking Unicode for IDN purposes and
> starting over (see my comments on your "protocol" notes and
> above), the one thing that they will agree on is that some
> small group of people operating out of California (or even the
> US) are not the appropriate decision-makers for the use of
> their languages in IDNs.

This is not some "small group in California". It is representatives
from major computer companies and other organizations that have a
strong interest in seeing that their customers' needs are met, and has
been operating for almost two decades in close concert with ISO/IEC
JTC1 SC2/WG2, which has representatives from a large number of ISO
member national bodies.
===========

>
>
>
> > Issues-9.
> >
> >       it is removed from Unicode.
> >
> > (multiple instances)
> >
> > This is not necessary; characters aren't removed from
> > Unicode. If you really have to have it, then add "(however,
> > the Unicode stability policies expressly forbid this)"
>
> See comment on this subject in the "protocol" response and the
> comments on your Issues-1 above.  If it will never happen, then
> the condition will never apply.

Adding all possible caveats, no matter how unlikely, is not
particularly productive. You might as well add for ASCII characters a
caveat about characters being removed, since "if it will never happen,
then the condition will never apply."

>
>
> > Issues-10.
> >
> >    Applications are expected to not treat "ALWAYS" and "MAYBE"
> >    differently with regard to name resolution ("lookup").
> >    They may choose to provide warnings to users when labels
> >    or fully-qualified names containing characters in the
> >    "MAYBE" categories are to be
> >
> > In practice, expecting applications to treat these
> > differently is wishful thinking; especially if it seems
> > Eurocentric to users (see other notes on MAYBE). In practice,
> > registries always have the ability to filter characters out.
> > See above on removing Maybe.
>
> You didn't include your notes on removing MAYBE in this
> message, but let me respond on the basis of previous remarks by
> you and others.
>
> We've had many requests, from browser vendors, from zone
> administrators (many of them completely unaffected by whatever
> policies ICANN makes about TLDs they control or advise), and
> others who are dependent on domain names and things that look
> like them (including parts of the security and PKI Certificate
> community), and others for guidance about what is safe, what is
> clearly bad, and where they need to exercise caution.  Some of
> those requests have come from users of scripts that you have
> implied that we have put into MAYBE because of a Eurocentric
> view -- they are encouraging us to go cautiously on their scripts
> until they have time locally to get some issues straightened
> out. Note that some of these requests are on the registration
> or registration policy side and others are on the lookup or
> other application side.

Those should be explicitly cited in this document, in that case. This
is exactly what I meant about specific examples and justification.
- "requests have come from users of a script". Which users, and which
scripts? What are their concerns? Unless there are specifics, nobody
reading this document can judge whether these are reasonable requests
or not.

>
> The MAYBE category, the distinction between MAYBE YES and MAYBE
> NO, and the discussion associated with them, are intended to
> provide exactly that guidance, especially in cases where we, or
> the relevant script community (see above), have concluded that
> there are still details to be sorted out.  Now perhaps you are
> right and that, having asked for guidance, those groups will
> ignore it and do whatever they like locally.  I would predict
> that they would make use of the advice.  However, if they did
> not, removing the distinctions implied by MAYBE somewhere down
> the line would be fairly easy.  By contrast, adding
> distinctions or restrictions later always proves very
> difficult, especially with regard to forward and backward
> compatibility.
>
> The observation that MAYBE NO and MAYBE YES are
> indistinguishable from a lookup protocol standpoint does not
> make them useless for all purposes.
>
> Finally, while registries always have the ability to forbid
> specific characters (I'm not quite sure what "filter characters
> out" would mean on the registry side of things), some
> registries make their own registration policies and some do
> not.  For those who make policies, categories and distinctions
> may be helpful in discussing and writing possible rules even if
> those categories don't directly affect the protocol.  The text
> suggests some ways in which such rules could be formulated.
> Again, in the contexts in which extra distinctions are not
> important, it is harmless to ignore them.

See forthcoming document.

> > Issues-11.
> >
> >    5.1.3.  CONTEXTUAL RULE REQUIRED
> >
> > I know what the point is supposed to be (and don't disagree),
> > but this section was very hard to make out.
>
> I'll take another look at the text.  Thanks.  Again, specific
> suggestions (from you or others) would be welcome.
>
>
> > Issues-12.
> >
> >    Characters that are placed in the "NEVER" category are
> >    never removed from it or reclassified.  If a character is
> >    classified as "NEVER" in error and the error is
> >    sufficiently problematic, the only recourse is to
> >    introduce a new code point into Unicode and classify it as
> >    "MAYBE" or "ALWAYS" as appropriate.
> >
> > The odds of this happening are extremely low. Anything in
> > Never has to be extremely certain.
>
> We agree (and that is another reason for the somewhat gray area
> described as "MAYBE").  Do you think that needs to be said
> more clearly?

This is all too complex. See other thread on MAYBE.

>
>
>
> > Issues-13.
> >
> >    Instead, we need to have a variety of approaches that,
> >    together, constitute multiple lines of defense.
> >
> > Defense against what? Without examples, it is hard to say
> > what the problems are.

What we are worried about is spoofing, and there are examples a-plenty
in UTR#36.

>
> <<<>>>
>
> > Issues-14.
> >
> >    Applications MAY
> >    allow the display and user input of A-labels, but are not
> >    encouraged to do so except as an interface for special
> >    purposes, possibly for debugging, or to cope with display
> >    limitations.  A-labels are opaque and ugly, and, where
> >    possible, should thus only be exposed to users who
> >    absolutely need them.  Because IDN labels can be rendered
> >    either as the A-labels or U-labels, the application may
> >    reasonably have an option for the user to select the
> >    preferred method of display; if it does, rendering the
> >    U-label should normally be the default.
> >
> > Add:
> >
> >  It is, however, now common practice to display a suspect
> >  U-Label (such as a mixture of Latin and Cyrillic) as an
> > A-Label.
>
> I deliberately didn't say that although I certainly agree that
> it is true.  Let me explain why and then ask for your
> suggestions and those of others as to how to handle this.
>
> After a series of informal and very unscientific tests with end
> users, I'm convinced that they don't see A-labels as much
> different from a sequence of question marks or boxes.  Both are
> completely opaque with regard to what the original string was
> and that is bad news if there is some possibility that there
> are some labels that are legitimate but would fail the test and
> be displayed in A-label form... the user would have essentially
> no way to discriminate between legitimate mixed-script labels
> and evil ones.
>
> In retrospect, while "display punycode (A-labels) instead" was
> a good fast patch once the problems with phishing and suspect
> labels became clear, I think your original suggestion of
> displaying the strings in U-label form but with some special
> emphasis (colors, special fonts, flags, warning dialog boxes
> when touched,...)  is a much better idea for "suspect" labels
> and that "display A-labels" should be reserved for labels that
> are known to be bad, not just ones that are suspect because of
> some combination that may actually be legitimate.
>
> I note, for example, that various Russians have claimed that
> Cyrillic-Latin mixtures are absolutely essential.  If one
> accepts their position, then one either needs to do a lot of
> localization about what the user finds reasonable or to treat
> Cyrillic-Latin mixes differently from Latin-Cyrillic ones (and
> I have no idea how one would write a rule that described the
> difference).
>
> Does this need to be discussed in "issues", regardless of what
> is or is not said about common practice?

I fully agree with you on this matter concerning the UI. We could urge
other mechanisms, and point to descriptions of them, but we can't
ignore common practice -- we need to mention it.

>
>
>
>
>
> > Issues-15.
> >
> >    6.3.  The Ligature and Digraph Problem
> >
> >    There are a number of languages written with alphabetic
> >    scripts in which single phonemes are written using two
> >    characters, termed a
> >
> >    "digraph", for example, the "ph" in "pharmacy" and
> >    "telephone".
> >
> > The text has been improved considerably from earlier
> > versions, but the whole issue is just a special case of the
> > fact that words are spelled different ways in different
> > languages or language variants. And it has really nothing to
> > do with ligatures and diagraphs. The same issue is exhibited
> > between theatre.com and theater.com as between a Norwegian URL
> > with ae and a Swedish one with a-umlaut.
> >
> > So if you retain this section, it should be recast as
> > something like
> >
> >
> > 6.3 Linguistic Expectations
> >
> >  Users often have certain expectations based on their
> >  language. A Norwegian user might expect a label with the
> > ae-ligature to be treated as the same label using the Swedish
> > spelling with a-umlaut. A user in German might expect a label
> > with a u-umlaut and the same label with "ae" to resolve the
> > same. For that matter, an English user might expect
> > "theater.com" and "theatre.com" to resolve the same. [more in
> > that vein].
>
> I think this is an improvement, although part of that text was
> motivated by the list discussion about the handling of Eszett
> (Latin Small Letter Sharp S, U+00DF) and its friends.  The
> anomaly there is that it is treated as a "real" Latin character
> by NFKC (and special-cased in IDNA2003), while its very close
> relative U+FB05 is treated as a ligature, named as one ("Latin
> Small Ligature Long S T") and defined as a compatibility
> character that is mapped out by NFKC.
>
> One can argue for the difference on the basis of frequency of
> use and on the basis of what the relevant community decided to
> include in ISO 8859-1 (I assume the latter was the ultimate
> cause of the Unicode decision).  But, if one is going to treat
> one as a letter and the other as a compatibility ligature, then
> some communities think it is strange.  Perhaps that is the only
> case of this, perhaps not, but we have heard claims that, e.g.,
> some of the Arabic ligatures raise similar issues.

Ok, then it sounds like the recast paragraph is ok, plus some
additional text on ess-zed.

>
>
> > Issues-16.
> >
> >  there is no evidence that
> > they are important enough to Internet operations or
> > internationalization to justify large numbers of special cases
> > and character-specific handling (additional discussion and
> >
> > I suggest the following wording instead:
> >
> >  there is no evidence that
> >  they are important enough to Internet operations or
> >  internationalization to justify inclusion (additional
> >  discussion and
> >
> > It doesn't actually involve "large numbers of special cases",
> > there are a rather small percentage of demonstrable problems
> > in the symbol/punctuation area. What we could say is that
> > there is general consensus that removing all but letters,
> > digits, numbers, and marks (with some exceptions) causes
> > little damage in terms of backwards compatibility, and does
> > remove some problematic characters like fraction slash.
>
> Much better.  Thanks.

good.

>
>
>
> > Issues-17.
> >
> >    For example, an essential
> >    element of the ASCII case-mapping functions is that
> >    uppercase(character) must be equal to
> >    uppercase(lowercase(character)).
> >
> > Remove or rephrase. It is a characteristic, but not an
> > essential one. In fact, case mappings of strings are lossy;
> > once you lowercase "McGowan", you can't recover the original.
>
> Hmm.  The original statement doesn't say "not lossy" (or, more
> accurately, "not lossy from an information standpoint").  It
> just says that a particular function applies.  It is
> incontrovertible that the function applies.  The fact that the
> function applied (and that it could be applied very
> mechanically) was one reason for the case-matching rule that
> made it into the base DNS specs from the Hostname rules.  And
> it is easy to demonstrate that it does not apply in every case
> once one starts examining the Unicode characters with a case
> property rather than the version of Base Latin characters that
> appears in ASCII/ ISO 646IRV.

Can you explain why this is an "essential element"? I just don't see
it. Case folding does not require "uppercase(character) must be equal
to uppercase(lowercase(character))". It just requires that there one
can come up with an idempotent mapping such that

CF(character) = CF(uppercase(character)) = CF(lowercase(character)) =
CF(titlecase(character))

>
>
>
> > Issues-18.
> >
> >    o  Unicode names for letters are fairly intuitive,
> >    recognizable to uses of the relevant script, and
> >       unambiguous.  Symbol names are more problematic because
> >       there may be no general agreement on whether a
> >       particular glyph matches a symbol, there are no uniform
> >       conventions for naming, variations such as outline,
> >       solid, and
> >
> > Actually, the formal Unicode names are often far from
> > intuitive to users of the relevant script. That's because the
> > constraints of using ASCII for the name, to line up with ISO
> > standards for character encodings.
>
> > This section is not really needed. The use of I<heart>NY.com
> > is not really problematic; the main justification for
> > removing it is that we don't think it is needed (and has not
> > been used much since IDNA was introduced). Better to just
> > stick with that.
> >
>
> I would welcome a better way to say this.  But I<heart>NY.com
> is problematic because the user hearing it described that way
> has no way to know --absent careful study of the code tables--
> whether it is ambiguous or not.  Conversely, the user seeing
> the "name" doesn't know whether to read it --try to transmit it
> to a colleague by voice-- as I<heart>NY.com (or any of the
> examples below) or as I<love>NY.com, where the latter is
> undoubtedly what the owner of such a domain would intend.
> Please consider, as an example, U+2605 and U+2606.  While
> <heart> may be unambiguous in U+2764 (actually it is not
> either, but let me come back to that), one could not have
> I<star>NY.com because one has an immediate ambiguity, at the
> level of the Unicode character name, between I<black
> star>NY.com and I<white star>NY.com.  Worse, because these are
> not what we have informally called language characters, the
> user has no way to know, absent careful study of the Unicode
> tables, whether <white star> is just a different font
> stylization of <black star> or is a separate character.
>
> Of course, it is even worse than this.  There is no possible
> way for a normal, casual, user to tell the difference between
> the stars of U+2606 and U+2729 or the hearts of U+2665 and
> U+2765 without somehow knowing to look for a distinction.
> Since we have a white heart (U+2661) as well a few black
> hearts, the problem with hearts is ultimately the same as the
> one with stars and I<heart>NY.com is hopelessly ambiguous.  I
> note that I live in an area in which someone might well believe
> that harvard<square>.com would be an attractive domain and that
> there are far more Squares of various flavors than there are
> hearts or stars.
>
> Unlike font and style variations in language (and
> "mathematical") characters, identification of compatibility
> encodings and application of NFKC is of no help here.  All of
> these symbols (and many other pairs and triples) are treated as
> valid, independent, non-reducible, code points.
>
> We just do not need this stuff.  They are serious threats to
> interoperability and clear descriptions of domain names.
>
> By the way, while one could have a discussion about motivation,
> there are several registered domain names, appearing in TLDs,
> that use characters that the "no symbols" rule excludes.  See
> the comments under Issues-1 (ii) above.

We are all in agreement that we don't need symbols. But the argument
you give is unnecessary in the text, and very debatable.

There are some 100ish characters in Chinese pronounced "yi1" (not sure
of the the tone, but a specific one). It is not a really a problem
that people have to give some extra information to describe that
character orally, nor do we want to say that that character cannot be
used! People find ways of communicating these things on there own: how
many times in English to we say "fifty -- that's five-zero, not
one-five".

Similarly, I can say "EYE heart-symbol EN WHY DOT COM (with a white
heart)", and distinguish I♡NY.com from I♥NY.com. And if people find
that that nobody uses that symbol, they'll drop the registration. In
the grand scheme of things, compared with "paypal.com" with either 1
or Cyrillic 'a', as a problem this is utterly in the noise.

Moreover, there are many more complicated URLs that people try to
communicate. How many of you have been in telecons where someone reads
off (painfull) a URL like

http://www.amazon.com/Statistical-Scientific-Database-Management-International/dp/B000UIQACY/ref=sr_1_1?ie=UTF8&s=books&qid=1200770132&sr=1-1

URLs can be vastly more complicated to read out than a simple "EYE
heart-symbol EN WHY DOT COM (with a white heart)".

We have sufficient reason to remove symbols without this argument, so
the text should simply be removed.

>
>
> > Issues-19
> >
> >    11.  IANA Considerations
> >
> >    11.1.  IDNA Permitted Character Registry
> >
> >    The distinction between "MAYBE" code points and those
> >    classified into "ALWAYS" and "NEVER" (see Section 5)
> >    requires a registry of characters and scripts and their
> >    categories.  IANA is requested to
> >
> > Expecting an IANA registry to maintain this is setting it up
> > for failure. If this were to be done, precise and lengthy
> > guidance as to the criteria for removing characters (moving
> > to NEVER) would have to be supplied, because of the
> > irrevocable nature of this step. The odds of a registry being
> > able to perform this correctly are very small.
>
> We do not expect IANA to make the decisions, only to maintain
> the registry (IETF rarely expects IANA to make decisions about
> protocol registries any more, so this is not an exceptional
> case).   We are working on procedures but, of course, this
> interacts with the discussion about "script communities"
> above.

see above.

>
>
> > The best alternative would be to simply have all the
> > non-historic scripts have the same status in
> > *draft-faltstrom-idnabis-tables-03.txt*<http://www.ietf.org/i
> > nternet-drafts/draft-faltstrom-idnabis-tables-03.txt>, by
>
> > moving the non-historic scripts to the same status as Latin,
> > Greek, and Cyrillic.
>
> Only if one believes that the IDN implications of using all of
> those scripts, and the restrictions that need to be applied,
> are as well understood as they are for Latin, Greek, and
> Cyrillic.  Some of the language communities (and I mean
> "language" here and not "script") do not believe that.

Let's see some concrete examples here. What are the problems you think
exist, or what are the problems that are reported. We cannot come to
reasonable conclusions if the possible evidence is not brought into
the light of day for review.

>
>
> > The second best would be to have the Unicode consortium make
> > the determinations (and take the heat for objections).
>
> I'd like to put that discussion off to another time, if only in
> the interest of getting these notes out to you.  But some of
> those language communities have made it clear that while, for
> example, they might trust determinations by ISO/IEC JTC1/SC2
> and its formal approval processes linked to National Member
> Bodies, they are strongly disinclined to ascribe similar
> authority to the Unicode Consortium.  However misguided that
> attitude might be, I don't believe the IETF is able (or
> willing) to take the heat that would result from handing this
> over to the Unicode Consortium.
>
> A different way to look at that situation is that there is some
> feeling in some of the language communities that an independent
> check and signoff is needed on whether the way that the Unicode
> Consortium has handled a particular script is adequate for IDN
> use.

Again, citations/examples are necessary. (It is often not clear who
speaks for a given language group.)

> In that context, having the Unicode Consortium review its
> own work would obviously not be considered appropriate.

ISO/IEC JTC1 SC2/WG2 might be a possibility, but I don't think we need
a heavy-weight process in the first place. Let's see after you provide
the problem cases.

>
>
>
> > Issues-20
> >
> >    Some specific suggestion
> >    about identification and handling of confusable characters
> >    appear in a Unicode Consortium publication [???]
> >
> > Use: [UTR36]
> >       UTR #36: *Unicode Security Considerations*
> >       http://www.unicode.org/reports/tr36/
>
> Ack.  Thanks.  Actually, my apologies -- I knew the
> reference and had intended to go back and insert it before
> posting the draft, then slipped up.  This has been fixed in the
> working text for -06 since a _very_ short time after -05 was
> posted.

no problem.

>
> Again, thanks for the comments. This is a difficult process,
> but one that certainly results in a better specification.
>
>    john
>
>



-- 
Mark


More information about the Idna-update mailing list