I-D Action:draft-ietf-idnabis-mappings-00.txt

Mon Jun 29 06:28:56 CEST 2009

Returning to the discussion, now that some of my other standards work is
under control (RFC4646bis was approved, whew!)

------------------------------
I've had a chance to do some data mining, and it is now clear which are the
most prominent characters that are remapped under the current scheme (the
relative frequencies vary, as one might expect, depending on the language):
they are case variants, width variants, and presentation variants.

A copy of this email is at http://www.macchiato.com/unicode/idna/remap

Now, my position is still that the simplest and most compatible option open
to us is to simply map with NFKC + Casefold. However, in the interest of
getting this process moving, I offer the following as a possible compromise
approach. It limits the remapped characters to those that are the most
useful in practice. I'll first give the proposal, then list some of the
details afterwards.
Proposal: A. Tables document

Add a new type of character: REMAP. A character is REMAP if it meets *all of
* the following criteria:

   1. The character is not PVALID or CONTEXTO
   2. If remapped by the Unicode property NFKC_Casefold*, then the resulting
   character(s) are all PVALID or CONTEXTO
   3. The character is a LetterDigit or Pd
   4. The character has one of the following Decomposition_Type values:
   initial, medial, final, isolated, wide, narrow, or compat
   5. The character does not have the Script value: Hangul

The REMAP characters are removed from DISALLOWED, so that the TABLES values
form a partition (all the values are disjoint).

B. Protocols documentChange sections 4.2.1 and 5.3 so as to require:

   1. Mapping all REMAP characters according to the Unicode property
   NFKC_Casefold,
   2. Then normalizing the result according to NFC.

The rest of the tests for U-Label remain unchanged.

C. Defs document

   1. Define REMAP
   2. Define an M-Label to be one which if remapped according to B1+B2,
   results in a U-Label.

------------------------------
Details on REMAP

   1. The character is not PVALID or CONTEXTO
      - This guarantees that the class of REMAP is disjoint with the others
      2. If remapped by the Unicode property NFKC_Casefold*, then the
   resulting character(s) are all PVALID or CONTEXTO
      - This condition is not strictly necessary. Because of the way in
      which REMAP is used in the protocol above, if a character results that is
      not PVALID, then it would fail the later tests. So as far as I'm
concerned,
      this could be dropped. However, restricting the characters in
this way will
      probably make a character listing clearer to people.
      - The derived property NFKC_Casefold is being added to Unicode 5.2,
      and is already present in the 5.2 beta. It provides a convenient way
      to fold characters for identifiers (and not just for IDNA). It
is defined in
      http://unicode.org/reports/tr44/tr44-3.html, and the characters
      affected are listed in http://unicode.org/Public/5.2.0/ucd/ under
      DerivedNormalizationProps. If we didn't want to wait for U5.2,
we can define
      it on our own, but it would be convenient to use it, as long as
we release
      in October or later.
      3. The character is a LetterDigit or Pd
   - This limits the input characters by eliminating symbols, punctuation,
      etc.
      - The Pd is in only to pick up the fullwidth hyphen.
   4. The character has one of the following Decomposition_Type values:
   initial, medial, final, isolated, wide, narrow, or compat
      - The initial/medial/final/isolated forms are all Arabic presentation
      forms, such as:
      U+FE91 <http://unicode.org/cldr/utility/character.jsp?a=FE91> ( ‎ﺑ‎ )
      ARABIC LETTER BEH INITIAL FORM
      - The narrow/wide are all width variants, such as:
      U+FF71 <http://unicode.org/cldr/utility/character.jsp?a=FF71> ( ｱ )
      HALFWIDTH KATAKANA LETTER A
      - The compat forms include various digraphs or forms such as:
      U+013F <http://unicode.org/cldr/utility/character.jsp?a=013F> ( Ŀ )
      LATIN CAPITAL LETTER L WITH MIDDLE DOT,
      U+01C6 <http://unicode.org/cldr/utility/character.jsp?a=01C6> ( ǆ )
      LATIN SMALL LETTER DZ WITH CARON
      Most compat characters are, however, eliminated by other conditions,
      such as:
      U+2474 <http://unicode.org/cldr/utility/character.jsp?a=2474> ( ⑴ )
      PARENTHESIZED DIGIT ONE, which is eliminated by both condition #2 and #3
      - The excluded decomposition types are: font, super, sub; vertical;
      circle, fraction, nobreak, small, square
      5. The character does not have the Script value: Hangul
      - This is consistent with the exclusion in Tables of OldHangulJamo.
      Sample exclusions are:
      U+3131 <http://unicode.org/cldr/utility/character.jsp?a=3131> ( ㄱ )
      HANGUL LETTER KIYEOK
      U+FFA1 <http://unicode.org/cldr/utility/character.jsp?a=FFA1> ( ﾡ )
      HALFWIDTH HANGUL LETTER KIYEOK

On Sun, Jun 7, 2009 at 13:49, Paul Hoffman <phoffman at imc.org> wrote:

> At 11:57 AM -0400 6/7/09, John C Klensin wrote:
> >--On Saturday, June 06, 2009 16:38 -0700 Paul Hoffman
> ><phoffman at imc.org> wrote:
> >
> >>...
> >>> I continue to believe that use of NKFC without exclusion of
> >>> character groups for which there are no justifications is
> >>
> >> Pete's proposed mapping happens before the
> >> is-it-valid-IDNA2008 check. Why should we use a modified NFKC
> >> instead of plain-vanilla-NFKC and let the second step
> >> (is-it-valid-IDNA2008 check) happen as-is?
> >
> >My concern is not those NFKC mappings that will result in
> >invalid (DISALLOWED) characters.   It is
> >
> >(1) NFKC mappings of characters that, if used in domain names,
> >are probably used to cause mischief and for which there is no
> >substantive justification.   The "Mathematical" characters are
> >examples of this.
>
> I'm still confused. If someone enters a mathematical character that is
> mapped to a allowed character, the result is a valid domain name that could
> have been entered as allowed characters. This is identical to what we have
> today in IDNA2003, no worse.
>
> > Martin's original list identified others.
> >Note that, except in specialized systems, these characters are
> >very difficult to type and ones for which fonts are unlikely to
> >be present.
>
> Yes, exactly. I'm still missing your point of concern.
>
> >(2) NFKC mappings of characters that result in characters in
> >CONTEXTO or CONTEXTJ.  Unless I missed something in my search,
> >this is a null set at present.  But I can find no stability rule
> >that would prevent adding such a character and the same
> >presentation and ambiguity issues that apply to the listed
> >CONTEXTx characters would apply to their compatibility
> >equivalents.
>
> And I don't see a problem with that. Someone enters an
> name-which-needs-mapping, it is mapped, and out pops some characters that
> are valid. How is this of more concern than a valid, no-mapping IDNA2008
> name?
>
> > >> (i) A violation of the "inclusion" model of IDNA2008
> >>
> >> Completely agree. However, this whole document is a violation
> >> of the "no-mapping" model of IDNA2008, so that seems like an
> >> odd objection.
> >
> >We are likely to have to agree to disagree about this, but I
> >believe that "inclusion" and "no mapping" are separate
> >principles.  The acceptance of mapping in some contexts does not
> >seem to me to justify, in any way, abandoning "inclusion".  From
> >that point of view, the argument to abandon inclusion in the
> >mapping context has to be made separately... and that argument
> >has not, as far as I can remember, been made yet.
>
> We do agree to disagree here. Up to this point, we have held the whole set
> of rules constant, and some of them intertwined. I do *not* want us to
> abandon any of the rules in the end, just to allow sensible mapping before
> doing the final rule check.
>
> > >> (ii) A violation of the closely-related protocol design
> >>> principle that one should include only those things for which
> >>> one has both use and understanding because it is easier to add
> >>> later than it is to remove.
> >>
> >> Implementers of IDNA2008 will understand NFKC as well as
> >> implementers of IDNA2003.
> >
> >Which is to say, IMO, not at all.
>
> And here we agree. Thus, I see no harm in extending the protocol. We have
> not seen any significant damage from the lack of understanding in IDNA2003
> names, just as the lack of understanding of some crypto properties (for
> example) in developers hasn't had any significant negative effect on, say
> IPsec and TLS. And I really do see a parallel between the two types of
> protocols.
>
> > >> (iii) An increased risk, however slight, that we will, in the
> >>> future, get strong demands from some particular community to
> >>> treat a character classified by Unicode as "compatibility" as
> >>> a real and distinct character.  If such a character is
> >>> disallowed by virtue of not being mapped, we will have the
> >>> difficult problem of changing a disallowed character to a
> >>> PVALID one. But, if it is mapped to something else, we will
> > >> have to revisit the very complex discussions that we have had
> >>> over Eszett and Final Sigma.  We should not incur that risk
> >>> unless there is a reason to do so.
> >>
> >> There is: increased (but, of course, not full) backwards
> >> compatibility with the large installed base of IDNA2003.
> >
> >We've seen no evidence that any of these other categories of
> >compatibility characters are used --other than in possible
> >demonstrations or out of malice-- in IDNA2003-compatible
> >applications, much less enough to constitute a large installed
> >base.
>
> Really? I thought that multiple presentations by Asians showed that some of
> the compatibility characters got entered automatically by the UIs of some
> browsers and so on. I could be mistaken, of course, but that is certainly
> what I interpret slide 3 of <
> http://www.ietf.org/proceedings/09mar/slides/idnabis-4.pdf>.
>
> >YMMD.
>
> As we age, all of our milages are degrading, yes. :-)
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090628/14937689/attachment.htm