I-D Action:draft-ietf-idnabis-mappings-00.txt

Wed Jul 1 00:06:48 CEST 2009

Mark

On Tue, Jun 30, 2009 at 03:29, "Martin J. Dürst" <duerst at it.aoyama.ac.jp>wrote:

>
>
> On 2009/06/30 2:55, John C Klensin wrote:
>
>> Mark,
>>
>> Several comments inline...
>>
>> --On Sunday, June 28, 2009 21:28 -0700 Mark Davis ⌛
>> <mark at macchiato.com>  wrote:
>>
>>  Returning to the discussion, now that some of my other
>>> standards work is under control (RFC4646bis was approved,
>>> whew!)
>>> ...
>>>
>>
>>  Now, my position is still that the simplest and most
>>> compatible option open to us is to simply map with NFKC +
>>> Casefold.
>>>
>>
>> I continue to believe that CaseFold is a showstopper.  When its
>> results are not identical to those produced by LowerCase, it
>> produces results that are astonishing to some users and leads us
>> into the "is that a separate character or not" trap that we've
>> seen manifested at least twice.  I note that TUS recommends
>> against its use for mapping (as distinct from comparison) and
>> appears to do so for just the reason that it involves too much
>> information loss.
>>
>
> I have earlier said that I think Mark's proposal goes in the right
> direction, but I agree with John that LowerCase is better than CaseFold. If
> anything, the burden of proof should be on the CaseFold side (show, for each
> case of mapping that's in CaseFold but not LowerCase, why it's needed)
> rather than on the LowerCase side.

The key issue for me is that we don't map *differently -- unless it is vital
--* from how IDNA2003 mapped. Different mappings are *very* painful for
implementations, and will be very painful for years to come for users.

>
> Mark wrote, in a later mail:
>
> You make it sounds like final sigma, ZWJ/NJ, eszett and the other cases
> under discussion were oversights in the process of developing the current
> IDNA. That wasn't the case; these were deliberate choices made at the time.
> A case mapping is also a 'loss of information', but one that people clearly
> want.
>
>
> Eszett wasn't exactly an oversight, I knew at the time that it was
> problematic and told others. However, I didn't have the zeal to defend it
> because as a Swiss, I didn't and don't feel as attached to it as Germans and
> Austrians do.
>
> My understanding of why the eszett got mapped in IDNA 2003 was that the
> IETF wanted a one-stop shopping table, and Unicode had such a table, and any
> discussions about individual characters were out of fashion because it was
> felt that if we started discussing individual characters, we would never
> finish.

The main reason, as I recall, for favoring the mapping to "ss" is because
then the uppercases match correctly. But we'd have to go back and dredge
through all the email to see get any feeling for what the decisive arguments
were. There are certainly a number of exceptional cases in IDNA2003 that are
called out in different categories...

>
>
>
>  ...
>>> Proposal: A. Tables document
>>>
>>> Add a new type of character: REMAP. A character is REMAP if it
>>> meets *all of * the following criteria:
>>>
>>>    1. The character is not PVALID or CONTEXTO
>>>    2. If remapped by the Unicode property NFKC_Casefold*, then
>>> the resulting    character(s) are all PVALID or CONTEXTO
>>>    3. The character is a LetterDigit or Pd
>>>    4. The character has one of the following
>>> Decomposition_Type values: initial, medial, final,
>>> isolated, wide, narrow, or compat
>>>
>>
>> I am very concerned that collapsing initial, medial, and final
>> together may get us into problems with other language
>> communities similar to those we have gotten into with Final
>> Sigma, especially when those communities denote word boundaries
>> by the appearance of final or initial forms and hence would use
>> those forms in a style similar to the way "BigCompany" or
>> "big-company" might be used in ASCII.
>>
>
> The only character currently not containing the word "ARABIC" in its name
> for <initial>, <medial>, <final>, or <isolated> is U+FDFC, RIAL SIGN, which
> is just as well Arabic even if it doesn't say so in its name.
>
> I strongly doubt that the UTC would encode other backwards compatibility
> contextual forms in these four categories, and it might be possible to make
> sure that doesn't happen with a stability guarantee if that's really
> necessary.
>
> What I already asked Mark for, and what I'm still looking for, is some data
> on how (in)frequent these actually are.

These are infrequent, as far as my initial sampling goes. I can't really
release all data (sorry for not answering before), and for the less-frequent
cases I really have to get a much larger sample in order to have meaninful
statistics. Here's an example:

U+FEA9    ( ﺩ )    ARABIC LETTER DAL ISOLATED FORM

>
>
> As for <wide>, that includes only U+3000 (full width space, irrelevant
> here) and U+FFxx characters that contain FULLWIDTH in their name.
>
> As for <narrow>, that includes HANGUL, KATAKANA, and 11 characters in the
> U+FFxx area, all of which contain the word HALFWIDTH. The one to watch out
> for is U+FF61, HALFWIDTH IDEOGRAPHIC FULL STOP. Its fullwidth sibling
> (U+3002) is part of IDNA 2003.
>
> For these two (wide/narrow), I know from local experience here in Japan
> that they are probably necessary. Still, it would be good to get some
> numbers from Mark.

The highest frequency width-variant characters are:

U+FF70    ( ｰ )    HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF83    ( ﾃ )    HALFWIDTH KATAKANA LETTER TE
...
U+FF0D    ( － )    FULLWIDTH HYPHEN-MINUS
U+FF43    ( ｃ )    FULLWIDTH LATIN SMALL LETTER C
...

The fullwidth hyphen is the highest frequency character that would be
subject to remapping under the proposed scheme; so higher frequency than any
case-mapped character outside of ASCII. The top 4 characters are:

U+FF0D    ( － )    FULLWIDTH HYPHEN-MINUS
U+00AD    (  )    SOFT HYPHEN
U+00C3    ( Ã )    LATIN CAPITAL LETTER A WITH TILDE
U+00DF    ( ß )    LATIN SMALL LETTER SHARP S

Note that soft-hyphen is mapped away by IDNA2003. It is invisible, and only
serves to indicate hyphenation points, so it is easy for someone to cut and
paste a word into an href, for example, without knowing that it is there.

The eszett is not included in REMAP because it is PVALID.

>
>
> As for <compat>, that's the "everything else" bucket. That's a total of 720
> characters in Unicode 5.2 (as of UnicodeData-5.2.0d9.txt). Not all of them
> qualify by Mark's rules (in particular things such as parenthesized numbers
> don't because parentheses aren't allowed), but there are still way to many
> in my opinion that qualify. It would be good to know from Mark how many of
> these he really thinks need to be mapped, and why. If that's let's say 90%
> or 95% of the characters that would qualify by Mark's rules, it might be
> okay to just leave the rest as is, provided we can see no harm. Otherwise, I
> think a more detailed analysis may be necessary.
>
> To be more explicit, I think *at least* the following are included by the
> rules that Mark proposes but shouldn't be used for mapping:
>
> - ROMAN NUMERALs (32)
> - CJK/KANGXI RADICALs (216)
> - IDEOGRAPHIC TELEGRAPH SYMBOLs (68)

I put the exact cases generated on my page; the cases you are worried about
are not there. There are 59 characters, and they don't include the above.
The characters are:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[µ<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%C2%B5>
ĲĳĿŀŉſǄ-ǌǱ-ǳϐ-ϒϕϖϰ-ϲϵϹևٵ-ٸำຳໜໝ\u0F77\u0F79ẚℇℵ-ℸﬀ-ﬆﬓ-ﬗﭏ]
You can look at details on the other categories in the breakdown with the
same tool.

>
> Excluding characters with the words HANGUL, PARENTHESIZED, COMMA, and FULL
> STOP (all of which are excluded by Mark's rules) reduces the overall total
> from 720 to 456. In these, there are at least three categories:
> - Some more that are already excluded my Mark's rules but that my simple
> greps didn't catch.
> - Those that I think definitely shouldn't be included (see above, 316 in
> total)
> - The rest, possibly okay to include, which is at most 140.
>
>
>  As I've said several times before, even if we disallow the
>> NFKC-affected forms those characters, if a need arises, we can
>> (painfully) redefine them as PVALID and allow them.  But, if we
>> map them to something else, we lose all information about what
>> was intended/desired and end up in precisely the mess we have
>> with e.g., Final Sigma  and ZWJ/ZWNJ in which "the right thing
>> to do" poses enough compatibility problems to cause opposition
>> to making changes.
>>
>
> We definitely have to look at this carefully. I'm not overly concerned in
> general, but we shouldn't just gloss over it.

I agree. Feedback on the data would be appreciated.

>
>
>     5. The character does not have the Script value: Hangul
>>>
>>> The REMAP characters are removed from DISALLOWED, so that the
>>> TABLES values form a partition (all the values are disjoint).
>>>
>>
>> This strikes me as dangerous -- see below.
>>
>>  B. Protocols documentChange sections 4.2.1 and 5.3 so as to
>>> require:
>>>
>>>    1. Mapping all REMAP characters according to the Unicode
>>> property    NFKC_Casefold,
>>>    2. Then normalizing the result according to NFC.
>>>
>>
> We have to make sure this transform is idempotent on all strings we are
> concerned about, or introduce additional steps if necessary.

It is, but if we are paranoid it is easy to ensure idempotence in the spec
with a "repeat until there is no change clause".

>
> Regards,    Martin.
>
>
>  Making this change to 4.2.1 eliminates the requirement that the
>> registrant understand _exactly_ what is being registered, i.e.,
>> that the communication path between the registrant and registry
>> occur only using U-labels and/or A-labels.  My understanding was
>> that we had reached one of the more clear consensus we had in
>> these discussions that the "no mapping on registration"
>> restriction was appropriate.  Are you proposing to reopen that
>> question?
>>
>>  The rest of the tests for U-Label remain unchanged.
>>>
>>
>> I believe that doing this by the type of change to Tables that
>> you recommend either requires a change to the way that the
>> definition of U-label is stated or requires us to abandon the
>> very clear concept of a U-label that is completely symmetric,
>> with no information loss in either direction, with an A-label.
>>
>> There is also a subtle interaction with Section 5.5: if the
>> mapping is performed by the time Section 5.3 concludes (or,
>> under special circumstances, not applied at all), then Section
>> 5.5 must also prohibit REMAP.
>>
>>  C. Defs document
>>>
>>>    1. Define REMAP
>>>    2. Define an M-Label to be one which if remapped according
>>> to B1+B2,    results in a U-Label.
>>>
>>
>> The idea of an M-Label still makes me uncomfortable.  Again, we
>> have had that discussion before.
>>
>> regards,
>>    john
>>
>>
>>
>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090630/3674826f/attachment-0001.htm