"Martin J. Dürst"
duerst at it.aoyama.ac.jp
Tue Jun 30 12:29:27 CEST 2009
On 2009/06/30 2:55, John C Klensin wrote:
> Several comments inline...
> --On Sunday, June 28, 2009 21:28 -0700 Mark Davis ⌛
> <mark at macchiato.com> wrote:
>> Returning to the discussion, now that some of my other
>> standards work is under control (RFC4646bis was approved,
>> Now, my position is still that the simplest and most
>> compatible option open to us is to simply map with NFKC +
> I continue to believe that CaseFold is a showstopper. When its
> results are not identical to those produced by LowerCase, it
> produces results that are astonishing to some users and leads us
> into the "is that a separate character or not" trap that we've
> seen manifested at least twice. I note that TUS recommends
> against its use for mapping (as distinct from comparison) and
> appears to do so for just the reason that it involves too much
> information loss.
I have earlier said that I think Mark's proposal goes in the right
direction, but I agree with John that LowerCase is better than CaseFold.
If anything, the burden of proof should be on the CaseFold side (show,
for each case of mapping that's in CaseFold but not LowerCase, why it's
needed) rather than on the LowerCase side.
Mark wrote, in a later mail:
You make it sounds like final sigma, ZWJ/NJ, eszett and the other cases
under discussion were oversights in the process of developing the current
IDNA. That wasn't the case; these were deliberate choices made at the time.
A case mapping is also a 'loss of information', but one that people clearly
Eszett wasn't exactly an oversight, I knew at the time that it was
problematic and told others. However, I didn't have the zeal to defend
it because as a Swiss, I didn't and don't feel as attached to it as
Germans and Austrians do.
My understanding of why the eszett got mapped in IDNA 2003 was that the
IETF wanted a one-stop shopping table, and Unicode had such a table, and
any discussions about individual characters were out of fashion because
it was felt that if we started discussing individual characters, we
would never finish.
>> Proposal: A. Tables document
>> Add a new type of character: REMAP. A character is REMAP if it
>> meets *all of * the following criteria:
>> 1. The character is not PVALID or CONTEXTO
>> 2. If remapped by the Unicode property NFKC_Casefold*, then
>> the resulting character(s) are all PVALID or CONTEXTO
>> 3. The character is a LetterDigit or Pd
>> 4. The character has one of the following
>> Decomposition_Type values: initial, medial, final,
>> isolated, wide, narrow, or compat
> I am very concerned that collapsing initial, medial, and final
> together may get us into problems with other language
> communities similar to those we have gotten into with Final
> Sigma, especially when those communities denote word boundaries
> by the appearance of final or initial forms and hence would use
> those forms in a style similar to the way "BigCompany" or
> "big-company" might be used in ASCII.
The only character currently not containing the word "ARABIC" in its
name for <initial>, <medial>, <final>, or <isolated> is U+FDFC, RIAL
SIGN, which is just as well Arabic even if it doesn't say so in its name.
I strongly doubt that the UTC would encode other backwards compatibility
contextual forms in these four categories, and it might be possible to
make sure that doesn't happen with a stability guarantee if that's
What I already asked Mark for, and what I'm still looking for, is some
data on how (in)frequent these actually are.
As for <wide>, that includes only U+3000 (full width space, irrelevant
here) and U+FFxx characters that contain FULLWIDTH in their name.
As for <narrow>, that includes HANGUL, KATAKANA, and 11 characters in
the U+FFxx area, all of which contain the word HALFWIDTH. The one to
watch out for is U+FF61, HALFWIDTH IDEOGRAPHIC FULL STOP. Its fullwidth
sibling (U+3002) is part of IDNA 2003.
For these two (wide/narrow), I know from local experience here in Japan
that they are probably necessary. Still, it would be good to get some
numbers from Mark.
As for <compat>, that's the "everything else" bucket. That's a total of
720 characters in Unicode 5.2 (as of UnicodeData-5.2.0d9.txt). Not all
of them qualify by Mark's rules (in particular things such as
parenthesized numbers don't because parentheses aren't allowed), but
there are still way to many in my opinion that qualify. It would be good
to know from Mark how many of these he really thinks need to be mapped,
and why. If that's let's say 90% or 95% of the characters that would
qualify by Mark's rules, it might be okay to just leave the rest as is,
provided we can see no harm. Otherwise, I think a more detailed analysis
may be necessary.
To be more explicit, I think *at least* the following are included by
the rules that Mark proposes but shouldn't be used for mapping:
- ROMAN NUMERALs (32)
- CJK/KANGXI RADICALs (216)
- IDEOGRAPHIC TELEGRAPH SYMBOLs (68)
Excluding characters with the words HANGUL, PARENTHESIZED, COMMA, and
FULL STOP (all of which are excluded by Mark's rules) reduces the
overall total from 720 to 456. In these, there are at least three
- Some more that are already excluded my Mark's rules but that my simple
greps didn't catch.
- Those that I think definitely shouldn't be included (see above, 316 in
- The rest, possibly okay to include, which is at most 140.
> As I've said several times before, even if we disallow the
> NFKC-affected forms those characters, if a need arises, we can
> (painfully) redefine them as PVALID and allow them. But, if we
> map them to something else, we lose all information about what
> was intended/desired and end up in precisely the mess we have
> with e.g., Final Sigma and ZWJ/ZWNJ in which "the right thing
> to do" poses enough compatibility problems to cause opposition
> to making changes.
We definitely have to look at this carefully. I'm not overly concerned
in general, but we shouldn't just gloss over it.
>> 5. The character does not have the Script value: Hangul
>> The REMAP characters are removed from DISALLOWED, so that the
>> TABLES values form a partition (all the values are disjoint).
> This strikes me as dangerous -- see below.
>> B. Protocols documentChange sections 4.2.1 and 5.3 so as to
>> 1. Mapping all REMAP characters according to the Unicode
>> property NFKC_Casefold,
>> 2. Then normalizing the result according to NFC.
We have to make sure this transform is idempotent on all strings we are
concerned about, or introduce additional steps if necessary.
> Making this change to 4.2.1 eliminates the requirement that the
> registrant understand _exactly_ what is being registered, i.e.,
> that the communication path between the registrant and registry
> occur only using U-labels and/or A-labels. My understanding was
> that we had reached one of the more clear consensus we had in
> these discussions that the "no mapping on registration"
> restriction was appropriate. Are you proposing to reopen that
>> The rest of the tests for U-Label remain unchanged.
> I believe that doing this by the type of change to Tables that
> you recommend either requires a change to the way that the
> definition of U-label is stated or requires us to abandon the
> very clear concept of a U-label that is completely symmetric,
> with no information loss in either direction, with an A-label.
> There is also a subtle interaction with Section 5.5: if the
> mapping is performed by the time Section 5.3 concludes (or,
> under special circumstances, not applied at all), then Section
> 5.5 must also prohibit REMAP.
>> C. Defs document
>> 1. Define REMAP
>> 2. Define an M-Label to be one which if remapped according
>> to B1+B2, results in a U-Label.
> The idea of an M-Label still makes me uncomfortable. Again, we
> have had that discussion before.
> Idna-update mailing list
> Idna-update at alvestrand.no
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
More information about the Idna-update