vint at google.com
Tue Jun 30 14:47:15 CEST 2009
Mark, John, et al,
In an effort to truly get to a conclusion in time for Stockholm, may I
that we start with a minimum mapping position (e.g. lowercasing) and
build up from there? That's more or less what Pete was trying to do.
John K and Martin D make arguments that favor that tactic and it will
allow us to balance how much mapping (information loss) we accept.
I think we have agreed and should stick with the 1:1 nature of A- and
and with no mapping on registration. This is long-since agreed.
Paul and Pete are making an attempt, i believe, to formulate a proposal.
Mark has made one that for some people may overshoot the degree of
mapping believed necessary to preserve elements of (but not all) of
backward compatibility with IDNA2003. I think it is fair to say that
not be having this discussion were it not for the fact that a consensus
has been reached that IDNA2003 had properties that led to the creation
of the IDNABIS working group. Our task is to find a path forward that
balances no-mapping on lookup and overmuch mapping.
Preserving the esszet and other special characters, lowercasing, and
probably dealing with the CJK characters would likely be a minimum
treatment to start, as I think Pete attempted in his first proposal.
Mark says that for lookup purposes we are creating a larger
class than A-labels with the property that the lookup process will
convert them into A-label format. For precision, describing that
fact has some value. John, I can make a stab at drawing it just to
see how difficult that gets. Maybe we can just say it without drawing
On Jun 30, 2009, at 6:29 AM, Martin J. Dürst wrote:
> On 2009/06/30 2:55, John C Klensin wrote:
>> Several comments inline...
>> --On Sunday, June 28, 2009 21:28 -0700 Mark Davis ⌛
>> <mark at macchiato.com> wrote:
>>> Returning to the discussion, now that some of my other
>>> standards work is under control (RFC4646bis was approved,
>>> Now, my position is still that the simplest and most
>>> compatible option open to us is to simply map with NFKC +
>> I continue to believe that CaseFold is a showstopper. When its
>> results are not identical to those produced by LowerCase, it
>> produces results that are astonishing to some users and leads us
>> into the "is that a separate character or not" trap that we've
>> seen manifested at least twice. I note that TUS recommends
>> against its use for mapping (as distinct from comparison) and
>> appears to do so for just the reason that it involves too much
>> information loss.
> I have earlier said that I think Mark's proposal goes in the right
> direction, but I agree with John that LowerCase is better than
> If anything, the burden of proof should be on the CaseFold side (show,
> for each case of mapping that's in CaseFold but not LowerCase, why
> needed) rather than on the LowerCase side.
> Mark wrote, in a later mail:
> You make it sounds like final sigma, ZWJ/NJ, eszett and the other
> under discussion were oversights in the process of developing the
> IDNA. That wasn't the case; these were deliberate choices made at
> the time.
> A case mapping is also a 'loss of information', but one that people
> Eszett wasn't exactly an oversight, I knew at the time that it was
> problematic and told others. However, I didn't have the zeal to defend
> it because as a Swiss, I didn't and don't feel as attached to it as
> Germans and Austrians do.
> My understanding of why the eszett got mapped in IDNA 2003 was that
> IETF wanted a one-stop shopping table, and Unicode had such a table,
> any discussions about individual characters were out of fashion
> it was felt that if we started discussing individual characters, we
> would never finish.
>>> Proposal: A. Tables document
>>> Add a new type of character: REMAP. A character is REMAP if it
>>> meets *all of * the following criteria:
>>> 1. The character is not PVALID or CONTEXTO
>>> 2. If remapped by the Unicode property NFKC_Casefold*, then
>>> the resulting character(s) are all PVALID or CONTEXTO
>>> 3. The character is a LetterDigit or Pd
>>> 4. The character has one of the following
>>> Decomposition_Type values: initial, medial, final,
>>> isolated, wide, narrow, or compat
>> I am very concerned that collapsing initial, medial, and final
>> together may get us into problems with other language
>> communities similar to those we have gotten into with Final
>> Sigma, especially when those communities denote word boundaries
>> by the appearance of final or initial forms and hence would use
>> those forms in a style similar to the way "BigCompany" or
>> "big-company" might be used in ASCII.
> The only character currently not containing the word "ARABIC" in its
> name for <initial>, <medial>, <final>, or <isolated> is U+FDFC, RIAL
> SIGN, which is just as well Arabic even if it doesn't say so in its
> I strongly doubt that the UTC would encode other backwards
> contextual forms in these four categories, and it might be possible to
> make sure that doesn't happen with a stability guarantee if that's
> really necessary.
> What I already asked Mark for, and what I'm still looking for, is some
> data on how (in)frequent these actually are.
> As for <wide>, that includes only U+3000 (full width space, irrelevant
> here) and U+FFxx characters that contain FULLWIDTH in their name.
> As for <narrow>, that includes HANGUL, KATAKANA, and 11 characters in
> the U+FFxx area, all of which contain the word HALFWIDTH. The one to
> watch out for is U+FF61, HALFWIDTH IDEOGRAPHIC FULL STOP. Its
> sibling (U+3002) is part of IDNA 2003.
> For these two (wide/narrow), I know from local experience here in
> that they are probably necessary. Still, it would be good to get some
> numbers from Mark.
> As for <compat>, that's the "everything else" bucket. That's a total
> 720 characters in Unicode 5.2 (as of UnicodeData-5.2.0d9.txt). Not all
> of them qualify by Mark's rules (in particular things such as
> parenthesized numbers don't because parentheses aren't allowed), but
> there are still way to many in my opinion that qualify. It would be
> to know from Mark how many of these he really thinks need to be
> and why. If that's let's say 90% or 95% of the characters that would
> qualify by Mark's rules, it might be okay to just leave the rest as
> provided we can see no harm. Otherwise, I think a more detailed
> may be necessary.
> To be more explicit, I think *at least* the following are included by
> the rules that Mark proposes but shouldn't be used for mapping:
> - ROMAN NUMERALs (32)
> - CJK/KANGXI RADICALs (216)
> - IDEOGRAPHIC TELEGRAPH SYMBOLs (68)
> Excluding characters with the words HANGUL, PARENTHESIZED, COMMA, and
> FULL STOP (all of which are excluded by Mark's rules) reduces the
> overall total from 720 to 456. In these, there are at least three
> - Some more that are already excluded my Mark's rules but that my
> greps didn't catch.
> - Those that I think definitely shouldn't be included (see above,
> 316 in
> - The rest, possibly okay to include, which is at most 140.
>> As I've said several times before, even if we disallow the
>> NFKC-affected forms those characters, if a need arises, we can
>> (painfully) redefine them as PVALID and allow them. But, if we
>> map them to something else, we lose all information about what
>> was intended/desired and end up in precisely the mess we have
>> with e.g., Final Sigma and ZWJ/ZWNJ in which "the right thing
>> to do" poses enough compatibility problems to cause opposition
>> to making changes.
> We definitely have to look at this carefully. I'm not overly concerned
> in general, but we shouldn't just gloss over it.
>>> 5. The character does not have the Script value: Hangul
>>> The REMAP characters are removed from DISALLOWED, so that the
>>> TABLES values form a partition (all the values are disjoint).
>> This strikes me as dangerous -- see below.
>>> B. Protocols documentChange sections 4.2.1 and 5.3 so as to
>>> 1. Mapping all REMAP characters according to the Unicode
>>> property NFKC_Casefold,
>>> 2. Then normalizing the result according to NFC.
> We have to make sure this transform is idempotent on all strings we
> concerned about, or introduce additional steps if necessary.
> Regards, Martin.
>> Making this change to 4.2.1 eliminates the requirement that the
>> registrant understand _exactly_ what is being registered, i.e.,
>> that the communication path between the registrant and registry
>> occur only using U-labels and/or A-labels. My understanding was
>> that we had reached one of the more clear consensus we had in
>> these discussions that the "no mapping on registration"
>> restriction was appropriate. Are you proposing to reopen that
>>> The rest of the tests for U-Label remain unchanged.
>> I believe that doing this by the type of change to Tables that
>> you recommend either requires a change to the way that the
>> definition of U-label is stated or requires us to abandon the
>> very clear concept of a U-label that is completely symmetric,
>> with no information loss in either direction, with an A-label.
>> There is also a subtle interaction with Section 5.5: if the
>> mapping is performed by the time Section 5.3 concludes (or,
>> under special circumstances, not applied at all), then Section
>> 5.5 must also prohibit REMAP.
>>> C. Defs document
>>> 1. Define REMAP
>>> 2. Define an M-Label to be one which if remapped according
>>> to B1+B2, results in a U-Label.
>> The idea of an M-Label still makes me uncomfortable. Again, we
>> have had that discussion before.
>> Idna-update mailing list
>> Idna-update at alvestrand.no
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
> Idna-update mailing list
> Idna-update at alvestrand.no
More information about the Idna-update