I-D Action:draft-ietf-idnabis-mappings-00.txt

Tue Jun 30 14:47:15 CEST 2009

Mark, John, et al,

In an effort to truly get to a conclusion in time for Stockholm, may I  
ask
that we start with a minimum mapping position (e.g. lowercasing) and
build up from there? That's more or less what Pete was trying to do.
John K and Martin D make arguments that favor that tactic and it will
allow us to balance how much mapping (information loss) we accept.

I think we have agreed and should stick with the 1:1 nature of A- and  
U-Label
and with no mapping on registration. This is long-since agreed.

Paul and Pete are making an attempt, i believe, to formulate a proposal.
Mark has made one that for some people may overshoot the degree of
mapping believed necessary to preserve elements of (but not all) of
backward compatibility with IDNA2003. I think it is fair to say that  
we would
not be having this discussion were it not for the fact that a consensus
has been reached that IDNA2003 had properties that led to the creation
of the IDNABIS working group. Our task is to find a path forward that
balances no-mapping on lookup and overmuch mapping.

Preserving the esszet and other special characters, lowercasing, and
probably dealing with the CJK characters would likely be a minimum
treatment to start, as I think Pete attempted in his first proposal.

Mark says  that for lookup purposes we are creating a larger
class than A-labels with the property that the lookup process will
convert them into A-label format. For precision, describing that
fact has some value. John, I can make a stab at drawing it just to
see how difficult that gets. Maybe we can just say it without drawing
it?

vint

On Jun 30, 2009, at 6:29 AM, Martin J. Dürst wrote:

>
>
> On 2009/06/30 2:55, John C Klensin wrote:
>> Mark,
>>
>> Several comments inline...
>>
>> --On Sunday, June 28, 2009 21:28 -0700 Mark Davis ⌛
>> <mark at macchiato.com>  wrote:
>>
>>> Returning to the discussion, now that some of my other
>>> standards work is under control (RFC4646bis was approved,
>>> whew!)
>>> ...
>>
>>> Now, my position is still that the simplest and most
>>> compatible option open to us is to simply map with NFKC +
>>> Casefold.
>>
>> I continue to believe that CaseFold is a showstopper.  When its
>> results are not identical to those produced by LowerCase, it
>> produces results that are astonishing to some users and leads us
>> into the "is that a separate character or not" trap that we've
>> seen manifested at least twice.  I note that TUS recommends
>> against its use for mapping (as distinct from comparison) and
>> appears to do so for just the reason that it involves too much
>> information loss.
>
> I have earlier said that I think Mark's proposal goes in the right
> direction, but I agree with John that LowerCase is better than  
> CaseFold.
> If anything, the burden of proof should be on the CaseFold side (show,
> for each case of mapping that's in CaseFold but not LowerCase, why  
> it's
> needed) rather than on the LowerCase side.
>
> Mark wrote, in a later mail:
>
> You make it sounds like final sigma, ZWJ/NJ, eszett and the other  
> cases
> under discussion were oversights in the process of developing the  
> current
> IDNA. That wasn't the case; these were deliberate choices made at  
> the time.
> A case mapping is also a 'loss of information', but one that people  
> clearly
> want.
>
>
> Eszett wasn't exactly an oversight, I knew at the time that it was
> problematic and told others. However, I didn't have the zeal to defend
> it because as a Swiss, I didn't and don't feel as attached to it as
> Germans and Austrians do.
>
> My understanding of why the eszett got mapped in IDNA 2003 was that  
> the
> IETF wanted a one-stop shopping table, and Unicode had such a table,  
> and
> any discussions about individual characters were out of fashion  
> because
> it was felt that if we started discussing individual characters, we
> would never finish.
>
>
>>> ...
>>> Proposal: A. Tables document
>>>
>>> Add a new type of character: REMAP. A character is REMAP if it
>>> meets *all of * the following criteria:
>>>
>>>    1. The character is not PVALID or CONTEXTO
>>>    2. If remapped by the Unicode property NFKC_Casefold*, then
>>> the resulting    character(s) are all PVALID or CONTEXTO
>>>    3. The character is a LetterDigit or Pd
>>>    4. The character has one of the following
>>> Decomposition_Type values: initial, medial, final,
>>> isolated, wide, narrow, or compat
>>
>> I am very concerned that collapsing initial, medial, and final
>> together may get us into problems with other language
>> communities similar to those we have gotten into with Final
>> Sigma, especially when those communities denote word boundaries
>> by the appearance of final or initial forms and hence would use
>> those forms in a style similar to the way "BigCompany" or
>> "big-company" might be used in ASCII.
>
> The only character currently not containing the word "ARABIC" in its
> name for <initial>, <medial>, <final>, or <isolated> is U+FDFC, RIAL
> SIGN, which is just as well Arabic even if it doesn't say so in its  
> name.
>
> I strongly doubt that the UTC would encode other backwards  
> compatibility
> contextual forms in these four categories, and it might be possible to
> make sure that doesn't happen with a stability guarantee if that's
> really necessary.
>
> What I already asked Mark for, and what I'm still looking for, is some
> data on how (in)frequent these actually are.
>
>
> As for <wide>, that includes only U+3000 (full width space, irrelevant
> here) and U+FFxx characters that contain FULLWIDTH in their name.
>
> As for <narrow>, that includes HANGUL, KATAKANA, and 11 characters in
> the U+FFxx area, all of which contain the word HALFWIDTH. The one to
> watch out for is U+FF61, HALFWIDTH IDEOGRAPHIC FULL STOP. Its  
> fullwidth
> sibling (U+3002) is part of IDNA 2003.
>
> For these two (wide/narrow), I know from local experience here in  
> Japan
> that they are probably necessary. Still, it would be good to get some
> numbers from Mark.
>
>
> As for <compat>, that's the "everything else" bucket. That's a total  
> of
> 720 characters in Unicode 5.2 (as of UnicodeData-5.2.0d9.txt). Not all
> of them qualify by Mark's rules (in particular things such as
> parenthesized numbers don't because parentheses aren't allowed), but
> there are still way to many in my opinion that qualify. It would be  
> good
> to know from Mark how many of these he really thinks need to be  
> mapped,
> and why. If that's let's say 90% or 95% of the characters that would
> qualify by Mark's rules, it might be okay to just leave the rest as  
> is,
> provided we can see no harm. Otherwise, I think a more detailed  
> analysis
> may be necessary.
>
> To be more explicit, I think *at least* the following are included by
> the rules that Mark proposes but shouldn't be used for mapping:
>
> - ROMAN NUMERALs (32)
> - CJK/KANGXI RADICALs (216)
> - IDEOGRAPHIC TELEGRAPH SYMBOLs (68)
>
> Excluding characters with the words HANGUL, PARENTHESIZED, COMMA, and
> FULL STOP (all of which are excluded by Mark's rules) reduces the
> overall total from 720 to 456. In these, there are at least three
> categories:
> - Some more that are already excluded my Mark's rules but that my  
> simple
> greps didn't catch.
> - Those that I think definitely shouldn't be included (see above,  
> 316 in
> total)
> - The rest, possibly okay to include, which is at most 140.
>
>
>> As I've said several times before, even if we disallow the
>> NFKC-affected forms those characters, if a need arises, we can
>> (painfully) redefine them as PVALID and allow them.  But, if we
>> map them to something else, we lose all information about what
>> was intended/desired and end up in precisely the mess we have
>> with e.g., Final Sigma  and ZWJ/ZWNJ in which "the right thing
>> to do" poses enough compatibility problems to cause opposition
>> to making changes.
>
> We definitely have to look at this carefully. I'm not overly concerned
> in general, but we shouldn't just gloss over it.
>
>>>    5. The character does not have the Script value: Hangul
>>>
>>> The REMAP characters are removed from DISALLOWED, so that the
>>> TABLES values form a partition (all the values are disjoint).
>>
>> This strikes me as dangerous -- see below.
>>
>>> B. Protocols documentChange sections 4.2.1 and 5.3 so as to
>>> require:
>>>
>>>    1. Mapping all REMAP characters according to the Unicode
>>> property    NFKC_Casefold,
>>>    2. Then normalizing the result according to NFC.
>
> We have to make sure this transform is idempotent on all strings we  
> are
> concerned about, or introduce additional steps if necessary.
>
> Regards,    Martin.
>
>> Making this change to 4.2.1 eliminates the requirement that the
>> registrant understand _exactly_ what is being registered, i.e.,
>> that the communication path between the registrant and registry
>> occur only using U-labels and/or A-labels.  My understanding was
>> that we had reached one of the more clear consensus we had in
>> these discussions that the "no mapping on registration"
>> restriction was appropriate.  Are you proposing to reopen that
>> question?
>>
>>> The rest of the tests for U-Label remain unchanged.
>>
>> I believe that doing this by the type of change to Tables that
>> you recommend either requires a change to the way that the
>> definition of U-label is stated or requires us to abandon the
>> very clear concept of a U-label that is completely symmetric,
>> with no information loss in either direction, with an A-label.
>>
>> There is also a subtle interaction with Section 5.5: if the
>> mapping is performed by the time Section 5.3 concludes (or,
>> under special circumstances, not applied at all), then Section
>> 5.5 must also prohibit REMAP.
>>
>>> C. Defs document
>>>
>>>    1. Define REMAP
>>>    2. Define an M-Label to be one which if remapped according
>>> to B1+B2,    results in a U-Label.
>>
>> The idea of an M-Label still makes me uncomfortable.  Again, we
>> have had that discussion before.
>>
>> regards,
>>    john
>>
>>
>>
>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>
> -- 
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update