I-D Action:draft-ietf-idnabis-mappings-00.txt

Mark Davis ⌛ mark at macchiato.com
Tue Jun 30 21:41:30 CEST 2009


Mark


On Mon, Jun 29, 2009 at 13:26, Vint Cerf <vint at google.com> wrote:

> Mark, et al,
> broad re-mapping, even if only for lookup, makes me wonder about ways to
> mislead users through the use of labels in domain names that would not be
> allowed until mapped. I think I can understand how we might want NOT to map
> some of the recently introduced characters (sharp-s for instance) and how we
> might want to map upper case into lower case prior to lookup (emulating the
> case independent matching of the purely ASCII domain names/labels of the
> past. I am having some difficulty with the full range of characters that
> might be invalid under IDNA2008 but mapped into valid IDNA2008 characters.
> If there has been a trend in the discussions it has been towards limiting
> the set of characters that may be mapped prior to lookup.
>

This does limit those characters, does limit the characters remapped, and *
is* a compromise (as I said, I'd rather just map them all; it would be the
simplest and offer the most compatibility).

Let me provide some figures to illustrate this.

Caveats: sizes may change depending on tweaks/changes to the formulation,
etc.

 *Size
* *Description
* *Clause*
  90,261 PVALID or CONTEXTx     1,023,851 Not PValid or ContextO 1
   5,815 & NFKC/Case mapped
 2
  4,513 & LetterDigit or Pd
 3
  4,312
 & Results are PValid or ContextO
 4
  2,324
 & Results are allowed DTs & not Hangul
 5,6
I reordered the proposed criteria for REMAP to make the above clearer:

   1. The character is not PVALID or CONTEXTO
   2. The character is mapped by NFKC_CaseFold
   3. The character is a LetterDigit or Pd
   4. If remapped by the Unicode property NFKC_Casefold*, then the resulting
   character(s) are all PVALID or CONTEXTO
   5. The character has one of the following Decomposition_Type values:
   canonical, initial, medial, final, isolated, wide, narrow, or compat
   6. The character does not have the Script value: Hangul

I also put a breakdown of that the mapped characters on
http://www.macchiato.com/unicode/idna/remap, and updated the text there for
John's comments.


> I think we need to find a space around which compromise and consensus can
> built as to what chars are allowed to be mapped prior to look up. I think we
> all agree that there should be no implicit or explicit mapping in the
> registration process.
>

Yes, agreed; that was just an oversight on my part.


>
> Looking for more common ground.
>

Me too.


>
> vint
>
>
> On Jun 29, 2009, at 3:46 PM, Mark Davis ⌛ wrote:
>
>
> Mark
>
>
> On Mon, Jun 29, 2009 at 10:55, John C Klensin <klensin at jck.com> wrote:
>
>> Mark,
>>
>> Several comments inline...
>>
>> --On Sunday, June 28, 2009 21:28 -0700 Mark Davis ⌛
>> <mark at macchiato.com> wrote:
>>
>> > Returning to the discussion, now that some of my other
>> > standards work is under control (RFC4646bis was approved,
>> > whew!)
>> >...
>>
>> > Now, my position is still that the simplest and most
>> > compatible option open to us is to simply map with NFKC +
>> > Casefold.
>>
>> I continue to believe that CaseFold is a showstopper.  When its
>> results are not identical to those produced by LowerCase, it
>> produces results that are astonishing to some users and leads us
>> into the "is that a separate character or not" trap that we've
>> seen manifested at least twice.  I note that TUS recommends
>> against its use for mapping (as distinct from comparison) and
>> appears to do so for just the reason that it involves too much
>> information loss.
>
>
> You need to provide actual data behind this. Please list exactly the
> characters that you mean, and why you think they are problematic. Note also
> that the formulation that I gave means that any character that is PVALID
> would automatically be excluded, eg if final-sigma is PVALID then it is
> unaffected. And we can certainly introduce other exceptions.
>
> And I know full well about the issues in TUS, having written or
> participated in the writing of them.
>
>  >...
>> > Proposal: A. Tables document
>> >
>> > Add a new type of character: REMAP. A character is REMAP if it
>> > meets *all of * the following criteria:
>> >
>> >    1. The character is not PVALID or CONTEXTO
>> >    2. If remapped by the Unicode property NFKC_Casefold*, then
>> > the resulting    character(s) are all PVALID or CONTEXTO
>> >    3. The character is a LetterDigit or Pd
>> >    4. The character has one of the following
>> > Decomposition_Type values: initial, medial, final,
>> > isolated, wide, narrow, or compat
>>
>> I am very concerned that collapsing initial, medial, and final
>> together may get us into problems with other language
>> communities similar to those we have gotten into with Final
>> Sigma, especially when those communities denote word boundaries
>> by the appearance of final or initial forms and hence would use
>> those forms in a style similar to the way "BigCompany" or
>> "big-company" might be used in ASCII.
>
>
> The mechanism used to indicate boundaries is not, as you think, the use of
> the presentation forms; it is the use of the ZWNJ/J, which we already
> provide for.
>
>
>>
>> As I've said several times before, even if we disallow the
>> NFKC-affected forms those characters, if a need arises, we can
>> (painfully) redefine them as PVALID and allow them.  But, if we
>> map them to something else, we lose all information about what
>> was intended/desired and end up in precisely the mess we have
>> with e.g., Final Sigma  and ZWJ/ZWNJ in which "the right thing
>> to do" poses enough compatibility problems to cause opposition
>> to making changes.
>
>
> You make it sounds like final sigma, ZWJ/NJ, eszett and the other cases
> under discussion were oversights in the process of developing the current
> IDNA. That wasn't the case; these were deliberate choices made at the time.
> A case mapping is also a 'loss of information', but one that people clearly
> want.
>
> If you have any particular characters that you think would be of concern,
> you should raise them as issues.
>
>
>>
>> >    5. The character does not have the Script value: Hangul
>> >
>> > The REMAP characters are removed from DISALLOWED, so that the
>> > TABLES values form a partition (all the values are disjoint).
>>
>> This strikes me as dangerous -- see below.
>>
>> > B. Protocols documentChange sections 4.2.1 and 5.3 so as to
>> > require:
>> >
>> >    1. Mapping all REMAP characters according to the Unicode
>> > property    NFKC_Casefold,
>> >    2. Then normalizing the result according to NFC.
>>
>> Making this change to 4.2.1 eliminates the requirement that the
>> registrant understand _exactly_ what is being registered, i.e.,
>> that the communication path between the registrant and registry
>> occur only using U-labels and/or A-labels.  My understanding was
>> that we had reached one of the more clear consensus we had in
>> these discussions that the "no mapping on registration"
>> restriction was appropriate.  Are you proposing to reopen that
>> question?
>
>
> Sorry, you are correct. This would only affect the lookup part.
>
>
>>
>>
>> > The rest of the tests for U-Label remain unchanged.
>>
>> I believe that doing this by the type of change to Tables that
>> you recommend either requires a change to the way that the
>> definition of U-label is stated or requires us to abandon the
>> very clear concept of a U-label that is completely symmetric,
>> with no information loss in either direction, with an A-label.
>
>
> I don't see why you would think that.  A U-Label remains just the way it
> is, and has a 1-1 relation with an A-Label. The only difference is that we
> have an additional category of M-Label; one that is not a U-Label but maps
> to one.
>
>
>>
>> There is also a subtle interaction with Section 5.5: if the
>> mapping is performed by the time Section 5.3 concludes (or,
>> under special circumstances, not applied at all), then Section
>> 5.5 must also prohibit REMAP.
>
>
> You are correct; that was my intention, but I forgot to mention it. Yes,
> there needs to be a change in 5.5.
>
> So below:
>
>    o  Labels containing prohibited code points, i.e., those that are
>
>       table [IDNA2008-Tables <http://tools.ietf.org/html/draft-ietf-idnabis-protocol-12#ref-IDNA2008-Tables>].
>
>       assigned to the "DISALLOWED" category in the permitted character
>
>
>  add
>
>    o  Labels containing remapped code points, i.e., those that are
>       assigned to the "REMAP" category in the permitted character
>       table [IDNA2008-Tables <http://tools.ietf.org/html/draft-ietf-idnabis-protocol-12#ref-IDNA2008-Tables>].
>
>
>
>>
>> > C. Defs document
>> >
>> >    1. Define REMAP
>> >    2. Define an M-Label to be one which if remapped according
>> > to B1+B2,    results in a U-Label.
>>
>> The idea of an M-Label still makes me uncomfortable.  Again, we
>> have had that discussion before.
>>
>> regards,
>>    john
>>
>>
>>
>>
>> ______________________________ _________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090630/26565cfa/attachment-0001.htm 


More information about the Idna-update mailing list