I-D Action:draft-ietf-idnabis-mappings-00.txt

Mon Jun 29 22:26:26 CEST 2009

Mark, et al,

broad re-mapping, even if only for lookup, makes me wonder about ways  
to mislead users through the use of labels in domain names that would  
not be allowed until mapped. I think I can understand how we might  
want NOT to map some of the recently introduced characters (sharp-s  
for instance) and how we might want to map upper case into lower case  
prior to lookup (emulating the case independent matching of the purely  
ASCII domain names/labels of the past. I am having some difficulty  
with the full range of characters that might be invalid under IDNA2008  
but mapped into valid IDNA2008 characters. If there has been a trend  
in the discussions it has been towards limiting the set of characters  
that may be mapped prior to lookup.

I think we need to find a space around which compromise and consensus  
can built as to what chars are allowed to be mapped prior to look up.  
I think we all agree that there should be no implicit or explicit  
mapping in the registration process.

Looking for more common ground.

vint

On Jun 29, 2009, at 3:46 PM, Mark Davis ⌛ wrote:

>
> Mark
>
>
> On Mon, Jun 29, 2009 at 10:55, John C Klensin <klensin at jck.com> wrote:
> Mark,
>
> Several comments inline...
>
> --On Sunday, June 28, 2009 21:28 -0700 Mark Davis ⌛
> <mark at macchiato.com> wrote:
>
> > Returning to the discussion, now that some of my other
> > standards work is under control (RFC4646bis was approved,
> > whew!)
> >...
>
> > Now, my position is still that the simplest and most
> > compatible option open to us is to simply map with NFKC +
> > Casefold.
>
> I continue to believe that CaseFold is a showstopper.  When its
> results are not identical to those produced by LowerCase, it
> produces results that are astonishing to some users and leads us
> into the "is that a separate character or not" trap that we've
> seen manifested at least twice.  I note that TUS recommends
> against its use for mapping (as distinct from comparison) and
> appears to do so for just the reason that it involves too much
> information loss.
>
> You need to provide actual data behind this. Please list exactly the  
> characters that you mean, and why you think they are problematic.  
> Note also that the formulation that I gave means that any character  
> that is PVALID would automatically be excluded, eg if final-sigma is  
> PVALID then it is unaffected. And we can certainly introduce other  
> exceptions.
>
> And I know full well about the issues in TUS, having written or  
> participated in the writing of them.
>
> >...
> > Proposal: A. Tables document
> >
> > Add a new type of character: REMAP. A character is REMAP if it
> > meets *all of * the following criteria:
> >
> >    1. The character is not PVALID or CONTEXTO
> >    2. If remapped by the Unicode property NFKC_Casefold*, then
> > the resulting    character(s) are all PVALID or CONTEXTO
> >    3. The character is a LetterDigit or Pd
> >    4. The character has one of the following
> > Decomposition_Type values: initial, medial, final,
> > isolated, wide, narrow, or compat
>
> I am very concerned that collapsing initial, medial, and final
> together may get us into problems with other language
> communities similar to those we have gotten into with Final
> Sigma, especially when those communities denote word boundaries
> by the appearance of final or initial forms and hence would use
> those forms in a style similar to the way "BigCompany" or
> "big-company" might be used in ASCII.
>
> The mechanism used to indicate boundaries is not, as you think, the  
> use of the presentation forms; it is the use of the ZWNJ/J, which we  
> already provide for.
>
>
>
> As I've said several times before, even if we disallow the
> NFKC-affected forms those characters, if a need arises, we can
> (painfully) redefine them as PVALID and allow them.  But, if we
> map them to something else, we lose all information about what
> was intended/desired and end up in precisely the mess we have
> with e.g., Final Sigma  and ZWJ/ZWNJ in which "the right thing
> to do" poses enough compatibility problems to cause opposition
> to making changes.
>
> You make it sounds like final sigma, ZWJ/NJ, eszett and the other  
> cases under discussion were oversights in the process of developing  
> the current IDNA. That wasn't the case; these were deliberate  
> choices made at the time. A case mapping is also a 'loss of  
> information', but one that people clearly want.
>
> If you have any particular characters that you think would be of  
> concern, you should raise them as issues.
>
>
>
> >    5. The character does not have the Script value: Hangul
> >
> > The REMAP characters are removed from DISALLOWED, so that the
> > TABLES values form a partition (all the values are disjoint).
>
> This strikes me as dangerous -- see below.
>
> > B. Protocols documentChange sections 4.2.1 and 5.3 so as to
> > require:
> >
> >    1. Mapping all REMAP characters according to the Unicode
> > property    NFKC_Casefold,
> >    2. Then normalizing the result according to NFC.
>
> Making this change to 4.2.1 eliminates the requirement that the
> registrant understand _exactly_ what is being registered, i.e.,
> that the communication path between the registrant and registry
> occur only using U-labels and/or A-labels.  My understanding was
> that we had reached one of the more clear consensus we had in
> these discussions that the "no mapping on registration"
> restriction was appropriate.  Are you proposing to reopen that
> question?
>
> Sorry, you are correct. This would only affect the lookup part.
>
>
>
> > The rest of the tests for U-Label remain unchanged.
>
> I believe that doing this by the type of change to Tables that
> you recommend either requires a change to the way that the
> definition of U-label is stated or requires us to abandon the
> very clear concept of a U-label that is completely symmetric,
> with no information loss in either direction, with an A-label.
>
> I don't see why you would think that.  A U-Label remains just the  
> way it is, and has a 1-1 relation with an A-Label. The only  
> difference is that we have an additional category of M-Label; one  
> that is not a U-Label but maps to one.
>
>
>
> There is also a subtle interaction with Section 5.5: if the
> mapping is performed by the time Section 5.3 concludes (or,
> under special circumstances, not applied at all), then Section
> 5.5 must also prohibit REMAP.
>
> You are correct; that was my intention, but I forgot to mention it.  
> Yes, there needs to be a change in 5.5.
>
> So below:
>    o  Labels containing prohibited code points, i.e., those that are
>
>       assigned to the "DISALLOWED" category in the permitted character
>       table [IDNA2008-Tables].
>
>  add
>    o  Labels containing remapped code points, i.e., those that are
>       assigned to the "REMAP" category in the permitted character
>       table [IDNA2008-Tables].
>
>
>
>
> > C. Defs document
> >
> >    1. Define REMAP
> >    2. Define an M-Label to be one which if remapped according
> > to B1+B2,    results in a U-Label.
>
> The idea of an M-Label still makes me uncomfortable.  Again, we
> have had that discussion before.
>
> regards,
>   john
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090629/eb37787d/attachment.htm