MVALID (was Re: M-Label or MVALID, and dangers with mappings?)

Sat Apr 11 22:54:44 CEST 2009

sounds to me like the MVALID idea needs some tuning. I think the point about
substrings that John makes about operating on characters or full labels
sounds like a pretty critical choice. Can we make this work on a purely
character-by-character mapping basis?

2009/4/11 John C Klensin <klensin at jck.com>

>
>
> --On Saturday, April 11, 2009 09:14 -0700 Mark Davis
> <mark at macchiato.com> wrote:
>
> > We are thinking along very similar lines. Yes, I think what we
> > want to do is have the definition of MVALID as those
> > characters that are subject to IDNA2003-style mapping. I think
> > it is best to call it a slightly different name, since it is
> > those characters subject to mapping, and we don't want people
> > to think it is all those characters valid in an M-Label. I'll
> > use the working name MSUBJECT.
> > The process in Protocol would be along the following lines.
> >
> > 1. For any substring of the input whose characters are all in
> > MSUBJECT,
>
> I think this has MSUBJECT as a superset of PVALID, not as
> characters that are actually mapped under IDNA2003-like rules.
> I don't necessarily have a problem with that, but we need to be
> very, very, clear... especially since I'm not at all sure what
> "IDNA2003-style" covers.
>
> If my supposition about the subset relationship is not correct,
> then I'm still not sure about the implications of selecting a
> substring.  It would seem to introduce additional confusion; I
> think it would be much more sensible to operate on either
> characters or on full labels rather than dividing labels into
> substrings that have to be recombined.
>
> > convert that substring via the following mapping,
> > and replace in the source.
> >
> > substring = toNFKC(removeDI(toCaseFold(toNFKC(substring))))
> >
> > // the "removeDI step would be dropped if we decide not to
> > remove them
>
> Note another peculiarity of this rule.  If we decide to side
> with the language authorities rather than the "having accepted
> the mistake in IDNA2003, we would rather live with it than
> change" position of the registries and allow, e.g., Eszett, then
> applying "toCaseFold(toNFKC(..." to a substring containing
> Eszett but no other characters that are non-PVALID skips this
> step and uses Eszett in the substring to be looked up.  But one
> that contains at least one character that requires mapping would
> result in a final substring that contains "ss".  I believe that
> a user would find that even more astonishing than begin forced
> to use lower-case.
>
> Similarly, while we have already decided to DISALLOW Hangul
> Jamo, this rule would allow those combinations of Jamo that NFKC
> maps into Hangul syllables while not allowing those combinations
> that do not.  I'm not nearly familiar enough with Korean usage
> to know how problematic that would be in practice, but it is not
> what I think we agreed to.
>
> The latter is one of the reasons why Protocol now says "must be
> in NFC form" rather than "apply NFC".
>
> > 2. Transform the entire string via NFC.
> >
> > // we need to do this to make sure the result is NFC, because
> > of possible interactions between characters that are inside
> > and outside MSUBJECT.
>
> I agree, but, again, decomposing labels into substring
> components and then recombining them seems exceptionally likely
> to get us into surprises -- and implementations into states of
> confusion.
>
> > 3. Proceed with the rest of Protocol
> >...
>
>    john
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090411/8a7fea6d/attachment.htm