MVALID (was Re: M-Label or MVALID, and dangers with mappings?)

Sat Apr 11 21:37:40 CEST 2009

--On Saturday, April 11, 2009 09:14 -0700 Mark Davis
<mark at macchiato.com> wrote:

> We are thinking along very similar lines. Yes, I think what we
> want to do is have the definition of MVALID as those
> characters that are subject to IDNA2003-style mapping. I think
> it is best to call it a slightly different name, since it is
> those characters subject to mapping, and we don't want people
> to think it is all those characters valid in an M-Label. I'll
> use the working name MSUBJECT.
> The process in Protocol would be along the following lines.
> 
> 1. For any substring of the input whose characters are all in
> MSUBJECT,

I think this has MSUBJECT as a superset of PVALID, not as
characters that are actually mapped under IDNA2003-like rules.
I don't necessarily have a problem with that, but we need to be
very, very, clear... especially since I'm not at all sure what
"IDNA2003-style" covers.  

If my supposition about the subset relationship is not correct,
then I'm still not sure about the implications of selecting a
substring.  It would seem to introduce additional confusion; I
think it would be much more sensible to operate on either
characters or on full labels rather than dividing labels into
substrings that have to be recombined.

> convert that substring via the following mapping,
> and replace in the source.
> 
> substring = toNFKC(removeDI(toCaseFold(toNFKC(substring))))
> 
> // the "removeDI step would be dropped if we decide not to
> remove them

Note another peculiarity of this rule.  If we decide to side
with the language authorities rather than the "having accepted
the mistake in IDNA2003, we would rather live with it than
change" position of the registries and allow, e.g., Eszett, then
applying "toCaseFold(toNFKC(..." to a substring containing
Eszett but no other characters that are non-PVALID skips this
step and uses Eszett in the substring to be looked up.  But one
that contains at least one character that requires mapping would
result in a final substring that contains "ss".  I believe that
a user would find that even more astonishing than begin forced
to use lower-case.

Similarly, while we have already decided to DISALLOW Hangul
Jamo, this rule would allow those combinations of Jamo that NFKC
maps into Hangul syllables while not allowing those combinations
that do not.  I'm not nearly familiar enough with Korean usage
to know how problematic that would be in practice, but it is not
what I think we agreed to.

The latter is one of the reasons why Protocol now says "must be
in NFC form" rather than "apply NFC".

> 2. Transform the entire string via NFC.
> 
> // we need to do this to make sure the result is NFC, because
> of possible interactions between characters that are inside
> and outside MSUBJECT.

I agree, but, again, decomposing labels into substring
components and then recombining them seems exceptionally likely
to get us into surprises -- and implementations into states of
confusion.

> 3. Proceed with the rest of Protocol
>...

    john