Possible definition for MVALID and a mapping table

Mon Apr 13 21:52:03 CEST 2009

On Sat, Apr 11, 2009 at 5:15 PM, John C Klensin <klensin at jck.com> wrote:
> I hope we can formulate this as rules that generate tables, but
> I'd like to see if we can agree on principles before we get down
> to details.

I agree that this approach seems likely to lead to rough consensus
(and I think I said something like this before).

> Principles:
>
> (1) No character is mapped if it would map to a DISALLOWED
> character (I think we are agreed about that one).

Also, if a character is PVALID, CONTEXTO or CONTEXTJ to begin with, it
must not be mapped to something else, with the possible exception of
combining marks that would combine with base letters under NFC and
maybe even Jamos(?) that combine with each other under NFC. (I don't
know how much consensus there is on the Jamo issue. I'm just including
it for completeness.)

> (2) Only those NFKC mappings that are identified in UnicodeData
> as <wide> or <narrow> are automatically included.  <compat> will
> have to be considered on a case-by-case basis or with
> discrimination based on other rules.  There appear to be 673
> characters (fortunately quite a few less once rule (1) is
> applied) in that group in Unicode 5.1, so I certainly hope we
> can come up with a better discrimination function.

I don't know of any other Unicode properties that would help us
subdivide the <compat> set, but I'd like to tentatively suggest that
one criterion we have discussed earlier is "how easy it is to
(mis)type the character on the keyboard or in an input method". This
suggestion is tentative because it may be controversial, given that
the IETF often stays away from UI issues. Even more tentatively, I
suggest that one way to gather such data is via Wikipedia's pages on
keyboard layout, WG experts on input methods, and maybe even character
usage frequency data from the Web or other sources.

> (3) Of the case-related operations, only toLowerCase is used to
> form mapping functions.  The additional cases that result in
>    toLowerCase(cp) <> toCaseFold(cp)
> are all potentially problematic and, if they are to be included
> (mapped), require case-by-case consideration.

Yes, toLowerCase vs toCaseFold is something we should look into. I'd
also suggest looking into characters that were mapped to nothing in
IDNA2003 and default ignoreables that were added after Unicode 3.2.

The mapping spec should probably also say something about per-label or
per-FQDN processing, given that some IDNA2003 implementations process
entire FQDNs, leading to the generation of "extra" dots from NFKC.
Although such characters might not be MVALID under the WG's rough
consensus, it might be prudent to explicitly mention it as a kind of
warning or guide for implementers.

Erik