I-D Action:draft-ietf-idnabis-mappings-00.txt

Sun Jun 7 17:57:26 CEST 2009

--On Saturday, June 06, 2009 16:38 -0700 Paul Hoffman
<phoffman at imc.org> wrote:

>...
>> I continue to believe that use of NKFC without exclusion of
>> character groups for which there are no justifications is
> 
> Pete's proposed mapping happens before the
> is-it-valid-IDNA2008 check. Why should we use a modified NFKC
> instead of plain-vanilla-NFKC and let the second step
> (is-it-valid-IDNA2008 check) happen as-is?

My concern is not those NFKC mappings that will result in
invalid (DISALLOWED) characters.   It is 

(1) NFKC mappings of characters that, if used in domain names,
are probably used to cause mischief and for which there is no
substantive justification.   The "Mathematical" characters are
examples of this.  Martin's original list identified others.
Note that, except in specialized systems, these characters are
very difficult to type and ones for which fonts are unlikely to
be present.

(2) NFKC mappings of characters that result in characters in
CONTEXTO or CONTEXTJ.  Unless I missed something in my search,
this is a null set at present.  But I can find no stability rule
that would prevent adding such a character and the same
presentation and ambiguity issues that apply to the listed
CONTEXTx characters would apply to their compatibility
equivalents.

>> (i) A violation of the "inclusion" model of IDNA2008
> 
> Completely agree. However, this whole document is a violation
> of the "no-mapping" model of IDNA2008, so that seems like an
> odd objection.

We are likely to have to agree to disagree about this, but I
believe that "inclusion" and "no mapping" are separate
principles.  The acceptance of mapping in some contexts does not
seem to me to justify, in any way, abandoning "inclusion".  From
that point of view, the argument to abandon inclusion in the
mapping context has to be made separately... and that argument
has not, as far as I can remember, been made yet.

>> (ii) A violation of the closely-related protocol design
>> principle that one should include only those things for which
>> one has both use and understanding because it is easier to add
>> later than it is to remove.
> 
> Implementers of IDNA2008 will understand NFKC as well as
> implementers of IDNA2003.

Which is to say, IMO, not at all.  People understand how to
apply NFKC by invoking an operation or library module, but I
don't believe there is general understanding of NFKC.   As
Martin's note pointed out, NFKC is actually a composite
operation.  For the canonical equivalents processed by NFC/NFD,
there is really only one pair of operations: either composing or
decomposing different representational forms of _the same
character_.   For compatibility equivalents, however, the
characters are not the same, or different forms of the same
thing.  They are different characters which may be equivalent
for some purposes (and not for others).  And there are many
categories of such characters -- width, font, etc. -- with NFKC
and the generic "compatibility" label bypassing those
distinctions.   I suggest that, for IDNs, the distinctions are
important, especially given the inclusion principle, and should
not be lightly discarded because of the way NFKC is defined.

>> (iii) An increased risk, however slight, that we will, in the
>> future, get strong demands from some particular community to
>> treat a character classified by Unicode as "compatibility" as
>> a real and distinct character.  If such a character is
>> disallowed by virtue of not being mapped, we will have the
>> difficult problem of changing a disallowed character to a
>> PVALID one. But, if it is mapped to something else, we will
>> have to revisit the very complex discussions that we have had
>> over Eszett and Final Sigma.  We should not incur that risk
>> unless there is a reason to do so.
> 
> There is: increased (but, of course, not full) backwards
> compatibility with the large installed base of IDNA2003.

We've seen no evidence that any of these other categories of
compatibility characters are used --other than in possible
demonstrations or out of malice-- in IDNA2003-compatible
applications, much less enough to constitute a large installed
base.   In this context, the problem with Eszett and Final Sigma
is quite independent of how and whether they are used and
important.  It is that we have no way to know definitively by
looking a registered labels in zones, whether those characters
were intended or whether what they mapped to was.   If we see
"ss" in a zone, we don't know whether it started out Eszett or
as "ss" (we may be able to apply heuristics and make guesses in
some cases, but not always).  That gives the registries problems
--discussed at great length on this list-- if a later decision
is made to treat the original characters as separate.  While
both Eszett and Final Sigma are issues due to Case Folding
rather than Compatibility characters, I believe that similar
problems can be avoided by mapping what we conclude that we need
to map, i.e., by avoiding assuming that, because IDNA2003
effectively applied NFKC and is widely deployed, every NFKC
mapping is widely deployed and must be preserved for
backward-compatibility reasons.

YMMD.

      john