I-D Action:draft-ietf-idnabis-mappings-00.txt

Mon Jun 29 19:55:41 CEST 2009

Mark,

Several comments inline...

--On Sunday, June 28, 2009 21:28 -0700 Mark Davis ⌛
<mark at macchiato.com> wrote:

> Returning to the discussion, now that some of my other
> standards work is under control (RFC4646bis was approved,
> whew!)
>...

> Now, my position is still that the simplest and most
> compatible option open to us is to simply map with NFKC +
> Casefold. 

I continue to believe that CaseFold is a showstopper.  When its
results are not identical to those produced by LowerCase, it
produces results that are astonishing to some users and leads us
into the "is that a separate character or not" trap that we've
seen manifested at least twice.  I note that TUS recommends
against its use for mapping (as distinct from comparison) and
appears to do so for just the reason that it involves too much
information loss.

>...
> Proposal: A. Tables document
> 
> Add a new type of character: REMAP. A character is REMAP if it
> meets *all of * the following criteria:
> 
>    1. The character is not PVALID or CONTEXTO
>    2. If remapped by the Unicode property NFKC_Casefold*, then
> the resulting    character(s) are all PVALID or CONTEXTO
>    3. The character is a LetterDigit or Pd
>    4. The character has one of the following
> Decomposition_Type values: initial, medial, final,
> isolated, wide, narrow, or compat

I am very concerned that collapsing initial, medial, and final
together may get us into problems with other language
communities similar to those we have gotten into with Final
Sigma, especially when those communities denote word boundaries
by the appearance of final or initial forms and hence would use
those forms in a style similar to the way "BigCompany" or
"big-company" might be used in ASCII.

As I've said several times before, even if we disallow the
NFKC-affected forms those characters, if a need arises, we can
(painfully) redefine them as PVALID and allow them.  But, if we
map them to something else, we lose all information about what
was intended/desired and end up in precisely the mess we have
with e.g., Final Sigma  and ZWJ/ZWNJ in which "the right thing
to do" poses enough compatibility problems to cause opposition
to making changes.

>    5. The character does not have the Script value: Hangul
> 
> The REMAP characters are removed from DISALLOWED, so that the
> TABLES values form a partition (all the values are disjoint).

This strikes me as dangerous -- see below.

> B. Protocols documentChange sections 4.2.1 and 5.3 so as to
> require:
> 
>    1. Mapping all REMAP characters according to the Unicode
> property    NFKC_Casefold,
>    2. Then normalizing the result according to NFC.

Making this change to 4.2.1 eliminates the requirement that the
registrant understand _exactly_ what is being registered, i.e.,
that the communication path between the registrant and registry
occur only using U-labels and/or A-labels.  My understanding was
that we had reached one of the more clear consensus we had in
these discussions that the "no mapping on registration"
restriction was appropriate.  Are you proposing to reopen that
question?

> The rest of the tests for U-Label remain unchanged.

I believe that doing this by the type of change to Tables that
you recommend either requires a change to the way that the
definition of U-label is stated or requires us to abandon the
very clear concept of a U-label that is completely symmetric,
with no information loss in either direction, with an A-label.

There is also a subtle interaction with Section 5.5: if the
mapping is performed by the time Section 5.3 concludes (or,
under special circumstances, not applied at all), then Section
5.5 must also prohibit REMAP.  

> C. Defs document
> 
>    1. Define REMAP
>    2. Define an M-Label to be one which if remapped according
> to B1+B2,    results in a U-Label.

The idea of an M-Label still makes me uncomfortable.  Again, we
have had that discussion before.

regards,
   john