I-D Action:draft-ietf-idnabis-mappings-00.txt

Tue Jun 30 11:07:05 CEST 2009

--On Monday, June 29, 2009 12:46 -0700 Mark Davis ⌛
<mark at macchiato.com> wrote:

>...
>> > Now, my position is still that the simplest and most
>> > compatible option open to us is to simply map with NFKC +
>> > Casefold.
>> 
>> I continue to believe that CaseFold is a showstopper.  When
>> its results are not identical to those produced by LowerCase,
>> it produces results that are astonishing to some users and
>> leads us into the "is that a separate character or not" trap
>> that we've seen manifested at least twice.  I note that TUS
>> recommends against its use for mapping (as distinct from
>> comparison) and appears to do so for just the reason that it
>> involves too much information loss.

> You need to provide actual data behind this. Please list
> exactly the characters that you mean, and why you think they
> are problematic. Note also that the formulation that I gave
> means that any character that is PVALID would automatically be
> excluded, eg if final-sigma is PVALID then it is unaffected.
> And we can certainly introduce other exceptions.

I don't like operating by exception when it can be avoided (an
argument you have made as well).  Getting into situations in
which exceptions are required is not advantageous if we are
trying to be as Unicode version independent as possible.   I
also prefer operations that casual users understand and believe
they do (e.g., a Lower Case operation is fairly comprehensible
to any user of a script that supports case distinctions, while
CaseFolding is dependent on Unicode coding decisions.   Final
Sigma and Eszett are, indeed, the current examples but it
continues to appear that LowerCase is both necessary and
sufficient and that, if transformations it does not cover are
needed, it is they that should be handled by exception.

> And I know full well about the issues in TUS, having written
> or participated in the writing of them.

I assumed, given your role in the case-handling material, that
you had written it.  What I'm having trouble understanding is
why, given the perfectly logical (at least to me) explanation of
why CaseFold should be used for matching only that appears
there, you keep wanting to use it as a mandatory mapping
operation here.

A different way of looking at this is that I'm trying to resist
mapping transformations that non-expert users believe lose
significant information unless they can be demonstrated to be
really important.   Whatever can be said for, e.g., the
FinalSigma -> Lower Case Sigma transformation and the tradeoffs
between information-preservation and IDNA2003 compatibility, it
seems to be generally understood that the transformation is
information-losing. From that perspective, and the related
perspective of minimizing complexity by choosing simpler
operations rather than more complicated ones and not performing
mappings that are not justified by real-world usage, it seems to
me that it is you who need to make the case for operations that
lose more information, for more complexity, and for mapping of
more characters.

>> ...
> You make it sounds like final sigma, ZWJ/NJ, eszett and the
> other cases under discussion were oversights in the process of
> developing the current IDNA. That wasn't the case; these were
> deliberate choices made at the time. A case mapping is also a
> 'loss of information', but one that people clearly want.

Taking the last as an example, I think "a case mapping" was a
deliberate choice, one that I supported at the time and, given
the assumptions behind IDNA2003 would support it again.  I do
not believe it is plausible to argue that a majority of the
participants in the original IDNA WG, much less in the IETF,
understood the implications of the differences between case
folding and lower case mapping well enough to have exercised
informed consent, much less to have made a "deliberate choice".
Instead, they were informed by experts, yourself included, that
toCaseFold was the correct explanation and went along with it
despite some concerns about individual characters (which most of
the participants did not understand either).

Obviously one can have both "not an oversight" and "insufficient
understanding to have informed consent", so we are not
necessarily disagreeing.

>...

>> > The rest of the tests for U-Label remain unchanged.
>> 
>> I believe that doing this by the type of change to Tables that
>> you recommend either requires a change to the way that the
>> definition of U-label is stated or requires us to abandon the
>> very clear concept of a U-label that is completely symmetric,
>> with no information loss in either direction, with an A-label.

> I don't see why you would think that.  A U-Label remains just
> the way it is, and has a 1-1 relation with an A-Label. The
> only difference is that we have an additional category of
> M-Label; one that is not a U-Label but maps to one.

At a minimum, the already-complicated pictures in Defs will
require redrawing, which was not mentioned in your list.  But,
independent of that bit of work, I still wish we could avoid
introducing yet another label category.

>...

    john