I-D Action:draft-ietf-idnabis-mappings-00.txt

Sun May 31 06:13:25 CEST 2009

--On Saturday, May 30, 2009 16:02 -0700 Mark Davis
<mark at macchiato.com> wrote:

>...
> How would the example you cite be different from my seeing a
> box standing for a Malayalam character, where I have no
> Malayalam font? I don't think that you've given any concrete
> cases -- so at this point, you have not established that it is
> a practical problem, nor which characters would be at issue,
> nor what the magnitude is. So I think it is, at this point,
> pretty much theoretical.

Mark,

It differs in two ways:

(1) Mapping or no mapping, the IDNA2008 model still says
"inclusion" so I believe that the onus lies on those who think a
particular set of characters should be included to demonstrate
why that is necessary and appropriate.  

(2) The fact that we cannot and should not try to exclude every
problematic case, and every case for which it is unlikely that
some users will have fonts available, don't mean that we should
not take out the obvious cases.  In the Malayalam case, the
characters (or at least most of them) are not compatibility
characters, are needed to represent a language, and presumably
are relatively easy to type in areas where that language and
writing system are common and there are keyboards to match.  It
may still be appropriate for user interfaces to do special
things in areas where those things are not the case, but that is
a different matter.

In the case or the Mathematical characters I used as examples,
"Mathematical" is not a language, the characters are
compatibility characters rather than base/target characters,
fonts are often unavailable for those characters even when fonts
for the base characters are available, and they are hard to type
except in very unusual environments.   That combination seems
very different to me.

> It's been on my plate for some time to do a thorough analysis.
> (That plate's been pretty full lately.) I agree that if some
> set of characters simply are not in current use that there is
> no need to map them for compatibility; but we have to be very
> sure of what that set actually is. Your assumption has been
> case mapping + width, but we need actual data.

Actually, "my assumption" is that we've seen logic and strong
arguments based on local usage, as well as data, to support
those two cases.   For the other compatibility characters, I
have so far seen neither other than the argument that NFKC does
it.  I also note that Erik's reported data seem to indicate
that, with the possibly exception of those two cases, the
tendency is increasingly to see A-labels and U-labels and not
these variant forms.

Finally, I'm concerned about one additional case, even though
the concern is a matter of reasoning, not data.  It appears to
me that, with the exception of the Asian width variations, the
vast majority of compatibility characters are in Unicode because
someone thought they represented an important distinction from
the base character but there was consensus that they were not
very different or not different most of the time.  We've got
"Mathematical" characters because a portion of the mathematical
community insisted that they be given separate code points
because the distinction was important to them, but they are
classified as compatibility characters, mappings onto the Latin
base characters because they usually are not different.  But,
for the others, suppose someone comes along someday and makes a
convincing argument that a given collection of  characters
really are separate characters from the base ones, i.e.,  that
treating them as compatibility characters is (at least for IDN
purposes) incorrect.   

If we have DISALLOWED those characters, then we have the rather
difficult problem of changing their category from DISALLOWED to
PVALID.   We have agreed that is hard and that it should be
hard.  But it is possible.   On the other hand, if we have
mapped the character into something else, we would be faced with
exactly the problem we have with Eszett: inability to tell
whether a registered object started out in base form or one of
the variant alternatives.  Of the two, I think that is by far
the better approach... but it depends on our not mapping
anything we don't have to or, put differently, confining the
mappings to the important and obvious cases, rather that setting
it up as an opportunity for a few people to show have far along
they are.

   john