IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Sun Jan 25 05:04:53 CET 2015

(I started to respond to this on Wednesday.  Because of
interruptions from other parts of these issues, it has taken me
until now to finish it.  I've left the parts I wrote four days
ago intact even though people have at least partially responded
to some of them.)

--On Wednesday, January 21, 2015 15:27 -0600 Nico Williams
<nico at cryptonector.com> wrote:

> On Wed, Jan 21, 2015 at 03:04:22PM -0500, John C Klensin wrote:
>> (JSON list removed, per Paul Hoffman's request)
> 
> [Let's also drop JSON from the subject while we're at it, and
> add IDNA.]
> 
>> On the other hand, the Unicode Standard justifies all of this
>> one the basis of phonetic differences between  U+08A1 and the
>> U+0628 U+0654 sequence.  See the I-D and the sections of the
> 
> Yes, I see now.
> 
> One wonders how Arabic readers and writers handled this
> _before_ modern computing.  My guess is: they determined the
> phonetic difference here from _context_, not from how the
> character was written (otherwise these two would simply be
> different characters).  Which, if I'm right, would then lead
> to asking "why make the distinction that the UC chose to make?"

The various members of the UTC who seem to be responding to this
thread are presumably qualified to answer that question; I am
not.

All I can see is:

(1) Some very strong assertions, backed by examples, that we
should not rely on protocols or procedures that assume NFC (or
other Unicode normalization or stable properties) is sufficient
to identify the relationships among different ways to form
character-image-glyphs (deliberately invented term) that (i) are
the same except for the way they are drawn, i.e., consisting of,
or can be generated by, a base forms and one or more combining
sequences or of several combining sequences, and (ii) depend on
base (or precombined) characters from the same script and
combining characters that are either from that script or that
are script-independent.   

(2) The assertion that Unicode Normalization cannot be depended
upon to serve that function is in direct contradiction to claims
that were made during the development of IDNA (and various other
things including, e.g., SASLPrep) that Normalization was both
adequate to that function and that the function of rationalizing
different ways to code "the same character" so they could be
compared was the main purpose of having standardized
normalization mechanisms.

(3) It is extremely difficult to find a good (and stable)
vocabulary in which to have this discussion or ones like it.  If
one talks about differences or similarities among visual forms
of characters, someone can always point out that it is possible
to design fonts (or even page layout styles) that will make
visual distinctions that are not typically present (and then to
debate "typical").    If one talks about "abstract" characters
and whether two of them are the same or different, someone can
always point out a different abstraction, or set of rules for
forming abstractions, that makes two sequences "the same" that
other people consider different or two sequences different that
others consider the same (some of the comments in these threads
could be interpreted by a cynic as "two strings are, or are not,
the same abstract character because we say is it (or is not)".
We have already accepted part of that issue when we say "within
the same script" because script boundaries are inevitably a bit
arbitrary (or based on criteria for which there is not universal
agreement) and some characters were copied or inherited between
scripts we now consider separate centuries ago (e.g., from Greek
to Latin more than 2000 years ago) and are therefore abstractly
"the same" except when script is considered.

(4) Some of the notes and suggestions lead directly to both
conflating these issues of "same character within a single
script" with "confusable characters" (depending on definitions,
the latter is often extremely subjective and dependent on type
and print styles as well as user expectations).  Basing protocol
requirements (or protocol equivalence) on confusable characters
or anything equivalent to them would appear to depend on
evaluation panels considering Unicode code points one at a time
in comparison with all other Unicode characters or a significant
subset of them.  That approach was rejected by the IETF even
before the work that led to Stringprep (RFC 3454) began,   If
rejecting that approach is actually untenable because
character-by-character evaluation --rather than depending on
Unicode blocks (which we were strongly discouraged from using in
IDNA2008), normalization, and other properties-- is required.
then there is a case to be made that almost all IETF i18n work
that requires comparison among non-ASCII characters is flawed.

(5)  We've been told that these non-decomposing characters are
really not an issue for a number of reasons.  As far as I can
tell, all of those reasons amount to assumptions that readers
and users of the characters will either have knowledge of the
language involved or sufficient context from which language or
equivalent information can be deduced (see below for more on
this).   That is not the case for many (or most) of the
potentially non-ASCII identifiers the IETF has to deal with.
They, and related strings such as passwords, are typically short
and come without language context.  Many are not "words" in any
language that can be looked up in a dictionary or by regular
expression to guess at which characters were intended when
different encodings are possible.  

(6) As pointed out in draft-klensin-idna-5892upd-unicode70 (even
more details in -04 if you want to wait), there are what appear
to be clear rules in The Unicode Standard and various Annexes
and stability statements about when new characters will be
added.  Regardless of the history of changes before those
policies were published (e.g., U+1D92 appears to have been added
with Unicode 4.1 and, although I haven't checked carefully, the
non-decomposing Arabic forms other than U+08A1 are apparently
even older), the addition of new characters (or properties of
new characters) that appear to lie outside those rules on the
basis of Unicode Consortium decisions, perhaps subjective ones,
that rules do not provide for is troubling.  It is troubling for
three separate reasons:

* Before or at the time the stability rules were introduced, the
normalization rules could have been adjusted to make the older
characters conform.  One can make arguments for or against that,
but...

* We were told when IDNA2008 was under consideration that no
such cases existed and that we could depend on normalization to
resolve the issues.  Had we been told that there were some (or
many) of these non-decomposing characters, it would have been
fairly easy to add contextual rules (or an entirely new rule) to
deal with them , even if that had to be listed by code points,
or to beg UTC to add a well-maintained "non-decomposing"
property that could then be used in an appropriate IDNA rule.
Of course, I don't know right now whether a "non-decomposing"
property would be feasible to define adequately and/or
sufficient.

>> Unicode Standard that it cites for more information, but note
>> that most of the reasons for which the IETF is interested in
>> characters in identifiers are not associated with enough
>> information to determine either language or phonetics.
> 
> Indeed.  Though where does that leave speech synthesizers
> trying to pronounce IDNs?  For accessibility we ought to say
> something about that, but what, I don't know.

Having hung around with people doing speech synthesis for some
years, one has to know the language to do much of anything
competent, even robotic-sounding voices.  So I think this is
inherently a presentation issue and one that mostly affects text
and not identifiers.

>...
> Where users aren't reading (listening to) the [possibly
> synthesized] pronounciation of these two characters, they
> won't know how to distinguish them visually unless context
> matters, or unless the fonts in use do distinguish them (which
> would seem strange if pre-computers Arabic writers didn't).

Note that the same comments would apply to "ö" in its German
and Swedish uses, whether recorded as a percomposed character or
a combining sequence.  I really don't see how the above helps us
understand this problem.

> Distinguishing these two codepoint sequences as to
> normalization would seem to be harmful if historically their
> "rendernings" weren't visually distinct as written.

Yes.

> And then there's input methods.  How will Arabic writers input
> these two characters??  Will they bother to?  If the input
> methods make no distinction, why should Unicode?  (This could
> be a cart-before-the-horse question, I know; feel free to
> ignore it.)

This is where you need to be extremely careful, again, about the
question you are asking.  If you are not, the question may be
either irrelevant or out of scope.  First, for the particular
Fula case, remember that language is normally written in Latin
characters, making this a non-issue.  If, by "Arabic writer" you
mean someone who normally writes the Arabic Language, using an
Arabic keyboard and input method, they are unlikely (at best) to
have a precomposed BEH WITH HAMZA ABOVE on that keyboard, so
will enter the combining sequence even when typing Fula.  An
input method, knowing Fula was involved (if it did) might map
what was keyed in to U+08A!, but now we are no longer talking
about what the user/writer is doing.  By contrast, getting that
character/code point entered directly would require not only
Fula-sensitive input method with knowledge that Fula was being
typed but a Fula-specific keyboard.    I'm just not sure that
line of discussion is helpful.

     john