IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Wed Jan 21 22:27:05 CET 2015

On Wed, Jan 21, 2015 at 03:04:22PM -0500, John C Klensin wrote:
> (JSON list removed, per Paul Hoffman's request)

[Let's also drop JSON from the subject while we're at it, and add IDNA.]

> On the other hand, the Unicode Standard justifies all of this
> one the basis of phonetic differences between  U+08A1 and the
> U+0628 U+0654 sequence.  See the I-D and the sections of the

Yes, I see now.

One wonders how Arabic readers and writers handled this _before_ modern
computing.  My guess is: they determined the phonetic difference here
from _context_, not from how the character was written (otherwise these
two would simply be different characters).  Which, if I'm right, would
then lead to asking "why make the distinction that the UC chose to
make?"

> Unicode Standard that it cites for more information, but note
> that most of the reasons for which the IETF is interested in
> characters in identifiers are not associated with enough
> information to determine either language or phonetics.

Indeed.  Though where does that leave speech synthesizers trying to
pronounce IDNs?  For accessibility we ought to say something about that,
but what, I don't know.

Considering the lack of language information in any IDN, one reasonable
thing for speech synthesizers to do would be to best-effort convert the
IDN to the user's locale's language's script (replacing foreign
characters with spellings of their prononciations) then pronounce it,
and here the phonetic difference between U+08A1 and U+0628 U+0654 could
matter.  Maybe current speech synthesizer tech would pronounce both
identically or nearly so, but maybe not.

But again: if traditionally the difference between these two codepoint
sequences was determined from context, then the speech synthesizer could
do the same.

Where users aren't reading (listening to) the [possibly synthesized]
pronounciation of these two characters, they won't know how to
distinguish them visually unless context matters, or unless the fonts in
use do distinguish them (which would seem strange if pre-computers
Arabic writers didn't).

Distinguishing these two codepoint sequences as to normalization would
seem to be harmful if historically their "rendernings" weren't visually
distinct as written.

And then there's input methods.  How will Arabic writers input these two
characters??  Will they bother to?  If the input methods make no
distinction, why should Unicode?  (This could be a cart-before-the-horse
question, I know; feel free to ignore it.)

Nico
--