[Json] Json and U+08A1 and related cases

"Martin J. Dürst" duerst at it.aoyama.ac.jp
Fri Jan 23 10:14:16 CET 2015

Hello Asmus,

On 2015/01/22 11:58, Asmus Freytag wrote:

> I would go further, and claim that the notion that "*all homographs are
> the**
> **same abstract character*" is *misplaced, if not incorrect*.

That's fine. Nobody would claim that 8 (U+0038) and ৪ (Bengali 4, 
U+09EA) are the same abstract character. (How 'homographic' they look 
will depend on what fonts your mail user agent uses :-)

> U+08A1 is not the only character that has a non-decomposable homograph, and
> because the encoding of it wasn't an accident, but follows a principle
> applied
> by the Unicode Technical Committee, it won't, and can't be the last
> instance of
> a non-decomposable homograph.
> The "failure of U+08A1 to have a (non-identity) decomposition", while it
> perhaps
> complicates the design of a system of robust mnemonic identifiers (such
> as IDNs)
> it appears not be be due to a "breakdown" of the encoding process and
> also does
> not constitute a break of any encoding stability promises  by the Unicode
> Consortium.
> Rather, it represents reasoned, and principled judgment of what is or
> isn't the
> "same abstract character". That judgment has to be made somewhere in the
> process, and the bodies responsible for character encoding get to make the
> determination.

While I can agree with this characterization, many judgements on 
character encoding are by their very nature borderline, and U+08A1 
definitely in many aspects is borderline. What I hope is that the 
Unicode Technical Committee, when making future, similar decisions, 
hopefully puts the borderline a bit more in support of applications such 
as identifiers, and a bit less in favor of splitting. Also, that it 
realize that when principles lead to more and more homograph encodings, 
it may very well pay off to reexamine some of these principles before 
going down a slippery slope.

Regards,   Martin.

More information about the Idna-update mailing list