[Json] Json and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Andrew Sullivan ajs at anvilwalrusden.com
Thu Jan 22 04:37:45 CET 2015


On Wed, Jan 21, 2015 at 06:58:09PM -0800, Asmus Freytag wrote:

> I would go further, and claim that the notion that "*all homographs are
> the**
> **same abstract character*" is *misplaced, if not incorrect*.

This does not seem false to me, and actually I'm not sure that it
would be problematic for John either.  

> U+08A1 is not the only character that has a non-decomposable
> homograph, and because the encoding of it wasn't an accident, but
> follows a principle applied by the Unicode Technical Committee, it
> won't, and can't be the last instance of a non-decomposable
> homograph.

I also agree with this, but it appears that it may represent something
problematic for IETF identifiers.  Moreover, "non-decomposable
homograph" is not entirely useful here, because it's not merely the
non-decomposability that is at issue.  There's also the fact that this
particular case (and all the other cases I know of so far) are not
susceptible to the other stability tests and are always in the same
script.  It may well be that this is merely revealing the extent to
which I have missed important cases, I will cheerfully concede.  That
hardly suggests that our identifier system is robust, since if engaged
people like me (who are nevertheless admitted amateurs) are missing
chunks of problems for identifiers, we can hardly expect ordinary
operators to get policies right.

> it appears not be be due to a "breakdown" of the encoding process
> and also does not constitute a break of any encoding stability
> promises by the Unicode Consortium.

I will not speak for anyone else, but any worries I have are not
directed at UTC.  My worry is much worse: that we're asking Unicode to
provide something that nobody can, especially when the full generality
of the goal of Unicode is taken into consideration.

Unicode has a really hard set of problems to solve.  I don't think
anyone is intentionally suggesting, "Oh, those clowns at Unicode laid
an egg."  (If they are, then I'll say I think that's setting the bar
unreasonably high, and is quite unfair.)  But I do think that this new
character highlights a bunch of issues that are super important for
identifiers, especially when those identifiers are wandering around
without locale clues.  I think for the sake of the Internet we must
all worry about the implications of that.


Andrew Sullivan
ajs at anvilwalrusden.com

More information about the Idna-update mailing list