[Json] Json and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Thu Jan 22 05:39:14 CET 2015

We should treat U+08A1 as confusable with U+0628 U+0654, advise
registrars to disallow it, and otherwise let IDNA treat the two as
distinct because Unicode does.

On Wed, Jan 21, 2015 at 06:58:09PM -0800, Asmus Freytag wrote:
> On 1/21/2015 1:31 PM, Nico Williams wrote:
> >On Wed, Jan 21, 2015 at 03:33:12PM -0500, cowan at ccil.org wrote:
> >>John C Klensin scripsit:
> >[...]
> 
> Asserting, to the contrary, that there should be a principle that
> requires that all
> homographs are the same abstract character, would mean to base encoding
> [...]

No one made that assertion.  (I trimmed the quotes, but they're in the
archive; readers can go look for themselves.)

But I am curious as to how people writing in Arabic make this
distinction when writing with pen and paper.  And if they don't, why
that distinction should be made in Unicode (I can think of good
reasons).  (I'm NOT saying that there shouldn't be such a distinction,
just curious as to why there is one.) Unicode 7.0 doesn't answer this
question.  I doubt many here might know, and it will be just fine if I
never get an answer to that question.

> decisions entirely on the shape, or appearance of characters and code point
> sequences. Under that logic, Tamil LETTER KA and TAMIL DIGIT 1 would be the
> same abstract character, and a (non-identity) decomposition would be
> required.
> 
> That's just not how it works.

Clearly similar letters from different scripts should get different
codepoints, confusables be damned.  I think no one _today_ will argue
otherwise.

> That said, Unicode is generally (and correctly) reluctant to encode
> homographs.
> One of the earliest and most ardently requested changes was the proposed
> separation of "period" and "decimal point". It got rejected, and it
> was not the
> only one. Where homographs are encoded, they generally follow certain

We have enough periods (and spaces, and...).  It's nice to know we have
one fewer than we could have ended up with.

> principles. And while these principles will, over time, lead to the
> encoding of
> a few more homographs, they in turn, keep things predictable.
> 
> From my understanding, the case in question fully follows these principles
> as they are applicable to the encoding of characters for the Arabic script.
> 
> >>
> >>[...]
> >Should we treat all of these as confusables?
> Yes, that's the obvious way to handle them. If you have zones that support
> the concept of (blocked) variants, you can go further and make them that,
> which has the effect of making them confusables that are declared up front
> as such in the policy, not "discovered" in later steps of string
> review and analysis.

Agreed.

Nico
--