[Json] Json and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Thu Jan 22 08:19:30 CET 2015

On 1/21/2015 8:39 PM, Nico Williams wrote:
> We should treat U+08A1 as confusable with U+0628 U+0654, advise
> registrars to disallow it, and otherwise let IDNA treat the two as
> distinct because Unicode does.

Agreed.
>
> On Wed, Jan 21, 2015 at 06:58:09PM -0800, Asmus Freytag wrote:
>> On 1/21/2015 1:31 PM, Nico Williams wrote:
>>> On Wed, Jan 21, 2015 at 03:33:12PM -0500, cowan at ccil.org wrote:
>>>> John C Klensin scripsit:
>>> [...]
>> Asserting, to the contrary, that there should be a principle that
>> requires that all
>> homographs are the same abstract character, would mean to base encoding
>> [...]
> No one made that assertion.  (I trimmed the quotes, but they're in the
> archive; readers can go look for themselves.)

I overstated that a bit to make a point.
>
> But I am curious as to how people writing in Arabic make this
> distinction when writing with pen and paper.  And if they don't, why
> that distinction should be made in Unicode (I can think of good
> reasons).

How can people distinguish Tamil KA and TAMIL DIGIT 1 with hot metal 
typography?

They don't. But to get both number processing and sorting to happen in a 
sane fashion, the encoding has to respect that both are part of their 
own respective sequences that do not normally overlap.

In other words (and that is very much part of what I am driving at) 
display - or identifier uniqueness - is not the only constraint Unicode 
faces in its design.

>   (I'm NOT saying that there shouldn't be such a distinction,
> just curious as to why there is one.) Unicode 7.0 doesn't answer this
> question.  I doubt many here might know, and it will be just fine if I
> never get an answer to that question.

I was not involved in the actual decisions on Unicode 7.0.0, so I'm 
sidestepping the reply on the Arabic code point in question. Others so 
involved have written very cogent summaries of the issue from the 
encoding perspective; but perhaps on a different mailing list.
>
>> decisions entirely on the shape, or appearance of characters and code point
>> sequences. Under that logic, Tamil LETTER KA and TAMIL DIGIT 1 would be the
>> same abstract character, and a (non-identity) decomposition would be
>> required.
>>
>> That's just not how it works.
> Clearly similar letters from different scripts should get different
> codepoints, confusables be damned.  I think no one _today_ will argue
> otherwise.

The strict separation between related scripts (as in the case of Latin, 
Greek and Cyrillic) works well for some tasks, but by necessity leads to 
the existence of a significant number of cross-script homographs. 
Especially, as people continue to "borrow" from one script into another 
('q' and 'w' were borrowed from Latin to write Kurdish in Cyrillic, to 
give one historically recent example -- they are now separately encoded).

I do think, it's the correct decision for a universal encoding standard 
(and that's why it's become the accepted solution). However, from pure 
display or unique identifier approach, an encoding where identical (not 
merely similar) shapes are only encoded once would be appealing in many 
ways.

>> That said, Unicode is generally (and correctly) reluctant to encode
>> homographs.
>> One of the earliest and most ardently requested changes was the proposed
>> separation of "period" and "decimal point". It got rejected, and it
>> was not the
>> only one. Where homographs are encoded, they generally follow certain
> We have enough periods (and spaces, and...).  It's nice to know we have
> one fewer than we could have ended up with.

Way more than one fewer. I gave you as example only one of the _earliest 
_(and oft-repeated) requests for such disunification.
>
>> principles. And while these principles will, over time, lead to the
>> encoding of
>> a few more homographs, they in turn, keep things predictable.
>>
>>  From my understanding, the case in question fully follows these principles
>> as they are applicable to the encoding of characters for the Arabic script.
>>
>>>> [...]
>>> Should we treat all of these as confusables?
>> Yes, that's the obvious way to handle them. If you have zones that support
>> the concept of (blocked) variants, you can go further and make them that,
>> which has the effect of making them confusables that are declared up front
>> as such in the policy, not "discovered" in later steps of string
>> review and analysis.
> Agreed.
>
> Nico
A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150121/5caee8de/attachment-0001.html>