[Json] Json and U+08A1 and related cases

Asmus Freytag asmusf at ix.netcom.com
Fri Jan 23 11:17:12 CET 2015


On 1/23/2015 1:14 AM, "Martin J. Dürst" wrote:
> Hello Asmus,
>
> On 2015/01/22 11:58, Asmus Freytag wrote:
>
>> I would go further, and claim that the notion that "*all homographs are
>> the**
>> **same abstract character*" is *misplaced, if not incorrect*.
>
> That's fine. Nobody would claim that 8 (U+0038) and ৪ (Bengali 4, 
> U+09EA) are the same abstract character. (How 'homographic' they look 
> will depend on what fonts your mail user agent uses :-)

When I use the term homograph, it is with reference to shapes that are 
*the same by design, no*t some degree of similarity, and certainly not 
any degree of "/*accidenta*//*l*/" similarity. For example, the 
ideograph for 'one' is not a homograph of the dash or hyphen, even if 
both are based on the idea of a single horizontal line - those instances 
are merely of potential confusable similarity. For a true homograph 
situation, you really have to have a case where two code points were 
assigned to the "same thing", or, since the term homograph refers to the 
appearance, to two functions of the "same mark on paper".

Had Unicode encoded a base-line decimal point in distinction from the 
period, that would be a case of a homograph relation.

For non-hypothetical homographs look no further then Greek omicron and 
Latin o, or the TAMIL KA and TAMIL digit 1.

In all of these examples, while the shape is utterly the same, there is 
a need for a separate, non-normalizable coded representation. And (as 
the hypothetical example of the decimal point shows) it is not usually 
enough for some mark to have different conventions around its use for it 
to be encoded "by function". Many conventions are handled by software as 
a matter or context, just as they are by the human reader -- but not all.

>
>
>> U+08A1 is not the only character that has a non-decomposable 
>> homograph, and
>> because the encoding of it wasn't an accident, but follows a principle
>> applied
>> by the Unicode Technical Committee, it won't, and can't be the last
>> instance of
>> a non-decomposable homograph.
>>
>> The "failure of U+08A1 to have a (non-identity) decomposition", while it
>> perhaps
>> complicates the design of a system of robust mnemonic identifiers (such
>> as IDNs)
>> it appears not be be due to a "breakdown" of the encoding process and
>> also does
>> not constitute a break of any encoding stability promises  by the 
>> Unicode
>> Consortium.
>>
>> Rather, it represents reasoned, and principled judgment of what is or
>> isn't the
>> "same abstract character". That judgment has to be made somewhere in the
>> process, and the bodies responsible for character encoding get to 
>> make the
>> determination.
>
> While I can agree with this characterization, many judgements on 
> character encoding are by their very nature borderline, and U+08A1 
> definitely in many aspects is borderline. 

Totally agreed. I would phrase it differently. In character encoding few 
questions are black and white. Most are more akin to dark-gray vs. light 
gray. Some can look like neutral gray all around.

A few issues don't have a good solution, because with every context you 
choose to view the question under, the trade-off are different. 
Whichever solution you pick for one of those issues, some implementation 
will be burdened with costs. Luckily, these situations are not that 
frequent.

But they do exist, and it is well understood that there is no single set 
of principles that will help you arrive at a "correct" solution in these 
cases. They follow from the universal nature of the universal character 
set. Therefore, it is well understood that some of the principles cannot 
be satisfied simultaneously in such cases.

If that is what you mean by "borderline", I might agree that this could 
be one of those cases.


> What I hope is that the Unicode Technical Committee, when making 
> future, similar decisions, hopefully puts the borderline a bit more in 
> support of applications such as identifiers, and a bit less in favor 
> of splitting. Also, that it realize that when principles lead to more 
> and more homograph encodings, it may very well pay off to reexamine 
> some of these principles before going down a slippery slope.

In this particular case, it looks like the orthography supported is (at 
this point) not even mainstream, because Latin appears to be the main 
script to write the language(s) in question.

In designing repertoire tables for identifiers, one of the easiest ways 
to make the more robust is to *remove irrelevant code points*. I've been 
engaged in a process designed to create a repertoire for the DNS Root 
Zone. In that process, we rigorously remove code points that are not in 
widespread modern use (where that is measured relative to the community 
affected).

Some other poster likened the case of U+08A1 to cuneiform. Because 
IDNA2008 allows all the historic scripts (like cuneiform), there seems 
to be no principled position to single out this one code point. Instead, 
if it's troublesome, treat it like cuneiform and don't admit it in your 
zone repertoire.

Speaking of cuneiform (and any number of other ancient writing systems, 
especially those with more than a few hundred elements). I don't believe 
anybody knows how to create a robust system of identifiers that includes 
these writing systems because very few people really understand them 
well enough to understand whether they harbor issues of homographs or 
confusables by similarity and to what degree.

Compared to those systems then, the way to treat U+0A81 is to not allow 
it in any zone that doesn't explicitly need to cater to those writing 
Fula in Arabic. If that zone should need to support more than the Fula 
language, it still may not need to support the combining hamza at 0654. 
Certainly, the current draft Arabic repertoire for the Root Zone does 
not include that code point (and the group of community members and 
local experts have good reasons for that exclusion).

The whole issue arose because people were staring at it from what can be 
/ should be done at the protocol level, where it would not be possible 
to declare 0654 INVALID. While it would have been nice to remove this 
issue in the protocol, it's the wrong place to do so, because it's not 
possible to tell at the protocol level which of these code points are in 
fact irrelevant. (That's toally similar to other kinds of homographs).

However, in the final analysis, both 0654 and 08A1 are equally 
specialized and the most natural solution is to not support either or 
only 08A1 in a given zone repertoire, because the necessity to support 
0654 for a system of robust identifiers is to be questioned.

----

What should be the outcome of this storm in a tea-cup?

  * The full Unicode 7.0 IDNA tables should be released (with 08A1).
  * The mistaken notion that normalization eliminates all homographs
    should be highlighted.
  * A list of known homographs (confusables by design, not accident)
    should be maintained
  * A recommendation should be made for zone operators to robustly
    handle them by:

    - supporting only one
    - supporting both, but implement the equivalent of a Pauli exclusion
    principle
       (in other words, make them "blocked variants")

-----

A./


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150123/9768c8f9/attachment.html>


More information about the Idna-update mailing list