[Json] Json and U+08A1 and related cases
asmusf at ix.netcom.com
Fri Jan 23 11:17:12 CET 2015
On 1/23/2015 1:14 AM, "Martin J. Dürst" wrote:
> Hello Asmus,
> On 2015/01/22 11:58, Asmus Freytag wrote:
>> I would go further, and claim that the notion that "*all homographs are
>> **same abstract character*" is *misplaced, if not incorrect*.
> That's fine. Nobody would claim that 8 (U+0038) and ৪ (Bengali 4,
> U+09EA) are the same abstract character. (How 'homographic' they look
> will depend on what fonts your mail user agent uses :-)
When I use the term homograph, it is with reference to shapes that are
*the same by design, no*t some degree of similarity, and certainly not
any degree of "/*accidenta*//*l*/" similarity. For example, the
ideograph for 'one' is not a homograph of the dash or hyphen, even if
both are based on the idea of a single horizontal line - those instances
are merely of potential confusable similarity. For a true homograph
situation, you really have to have a case where two code points were
assigned to the "same thing", or, since the term homograph refers to the
appearance, to two functions of the "same mark on paper".
Had Unicode encoded a base-line decimal point in distinction from the
period, that would be a case of a homograph relation.
For non-hypothetical homographs look no further then Greek omicron and
Latin o, or the TAMIL KA and TAMIL digit 1.
In all of these examples, while the shape is utterly the same, there is
a need for a separate, non-normalizable coded representation. And (as
the hypothetical example of the decimal point shows) it is not usually
enough for some mark to have different conventions around its use for it
to be encoded "by function". Many conventions are handled by software as
a matter or context, just as they are by the human reader -- but not all.
>> U+08A1 is not the only character that has a non-decomposable
>> homograph, and
>> because the encoding of it wasn't an accident, but follows a principle
>> by the Unicode Technical Committee, it won't, and can't be the last
>> instance of
>> a non-decomposable homograph.
>> The "failure of U+08A1 to have a (non-identity) decomposition", while it
>> complicates the design of a system of robust mnemonic identifiers (such
>> as IDNs)
>> it appears not be be due to a "breakdown" of the encoding process and
>> also does
>> not constitute a break of any encoding stability promises by the
>> Rather, it represents reasoned, and principled judgment of what is or
>> isn't the
>> "same abstract character". That judgment has to be made somewhere in the
>> process, and the bodies responsible for character encoding get to
>> make the
> While I can agree with this characterization, many judgements on
> character encoding are by their very nature borderline, and U+08A1
> definitely in many aspects is borderline.
Totally agreed. I would phrase it differently. In character encoding few
questions are black and white. Most are more akin to dark-gray vs. light
gray. Some can look like neutral gray all around.
A few issues don't have a good solution, because with every context you
choose to view the question under, the trade-off are different.
Whichever solution you pick for one of those issues, some implementation
will be burdened with costs. Luckily, these situations are not that
But they do exist, and it is well understood that there is no single set
of principles that will help you arrive at a "correct" solution in these
cases. They follow from the universal nature of the universal character
set. Therefore, it is well understood that some of the principles cannot
be satisfied simultaneously in such cases.
If that is what you mean by "borderline", I might agree that this could
be one of those cases.
> What I hope is that the Unicode Technical Committee, when making
> future, similar decisions, hopefully puts the borderline a bit more in
> support of applications such as identifiers, and a bit less in favor
> of splitting. Also, that it realize that when principles lead to more
> and more homograph encodings, it may very well pay off to reexamine
> some of these principles before going down a slippery slope.
In this particular case, it looks like the orthography supported is (at
this point) not even mainstream, because Latin appears to be the main
script to write the language(s) in question.
In designing repertoire tables for identifiers, one of the easiest ways
to make the more robust is to *remove irrelevant code points*. I've been
engaged in a process designed to create a repertoire for the DNS Root
Zone. In that process, we rigorously remove code points that are not in
widespread modern use (where that is measured relative to the community
Some other poster likened the case of U+08A1 to cuneiform. Because
IDNA2008 allows all the historic scripts (like cuneiform), there seems
to be no principled position to single out this one code point. Instead,
if it's troublesome, treat it like cuneiform and don't admit it in your
Speaking of cuneiform (and any number of other ancient writing systems,
especially those with more than a few hundred elements). I don't believe
anybody knows how to create a robust system of identifiers that includes
these writing systems because very few people really understand them
well enough to understand whether they harbor issues of homographs or
confusables by similarity and to what degree.
Compared to those systems then, the way to treat U+0A81 is to not allow
it in any zone that doesn't explicitly need to cater to those writing
Fula in Arabic. If that zone should need to support more than the Fula
language, it still may not need to support the combining hamza at 0654.
Certainly, the current draft Arabic repertoire for the Root Zone does
not include that code point (and the group of community members and
local experts have good reasons for that exclusion).
The whole issue arose because people were staring at it from what can be
/ should be done at the protocol level, where it would not be possible
to declare 0654 INVALID. While it would have been nice to remove this
issue in the protocol, it's the wrong place to do so, because it's not
possible to tell at the protocol level which of these code points are in
fact irrelevant. (That's toally similar to other kinds of homographs).
However, in the final analysis, both 0654 and 08A1 are equally
specialized and the most natural solution is to not support either or
only 08A1 in a given zone repertoire, because the necessity to support
0654 for a system of robust identifiers is to be questioned.
What should be the outcome of this storm in a tea-cup?
* The full Unicode 7.0 IDNA tables should be released (with 08A1).
* The mistaken notion that normalization eliminates all homographs
should be highlighted.
* A list of known homographs (confusables by design, not accident)
should be maintained
* A recommendation should be made for zone operators to robustly
handle them by:
- supporting only one
- supporting both, but implement the equivalent of a Pauli exclusion
(in other words, make them "blocked variants")
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update