[Json] Json and U+08A1 and related cases

Vint Cerf vint at google.com
Sat Jan 24 15:44:49 CET 2015


I have been following this discussion with some interest and have come away
with a thought that some of you may wish to refine or perhaps debate.
Basically, I see the UNICODE effort as only partly aligned to the needs of
the Internet's Domain name System and the effort to use the UNICODE
character parameters/descriptors/properties does not always line up with
the desirable properties of the use of characters in the DNS. It seems to
me useful to recall that domain names are identifiers that are not expected
or even intended to follow purely linguistic constraints. They are used to
create what are intended to be unique identifiers. Characters that have a
high probability of looking the same but are encoded differently work
against that goal. Of course I am fully aware of the confusability of the
lower case letter "L" and the digit "ONE" (and "OH" and "ZERO") that is
sometimes used as an example of the inconsistent toleration of confusion in
the ASCII labels but I consider this to be an argument of the form "you
allowed a case of confusion therefore you should tolerate all confusion".

I do wonder whether it is worth considering an attempt to create a new set
of properties of UNICODED characters that are of specific use to the DNS.
The IDNA 2008 work tried to use properties of characters developed for
purposes other than the DNS and the fit is not always perfect.

vint


On Fri, Jan 23, 2015 at 4:14 AM, "Martin J. Dürst" <duerst at it.aoyama.ac.jp>
wrote:

> Hello Asmus,
>
> On 2015/01/22 11:58, Asmus Freytag wrote:
>
>  I would go further, and claim that the notion that "*all homographs are
>> the**
>> **same abstract character*" is *misplaced, if not incorrect*.
>>
>
> That's fine. Nobody would claim that 8 (U+0038) and ৪ (Bengali 4, U+09EA)
> are the same abstract character. (How 'homographic' they look will depend
> on what fonts your mail user agent uses :-)
>
>
>  U+08A1 is not the only character that has a non-decomposable homograph,
>> and
>> because the encoding of it wasn't an accident, but follows a principle
>> applied
>> by the Unicode Technical Committee, it won't, and can't be the last
>> instance of
>> a non-decomposable homograph.
>>
>> The "failure of U+08A1 to have a (non-identity) decomposition", while it
>> perhaps
>> complicates the design of a system of robust mnemonic identifiers (such
>> as IDNs)
>> it appears not be be due to a "breakdown" of the encoding process and
>> also does
>> not constitute a break of any encoding stability promises  by the Unicode
>> Consortium.
>>
>> Rather, it represents reasoned, and principled judgment of what is or
>> isn't the
>> "same abstract character". That judgment has to be made somewhere in the
>> process, and the bodies responsible for character encoding get to make the
>> determination.
>>
>
> While I can agree with this characterization, many judgements on character
> encoding are by their very nature borderline, and U+08A1 definitely in many
> aspects is borderline. What I hope is that the Unicode Technical Committee,
> when making future, similar decisions, hopefully puts the borderline a bit
> more in support of applications such as identifiers, and a bit less in
> favor of splitting. Also, that it realize that when principles lead to more
> and more homograph encodings, it may very well pay off to reexamine some of
> these principles before going down a slippery slope.
>
> Regards,   Martin.
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150124/7354ee71/attachment-0001.html>


More information about the Idna-update mailing list