[Json] Json and U+08A1 and related cases

Shawn Steele Shawn.Steele at microsoft.com
Sun Jan 25 02:15:56 CET 2015


As long as we’re being very open about the identifiers, I think that DNS may have been intended to be unique identifiers, but they have evolved into human readable (for the most part) identifiers.  If they were “just” unique, a bunch if #s would’ve sufficed.  Clearly now they are not just unique identifiers, but also cater to linguistic behavior.

I think that the important part of the name resolution isn’t whether or not certain characters are “allowed”, but rather that they resolve to the same thing (eg: they’re identifiers).  I don’t think that it’s important that DNS support all possible combinations, but that where names are resolved that they are consistent.  Currently 5 names can resolve to the same IP, and I don’t see a problem with that.  So I think that it should be totally possible for the “confusable” characters to merely resolve to the same thing.  Eg: be bundled.  Sure, then people can’t register some names that use similar letters (or variations), but then it isn’t confusing.  Also you have a round-tripping problem because if 5 names resolve to the same thing, which do you display?

-Shawn

From: Idna-update [mailto:idna-update-bounces at alvestrand.no] On Behalf Of Vint Cerf
Sent: Saturday, January 24, 2015 6:45 AM
To: Martin J. Dürst
Cc: John C Klensin; Asmus Freytag; idna-update at alvestrand.no; The IESG
Subject: Re: [Json] Json and U+08A1 and related cases

I have been following this discussion with some interest and have come away with a thought that some of you may wish to refine or perhaps debate. Basically, I see the UNICODE effort as only partly aligned to the needs of the Internet's Domain name System and the effort to use the UNICODE character parameters/descriptors/properties does not always line up with the desirable properties of the use of characters in the DNS. It seems to me useful to recall that domain names are identifiers that are not expected or even intended to follow purely linguistic constraints. They are used to create what are intended to be unique identifiers. Characters that have a high probability of looking the same but are encoded differently work against that goal. Of course I am fully aware of the confusability of the lower case letter "L" and the digit "ONE" (and "OH" and "ZERO") that is sometimes used as an example of the inconsistent toleration of confusion in the ASCII labels but I consider this to be an argument of the form "you allowed a case of confusion therefore you should tolerate all confusion".

I do wonder whether it is worth considering an attempt to create a new set of properties of UNICODED characters that are of specific use to the DNS. The IDNA 2008 work tried to use properties of characters developed for purposes other than the DNS and the fit is not always perfect.

vint


On Fri, Jan 23, 2015 at 4:14 AM, "Martin J. Dürst" <duerst at it.aoyama.ac.jp<mailto:duerst at it.aoyama.ac.jp>> wrote:
Hello Asmus,

On 2015/01/22 11:58, Asmus Freytag wrote:
I would go further, and claim that the notion that "*all homographs are
the**
**same abstract character*" is *misplaced, if not incorrect*.

That's fine. Nobody would claim that 8 (U+0038) and ৪ (Bengali 4, U+09EA) are the same abstract character. (How 'homographic' they look will depend on what fonts your mail user agent uses :-)

U+08A1 is not the only character that has a non-decomposable homograph, and
because the encoding of it wasn't an accident, but follows a principle
applied
by the Unicode Technical Committee, it won't, and can't be the last
instance of
a non-decomposable homograph.

The "failure of U+08A1 to have a (non-identity) decomposition", while it
perhaps
complicates the design of a system of robust mnemonic identifiers (such
as IDNs)
it appears not be be due to a "breakdown" of the encoding process and
also does
not constitute a break of any encoding stability promises  by the Unicode
Consortium.

Rather, it represents reasoned, and principled judgment of what is or
isn't the
"same abstract character". That judgment has to be made somewhere in the
process, and the bodies responsible for character encoding get to make the
determination.

While I can agree with this characterization, many judgements on character encoding are by their very nature borderline, and U+08A1 definitely in many aspects is borderline. What I hope is that the Unicode Technical Committee, when making future, similar decisions, hopefully puts the borderline a bit more in support of applications such as identifiers, and a bit less in favor of splitting. Also, that it realize that when principles lead to more and more homograph encodings, it may very well pay off to reexamine some of these principles before going down a slippery slope.

Regards,   Martin.

_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no<mailto:Idna-update at alvestrand.no>
http://www.alvestrand.no/mailman/listinfo/idna-update

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150125/c2f638c3/attachment-0001.html>


More information about the Idna-update mailing list