[Json] Json and U+08A1 and related cases

Sat Jan 24 17:22:19 CET 2015

On 1/24/2015 6:44 AM, Vint Cerf wrote:
> I have been following this discussion with some interest and have come 
> away with a thought that some of you may wish to refine or perhaps 
> debate. Basically, I see the UNICODE effort as only partly aligned to 
> the needs of the Internet's Domain name System

Agreed, that is so, and by necessity. Unicode as the *universal 
*character set, cannot hope to be aligned perfectly with any single use 
case. And the DNS is one particular use case.

> and the effort to use the UNICODE character 
> parameters/descriptors/properties does not always line up with the 
> desirable properties of the use of characters in the DNS.

There is less of a restriction on Unicode properties. In principle, 
properties can be tailored to any problem domain or implementation. In 
fact, PVALID, is a character property, except one not specified by the 
Unicode Consortium.

So, it's in principle not the case that no properties can be defined 
(whether by IETF or Unicode) that accommodate the needs of the DNS.

> It seems to me useful to recall that domain names are identifiers that 
> are not expected or even intended to follow purely linguistic 
> constraints. They are used to create what are intended to be unique 
> identifiers.

...that are reasonably mnemonic.

Without the last qualifier, you'd not need IDNs.

While mnemonics are often based on words or phrases of a given language, 
they are not identical to it, and not all linguistic conventions need 
apply. Definitely agree.

There is, however, a clear pressure to make the system 
non-discriminatory; that is, to support basing mnemonics on all 
languages (or rather writing systems) with something like "equal ease". 
That drags in the full messiness of writing systems by the back door.

> Characters that have a high probability of looking the same but are 
> encoded differently work against that goal. Of course I am fully aware 
> of the confusability of the lower case letter "L" and the digit "ONE" 
> (and "OH" and "ZERO") that is sometimes used as an example of the 
> inconsistent toleration of confusion in the ASCII labels but I 
> consider this to be an argument of the form "you allowed a case of 
> confusion therefore you should tolerate all confusion".

There's accidental confusability and then there's confusability by 
design - and all the shades between them. Accidental confusability 
depends on issues of font size, font design and/or human perception (for 
example, the confusability between "rn" and "m"). Confusibility by 
design is based on issues of dual encoding, homographs and characters 
derivation and borrowing.

Because of the pressure to allow mnemonics to be usable by users of 
other scripts, you inevitably drag in all the issues for these scripts 
(and, in the case of Latin, or Arabic, the issues that derive from 
having adapted these scripts to a multitude of orthographies).

>
> I do wonder whether it is worth considering an attempt to create a new 
> set of properties of UNICODED characters that are of specific use to 
> the DNS. The IDNA 2008 work tried to use properties of characters 
> developed for purposes other than the DNS and the fit is not always 
> perfect.

In principle the answer to that is yes.

Unicode has discovered that the cleanest way to do many properties is to 
derive any new property from a combination of other properties where 
possible, and where not, to create exception lists. (Where the 
underlying properties are not immutable, the derivation gets checked 
each version, and exception lists can be re-generated to keep the 
derived property immutable. That's still less work, than maintaining an 
entirely separate property).

That's more or less the path that's been followed for the IDNA2008 
specific properties.

In that sense, your argument comes down to improving the IDNA208 
specific properties.

I see one practical limitation in the fact that what is good for a 
stable and robust system of universal identifies will be at odds with 
the desire to provide mnemonics that work according to the expectations 
of specific sets of users (those expectations being based on the writing 
system, and the use thereof, that they are familiar with).

As long as you cater to that on the protocol level, you run into the 
same kinds of "universality constraints" that Unicode runs into: some 
stuff needed for local support doesn't play well globally (and vice versa).

Having just gone through that exercise, we've concluded that only about 
a third of all code points that are PVALID should even be considered for 
the Root Zone. The actual number that will come out of the more detailed 
investigations to follow will be smaller.

In some cases, the restrictions imposed by that limitation will lead to 
exclusions that will look mighty arbitrary if seen through the lens of a 
local writing system. While it's not possible to render an English 
possessive in the DNS ("Barron's"), in some language we are proposing to 
not support the representation of plurals in the root. That's 
appropriate for the root, but I wonder very much whether it's 
appropriate to do something that drastic on the protocol level.

And, as long as it isn't, it would represent a constraint on the kinds 
of properties you can design on the protocol level.

In the case where two writing systems have conflicting demands, but 
where you don't want to pick one over the other, you need a different 
mechanism that essentially says: in each zone, you can have either one 
of these, but not both. And you want that mechanism as close to the 
protocol level as you can get.

Having a robust way to define this mutual exclusion in a zone's IDN 
table (and perhaps backed up by an IDNA property that flags a code point 
or sequence as requiring such an exclusion to be defined) would seem to 
be an answer. In the root zone, we will have such a robust exclusion 
mechanism by the use of "blocked" variants.

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150124/516d13f4/attachment.html>