[Json] Json and U+08A1 and related cases
asmusf at ix.netcom.com
Sat Jan 24 17:22:19 CET 2015
On 1/24/2015 6:44 AM, Vint Cerf wrote:
> I have been following this discussion with some interest and have come
> away with a thought that some of you may wish to refine or perhaps
> debate. Basically, I see the UNICODE effort as only partly aligned to
> the needs of the Internet's Domain name System
Agreed, that is so, and by necessity. Unicode as the *universal
*character set, cannot hope to be aligned perfectly with any single use
case. And the DNS is one particular use case.
> and the effort to use the UNICODE character
> parameters/descriptors/properties does not always line up with the
> desirable properties of the use of characters in the DNS.
There is less of a restriction on Unicode properties. In principle,
properties can be tailored to any problem domain or implementation. In
fact, PVALID, is a character property, except one not specified by the
So, it's in principle not the case that no properties can be defined
(whether by IETF or Unicode) that accommodate the needs of the DNS.
> It seems to me useful to recall that domain names are identifiers that
> are not expected or even intended to follow purely linguistic
> constraints. They are used to create what are intended to be unique
...that are reasonably mnemonic.
Without the last qualifier, you'd not need IDNs.
While mnemonics are often based on words or phrases of a given language,
they are not identical to it, and not all linguistic conventions need
apply. Definitely agree.
There is, however, a clear pressure to make the system
non-discriminatory; that is, to support basing mnemonics on all
languages (or rather writing systems) with something like "equal ease".
That drags in the full messiness of writing systems by the back door.
> Characters that have a high probability of looking the same but are
> encoded differently work against that goal. Of course I am fully aware
> of the confusability of the lower case letter "L" and the digit "ONE"
> (and "OH" and "ZERO") that is sometimes used as an example of the
> inconsistent toleration of confusion in the ASCII labels but I
> consider this to be an argument of the form "you allowed a case of
> confusion therefore you should tolerate all confusion".
There's accidental confusability and then there's confusability by
design - and all the shades between them. Accidental confusability
depends on issues of font size, font design and/or human perception (for
example, the confusability between "rn" and "m"). Confusibility by
design is based on issues of dual encoding, homographs and characters
derivation and borrowing.
Because of the pressure to allow mnemonics to be usable by users of
other scripts, you inevitably drag in all the issues for these scripts
(and, in the case of Latin, or Arabic, the issues that derive from
having adapted these scripts to a multitude of orthographies).
> I do wonder whether it is worth considering an attempt to create a new
> set of properties of UNICODED characters that are of specific use to
> the DNS. The IDNA 2008 work tried to use properties of characters
> developed for purposes other than the DNS and the fit is not always
In principle the answer to that is yes.
Unicode has discovered that the cleanest way to do many properties is to
derive any new property from a combination of other properties where
possible, and where not, to create exception lists. (Where the
underlying properties are not immutable, the derivation gets checked
each version, and exception lists can be re-generated to keep the
derived property immutable. That's still less work, than maintaining an
entirely separate property).
That's more or less the path that's been followed for the IDNA2008
In that sense, your argument comes down to improving the IDNA208
I see one practical limitation in the fact that what is good for a
stable and robust system of universal identifies will be at odds with
the desire to provide mnemonics that work according to the expectations
of specific sets of users (those expectations being based on the writing
system, and the use thereof, that they are familiar with).
As long as you cater to that on the protocol level, you run into the
same kinds of "universality constraints" that Unicode runs into: some
stuff needed for local support doesn't play well globally (and vice versa).
Having just gone through that exercise, we've concluded that only about
a third of all code points that are PVALID should even be considered for
the Root Zone. The actual number that will come out of the more detailed
investigations to follow will be smaller.
In some cases, the restrictions imposed by that limitation will lead to
exclusions that will look mighty arbitrary if seen through the lens of a
local writing system. While it's not possible to render an English
possessive in the DNS ("Barron's"), in some language we are proposing to
not support the representation of plurals in the root. That's
appropriate for the root, but I wonder very much whether it's
appropriate to do something that drastic on the protocol level.
And, as long as it isn't, it would represent a constraint on the kinds
of properties you can design on the protocol level.
In the case where two writing systems have conflicting demands, but
where you don't want to pick one over the other, you need a different
mechanism that essentially says: in each zone, you can have either one
of these, but not both. And you want that mechanism as close to the
protocol level as you can get.
Having a robust way to define this mutual exclusion in a zone's IDN
table (and perhaps backed up by an IDNA property that flags a code point
or sequence as requiring such an exclusion to be defined) would seem to
be an answer. In the root zone, we will have such a robust exclusion
mechanism by the use of "blocked" variants.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update