Update to clarify combining characters

John C Klensin klensin at jck.com
Tue Apr 22 22:15:59 CEST 2014



--On Tuesday, April 22, 2014 09:38 -0700 Eric Brunner-Williams
<ebw at abenaki.wabanaki.net> wrote:

> On 4/22/14 1:10 AM, Cary Karp wrote:
>>> > ... in Abenaki we use several ASCII character sequences
>>> > inter-changeably ("ou", "w" and "8") as well as an "u atop
>>> > o" character defined in one or more extensions to ASCII,
>>> > which typewritters with half-height settings, and the
>>> > character "8" have accommodated over the past century, in
>>> > support of a local (to a zone) semantic, e.g., equivalency
>>> > of two labels, e.g., "ou.example" and "8.example" (or
>>> > "wabanaki.example" and "8abanaki.example" and
>>> > "ouabanaki.example"),
>> Are there similar non-ASCII examples?
>> 
> 
> Values assigned outside the ASCII range for the "u-above-o"
> combined character in the UTC repertoire are  U+0222 and
> U+0223, reflecting the casing of Latin Script.

So, as characters that can be used in labels (see below for some
other issues), there are actually precomposed characters for the
above.  For use of characters with precombined forms in DNS
labels, it is important that the IDNA requirement for NFC be
applied carefully (that requirement essentially eliminates
leading combining characters or marks) but, otherwise, what Eric
describes in an input problem (specifically what the Unicode
community refers to as an "IME" one).  In that regard, if I
understand Eric's description, Wabanaki is no different from the
examples that Cary gives for Swedish except that Wabanaki, to
paraphrase a comment made about another somewhat-endangered
language, has a considerably smaller army and navy than Swedish.

>>> Obviously, what ICANN gTLD registry operators do is governed
>>> by contacts between they and ICANN, and what ccTLD registry
>>> operators is also governed, in part, by desires for
>...
> Hmm. The cost of access to the IANA root zone for language
> communities not associated with an ISO-3166-1 assigned code
> point is bounded below by the cost of access to an ICANN new
> gTLD sales event, nominally a one-time fee to ICANN on the
> order of 200,000 USD with annual recurring fees on the order
> of a 50,000 USD, in addition to operational costs serving the
> language community (zone file generation and publication), and
> the transactional cost of providing policied create and modify
> access to the underlying database, associating labels and
> resources.

There are problems that the IETF could not solve even if there
were the will to do so.   One involves decisions by the Unicode
community that are unattractive for particular scripts.  In my
experience, while I'd be very interested in counter-examples,
there are few such problems with Latin-based characters unless
one gets to characters that require multiple decorations and
that can potentially be written as a base (i.e., undecorated and
typically ASCII) character plus two (or more) combining
characters or a precombined character plus one (or more) of
them.  Because some of those combinations appear to not be
resolved into a single form by normalization, there might be an
opportunity for "variant" consideration  except that ICANN, in
its wisdom (and unless things have changed recently), decided
that there is no such thing as a variant for Latin-based scripts.

Variants are also out of IETF scope, at least for IDNA, because
doing anything about them in anything resembling a general case
turns into a set of issues that cannot be handled in the DNS
except by externally treating names as equivalent.  As Andrew
has mentioned, there have been extensive ICANN efforts to deal
with a set of problems they have lumped together under that
term; it may be of note that there does not seem to be a single
period with experience using endangered languages or writing
systems in a DNS context in the relevant decision-making
committees.

A third thing that is out of IETF scope is ICANN's economic/
charging policies for TLD allocations.  As far as I can tell,
the combination of prices that Eric refers to and a number of
primary and second-guessing policies that, in practice, amount
to "ICANN cannot (or will not) say "no" to anyone with
sufficient economic or political resources.  Conversely, groups
who aren't willing or able to find strong advocates within the
ICANN system or devote significant resources to providing that
advocacy themselves, are likely to lose out.  It does seem to me
that, in the presence of heavily-used search engines, the
strategy that Eric implicitly advocates is just right: the user
of such a search engine really doesn't care how deeply in the
DNS hierarchy a particular label is embedded or, in most cases,
whether that label is in language-appropriate characters or not.


Beyond that, and with the understanding that there seem to be
innumerable better places to discuss the subject than on this
list, if anyone doesn't like the ICANN policies, they should
probably treat it as a reality that nothing internal to ICANN
will change those policies because they very much work in
ICANN's best interests.  If a broader view or set of priorities
is important, then people should find appropriate discussions
forums and pay careful attention both to current oversight
models for ICANN and proposals to change them.   In the context
of this discussion, I note that, while there are apparently a
significant number of minority and endangered languages in
Brazil, there do not seem to be advocates for any of them on the
agenda of the NETMundial meetings this week.

>...
> I can't speak for the cost-benefit analysis of others, but for
> Modern Chinook, a language I've been working in since
> September, with a caseless Latin-based script containing
> combining characters (ch, c'h, kw, k'w, qw, q'w, tL, t'L, ts,
> t's, xw, Xw) (where "L" indicates a "barred-ell" and "X"
> indicates "x-with-dot-under"), each of which functions as a
> single lexical unit, as well as the rarer combining characters
> (dj, dz, zh) which also function as a single lexical unit, the
> subject of this thread, with a user community in the
> North-Eastern Pacific coastal area, as with the Wabanaki
> languages I cited originally (user community in the
> North-Western Atlantic coastal area), I can't identify a
> benefit I think likely to motivate significant de-allocation
> of resources from in-community language programs to an
> external consumer offering only a label and significant
> recurring annual fees and elevated operational cost and a loss
> of some aspects of sovereignty.

For whatever my opinion is worth, that seems to be to be exactly
the right conclusion.  Actual language programs and use are
almost certainly much more important that location of labels in
the DNS (or even use of the DNS itself).  I would add to the
"significant recurring annual fees and elevated operational
cost" concerns about added cost to be regularly represented in
ICANN forums to prevent (or at least warn again) future moves
against community interest (even ones caused by accident or
indifference rather than malice) and the search engine effects
mentioned above... but those considerations just reinforce
Eric's conclusions.

>...

Finally, to respond to Martin's comment about simplified and
traditional Chinese, that problem is very different from those
associated with other, especially "alphabetic-phonetic",
scripts, in part because those of us who did the final editing
on the JET document that put "variants" on the map made a
serious error in terminology.  But, again, it isn't a topic for
this list.

     john



More information about the Idna-update mailing list