Visually confusable characters (2)

John C Klensin klensin at jck.com
Mon Aug 11 14:12:52 CEST 2014



--On Monday, August 11, 2014 01:19 -0700 Asmus Freytag
<asmusf at ix.netcom.com> wrote:

>> Everything else is "less control". Specifically as John says
>> with the non-contracted parties, and even less than that with
>> parties not even participating in the ICANN processes. Like
>> some ccTLDs.
> 
> Patrik,
> 
> no need actually :)
> 
> We all agree that the best that can happen is that any
> successful conclusion of the TLD project will reflect
> expertise that other parties might want to utilize simply by
> copying, or at least emulating.

Actually not, and therein lies another important property of the
distributed hierarchy.   Some examples that have come up in
active discussions (I don't know whether there are labels
corresponding to those discussions or not, but allowing for them
is important):

(1) The framework for the LGR prohibits "archaic" scripts and
writing systems explicitly and prohibits them again because of
the criteria for organizing a generation panel.  But the
discussion around the NUSEUM TLD, among others, pointed out that
use of such scripts in domain names by institutions with
departments devoted to the relevant cultures and societies might
be completely appropriate.   A decision to use an archaic script
for a domain name label in that context would have the effect of
restricting global usability of the relevant domain (note
interaction with my "advice" in another note).  That might or
might not be a desirable tradeoff, but making the tradeoff
decisions is (and quite appropriately should be) up to the
domain administration for the museum or equivalent institution.  

The same issues and considerations would obviously apply to some
booksellers, antiquities dealers, scholarly publications, and so
on.

(2) At least when I was last following the permissible code
point discussion for the LGR process, it appeared that all
CONTEXTx characters would be prohibited in TLD labels because
they are problematic in one way or another.  The issue is often
visual (but not "confusability" in the usual sense) --
especially with someone unfamiliar with the script, using a
rendering engine that was weak compared to the needs of the
script involved, and/or trying to enter characters by picking
from a table, several (or most) of the characters with those two
IDNA derived properties are known to be hard to handle.  On the
other hand, those characters are required to intelligently write
a large number of strings, even mnemonic ones, in some usages of
some scripts, so, in a subdomain in which the administrator
decided those considerations were more important than universal
usability or rendering, labels containing them might be
perfectly reasonable.

(3) There is a prohibition on digits (at least European digits
coded in ASCII, but I believe it would be harmful for the LGR
process to push that boundary for U-labels) in TLD names that is
almost as old as the DNS.  Labels with digits within TLDs and
below may make perfectly good sense and, indeed, are widely used
(and for more than the standard mentioned below and ENUM
structured labels).

Let me draw on one of your comments to generalize the situation
without making this note long with examples.  Human text, and
even mnemonics that have value in that text, is complicated.
The complications require making decisions about the balance
among various considerations and tradeoffs. Domain names are
intended to be deeply hierarchical (and often used that way) and
not only identify things other than web pages but may be
indicators or suppliers of information not related to objects at
all.   The "mnemonics, not words" issues actually complicates
things further because various obviously-not-word labels may be
appropriate (e.g., there are widely-followed (non-IETF)
standards for naming of certain types of network nodes that just
about require mixtures of letters and digits that could not
possibly be words).  

For TLD labels, global usability, clarity, and lack of ambiguity
should be major concerns.  That implies a minimum of reliance on
complex rendering, on characters that may not appear in every
engine that has to render or support user input for domain
names, etc., as well as adherence to the spirit (as well as the
letter) of rules designed to prevent even a chance of confusion
between domain names and other network names and objects.  Some
of the same considerations suggest that characters newly-added
to Unicode should not be allowed in root zone labels until some
years (and versions) have gone by.   The generation panel model
creates a really strong bias against obscure scripts (not just
archaic ones) and that is probably a good thing.   None of that
is really about visually confusable strings: if they were the
only issue, the fact that ICANN has visibility into the entire
root zone and all new allocation decisions means that it would
be possible to adopt a simple "is this new application
confusable with something that is there already" rule as a
principle and not struggle with attempts to define universal
rules (for all practical purposes, that actually has been the
rule for the last decade or so).

If the LGR Generation and Integration panels don't understand
those issues and the priority of those considerations in making
tradeoffs we are, IMO, going to be in serious trouble.

By contrast, at lower nodes in the tree, different
considerations apply.  In the most extreme case, labels may not
identify objects that will be used in ways in which mnemonic
labels or the ability to display them are necessary (or even
desirable), possibly even to the point that a label consisting
or a dozen or two randomly-generated octets might be reasonable.
Even within the bounds of IDNA, domain administrators may make
different tradeoff decisions than would likely appeal to a panel
charged with reviewing labels for the root zone: the tradeoffs
may favor allowing labels that differ by the presence or absence
of joiners or non-joiners; abbreviation or numeric markers;
archaic scripts; scripts that don't have large enough (or
economically-well-endowed enough) communities to organize and
staff generation panels; mixed-script labels; and other things
that would be bad ideas for the root.  

Whether actually using any of the labels that implies is, again,
a tradeoff, but I'm fairly sure that the community doesn't want
the decisions, or even the guidelines, to be applied top-down.
If they were, the first headline would be "ICANN Label
Generation Integration Panel is hostile to endangered languages
and writing systems".  I hope you don't want to go there, but
that is a natural implication of what you are suggesting.

    john



More information about the Idna-update mailing list