IDNs and Language definitions and labeling (was: RE: New version, draft-faltstrom-idnabis-tables-02.txt, available)

Thu Jun 21 17:32:44 CEST 2007

--On Wednesday, 20 June, 2007 10:12 +0100 Debbie Garside
<debbie at ictmarketing.co.uk> wrote:

> Hi 
> 
> I see both sides of this and I think there could be a
> compromise.  I like Patrik's "rules" but I can see that they
> will not work without some human intervention.  Is there a way
> forward that will utilise the rules as a starting point to
> produce a base list which is then revised by UNICODE/script
> experts?
> 
> For me, as Editor if ISO 639-6, I would like to see Unicode
> Codepoints allocated to the language writing system (alpha4)
> code within ISO 639-6 - that's why I put them there! I put
> this to the CLDR group last year.  A lot of work but it would
> be a beautiful result.  Subsets of the codepoints allocated to
> a writing system could be created for IDN purposes.

Debbie,

I'm working on a separate note to address the main thread here,
at least as it appears to me, but let me see if I can help by
isolating what seems to be almost a separate thread.

One of the larger difficulties in many of the recent discussions
of IDNs -- much more so around ICANN than here -- is that people
try to make both policy and technical decisions without a
thorough understanding of the technology itself.  I'd recommend,
to you and others, a decent tutorial on what the DNS is about in
terms of design, operations, and function [1].  One then needs
to understand that IDNs are simply a set of conventions and
overlay on the DNS itself and, at least in general, how that
overlay works [2].  And, to understand this effort, one should
probably start with the summaries of issues that have been found
(or perceived) with the 2003 version of IDNA [3].

Part of that understanding (but not a quick summary or
substitute for the above) is that, while many of us are
intensely interested in identifier and referencing mechanisms
that are sensitive to language, orthography, and culture at a
level as fine-grained as the user or applications designer
thinks appropriate to his or her needs, the DNS is not a good
vehicle for that sort of work.

Because an application encountering a "DNS name" [4] has no way
to obtain information about the language the registrant had in
mind when registering the mnemonic string, the applicability of
any language-based information is quite limited.  We can use
information informed by knowledge of a language to inform
choices of scripts and characters to be included, but that use
does not require either language tagging or a language taxonomy.
Some registries can, and do, use language information to
restrict the characters that they permit to occur together in a
given label.  Using language (or script) information that way
has become a recommended practice, but it is optional, different
registries can and do handle it differently, and the only use
for language tagging in that context involves communication
between registrant and registrar and between registrar and
registry.  There has been no demonstrated need for a single
international standard in that area and, if there were such a
need, it would be out of the scope of this effort.   

However, all of those uses occur at registration time; at the
time of name resolution, or of presentation of information to
the user, there is no language information available at all
except by heuristic on the strings themselves.  Because those
strings are typically very short (or at least as short as
registrants who recognize user distaste for typing long strings
and the opportunities for bad behavior if there are typing
errors can make them), heuristics that work very well with
moderate-sized blocks of text will often not work well.  And,
interestingly, one of the heuristics that many people believe
they can make into a firm and useful rule won't work at all in
the general DNS case (see discussion in reference [1]).

One final observation before I encourage you to stop reading
this and start reading the references: A suggestion to base any
of this work on ISO 639-6 runs into an extra problem that you
will need to address.  The IETF has adopted a system for
language tagging that is based on ISO 639-1, 639-2, and 15924
[5].  As you can probably appreciate, we smile at the old saw
that the nice thing about standards is that there are so many of
them, but generally try to avoid standardizing or relying on
redundant, duplicative, or alternate approaches to work that is
considered finished unless there are strong justifications for
doing so.  I suggest --with the understanding that this is just
my personal opinion-- that, if you want to see 639-6 used in
IETF-based protocols (presumably including but not limited to
IDNA), your first step is to write up a set of discussion notes,
in Internet-Draft form [6], that reviews the differences between
an approach based on 639-6 and one based on a profile of RFC
4646 or its successor and that discusses the circumstances in
which one would be more usefully applicable than the other.

best wishes and happy reading,
    john

-----------

[1] A well-vetted and reasonably balanced tutorial, oriented
toward policy makers rather than deep understanding of the
technology, is a US National Research Council Report, _Signposts
in Cyberspace: The Domain Name System and Internet Navigation_, 
http://books.nap.edu/catalog.php?record_id=11258.  For a deeper
understanding, the core DNS specifications themselves are RFC
1034 and 1035.  (RFCs can be obtained from a number of
locations.  The official location permits retrieving them by
substituting the RFC number for NNNN in
ftp://ftp.rfc-editor.org/in-notes/rfcNNNN.txt)

[2] RFC 3490, 3491, 3492, and 3454.  RFCs can be obtained from a
number of locations.  The official location permits retrieving
them by substituting the RFC number for NNNN in 
ftp://ftp.rfc-editor.org/in-notes/rfcNNNN.txt
There are also several tutorials floating around, but they tend
to be addressed to a user-level understanding rather than the
understanding needed to discuss the protocol issues
intelligently.   Slideware for one of them (now somewhat dated)
is at http://ws.edu.isoc.org/workshops/2004/ICANN-KL/

[3] RFC 4690 and
http://www.ietf.org/internet-drafts/draft-klensin-idnabis-issues-01.txt.
These two documents are complementary; neither can be adequately
understood without the other.  The second one is likely to be
replaced in the next week or so with an updated version, which
will have the same URL but with "-02" substituted for "-01". 

[4] As you might have noticed in my exchange with Gervase, I've
concluded that the use of terms like "name" or "word" are just
introducing more confusion.  Many, perhaps most, DNS "names" are
not "words" in the sense of obeying the orthographic or phonetic
rules of any language; perhaps we can reduce the confusion we
are causing ourselves by shifting to "mnemonic", which more
closely describes the actual situation.

[5] RFC 4646 and
http://www.ietf.org/internet-drafts/draft-ietf-ltru-4646bis-06.txt.
For many purposes, these documents are incomplete without
"matching rules", discussed in RFC 4647.

[6] See the discussion at http://www.ietf.org/ID and the links
to information about format and tools leading from that page.