AW: AW: AW: sharp s (Eszett)

John C Klensin klensin at jck.com
Tue Mar 18 00:36:54 CET 2008



--On Thursday, 13 March, 2008 17:33 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> I think saying that domain names always have been
> case-insensitive is in some way putting the chart before the
> horse.
> 
> What was done with domain names was that it was observed
> that certain groups (pairs, in fact) of letters were used
> virtually interchangeably. It was decided that the system
> would match these for user convenience, while keeping the
> distinction of which variant was registered. It turned out
> that these pairs of letters were related by case, and that
> the shortest way to characterize this behavior was to say
> that domain names are case-independent.

Martin, while this is an interesting, and possibly helpful,
explanation, it is not historically correct.  The DNS "host
name" rules, including case-matching, were derived from earlier
host name rules which were, in turn, based in part on rules from
some other systems.   They came along at a time when
case-distinguished character coding was not yet fully
established: some of the systems that influenced the ARPANET
were single-case-only while others were mixed-case.   The
convention about single-case systems was that all letters were
in upper case; the convention about the dual-case ones was that
lower case was the preferred one.

As far as I know (or recall) ASCII and first-generation EBCDIC,
which were roughly contemporaneous, were the first serious
attempts at general-purpose character sets that could represent
case distinctions.  Both were organized so that a one-bit AND or
OR operation could be used for case-mapping or case-independent
testing. I don't know whether that was an explicit design
decision/goal or just something that happened for other reasons.
Clearly, there would have been advantages to interleaving lower
and upper-case characters, but it wasn't done in either coding
system.  Curiously, EBCDIC collated lower-case characters before
upper-case ones and ASCII did it the other way around.  

In that context, case-independent matching was a natural
property (probably the only plausible option) for host names and
that convention was carried forward into the DNS.  And the DNS
matching was done in comparison (lookup) and not the stored form
because that was the convention from many years earlier.

> Extending the repertoire for domain names means that decisions
> should not just be based on using the same general concept/
> label/data file as for the basic Latin case.

There were many reasons for this, some of which you have
explained, but two more were that (i) for the basic Latin case,
case mappings to and from lower and upper case are fully
symmetric as long as those operations are considered on a
character by character basis and (ii) the "case mapping"
operation could be performed unambiguously and simply by
single-bit logical operations.   As soon as one moves away from
the basic Latin set, case mappings become idiosyncratic, with
per-character rules needed, and quickly slide into the sorts of
debates about equivalency that have dominated this thread.
  
> That may be
> a good starting point, but as I have explained in my mail
> about exceptions, human script usage is very varied.
> Rather than saying "domain names are case insensitive,
> what's the closest we can do for case insensitive for
> the sharp s", we should say "what's the best way
> (within the very general constraints of the DNS) to
> make sure that those characters that users think of
> as pretty fundamentally different are treated as different,
> whereas those characters that users see as virtually the
> same are treated as one".

Agreed.  But we also have to recognize that the answer to "are
these virtually the same" may be "sometimes".  For example,
whether a final or initial-form character is "virtually the
same" as the medial one on which it is based often depends very
sensitively on the purpose for which one is considering
"equivalent".   They don't look the same.  In principle, all
three forms might exist and have case-pairings although that
doesn't often (ever?) happen in practice.
 
> If the result of that consideration is case-insensitivity
> (as defined by a particular table) for that case, that's
> of course fine, but there may be other results. Immagine
> a script with a case distinction, but where there is a
> very firm tradition to use upper-case for official,
> government-related names, and lower-case for company and
> private names. In such a (hypothetical) case, it would
> probably better to treat that script as case-sensitive
> in DNS.

Exactly.  And, to make life a little worse, that sort of
distinction is more likely to occur on a language basis than in
all languages that use a particular script.

    john




More information about the Idna-update mailing list