support of metadata

Mon Sep 14 17:40:06 CEST 2009

--On Monday, September 14, 2009 16:42 +0900 "\"Martin J.
Dürst\"" <duerst at it.aoyama.ac.jp> wrote:

> Hello John, Jean-Michel, others,
> 
> On 2009/09/14 11:39, John C Klensin wrote:
>> 
>> --On Monday, September 14, 2009 02:11 +0200 jean-michel
>> bernier de portzamparc<jmabdp at gmail.com>  wrote:
>> 
>>> Dear colleagues,
>>> among the points we introduced during the WG/LC that have not
>>> been addressed yeat is the end to end support of script
>>> oriented metadata (one example being the French majuscules).
>>> Metadata can be supported either:
> 
>>> - implicitely through an unlike sequence of PVALID codes (ex:
>>> FE73-0061 ... 007A)
>> 
>> Since there is no prohibition on such strings, nothing
>> prevents you from using them and interpreting them in a
>> special way, assuming that FE73 is not problematic from a
>> Bidi standpoint (while it is identified as a "Arabic"
>> character, the code point does not appear in
>> Arabic-Shaping.txt, which drives Bidi).
> 
> Where in Bidi does it say so? The Bidi document refers to Bidi
> properties, and these are defined in UnicodeData.txt
> (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt).
> There, U+FE73 is AL (Arabic Letter), which means that the
> above won't work exactly as proposed. Of course, there are
> ample other characters in Unicode which may be suited for
> misuse for the above mentioned purpose.

Sorry.  You are absolutely correct.  Fuzzy thinking on my part
about the Unicode reference -- I looked in UnicodeData and
didn't see what I was looking for (probably just too tired) so
assumed ArabicShaping.  But, as you point out, this is invalid
under Bidi, and hence raises exactly the same issues as an
UNASSIGNED code point.   And, as you also point out, one could
find other characters that do not violate any IDNA-specific rule
to misuse, to which my other comments would apply.

>...
>>> If the WG documents remain unchanged in terms of French
>>> majuscules support, the support of the two will be offered as
>>> a response to the "+" entry. Ex. http://+Etat.fr.
> 
> While I'm writing this mail, some comments on majuscules that
> I have been thinking about for quite a while.
> 
> On careful reading, the French article
> http://fr.wikipedia.org/wiki/Majuscule and the English
> counterpart at http://en.wikipedia.org/wiki/Majuscule aren't
> too different at all. Not only French, but a wide range (if
> not all) European languages know a difference between
> 'majuscules' and 'capitales', and good orthography and
> typography is impossible without these concepts, even if they
> may be less explicitly distinguished in other languages than
> in French.

Of course, if the focus is "good orthography and typography",
the German practice for how nouns are displayed may be an
equally, if not more, significant example.

> The reason why this distinction hasn't made it into character
> encoding is in part historical (less computers than
> typewriters), but a big part of it, in my opinion, has to be
> attributed to the fact that a large majority of the population
> everywhere around the world thinks primarily visually. I.e.
> most people everywhere around the world want an upper case
> letter when they want an upper case letter and a lower case
> letter when they want a lower case letter, and on first
> approximation, they don't care whether something is a
> 'majuscule' or a 'capitale' because they both look the same.
> Trying to teach everybody to always be aware of the difference
> and press the right shift key would simply be impossible.
> That's not only the case for this specific difference, but is
> also a widely reported phenomenon on other levels, such as
> document appearance vs. document structure (think nicely
> structured, valid (X)HTML) vs. "it has to look the same on
> every browser").

Yes, probably.  To expand a bit on your typewriter observation
(which I've mentioned before in what I believe to be similar
contexts), most attempts to standardize orthography during the
period of typebar-style typewriters and early-generation hot
type machines involved "simplification", i.e., eliminating
actual character distinctions and and complete characters, often
in favor of look-alikes or other substitutions.   The
disappearance of a number of digraphs, ligature distinctions,
and medial/final form distinction from the orthographies of
several Germanic-based languages in which they were once used,
the elimination of several characters from Russian in the first
half of the 20th century, and so on are all examples (perhaps
better examples than the Simplified-Traditional Chinese
distinction where the characters themselves were simplified but
few fundamental form distinctions were actually eliminated).
Some of those characters, which were once considered
orthographically and typographically necessary, have disappeared
to the point that there are no Unicode code point assignments
for them (and the titlecase/capital distinction, a close
relative of the majuscule/capitale one, is actually made for
only a very few characters).

While I've seen typebar-style typewriters with three, and even
four, characters per bar, the mechanical complexities of making
such things work without jamming or other typing impediments,
and practical limits on the number of keys in a typebar-basket
arrangement, are probably far more important than having people
deal with multiple shift characters (after all, we
computer-oriented folks deal with three-shift keyboards every
day and some of us can remember space-cadet keyboards with far
more).  There is at least some reason to believe that those
mechanical difficulties influenced the orthographic reforms.

However, none of this has anything to do with the present
situation.  DNS entries are not words or sentences in a language
that are subject to rules about typographic or orthographic
correctness, they are simply identifiers.   As has been pointed
out on this list several times, a very large fraction of the
labels in the DNS today are not words in any language and in
fact violate orthographic norms for construction of such words
(with embedded digits being a popular example).  The "hostname"
rules were designed in the early 1970s to strike a reasonable
balance between good mnemonics and good identifiers (where
"good" for the latter includes minimizing actual or perceptual
ambiguity).  IDNA has tried to follow that same principle of
balance, including the decision to map out (in IDNA2003) and
prohibit (in IDNA2008) compatibility characters, but the
orthographically-correct representation of every word or string
in every language has never even been a goal, much less an
objective we have somehow failed to meet.

There may be an important message in this discussion, but it
doesn't have anything to do with domain names.  When we actually
label and identify bodies of text with language or script
information for reading and processing --in email messages, for
web pages, etc.-- our current model of identifying a "charset"
and language (via LTRU) may not be sufficient.  While we do have
a comparator registry, it isn't tightly linked to the
"charset/LTRU tag" labels.   Perhaps language coding needs to be
expanded to distinguish texts that should be treated as if
majuscule/ capital distinctions are important from texts in
which they are not.  But, again, that has nothing to do with
domain identifiers or with IDNA in its role of making it
possible to accommodate non-ASCII identifiers in the DNS.

best,
   john