The Two Lookups Approach (was Re: Parsing the issuesand finding a middle ground -- another attempt)

Sun Mar 15 00:11:45 CET 2009

John C Klensin wrote:
> The question below seems to have never been answered... sorry.
> 
> --On Saturday, March 07, 2009 15:52 +0900 Martin Duerst
> <duerst at it.aoyama.ac.jp> wrote:
> 
>> At 01:41 09/03/07, John C Klensin wrote:
>>
>>> It is worth stressing that the occurrence of this sort of
>>> problem does not depend on IDNA2008.  Paul's IDNAv2 proposal
>>> would cause it equally well, as would anything else that
>>> provides a change from Unicode 3.2 to Unicode 5.1 and, more
>>> generally, most or all future changes to Unicode that add new
>>> characters to existing scripts to improve the way in which
>>> those scripts can be expressed.
>> To be precise, only characters that interact with others
>> in the script would be problematic, not completely independent
>> characters. Or are I'm missing something?
> 
> I think that is correct.  But "completely independent
> characters" may be a slightly fuzzy concept in practice.  Two
> examples...
> 
> (i) Adding the Chillu to Malayalam, as Unicode 5.1 does,
> involves completely independent characters, in the sense that
> nothing was there before and they are not treated as
> compositions of characters that were in Unicode earlier.   In
> one way, that makes them "independent" of what has come before,
> but they profoundly change the way the script is coded, so they
> to interact and set up a problematic situation.
> 
> (2) Assume that the Catalan ela geminada were added to Unicode
> as a single character (I understand the odds of this are
> vanishingly small, but perhaps it makes a good illustration).
> I assume that it would go in with an NFKC equivalent of U+006C
> U+00B7 U+006C following the existing model for U+0140 (a
> character that some folks in the Catalan community believe is
> orthographic nonsense), i.e., NFC(U+0140) -> U+006C U+00B7.  For
> IDNA2008 purposes, the fact that both U+0140 and this
> hypothetical new precomposed ela geminada would turn into other
> things under NFKC would cause them to be disallowed, but, as
> soon as one moves into an environment in which mappings are
> considered part of the standard, there is clearly an
> interaction.  Whether the new character could be considered
> "independent" or not, whether it would interact in ways that
> would present transition issues depends on what else is going on
> with the protocol.
> 
> (3) Unless one adopts a strong "if they look alike, they are the
> same thing" model (often obnoxiously expressed as "Unicode
> really works only for printing"), strong cases can be made that
> the Yiddish character that looks like the Hebrew U+05D0 U+05B7
> should be treated as an entirely separate precomposed character
> with its own code point.  In Yiddish, that is a real character
> with its own phonemic properties that should not, under any
> circumstances, compare equal to U+05D0 alone.  In Hebrew U+05B7
> is a point ("vowel") whose presence is optional (and in most
> modern circumstances, discouraged), so dropping it on a
> comparison or, for the DNS case, even banning U+05D0 U+05B7 from
> being registered, would be reasonable.  The points do not appear
> on most Hebrew keyboards or appear only as non-spacing
> (combining) marks.  The particular character in question appears
> on most Yiddish keyboards.  Now, assume that new character were
> added to Unicode and assume that it does not immediately
> transform itself back into U+05D0 U+05B7 under NFC (which would
> defeat most of the purpose of adding it).  I don't know if it
> makes it an independent character or not, but it would certainly
> have an effect on backward compatibility.
> 
>   --john
> 

Excuse me for seeing a can of worms.

As a radio amateur and practicing morse code, I have seen
the international (us-english) code matches perfectly the
russian, cyrillic, the turkish and both the japanese katakana and
hirakan codes. We could communicate.

Well, the japanese do have some 40 characters (in katakana and
hirakana only) but the latin character set only has some 25. I
dont know how that fits. But later at the army (Bundeswehr) I
learned the german umlauts because we needed them to copy
russian chracters we did not even know. Much later I learned
they match the french accents too. I still dont what to do with
those 40 japanese characters but I am told when I hand what I
have copied from the radio, to a secret service guy he is able
to make sense of it.

Latin is a miraculous language allthough most of us dont speak it.
The french, italian, spanish, catalan, portuguese, and raetoramanic
speakers almost do. When you know at least one of these languages
you most likely can read the others too. I guess it makes sense
to treat the corresponding characters as one character set.

The german umlauts (oe), (ae) and (ue) or html &ouml, &auml and
&uuml, look very much like the french trema (e:) or &euml but
they mean the opposite. The umlaut means something in between
o and e for example but the trema means to pronounce them
separately and one after the other. No good idea to put them
together into one box - except for the printing guys and in a
german newspaper you might find them together because Joerg
does drive a Citroen. That is J-&ouml-rg and Citro-&euml-n.

Jearg as in yeah, yes would almost be correct but
Citreen is almost an insult. Nonetheless many germans dont
know and pronounce it that way.

The (sz) is exactly that sz pronounces ss as in less. If I
am correct the the swiss don't even have an (sz) and it used
to be nothing but a ligatur, a thing only typesetters knew about.
They do have letters, or peaces of lead for (ff) and (sz) and
many others because that looks better than (f)(f) or (s)(z).

In fixed fonts e.q. courier that does not make much sense but
proportional fonts like swiss and roman it does.

As (sz) really is (s)(z), you are looking in vain for (S)(z)
and (S)(Z) does make no sense at all. The (sz) is something
that belongs into typesetting and nowhere else. The swiss
were good to abolish it although replacing (sz) with (s)(s)
was wrong. Maybe now is the right moment to simply get rid
of the (sz).

On the other hand capital letters are typesetting in the first
place. You don't pronounce them. Capital letters are something
to aid the eye when reading. Computers are already intelligent
enough for us to simply drop capital letters and let a
grammatical parser display them when needed. That very same
parser could do ligatures as well. Sorry that is fiction but
keep it in mind for later :)

Imagine a parser could write the first letter of every noun as
capital letter for german readers only. English would look a lot
more pleasing for german readers :)

Kind regards
Peter
-- 
Peter and Karin Dambier
Cesidian Root - Radice Cesidiana
Rimbacher Strasse 16
D-69509 Moerlenbach-Bonsweiher
+49(6209)795-816 (Telekom)
+49(6252)750-308 (VoIP: sipgate.de)
mail: peter at peter-dambier.de
http://www.peter-dambier.de/
http://iason.site.voila.fr/
https://sourceforge.net/projects/iason/
ULA= fd80:4ce1:c66a::/48