further feedback from PAN L10n project

Sat Apr 5 14:08:53 CEST 2008

Hello Sarmad,

At 00:04 08/04/05, Sarmad Hussain wrote:

>I also have a question: 
> 
>Should the composed forms be DISALLOWED if the decomposed forms are valid?$B%D(B  It seems to be the case in some languages but not the case in some others.$B%D(B  For example, as indicated in the subject document, some Arabic script letters are PVALID both in their composed and decomposed forms, while in Bengali and some other languages the composed forms are DISALLOWED and decomposed forms of the same are PVALID.$B%D(B  Is this and $Bc`WJ(Bnconsistency$Bc`Y(Bor expected behavior.$B%D(B 

Only either composed or decomposed forms should be valid.
IDN uses NFKC, so in general, composed forms are used.
But there are some exceptions, which can be found in
http://unicode.org/Public/UNIDATA/CompositionExclusions.txt.
The list includes the Indic (Devanagari, Bengali, Gurmukhi,
and Oriya) letters with Nukta, some Tibetan letters
and some Hebrew letters. For these cases, decomposed forms
are preferred even in NFC/NFKC because that's what most
legacy encodings and current systems use (or so we have
been told when NFC/NFKC was created).

In the particular case of Bengali, please note that the current
status of the IDNAbis tables does NOT disallow the use of these
characters, it just means that they will be represented
internally in decomposed form. The same applies to Tibetan.

As for Arabic, I'm not sure what you mean with composed and
decomposed form. If you refer to Table 3.13, this is indeed
a list of character pairs that are easy to confuse, but this
is not because one of them is composed and the other is
decomposed. I think these pairs should be looked at in
very much detail by each registry that plans to allow the
use of the Arabic script. To only give one example, U+06F1
and U+0661 are both the digit 1, once in (western) Arabic
form and once in Eastern Arabic form. In the PDF, they
look somewhat different (the Eastern Arabic one looks like
a superscript), but in general, there is no such difference,
they look exactly the same. Because some of the digits in
(western) Arabic and Eastern Arabic look very different
(to the extent that they may not be recognized) (and probably
because these series were already separated in some legacy
encodings used as the base for Unicode), the full series of
digits was encoded twice. It is very obvious that this may
be a big source of spoofing if not addressed properly
(either by allowing only one series, as for example the
Iranian registry did, or potentially by bundling).

Also, on 3.5, it seems that your Mongolian experts went through
all of the Cyrillic alphabet. For such a widely used alphabet,
it should be very clear that there are many letters that are
not used in all languages written with Cyrillic. The table
for Mongolian should therefore in the end be listing the
characters needed, and in particlar those that might currently
be DISALLOWED (currently none). Listing all the Cyrillic
characters that should be disallowed in Mongolian domain
names isn't very helpful for anybody, just using a lot of space.
Very similar for Pashto and the Arabic script.
The same applies for some other languages, such as Nepali
written with Devanagari, although it's less of an issue
because the tables are not as big.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp