UTC Agenda Item: IDNA proposal

Sam Vilain sam.vilain at catalyst.net.nz
Thu Nov 23 01:20:28 CET 2006


Patrik Fältström wrote:
> I have recreated the tables using a new algorithm (based on input 
> from Kenneth mostly).
>
> (1) Use the scripts.txt file for the script definitions, do not use 
> the blocks definitions
>
> (2) Remove codepoints where cp != NFKC(cp)

ʏᴏᴜ ᴋɴᴑᴡ, ɪ ᴛʜıɴĸ ᴛʜᴇʀᴇ ᴍɪɢʜᴛ ʙᴇ ʀᴏᴑᴍ ƒᴏʀ ɪᴍᴘʀᴑᴠᴇᴍᴇɴᴛ ɪɴ ʟɑᴛɪɴ :-)

(decryption code for users of the FreeSerif font: read "M" for "L",
and "K" for "J")

I think the problem is that the NFKC process is not converting these
letters to a correct regular Latin substitute.

> (3) Remove codepoints where cp != lowercase(cp)
>
> (4) Remove codepoints where class(cp) != "Ll"
>
> (5) Include codepoints that are part of US-ASCII (0-9, A-Z and a-z)
>
> The result of doing this for U+0000 - U+FFFF can be found as
>
> http://stupid.domain.name/idnabis/table-ll.html

This table completely excludes all of Gujarati and Devanagari.  India
will not be able to register domains in their national language with
this table.  Many other languages have been completely excluded, too -
Mongolian, Khmer, ...

> If I instead instep 4 accept things of class both Ll and Lo, then the 
> result can be found as
>
> http://stupid.domain.name/idnabis/table-lllo.html

This table is slightly better, but still excludes all the vowel signs
from those two languages.  Without vowels they cannot write words.  I
have checked this informally with a local Gujarati/Hindi literate.

Also interesting is that only some of the Devanagari letters are being
decomposed by the NFKC algorithm (eg U+0959 : ख़ into U+0916 U+093C : ख
  ़़, but U+0912 : ऒ is not broken into U+0906 U+0946 : आ   ॆ ), but none
of the Gujarati letters are (eg, U+0A8D : ઍ not broken into U+0A85
U+0AC5 : અ   ૅ).  This seems to be the case with *all* of the
Sanskrit-based languages, even Tibetan.

There is still the issue Sanskrit "OM" letter existing in both the
Devanagari (ॐ) and Gujarati (ૐ) scripts.  These are both marked as
"OK" in the latter table.  But they are the same symbol!

> Please let me know what you think.

One other question - how does the table for CJK/Han characters compare
with the tables referred to in RFC3743 and RFC4713?

Note that my questions are only on behalf of major New Zealand
languages, which includes a few Western European languages, Māori and
some other Pacific Island languages like Samoan and Tongan,
Traditional and Simplified Chinese, Hindi and Gujarati.

I really think that you need to get linguists on the case here from
around the globe, make sure they really understand the homograph
issue, and get them to approve the tables for individual languages,
along with providing a list of example words for that language
demonstrating a representative portion (or, where possible, complete
coverage) of the characters that are necessary.

Ideally this would come down to each country to arrange, but how many
of them are aware of this discussion group?  Ok, it's not being held
four light years away on display in the bottom of a locked filing
cabinet stuck in a disused lavatory with a sign on the door saying
"Beware of the Leopard".  But, considering the language, social and
socio-economic barriers involved, it may as well be.  How will the
innovators in those countries feel when they eventually start looking
at using their own script for domain names, but are told it won't work
with all standards conformant browsers on the planet, because the
people drafting the standards did not take the time to ask them how
their script works?  This is I18N, not I13N¹...

> I have this comment regarding one entry from class Lm:
>
>>>  | Exclude  | U+02BB | U+02BB | Lm    | MODIFIER LETTER TURNED 
>>> COMMA |
>>>  | Exclude  | U+02BC | U+02BC | Lm    | MODIFIER LETTER 
>>> APOSTROPHE   |
>>>
>> As ASCII isn't directly encodable using Punycode, one of these is 
>> going
>> to be needed to be allowed for Pacific languages, which use the
>> apostrophe. eg, Hawaiʻi. It is often ignored, but in languages like
>> Tongan it can make a difference.
> I have not taken this into account when creating these tables.

Yeah.  I think this one should be sneaked in, it's also used by at
least one South American language as a consonant, too.  Before we got
to the point that it was illegal because it looks like a character
which interferes with parsing, but I don't think that parse-ability
commutes over confuse-ability, such that it follows that it should be
excluded.  Only one allowed, certainly - probably the apostrophe
rather than turned comma, because it was a symbol commissioned into
use by people who would have described it as an apostrophe, not even
using the American dialect-specific term "turned comma" :).  But I
guess if someone can find a decent argument to the contrary (other
than the Wikipedia editorial note), then the arbitrary decision could
fall the other way instead.
-- 
Sam Vilain, Systems Architect, Catalyst IT (NZ) Ltd.
phone: +64 4 499 2267        PGP ID: 0x66B25843

¹ - Imperiali[sz]ation :)



More information about the Idna-update mailing list