UTC Agenda Item: IDNA proposal

Thu Nov 23 22:52:19 CET 2006

Patrik Fältström wrote:
>> This table completely excludes all of Gujarati and Devanagari.  India
>> will not be able to register domains in their national language with
>> this table.  Many other languages have been completely excluded, too -
>> Mongolian, Khmer, ...
>>     
> I need help with what you want me to add to the algorithm...not only  
> "this is wrong", because that will not move us forward.
>
> Is what you try to say is that "please also accept class Mc"? Or is  
> it "please accept class Mc, but only for Gujarati and Devanagari"? If  
> the latter, what to do with Khmer and Mongolian (etc)?
>   

Ok, well here's the results of my own analysis of the Gujarati script;
I've spoken with literates about the script in general, but this
specific information is currently unconfirmed, so take with ⅜ tsp. salt.
If you think it is useful to continue down this track, I can contact the
relevant local language schools and attempt to arrange to meet an expert
to confirm the information rather than just casual conversations with
locals..

Maybe also the .in research team could be forwarded this, so they can
comment or complete the information. Who knows, this might even help
them :-).

So, without further ado:

Class "Nd" are numbers, and should be included. Class "Lo" is required,
except:

U+0A8D : ઍ, which can be represented as U+0A85 U+0AC5 : અૅ
U+0A8F : એ, which can be represented as U+0A85 U+0AC7 : અે
U+0A90 : ઐ, which can be represented as U+0A85 U+0AC8 : અૈ
U+0A91 : ઑ, which can be represented as U+0A86 U+0AC5 : આૅ
U+0A93 : ઓ, which can be represented as U+0A86 U+0AC7 : આે
U+0A94 : ઔ, which can be represented as U+0A86 U+0AC8 : આૈ

(note: these are all so far rendering as perfect homographs using Prof.
Jitendra Shah's GPL Padmaa 0.5 font, and look like they would using the
glyphs on the Unicode.org code charts).

In fact, it looks like U+0A86 : આ is actually U+0A85 U+0ABE : અા, I
guess there needs to be a Stringprep-like normalisation step for these.
So, maybe U+0A86 is not needed. - eg U+0A94 : ઔ could be U+0A85 U+0ABE
U+0AC8 : અાૈ. This is not a perfect homograph with the Padmaa font, but
it is on the Unicode.org code chart.

Here are some confusables that I think they'll just have to "live with",
like us Latin folk "live with" l vs 1, 0 vs O, etc.

U+0A8A : ઊ looks like U+0A89 U+0AC0 : ઉી, but they are not conceptually
the same, I think - and doesn't render correctly with Padmaa. I'd guess
A+0A89 U+0AC0 is not a valid combination of codepoints.

U+0A9A : ચ looks a little like U+0AB0 U+0ABE : રા, but is a different
combination.

U+0AAB : ફ almost looks like U+0A95 U+0ACD : ક્

U+0A98 : ઘ looks like U+0AA7 : ધ, and perhaps one of them is equivalent
to or a homograph with U+0AA6 U+0ABE : દા

Signs (that combine, somewhat like accents):

Class Mn, Mc, are required, except:

U+0ABC : ઼ looks too similar to a full stop, and I *think* it is rarely
used; the Gujarati native I asked about it didn't seem to recognise it.
U+0AC9 : ૉ is the combination of U+0ABE U+0AC5 : ાૅ
U+0ACB : ો is the combination of U+0ABE U+0AC7 : ાે
U+0ACB : ો is the combination of U+0ABE U+0AC7 : ાે
U+0ACC : ૌ is the combination of U+0ABE U+0AC8 : ાૈ

Again, U+0AD0 : ૐ is a Sanskrit symbol and its duplication at U+0950 : ॐ
is regrettable. Probably the Devanagari version should "win".

-- 
Sam Vilain, Systems Architect, Catalyst IT (NZ) Ltd.
phone: +64 4 499 2267        PGP ID: 0x66B25843