UTC Agenda Item: IDNA proposal

Harald Alvestrand harald at alvestrand.no
Fri Nov 24 10:07:32 CET 2006


Commenting without any knowledge of the script in question, but trying to 
shape answers into something that I know how to translate into 
software.......

--On 24. november 2006 10:52 +1300 Sam Vilain <sam.vilain at catalyst.net.nz> 
wrote:

> Patrik Fältström wrote:
>>> This table completely excludes all of Gujarati and Devanagari.  India
>>> will not be able to register domains in their national language with
>>> this table.  Many other languages have been completely excluded, too -
>>> Mongolian, Khmer, ...
>>>
>> I need help with what you want me to add to the algorithm...not only
>> "this is wrong", because that will not move us forward.
>>
>> Is what you try to say is that "please also accept class Mc"? Or is
>> it "please accept class Mc, but only for Gujarati and Devanagari"? If
>> the latter, what to do with Khmer and Mongolian (etc)?
>>
>
> Ok, well here's the results of my own analysis of the Gujarati script;
> I've spoken with literates about the script in general, but this
> specific information is currently unconfirmed, so take with ⅜ tsp. salt.
> If you think it is useful to continue down this track, I can contact the
> relevant local language schools and attempt to arrange to meet an expert
> to confirm the information rather than just casual conversations with
> locals..
>
> Maybe also the .in research team could be forwarded this, so they can
> comment or complete the information. Who knows, this might even help
> them :-).
>
> So, without further ado:
>
> Class "Nd" are numbers, and should be included.

if we accept Nd, we accept all numbers in all scripts, I think. That might 
be OK; I'd like to hear if anyone objects.

> Class "Lo" is required,
> except:
>
> U+0A8D : ઍ, which can be represented as U+0A85 U+0AC5 : અૅ
> U+0A8F : એ, which can be represented as U+0A85 U+0AC7 : અે
> U+0A90 : ઐ, which can be represented as U+0A85 U+0AC8 : અૈ
> U+0A91 : ઑ, which can be represented as U+0A86 U+0AC5 : આૅ
> U+0A93 : ઓ, which can be represented as U+0A86 U+0AC7 : આે
> U+0A94 : ઔ, which can be represented as U+0A86 U+0AC8 : આૈ
>
> (note: these are all so far rendering as perfect homographs using Prof.
> Jitendra Shah's GPL Padmaa 0.5 font, and look like they would using the
> glyphs on the Unicode.org code charts).
>
> In fact, it looks like U+0A86 : આ is actually U+0A85 U+0ABE : અા, I
> guess there needs to be a Stringprep-like normalisation step for these.
> So, maybe U+0A86 is not needed. - eg U+0A94 : ઔ could be U+0A85 U+0ABE
> U+0AC8 : અાૈ. This is not a perfect homograph with the Padmaa font,
> but it is on the Unicode.org code chart.

would it be harmful to include those, apart from the confusables problem?
Or do you think that they "should have had" canonical/compatibility 
decompositions, so that they would go away under the NFKC rule?

> Here are some confusables that I think they'll just have to "live with",
> like us Latin folk "live with" l vs 1, 0 vs O, etc.
>
> U+0A8A : ઊ looks like U+0A89 U+0AC0 : ઉી, but they are not
> conceptually the same, I think - and doesn't render correctly with
> Padmaa. I'd guess A+0A89 U+0AC0 is not a valid combination of codepoints.
>
> U+0A9A : ચ looks a little like U+0AB0 U+0ABE : રા, but is a
> different combination.
>
> U+0AAB : ફ almost looks like U+0A95 U+0ACD : ક્
>
> U+0A98 : ઘ looks like U+0AA7 : ધ, and perhaps one of them is
> equivalent to or a homograph with U+0AA6 U+0ABE : દા
>
> Signs (that combine, somewhat like accents):
>
> Class Mn, Mc, are required, except:
>
> U+0ABC : ઼ looks too similar to a full stop, and I *think* it is rarely
> used; the Gujarati native I asked about it didn't seem to recognise it.
> U+0AC9 : ૉ is the combination of U+0ABE U+0AC5 : ાૅ
> U+0ACB : ો is the combination of U+0ABE U+0AC7 : ાે
> U+0ACB : ો is the combination of U+0ABE U+0AC7 : ાે
> U+0ACC : ૌ is the combination of U+0ABE U+0AC8 : ાૈ
>
> Again, U+0AD0 : ૐ is a Sanskrit symbol and its duplication at U+0950 :
> ॐ is regrettable. Probably the Devanagari version should "win".

by "win", do you mean that there should be a canonical decomposition of 
U+0AD0 to U+0950?

>
> --
> Sam Vilain, Systems Architect, Catalyst IT (NZ) Ltd.
> phone: +64 4 499 2267        PGP ID: 0x66B25843
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>






More information about the Idna-update mailing list