FW: [centr-tech] IDNA Redux]

Tue Nov 7 01:03:36 CET 2006

John C Klensin wrote:
>> As ASCII isn't directly encodable using Punycode, one of these
>> is going to be needed to be allowed for Pacific languages,
>> which use the apostrophe. eg, Hawaiʻi. It is often ignored,
>> but in languages like Tongan it can make a difference.
>>     
> We are quite aware of this.  The problem goes back to, and was 
> recognized in, the original work on IDNA and earlier -- the 
> character, in many typefaces, looks like the ASCII apostrophe / 
> single quote.  That character is prohibited in DNS names for 
> several reasons, not least of which involves parsing problems in 
> many operating systems as well as the usual "confusable" 
> problem.

Would there really be a confusable problem, if only one type of
quote-like letter was allowed in domain names?

Given that the U+2BB and U+2BC characters are not quotation marks at
all, there should not be parsing problems, unless they somehow
spuriously get converted to an apostrophe or another quote.

Software that has somebody enter "o'hagan.co.nz" into a field expecting
a domain name or IRI could convert the currently illegal ' to ʼ and
resolve oʼhagan.co.nz.  I really don't see that there is an issue with
this character; it might be hard to type in places where an actual
apostrophe would be illegal (eg, html source, the command line), but
then so is 乳酪 hard to type if you're not a Chinese literate.

>    Suggestions as to how to deal with it -- and avoid or 
> minimize those problems -- would be welcome, but this is one of 
> those cases in which "this is needed to write the language" is 
> unfortunately not sufficient.  In practice, the principle needs 
> to be closer to "any character needed to write the language but 
> consistent with a stable and predictable DNS".
>   

I agree with this, but I think we should be careful not to exclude vast
portions of words in some of the the world's major languages, just
because current font rendering software will let you do nonsensical
arrangements of characters.  In particular, the use of combining accents
from one script to another seems an unnecessary flexibility and a source
of many of the exclusions.

One issue important to .nz, is that most of the Indic Vowel Signs are
marked "possibly not".  The Gujarati word "ગાંધી" (Gandhi) contains a
character marked as "possibly not".  Even the Gujarati word for
"Gujarati" - ગુજરાતી - contains characters marked this way.  How would we
feel if we could not have certain Latin letters, like "o" ?

These signs look to be very similar to combining accents in Latin
scripts.  I think that someone literate in each of the scripts needs to
sit down and for each of these signs, produce a white list of characters
which they may follow.  For instance, U+AC7 ("Gujarati Vowel Sign E")
might only follow Gujarati letters and other than U+A8D - U+A94.

Hmm, did the normalization miss the Homographs in Indic scripts?  Eg
"Candra O" can be written as U+A91 (ઑ) - or as U+A86, U+AC5 (આૅ).  There
doesn't seem to be anything in the Unicode database dealing with this...
so I assume Stringprep doesn't try to re-write those characters at the
moment.

I think that the character-based whitelist is an incomplete approach. 
Maybe it would be better to take a word-based approach as a second
layer.  This should hopefully have the advantage of not breaking
backwards compatibility with software that uses Stringprep for non-DNS
things; a word-based stringprep would just be a supplemental restriction
recommended for applications such as the DNS where avoiding confusion is
more important than representing text correctly.  Then, by default, the
script may not vary within a word - unless the rules for that script
permit them to be mixed with other scripts.  Eg, most scripts will be
quite happy to mix with Arabic numerals.  Then, each script has its own
rules about which sequences are legal, and you won't get people putting
Latin accents on Indic characters.

As a side note, I notice that both the Gujarati and Devanagari OM
symbols are marked "maybe" - ॐ vs ૐ.  Using the above rules, there would
be no problem with either, so long as the ૐ did not appear on its own as
a word, in which case one (probably the Devanagari) would have to "win"
and be the normalized form.

-- 
Sam Vilain, Systems Architect, Catalyst IT (NZ) Ltd.
phone: +64 4 499 2267        PGP ID: 0x66B25843