Allowed characters (was: Re: Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

Wed Mar 26 22:08:41 CET 2008

--On Wednesday, 26 March, 2008 20:00 +0000 Michael Everson
<everson at evertype.com> wrote:

> At 15:47 -0400 2008-03-26, John C Klensin wrote:
> 
>> I think what both Mark and I are saying, albeit in very
>> different ways, is that it just isn't that simple.  Arabic
>> (and any other RTL script) requires consideration of
>> sequences of characters in labels, not just individual Yes/No
>> character lists.
> 
> That doesn't mean I don't need a list.

>> In IDNA2003, there are some canonical form issues and, I
>> believe, some compatibility ones.  In general, for the current
>> state of the IDNA200X proposals, those issues translate into
>> disallowed code point (what you are calling "out", I think).
> 
> I'm interested in the present and future, not 2003
> restrictions.

IDNA2003 is the present.  And, had you said that, you could have
saved everyone a lot of time.

>> If people are interested in Arabic domain names (other uses of
>> Arabic script are not the subject matter of either this
>> mailing list or either set of protocols), you miss a major
>> portion of the picture if you restrict yourself to the Arabic
>> script block or specifically-Arabic letters and decorations.
> 
> Other characters are orthogonal to my need. I need to know if
> anyone has decided to say "No" to some diacritics used only in
> Qur'anic annotation for instance, or if any of the basic
> letters have been excluded. I am interested in the world
> beyond Persian and Arabic and Urdu.

The 10000 meter answer to that question is that, with the
exception of characters that are transformed by NFKC, all
letters, digits, and combining marks are permitted.  That set of
results is based on the Unicode property relationships described
in "tables" and is _exactly_ identical to the rules for every
other script.  Exceptions could be defined on top of those rules
but, so far, there are no specific exceptions for Arabic
characters.  There is, in that regard, nothing special about
Arabic. 

We remain open to a strong argument from the users of the script
that some additional characters should be excluded at the
protocol level, not just the registry one.   For example, I've
seen at least one proposal to prohibit those Qur'anic annotation
characters.  I hope that you and others who are at that meeting
can help to focus on those questions rather than than only on
what is possible.  For that purpose, you (or someone else) needs
to understand some of the theory behind IDNA (either version,
actually) and the use cases for domain names more generally,
which is the other reason I was pointing you toward documents
rather than lists.

>> So, while we could probably contrive to answer your precise
>> questions above, we would only be misleading you and your
>> audience by doing so.
> 
> No, it would not. I need something indicative.

See above.  But, of course, your desires and expectations may
reasonably differ from mine and/or those of your audience.

>> And, for IDNA200X, some of the characters and relationships
>> are still under active consideration -- consideration in
>> which some of the participants in the meeting for which you
>> presumably want this information are very much participating
>> and very well informed as to the issues.
> 
> I think if you guys can't come up with SOME sort of list
> things are a lot worse than I thought.

For IDNA200X the list is, as Ken pointed out, pretty simple.

   john