[Idna-arabicscript] mapping of Full Stops

Mon Oct 12 15:53:38 CEST 2009

Hi.

I see two issues with this idea.  The second leads to a
suggestion.

(1) While very local mapping of full stops makes perfectly good
sense, any leakage is going to interfere with programs that need
to parse dot-separated-label form into length-label pairs.  Such
programs may exist because they are not IDNA-aware or because
they are trying to resolve names in private name spaces and
follow the RFC 2181 observation that the DNS can accommodate any
string of octets.  We know that things "leak", so these mappings
need to be performed very carefully --even more carefully than
mappings of characters within labels and any document should
reflect that.

(2) One of the advantages of the list of characters that are now
discussed in the mapping document is that the list is fairly
stable.  Because it is largely motivated by within-label
IDNA2003 compatibility, there should be little or no need to
expand the list as Unicode evolves.  That is a good thing
because we don't have comprehensive rules to generate the
character and mapping list (even though much of it is either
NFKC or CaseFold) -- the mappings document is essentially
Unicode version-dependent.

By contrast, I think the general rule for candidate alternate
full-stop characters is going to be:

	(i) The traditional, DNS-specified, label separator
	U+002E (ASCII dot) is unnatural or hard to type or
	render in the local environment.

	(ii) There is a character in the local environment and
	script that is in common use, that is an obvious logical
	substitute label separator, and possibly that users will
	expect it to be a substitute regardless of what we have
	to say on the subject.

Although I think one can make a slightly stronger argument for
U+06D4 because of bidi implications, I don't personally see a
very strong justification for a recommendation to map one of
these characters and not other.   Erik and others have made that
point for some specific cases.

The list of alternate full stops (referred to on the IDNA list
for a while as "dot-oids") second list initialized with the
three East Asian full stops called out in RFC 3490, i.e., 
	 U+3002 (ideographic full stop),
	 U+FF0E (fullwidth full stop), and
	 U+FF61 (halfwidth ideographic full stop)

plus U+06D4 (Arabic full stop) for the reasons identified in
Sarmad's note.

But a search in the UnicodeData file shows 38 characters with
names containing "FULL STOP" and 60 (22 more) containing "STOP".
Some of these are clearly irrelevant compatibility characters
(e.g., those in the 2488..249B range which encode numerals
followed by full stop) as single code points or used for some
completely different purpose (e.g., a series of Glottal Stop
code points).  We've also been told repeatedly that putting too
much reliance on the assumption that Unicode names encode
characteristic information is a bad idea -- there may well be
characters out there that could be seen locally as reasonable
label separators that are not identified as "Full Stop" in their
names.

So building a comprehensive list would require
character-by-character examinations and decisions based on local
contexts and usage.  While we clearly have the expertise
available to make those judgments for East Asian scripts the
Urdu use of Arabic script, we do not have that knowledge in the
general case.   The list is not close-ended either.  As new
scripts are added to Unicode, it is safe to assume that at least
some of them will have their own full stop (or equivalent)
characters that will need to be considered for addition to the
list.

Whether publishing a list of recommended mappings to U+002E is a
good idea or not depends on (1) above and causes me to again
recall that only ASCII separators are permitted in IRI and URI
syntax -- separators that raise many of the same issues as the
label separator in domain names.  But, assuming that there is
rough consensus that doing so is appropriate, I suggest that the
right way to proceed is to create yet another document that 

	* focuses on the label-separator mapping alone

	* explains the issues and risks of doing such mappings

	* creates an IANA registry, initially populated by the
	four characters identified above, and with a mechanism for
	adding addition characters to it as the need for them is
	identified.

Since that document would presumably be informational anyway, if
Lisa is agreeable, it could be handled as an AD-sponsored
individual submission, thereby separating it from the WG's
schedule and task list.  I would hope we would not go forward
with it (and that Lisa would not sponsor it) unless there were
rough consensus on this list that having such a document and
list of characters would we wise.  But decoupling it from the
current Mapping document seems to be to appropriate, if only
because of the expandability and open-endedness of the list of
characters.

best,
    john