Standards and localization (was Dot-mapping)

Thu Dec 13 03:42:42 CET 2007

"The dots" that that are relevant are the full stops:

U+002E FULL STOP
U+3002 IDEOGRAPHIC FULL STOP

and, because of fullwidth/halfwidth cloning in East Asian character sets:

U+FF0E FULLWIDTH FULL STOP (explicitly fullwidth version of U+002E)
U+FF61 HALFWIDTH IDEOGRAPHIC FULL STOP (explicitly halfwidth version of
U+3002)

These are exactly the full stops used in the IDNA2003 spec (
http://ietf.org/rfc/rfc3490.txt):

   1) Whenever dots are used as label separators, the following
      characters MUST be recognized as dots: U+002E (full stop), U+3002
      (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61
      (halfwidth ideographic full stop).

That's it. We shouldn't even be talking about "dots" here, because really
what is at issue are these two full stops. The reason for adding them to
IDNA2003 was to make it easier for about a third of the world to enter in
URLs, because of the way that input methods work. The reason for keeping
them in IDNAbis is for the same reason, plus backwards compatibility with
IDNA2003.

If we had wanted to extend this set to all the compatibility NFKC variants,
then we would also add the following: 

2024  ONE DOT LEADER
FE12  PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
FE52  SMALL FULL STOP

However, there is no need for that at all, since those characters
will not be entered in by accident on Chinese and Japanese computers.

I'm agnostic about where FULL STOP and IDEOGRAPHIC FULL STOP
equivalence get handled in the protocol stack, by the way.
I leave that to others to sort out.

While lots of scripts have different kinds of
terminal punctuation, of all shapes, which function somewhat
similarly to FULL STOP in Western punctuation conventions,
they don't look like dots, and as far as I know nobody is advocating
that those start to appear as internet label delimiters.

I just want to emphasize the point that what you do about mapping
full stops shouldn't be colored by the fear that a nonextensible
specification for them will be broken and lead to cultural
and political attacks on the specification.

> If the list is not extensible, we run
> into problems with scripts that have not been coded but whose
> users believe that their dots are equally important.

I will venture to assert that that is the null set.

Unicode 5.1 is adding the Cham, Kayah Li, Lepcha, Ol Chiki,
Rejang, Saurashtra, Sundanese and Vai scripts -- all in modern
use. Many of those scripts have "danda" punctuation, but none
of them adds a baseline dot delimiter FULL STOP to the standard.

Unicode 5.2 will add the Tai Tham and Tai Viet scripts, and
the same statement holds for those.

There are about a dozen more regional current use scripts
in the pipeline for eventual encoding, many of which have
reasonably complete proposals to hand by now -- and as far
as I know, none of them will add a baseline dot delimiter
FULL STOP to the standard.

As for the various archaic scripts, those aren't going to
be appropriate for IDNs in any case, and don't have "users"
with cultural expectations, even if they did have baseline
delimiter dots.

So we don't need to get distracted by worrying about
extensibility issues for these particular delimiters.

--Ken