FW: UTC Agenda Item: IDNA proposal

Wed Dec 6 02:35:21 CET 2006

I am forwarding this message that Ken tried to send before he got
enlisted.

Michel
------------- Begin Forwarded Message -------------

Date: Thu, 30 Nov 2006 18:35:00 -0800 (PST)
From: Kenneth Whistler <kenw at sybase.com>
...
Patrick,

Following up on your drafted tables, I have built a utility
that lets me experiment with various criteria, to produce
tables that are easier to manipulate and compare.

For first results, see:

http://www.unicode.org/~whistler/SPLlLoLmMnMcNdStableCaseNFKC.txt

and

http://www.unicode.org/~whistler/SPXIDContStableCaseNFKC.txt

SPLlLoLmMnMcNdStableCaseNFKC.txt, as the name I hope suggests,
consists of all Unicode characters of General_Category =
[Ll Lo Lm Mn Mc Nd], constrained to those code points which
are also stable under lowercasing ( cp = lowercase(cp) ) and
which are also stable under NFKC normalization ( cp = NFKC(cp) ).

SPXIDContStableCaseNFKC.txt repeats the same general scheme,
but starts with all Unicode characters of XID_Continue = True,
then constrained to those code points which are also stable
under lowercasing and those which are also stable under
NFKC normalization.

These are plain text files, fielded with spaces, to simplify
sorting on various fields with simple sort utilities and
to simplify searching with grep and comparison with diff.

Lines have the form:

000E0 gc=Ll sc=Latn LATIN SMALL LETTER A WITH GRAVE
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
fixed column fields character name

I've elided all the unified Han characters, which under
any criteria need to all be included. I've also elided all
the Hangul syllables, which also need to all be included.
I think we can just take those as given and not have to
deal with the 10's of thousands of extra lines of redundant
material they represent, and instead focus on the issues
for the non-Han and non-Hangul characters.

I haven't bothered special-casing uppercase Latin A-Z,
as again we all know those are a special case to be
dealt with.

The files are in code point order, and I've used zero-extended
5 digit fields for the code point, to make sorting on the
code point easy. The second field (with General_Category
values) and the third field (with Script values) have
unique values, so it is easy to use grep or grep -v to
either pull out a specific subset of records by attribute
or to exclude some specific subset of records by attribute,
to examine the results more carefully.

You can, of course, easily write a transducer for these
files that would reformat into HTML tables and/or convert
the code point to UTF-8 for display of the actual characters
with fonts in a browser.

XID_Continue is the Unicode character property that summarizes
the basic recommendation for characters appropriate for
use in identifiers (cf. UAX #31).

If you diff SPLlLoLmMnMcNdStableCaseNFKC.txt and
SPXIDContStableCaseNFKC.txt, you'll find that the former is
a proper subset of the latter -- in other words using the
criterion General_Category = [Ll Lo Lm Mn Mc Nd] as the
starting point for defining the appropriate set of characters
is *more* restricted than XID_Continue. And in particular,
XID_Continue also allows the following subtypes that
General_Category = [Ll Lo Lm Mn Mc Nd] omits:

  A. U+00B7 MIDDLE DOT (a special case)
  B. Some connector punctuation (U+005F LOW LINE and a
     few others that are similar in function)
  C. Ethiopic digits (which are gc=No, instead of gc=Nd)
  D. Number letters (gc=Nl), which are letterlike numberforms
     that would be appropriate in identifiers
  E. Two letterlike symbols (gc=So) that are grandfathered
     in to maintain identifier definition stability for
     characters whose General_Category was changed at
     a certain point in the history of the standard.

We could discuss whether including any of these subsets in
the StringPrep output repertoire would be desirable. In
particular, I don't think C and D would hurt anything. But
none of them are really high priority, and many of the
characters in D are for historic scripts only.

Given that the relationship between the General_Category = 
[Ll Lo Lm Mn Mc Nd] criterion and the more lenient
XID_Continue = True criterion can now be quantified exactly
by comparing these two files, I think it would then
be productive to next examine:

SPLlLoLmMnMcNdStableCaseNFKC.txt

to make the case for paring it down further simply by the
omission of characters in it which otherwise seem inappropriate
for domain names (and similar identifiers that StringPrep
would be used for). In particular, the next chunk that could
easily be eliminated algorithmically would be to
drop the historic-only scripts. The list could easily be
pared down, for example, by dropping cuneiform scripts:

sc=Xsux  (Sumero-Akkadian cuneiform)
sc=Ugar  (Ugaritic cuneiform)
sc=Xpeo  (Old Persian cuneiform)

other archaic alphabets and syllabaries:

sc=Goth  (Gothic alphabet)
sc=Ital  (Old Italic alphabet)
sc=Cprt  (Cypriot syllabary)
sc=Linb  (Linear-B syllabary)
sc=Phnx  (Phoenician alphabet)
sc=Khar  (Kharoshthi abjad)
sc=Phag  (Phags-pa alphabet)
sc=Glag  (Glagolitic alphabet)

and conscripts with no current usage:

sc=Shaw  (Shavian conscript alphabet)
sc=Dsrt  (Deseret conscript alphabet)

I don't think anybody would shed any tears if those weren't
available for domain names, etc.

More controversial ones might be:

sc=Ogam  (Ogham, which has a devoted following in Ireland)
sc=Runr  (Runic, which has much current usage, despite 
            being officially archaic)
sc=Cher  (Cherokee, which has little current use and is
            a problem for confusables, but whose elimination
            could be a cause celebre and be taken as discriminatory)

After that the pickings get slim, and I don't think you can
make a very good case for eliminating any more scripts
qua scripts.

If we could get consensus somewhere along these lines, I think
we could then examine what remains for the next priority
collections of characters to omit systematically. For
example, while many, many combining marks are clearly
required for many languages, there are identifiable
subsets whose usage is restricted and not required for
normal orthography. Examples include Hebrew annotation
marks and Arabic Koranic annotation marks, whose usage is
primarily for annotating religious texts for chanting and
singing. Also combining marks used only in musical notation.
Such characters are harder to identify by Unicode properties,
and would best be handled by specifying a small number
of ranges of code points that would be restricted, instead.

Comments?

By the way, if there is additional information or a different
format that folks would find more useful for fiddling with
these data files, just let me know. It is easy to adjust
the output formatting or to incorporate listing of
additional properties for characters, if having them
explicitly listed would assist any in making these
decisions. At the moment, it seems to me that General_Category
and Script are really the crucial ones that folks are
most concerned with and which seem most useful as filtering
criteria.

Regards,

--Ken

P.S. I am assuming that "idna-update at alvestrand.no" is simply
a mailing list that is set up to automatically distribute
this discussion to the relevant group. If not, I need to
know, so I can manually cc this to the relevant participants.

------------- End Forwarded Message -------------