ALWAYS/MAYBE and CJK (was: Re: IDNAbis Main Open Issues)

Wed Jan 23 23:41:23 CET 2008

In his recent post on IDNAbis main open issues, Mark listed:

> >       2. Don't be Eurocentric: ensure that modern scripts are
> > in       ALWAYS (or whatever it is called)
> >       3. Resolve ALWAYS/MAYBE/NEVER problem (see below).

And John responded:

> The explanation of this situation in issues-06 is much better
> than it is in issues-05 (others will have to judge whether it is
> adequate).  However, the ALWAYS problem is that it is not
> necessarily all of a script and the  boundaries require
> explicit, IDN-specific, input from the users of that script (as
> we have both noted earlier, that is a tough problem, but let's
> isolate it a bit).

I think this is the crux of the difficulty we are having here.
There is no dispute that in any specific local market
(whether bounded by a country, or otherwise), not all of
the characters of any particular script would be needed,
relevant, or useful -- let alone all the characters of
all the *other* scripts in Unicode.

So in Tonga, for example, at least 96% of the Latin characters in
Unicode would be utterly irrelevant to any Tongan speakers. But if
there is (or would be) a .to zone registry and they decided
they wanted to use IDNs so they could make use of the one
non-ASCII character they *do* care about, the fakau'a
(glottal stop), they would need U+02BC (or U+02BB, which
is what the Hawai'ians use) to be ALWAYS for IDNs.

Now I have no trouble whatever with the notion that a .to
registry might decide to refuse to register any domain
label containing non-Tongan (or most likely also English)
letters -- thus just ASCII + a glottal stop, for example.

What I do have trouble with is extrapolating from that to the
notion that what goes in the IDNA tables regarding "ALWAYS"
or "MAYBE" values for Latin characters depends on
consulting the Tongan community and then sequentially
another 1000+ Latin script-using communities speaking
and writing different languages, all with different patterns
of determination of which Latin characters are useful for
them, in order to come up with a reliable determination
of the IDNA table content.

At the end of the day, after an indefinite and
indeterminantly extended process, you could pretty
reliably predict that the answer for the IDNA table
would be that *ALL* of the Latin letters and marks should
be allowed (unless they have some structural problem
that interferes with their suitability for IDNs -- such
as instability under normalization). After all, the Latin
characters got into the standard in the first place
because some Latin script user someplace was using the
letters to write *something*.

Furthermore, having all the Latin characters be ALWAYS
doesn't interfere with the Tongans' ability to appropriately
restrict registrations in *their* registry. So where is
the problem in simply making them all ALWAYS to begin
with, thereby skipping a long, potentially rancorous
and interminable process that wouldn't actually materially
improve the IDNA spec or help in rolling out of useful IDNs?
If anything, keeping some otherwise suitable Latin
characters back by labelling them MAYBE, thereby leading
to the kind of uncertainty among major implementers that
Mark has been talking about, could *impede* the acceptance
of the spec.

Now it seems to me that everybody here has more or less
accepted this conclusion for the Latin script, despite the
manifest complexities of the script and the number of odd
characters in Latin only used for a single small language,
or perhaps only in historic contexts.

But here is where there is a disconnect. If the Tongans
don't care (and have no reason, really, to care) whether
some other Latin script character which they don't use, such
as, for example, U+02B7 MODIFIER LETTER SMALL W, is used
in Latin domain names for the Blackfoot nation -- then
why should they (or we) care if the Blackfoot nation
also decided to use Canadian Syllabic characters in
domain names?

In other words, I see no principled reason for ruling
the entire Latin script to be ALWAYS in the table, but
then ruling the entire Canadian Syllabic script to be
MAYBE, simply because *we* haven't heard from the Blackfoot
nation about exactly *which* of all the Canadian Syllabic
syllables they would actually find useful for *their*
language.

As for the encoding of the Latin script, the encoding of
any other cosmopolitan script in Unicode (and this
would include the Canadian Syllabics, for example), is
a process of repeatedly checking with *all* the
stakeholders over time, to verify that *all* of the
characters that any of them need end up being included
in the overall encoding. So Canadian Syllabics is intentionally
comprehensive for Cree, Ojibwe, Carrier, Blackfoot,
Slavey, Tanaina, Inuktitut, ... and so on. Any one community
uses only a portion of the syllabics, just as any one language
using the Latin script uses only a small fraction of
all the Latin characters. But *all* of the characters
are useful to *somebody*.

I don't see where it benefits us (or any of the potential
users of IDNs) to sit around for years to come, waiting
to hear once again from the Cree, the Ojibwe, the Carrier,
the Blackfoot, the Slavey, the Tanaina, the Inuktitut, ...
until they can all collectively tell us that, "Yep, we
need those characters that we said we needed when you
last consulted us to help put together the Canadian
National Body contribution that was accepted for encoding
in ISO/IEC 10646 and Unicode."

So, now turning to CJK...

> So, for example, we've got advice from the three important
> pieces of the CJK community about characters that are considered
> appropriate (by one or more of them) for use in IDNs (see
> RFC3743, RFC4713, and the corresponding IANA table
> registrations).  So, because the community has told us what they
> consider safe, appropriate, and sufficiently unambiguous for IDN
> use, those characters go into ALWAYS.  The rest of the CJK
> characters (i.e., the characters of the Han and Han-derived
> script) belong in MAYBE somewhere.

I don't read it that way at all.

I've read the RFCs and examined some of the tables.

The main thrust of the RFCs is dealing with the equivalence
problem for CJK -- particularly simplified versus traditional
forms -- and how that impacts decisions about what to register.

If you actually look at the CNNIC table:

http://www.iana.org/assignments/idn/cn-chinese.html

filed according to RFC 4713 (and based on the structure
in RFC 3743), what's actually listed is LDH, a few
Han characters from Extension A, and then *every* Han
character from the URO, U+4E00..U+9FA5.

In effect what the CNNIC did was jump from CP936 as the
basis of their repertoire (which was what the tentative example
in RFC 3743 used), to using GB 18030 as the basis of their
repertoire -- and that, in turn, was based on Unicode.
Off the cuff I can't determine why they chose the particular
52 characters from Extension A that they did, but could
presumably investigate GB 18030 for awhile and figure it out.

What the CNNIC determination does *not* actually do
is tell you a) that this list of characters is actually
Chinese -- it isn't, since the URO contains many characters
whose source was in Japanese standards and which aren't
used in China for Chinese; or b) that this list of characters
is actually safe or unambiguous -- since the URO contains
many kinds of duplicates that resulted from the application
of the source separation rule for the original generation
of that repertoire of unified Han characters.

What the CNNIC table *does* do is provide the mapping information
needed to deal with the various duplication and variant
problems inherent in the registration of domain names
using the URO repertoire for CJK characters, including the
simplified/traditional variants.

What I think the CNNIC table implies for IDNA is actually that
the correct approach is just to determine that *all*
the URO and Extension A Han characters are appropriate for IDNs based
on their identity and Unicode properties (and stability
under normalization, etc.) -- i.e., the entries in the
IDNA table itself should simply be:

3400..4EB5 ; IDN_Always # Lo  [6582] CJK UNIFIED IDEOGRAPH-3400..4DB5
4E00..9FC3 ; IDN_Always # Lo [20932] CJK UNIFIED IDEOGRAPH-4E00..9FC3

That's simply a determination that, based on the protocol
definition and its associated property table, resolvers
shouldn't do anything special about prohibiting any of
these characters. Trying to do anything more than simply
saying "o.k." when handed one of these Han characters just complicates
the whole process unnecessarily -- both in terms of the
code for the resolvers and more importantly, the IETF
process for updating its own IDNA table.

Then if the CNNIC wishes to declare, through its table linked
above, that it only accepts 52 of those characters from
Extension A, and only the Unicode 3.2 subrange of the URO
(4E00..9FA5), that's fine. That's all they register, and
clearly any IDN of the form xxx.cn that contains any
other CJK characters simply won't resolve.

And if the CNNIC later updates its table, which it most likely
will, to include the HKSCS characters that got included in
the extensions to the URO *after* Unicode 3.2 (and presuming
they update to IDNAbis from IDNA2003, once we are done
with all this), then that is no problem whatsoever. They
update their table, but we don't have to come back reviewing
and updating the IDNAbis table to try to synch it back to
the new "advice" from CNNIC that "Oh, by the way, we also
need these additional 27 (or 127, or 2227) more Han characters."

> Since they are legitimate
> language characters and the community has not said "these are
> evil" or the equivalent, they cannot be classified as NEVER and
> should not be.

Of course.

> Which of the MAYBE categories those additional
> characters belong in is still something we are trying to sort
> out (and hence another unresolved issue), but they certainly
> don't go to ALWAYS.

And that is precisely where I (and Mark and Michel) disagree.

See above for the discussion of Latin versus here for the
discussion of CJK. The approach taken is just inconsistent --
unless you want to start off Latin by going to that same
IANA IDN languages tables registry and one-by-one examine all the
Polish NIC tables for Latin characters which they have determined
to be "safe, appropriate, and sufficiently unambiguous" for
their use for one language after another, and then repeat
that process again and again for every NIC and then for
every other "language community" that might have a say
in what Latin characters it thinks it should use.

> I don't have an opinion about it and seek your advice but, to
> the extent to which there are particularly problematic
> characters in Western European scripts, they might be kept out
> of ALWAYS for the present and until more experience and advice
> from the affected parties comes along.

Exactly. But such a process will please no one and never
terminate.

> Such characters might
> include the notorious dotless "i" and perhaps such difficulties
> as Eszett and Final Sigma, although backward-compatibility or
> other considerations might dictate the handling of those
> characters.

It should *not* include those characters, precisely because
they *are* notorious cases, and because we already know
what the answers for them have to be and why. Those are
all common use characters, and we can't duck having a
clear decision for their ilk.

> Note that the model described above involves splitting up the
> characters of scripts in ways that cannot be done by Unicode
> properties alone since judgments specific to IDN usability are
> required.

And that is precisely what the IDNA table itself should not
attempt to do. I think that process is what should be distributed
to the NICs to determine what they consider to be appropriate
to use as a subset for their purposes, once the IDNA spec
itself puts a stake in the ground regarding which
characters are suitable for use in IDN based on
the technical grounds (such as normalization and casing
stability and inappropriate general category, such as
symbols and punctuation, and so on) -- considered not
on a language-by-language, country-by-country basis,
but rather in the generic, universal context of scripts
as defined by the Unicode Standard.

*That* is a process that can quickly terminate, and which
*ought* to terminate without too much controversy, because
it doesn't really slam the door on what *anybody* needs.
And it leaves complete room then for the NICs to do what
they need to do to pare down the huge realm of possibilities
to what they see as locally relevant and appropriate for their
own particular domains.

--Ken