Cyrillic titlo in tables-05

Kenneth Whistler kenw at sybase.com
Tue Mar 11 00:05:36 CET 2008


Patrik,

In reviewing the esszet thread, I happened to spot something
strange in your excerpt of the Category F - Exceptions list,
and that is the Cyrillic titlo.

The tables-04e draft (excerpted from when this was undergoing
intense discussion back on February 6 -- when I missed this)
noted:

> 2.2.2.  Category F - Exceptions
> 
>     F: cp in {00B7, 02B9, 0375, 0483, 05F3, 05F4, 3007, 303B, 30FB}
...
>     02B9; CONTEXTO  # MODIFIER LETTER PRIME
>        # Rule: Permitted only in context in which
>        #       0375 is permitted.
>     0375; CONTEXTO  # GREEK LOWER NUMERAL SIGN (KERAIA)
>        # No rule at present
>     0483; CONTEXTO  # COMBINING CYRILLIC TILTO
>        # No rule at present
...
>     While three of the characters (02B9, 0483 and 0375), plus Geresh and
>     Gershayim, appear to be special rules based on picking characters  
>     one
>     at a time, they actually reflect a character property that is not
>     (yet) defined for Unicode.  That character property might be
>     described as "indicates a numeric use in a script for which numbers
>     are represented by treating the letters (in collation order) as
>     digits".  Were that property to be created, these characters could  
>     be
>     removed from Category F and assigned to a separate category based on
>     the property.

The tables-05 document has now simplified this to:

    02B9; CONTEXTO  # MODIFIER LETTER PRIME
    0375; CONTEXTO  # GREEK LOWER NUMERAL SIGN (KERAIA)
    0483; CONTEXTO  # COMBINING CYRILLIC TILTO

    The characters 02B9, 0375 and 0483 are used in different scripts to
    indicate that an adjacent letter is being used with a numeric value.
    
Now I have to guess this all occurred in some discussion about
geresh and gershayim being in the exception list, which morphed
to asking what other "prime" lookalikes needed to be in the exception
list as well -- although I seem to have missed it, and still can't
find where this was justified.

But besides the typo in the name of U+0483 (it is "TITLO", not "TILTO"),
there seem to be other mistakes and problems here.

While it is true that geresh and gershayim in the representation of
numbers represented by Hebrew letters, that wasn't the main
rationale for including them in the exceptions list. The reason
for having them there is their common occurrence in Hebrew writing
to indicate initialisms and acronyms of various sorts.

The Greek dexia keraia (U+0374, canonically equivalent to U+02B9) and
aristeri keraia (U+0375) are used in the traditional indication
of numeric values with Greek letters. But unlike the geresh and
gershayim, those usages don't spill over to regular Greek
orthography, otherwise. Now since U+02B9 is General_Category=Lm,
and hence is automatically PVALID (and ends up in the Unicode
identifier categories), I can see a case for making an exception
to add U+0375, since it is in a paradigmatic set with U+0374
in Greek. And while U+0374 itself is DISALLOWED (because of
the NFKC(cp) != cp rule), U+02B9 would be PVALID -- so at
least some representations of traditional Greek numeric values
would be allowed. On the other hand, I don't see much need for
this -- and the argument for their inclusion based on analogy
to the Hebrew use of geresh and gershayim in numeric notation
is marginal at best. You could just as well go from that to
claiming that U+0023 '#' NUMBER SIGN also needs to be CONTEXTO,
because it is widely used together with digits in numeric
notations: "#2", and so forth.

By the way, *if* U+0375 stays in the list, the annotation
for it should be changed to "aristeri keraia" (left keraia),
rather than just "KERAIA", because it is U+0374=U+02B9 which
is the ordinary keraia, the one occurring on numbers 1-999,
instead of 1000 and up.

But I don't see any real justification for going even further
and adding U+0483 COMBINING CYRILLIC TITLO to the exceptions
list. First of all, as for U+02B9, adding U+0483 seems to
be twisting the sense of an exceptions list in the first
place. The original list of exceptions were all characters
which would be DISALLOWED unless there was some exception
in the derivation to allow them. But U+0483 is General_Category=Mn,
and is thus automatically PVALID, because not otherwise
excluded.

What has happened is that category F has morphed into an
override for the table derivation, rather than something useful
for a set definition. It contains both PVALID and CONTEXTO
as values, and applies to both characters that would
otherwise have been DISALLOWED or PVALID. As such, it
has gotten quite messy again.

And the argument for including TITLO here seems to come down
to nothing other than the implied assertion that since the
titlo was also used historically to mark letters as having
numeric values that it also needs to be CONTEXTO for
IDN, along with geresh and gershayim. (Draft -4e even implied
that there ought to be a Unicode character property defined
for this, but that justification got trimmed down to
an assertion of functional similarity in draft -5.) But
as for geresh, this was only one use of the titlo in Cyrillic
text. And unlike the geresh, the titlo is a non-spacing
combining mark in Cyrillic. Calling this one Cyrillic
non-spacing mark out for special rule behavior in IDNA
doesn't really accomplish anything that I can tell. Why
does this *one* mark need a context rule, when all the
other marks do not, regardless of their semantics?

My recommendation would be to just remove 02B9, 0375, and
0483 from the category F exception list. Keeping them
there just adds complications to the table spec, for
no real apparent gain that I can tell.

--Ken



More information about the Idna-update mailing list