Moving Right Along on the Inclusions Table...
Kenneth Whistler
kenw at sybase.com
Wed Dec 20 00:20:24 CET 2006
On
> Date: Sat, 16 Dec 2006 11:58:43 +0100
Cary asked:
> Perhaps we can now take a look at the way the Hebrew script is being
> handled?
>
> The table currently excludes the U+05F3 HEBREW PUNCTUATION GERESH and
> the U+05F4 HEBREW PUNCTUATION GERSHAYIM. Although it is reassuring to
> note that we recognize the fundamental role they play in Hebrew and
> Ladino orthographies, and the likelihood of their appearing in the
> exception table, I am a bit more concerned about the main table
> permitting the U+059C HEBREW ACCENT GERESH and the U+059E HEBREW ACCENT
> GERSHAYIM. The latter pair, along with the other 28 HEBREW ACCENTs
> strike me as prime examples of what we explicitly need to be excluding.
> Can the rules, or the sequence of their application be modified to
> include the two characters that are missing, or are we stuck with
> allowing the dozens of characters that are not needed for IDN, and
> treating the remaining two as exceptions? One alternative would be
> simply to permit all Hebrew characters in the range 0591..05F4. (At
> least one of the three characters that would thereby be reintroduced,
> U+05BE HEBREW PUNCTUATION MAQAF, can be justified in its own right.)
> This would make nothing substantially worse, and would at least call
> registry attention to the fact that there are different kinds of geresh.
No one has responded on the specifics of these suggestions about
Hebrew, and the thread took off to discuss other more general
issues that Cary brought up.
Also, no one has responded regarding the specific suggestions I
had made regarding possible omissions of some Hebrew accents and
Arabic Koranic annotation marks.
So, to move this along, I will make a large number of very specific
suggestions for omission of combining marks from the inclusion
table (and a few non-combining characters as well). The
results of that can be viewed at:
http://www.unicode.org/~whistler/SPInclusionList061219.txt
That has 127 fewer characters than the December 16 draft.
(The exact details of ranges removed and justification for
each are given below.)
To address Cary's concern about geresh, gershayim, and maqaf in
Hebrew, I have constructed a separate table (with only 3 entries
in it so far), which contains the explicit suggestions for
characters that should be *added* back to the inclusions table,
after having been removed by some more generic rule. In the
case of the 3 Hebrew characters, the fact that they are
listed as gc=Po (Punctuation, Other) in the Unicode Character
Database removed them from the candidate inclusions list very
early on in the rules. But once we do a script-by-script
review to check whether all the omissions make sense, we may
come up with instances, as for these three, where characters
belong in the inclusions list despite their General_Category
values. The results can be seen at:
http://www.unicode.org/~whistler/SPInclusionAdd061219.txt
O.k. here are the exact details of what I am suggesting be
next omitted from the inclusions list. (This should then be
implemented as a series of exclusions by rule for the overall
expression that Mark is using to generate tables.)
Common Diacritics
omit: 0363..036F
reason: These Latin letters above are specialist medievalist
usage for manuscripts, and are not a part of regular
orthographies. They would also be quite confusing
for internet identifiers.
Hebrew
omit: 0591..05AF, 05C4..05C5
reason: 0591..05AF are the Hebrew accent marks Cary was talking about;
their major function is as cantillation marks, to help
in the chanting and singing of sacred texts. 05C4..05C5
are more marks used in the annotation of Biblical text,
and are not part of the regular pointing system for vowels.
Arabic
omit: 0610..0615, 06D6..06ED
reason: 0610..0615 are honorific annotations added to names
in text. 06D6..06ED are annotation marks used in Koranic
text, again mostly for guidance in chanting and singing
sacred text. None of these are part of regular orthographies,
and should not be confused with the harakat used for
indicating vowels in Arabic.
Syriac
omit: 0740..074A
reason: Again, these are marks used in annotating text, and need
to be distinguished from the regular vowel marks needed
for the orthography. There is no need for these annotation marks
for internet identifiers.
Devanagari
omit: 0953..0954
reason: These are the dubious clones of acute and grave accent
marks included in the Devanagari block. While not formally
deprecated, there is no obvious function for them in
Devanagari, and they are otherwise easily confused with
the common diacritic acute and grave accent marks.
Tibetan
omit: 0F18..0F19, 0F35, 0F37, 0F3E..0F3F, 0FC6
reason: Some of these are astrological signs, only used for special
purpose markup of digits (or occasionally other signs) in
Tibetan astrology. 0F35 and 0F37 are text highlighting
marks; they are used like underlining. 0FC6 is a
symbol diacritic, not used with regular Tibetan text.
Khmer
omit: 17D3
reason: This is a deprecated character originally intended as
part of the formation of lunar date symbols. It is not
used in regular text.
Mongolian
omit: 180B..180D
reason: These are the Mongolian-specific variation selectors.
They get automatically removed (by an earlier rule),
because they are Default_Ignorable_Code_Point. I am
just cleaning up my list here to match the rules to
date.
Balinese
omit: 1B6B..1B73
reason: These are combining marks only used in Balinese musical
notation, rather than in regular text.
Combining Diacritical Marks Supplement
omit: 1DC0..1DC1, 1DC3
reason: 1DC0..1DC1 are editorial signs for Ancient Greek, used only
in academic annotation. 1DC3 is a combining mark for
Glagolitic, a historic script already omitted from the list.
CJK Symbols and Punctuation
omit: 302A..302F
reason: These are tone mark annotations only used in nonstandard
annotations of Han characters or Hangul. They are not
part of either standard CJK orthographies or the commonly
encountered Latin transliterations for Chinese or Korean.
omit: 3031..3035, 303B..303C
reason: While these are not combining marks, they should also be
omitted from the inclusions list. 3031..3035 are special
character forms only appropriate for vertically-rendered
text and inappropriate for internet identifiers. 303B
is another vertical rendering form. And 303C is an
abbreviatory symbol that happens to equate to "masu"
in Japanese, but is not a part of the regular orthography
of Japanese.
Combining Half Marks
omit: FE20..FE23
reason: These are compatibility half forms, used only in the
mapping of certain legacy bibliographic character encodings.
They are not appropriate for normal Unicode text representation.
Arabic Presentation Forms-B
omit: FE73
reason: This is another oddball compatibility character, encoded only
for transcoding to some old IBM code pages, but which doesn't
have any compatibility decomposition mapping, and so which
didn't get filtered by the NFKC(cp) != cp criterion. It should
simply be omitted by exception here because it is
inappropriate for use in internet identifiers.
More information about the Idna-update
mailing list