Moving Right Along on the Inclusions Table...

Wed Dec 20 00:20:24 CET 2006

On

> Date: Sat, 16 Dec 2006 11:58:43 +0100

Cary asked:

> Perhaps we can now take a look at the way the Hebrew script is being
> handled?
> 
> The table currently excludes the U+05F3 HEBREW PUNCTUATION GERESH and
> the U+05F4 HEBREW PUNCTUATION GERSHAYIM. Although it is reassuring to
> note that we recognize the fundamental role they play in Hebrew and
> Ladino orthographies, and the likelihood of their appearing in the
> exception table, I am a bit more concerned about the main table
> permitting the U+059C HEBREW ACCENT GERESH and the U+059E HEBREW ACCENT
> GERSHAYIM. The latter pair, along with the other 28 HEBREW ACCENTs
> strike me as prime examples of what we explicitly need to be excluding.

> Can the rules, or the sequence of their application be modified to
> include the two characters that are missing, or are we stuck with
> allowing the dozens of characters that are not needed for IDN, and
> treating the remaining two as exceptions?  One alternative would be
> simply to permit all Hebrew characters in the range 0591..05F4. (At
> least one of the three characters that would thereby be reintroduced,
> U+05BE HEBREW PUNCTUATION MAQAF, can be justified in its own right.)
> This would make nothing substantially worse, and would at least call
> registry attention to the fact that there are different kinds of geresh.

No one has responded on the specifics of these suggestions about
Hebrew, and the thread took off to discuss other more general
issues that Cary brought up.

Also, no one has responded regarding the specific suggestions I
had made regarding possible omissions of some Hebrew accents and
Arabic Koranic annotation marks.

So, to move this along, I will make a large number of very specific
suggestions for omission of combining marks from the inclusion
table (and a few non-combining characters as well). The
results of that can be viewed at:

http://www.unicode.org/~whistler/SPInclusionList061219.txt

That has 127 fewer characters than the December 16 draft.
(The exact details of ranges removed and justification for
each are given below.)

To address Cary's concern about geresh, gershayim, and maqaf in
Hebrew, I have constructed a separate table (with only 3 entries
in it so far), which contains the explicit suggestions for
characters that should be *added* back to the inclusions table,
after having been removed by some more generic rule. In the
case of the 3 Hebrew characters, the fact that they are
listed as gc=Po (Punctuation, Other) in the Unicode Character
Database removed them from the candidate inclusions list very
early on in the rules. But once we do a script-by-script
review to check whether all the omissions make sense, we may
come up with instances, as for these three, where characters
belong in the inclusions list despite their General_Category
values. The results can be seen at:

http://www.unicode.org/~whistler/SPInclusionAdd061219.txt

O.k. here are the exact details of what I am suggesting be
next omitted from the inclusions list. (This should then be
implemented as a series of exclusions by rule for the overall
expression that Mark is using to generate tables.)

Common Diacritics

  omit:    0363..036F

  reason:  These Latin letters above are specialist medievalist
           usage for manuscripts, and are not a part of regular
           orthographies. They would also be quite confusing
           for internet identifiers.

Hebrew

  omit:    0591..05AF, 05C4..05C5

  reason:  0591..05AF are the Hebrew accent marks Cary was talking about;
           their major function is as cantillation marks, to help
           in the chanting and singing of sacred texts. 05C4..05C5
           are more marks used in the annotation of Biblical text,
           and are not part of the regular pointing system for vowels.

Arabic

  omit:    0610..0615, 06D6..06ED

  reason:  0610..0615 are honorific annotations added to names
           in text. 06D6..06ED are annotation marks used in Koranic
           text, again mostly for guidance in chanting and singing
           sacred text. None of these are part of regular orthographies,
           and should not be confused with the harakat used for
           indicating vowels in Arabic.

Syriac

  omit:    0740..074A

  reason:  Again, these are marks used in annotating text, and need
           to be distinguished from the regular vowel marks needed
           for the orthography. There is no need for these annotation marks
           for internet identifiers.

Devanagari

  omit:    0953..0954

  reason:  These are the dubious clones of acute and grave accent
           marks included in the Devanagari block. While not formally
           deprecated, there is no obvious function for them in
           Devanagari, and they are otherwise easily confused with
           the common diacritic acute and grave accent marks.

Tibetan

  omit:    0F18..0F19, 0F35, 0F37, 0F3E..0F3F, 0FC6

  reason:  Some of these are astrological signs, only used for special
           purpose markup of digits (or occasionally other signs) in
           Tibetan astrology. 0F35 and 0F37 are text highlighting
           marks; they are used like underlining. 0FC6 is a
           symbol diacritic, not used with regular Tibetan text.

Khmer

  omit:    17D3

  reason:  This is a deprecated character originally intended as
           part of the formation of lunar date symbols. It is not
           used in regular text.

Mongolian

  omit:    180B..180D

  reason:  These are the Mongolian-specific variation selectors.
           They get automatically removed (by an earlier rule),
           because they are Default_Ignorable_Code_Point. I am
           just cleaning up my list here to match the rules to
           date.

Balinese

  omit:    1B6B..1B73

  reason:  These are combining marks only used in Balinese musical
           notation, rather than in regular text.

Combining Diacritical Marks Supplement

  omit:    1DC0..1DC1, 1DC3

  reason:  1DC0..1DC1 are editorial signs for Ancient Greek, used only
           in academic annotation. 1DC3 is a combining mark for
           Glagolitic, a historic script already omitted from the list.

CJK Symbols and Punctuation

  omit:    302A..302F

  reason:  These are tone mark annotations only used in nonstandard
           annotations of Han characters or Hangul. They are not
           part of either standard CJK orthographies or the commonly
           encountered Latin transliterations for Chinese or Korean.

  omit:    3031..3035, 303B..303C

  reason:  While these are not combining marks, they should also be
           omitted from the inclusions list. 3031..3035 are special
           character forms only appropriate for vertically-rendered
           text and inappropriate for internet identifiers. 303B
           is another vertical rendering form. And 303C is an
           abbreviatory symbol that happens to equate to "masu"
           in Japanese, but is not a part of the regular orthography
           of Japanese.

Combining Half Marks

  omit:    FE20..FE23

  reason:  These are compatibility half forms, used only in the
           mapping of certain legacy bibliographic character encodings.
           They are not appropriate for normal Unicode text representation.

Arabic Presentation Forms-B

  omit:    FE73

  reason:  This is another oddball compatibility character, encoded only
           for transcoding to some old IBM code pages, but which doesn't
           have any compatibility decomposition mapping, and so which
           didn't get filtered by the NFKC(cp) != cp criterion. It should
           simply be omitted by exception here because it is
           inappropriate for use in internet identifiers.