Moving Right Along on the Inclusions Table...

Mark Davis mark.davis at
Wed Dec 20 00:45:05 CET 2006

Those are all reasonable changes.

   - We should also add the Joiner/NonJoiner. They would however, as
   discussed, be restricted to very specific contexts by additional clauses
   (like the current bidi restrictions).
   - I think those are the only ones from that
   we need to consider, but others may have more information.

(That also reminded me that is an example,
for those not deeply familiar with Unicode properties, of how derived
properties are stabilized.)


On 12/19/06, Kenneth Whistler <kenw at> wrote:
> On
> > Date: Sat, 16 Dec 2006 11:58:43 +0100
> Cary asked:
> > Perhaps we can now take a look at the way the Hebrew script is being
> > handled?
> >
> > The table currently excludes the U+05F3 HEBREW PUNCTUATION GERESH and
> > the U+05F4 HEBREW PUNCTUATION GERSHAYIM. Although it is reassuring to
> > note that we recognize the fundamental role they play in Hebrew and
> > Ladino orthographies, and the likelihood of their appearing in the
> > exception table, I am a bit more concerned about the main table
> > permitting the U+059C HEBREW ACCENT GERESH and the U+059E HEBREW ACCENT
> > GERSHAYIM. The latter pair, along with the other 28 HEBREW ACCENTs
> > strike me as prime examples of what we explicitly need to be excluding.
> > Can the rules, or the sequence of their application be modified to
> > include the two characters that are missing, or are we stuck with
> > allowing the dozens of characters that are not needed for IDN, and
> > treating the remaining two as exceptions?  One alternative would be
> > simply to permit all Hebrew characters in the range 0591..05F4. (At
> > least one of the three characters that would thereby be reintroduced,
> > U+05BE HEBREW PUNCTUATION MAQAF, can be justified in its own right.)
> > This would make nothing substantially worse, and would at least call
> > registry attention to the fact that there are different kinds of geresh.
> No one has responded on the specifics of these suggestions about
> Hebrew, and the thread took off to discuss other more general
> issues that Cary brought up.
> Also, no one has responded regarding the specific suggestions I
> had made regarding possible omissions of some Hebrew accents and
> Arabic Koranic annotation marks.
> So, to move this along, I will make a large number of very specific
> suggestions for omission of combining marks from the inclusion
> table (and a few non-combining characters as well). The
> results of that can be viewed at:
> That has 127 fewer characters than the December 16 draft.
> (The exact details of ranges removed and justification for
> each are given below.)
> To address Cary's concern about geresh, gershayim, and maqaf in
> Hebrew, I have constructed a separate table (with only 3 entries
> in it so far), which contains the explicit suggestions for
> characters that should be *added* back to the inclusions table,
> after having been removed by some more generic rule. In the
> case of the 3 Hebrew characters, the fact that they are
> listed as gc=Po (Punctuation, Other) in the Unicode Character
> Database removed them from the candidate inclusions list very
> early on in the rules. But once we do a script-by-script
> review to check whether all the omissions make sense, we may
> come up with instances, as for these three, where characters
> belong in the inclusions list despite their General_Category
> values. The results can be seen at:
> O.k. here are the exact details of what I am suggesting be
> next omitted from the inclusions list. (This should then be
> implemented as a series of exclusions by rule for the overall
> expression that Mark is using to generate tables.)
> Common Diacritics
>   omit:    0363..036F
>   reason:  These Latin letters above are specialist medievalist
>            usage for manuscripts, and are not a part of regular
>            orthographies. They would also be quite confusing
>            for internet identifiers.
> Hebrew
>   omit:    0591..05AF, 05C4..05C5
>   reason:  0591..05AF are the Hebrew accent marks Cary was talking about;
>            their major function is as cantillation marks, to help
>            in the chanting and singing of sacred texts. 05C4..05C5
>            are more marks used in the annotation of Biblical text,
>            and are not part of the regular pointing system for vowels.
> Arabic
>   omit:    0610..0615, 06D6..06ED
>   reason:  0610..0615 are honorific annotations added to names
>            in text. 06D6..06ED are annotation marks used in Koranic
>            text, again mostly for guidance in chanting and singing
>            sacred text. None of these are part of regular orthographies,
>            and should not be confused with the harakat used for
>            indicating vowels in Arabic.
> Syriac
>   omit:    0740..074A
>   reason:  Again, these are marks used in annotating text, and need
>            to be distinguished from the regular vowel marks needed
>            for the orthography. There is no need for these annotation
> marks
>            for internet identifiers.
> Devanagari
>   omit:    0953..0954
>   reason:  These are the dubious clones of acute and grave accent
>            marks included in the Devanagari block. While not formally
>            deprecated, there is no obvious function for them in
>            Devanagari, and they are otherwise easily confused with
>            the common diacritic acute and grave accent marks.
> Tibetan
>   omit:    0F18..0F19, 0F35, 0F37, 0F3E..0F3F, 0FC6
>   reason:  Some of these are astrological signs, only used for special
>            purpose markup of digits (or occasionally other signs) in
>            Tibetan astrology. 0F35 and 0F37 are text highlighting
>            marks; they are used like underlining. 0FC6 is a
>            symbol diacritic, not used with regular Tibetan text.
> Khmer
>   omit:    17D3
>   reason:  This is a deprecated character originally intended as
>            part of the formation of lunar date symbols. It is not
>            used in regular text.
> Mongolian
>   omit:    180B..180D
>   reason:  These are the Mongolian-specific variation selectors.
>            They get automatically removed (by an earlier rule),
>            because they are Default_Ignorable_Code_Point. I am
>            just cleaning up my list here to match the rules to
>            date.
> Balinese
>   omit:    1B6B..1B73
>   reason:  These are combining marks only used in Balinese musical
>            notation, rather than in regular text.
> Combining Diacritical Marks Supplement
>   omit:    1DC0..1DC1, 1DC3
>   reason:  1DC0..1DC1 are editorial signs for Ancient Greek, used only
>            in academic annotation. 1DC3 is a combining mark for
>            Glagolitic, a historic script already omitted from the list.
> CJK Symbols and Punctuation
>   omit:    302A..302F
>   reason:  These are tone mark annotations only used in nonstandard
>            annotations of Han characters or Hangul. They are not
>            part of either standard CJK orthographies or the commonly
>            encountered Latin transliterations for Chinese or Korean.
>   omit:    3031..3035, 303B..303C
>   reason:  While these are not combining marks, they should also be
>            omitted from the inclusions list. 3031..3035 are special
>            character forms only appropriate for vertically-rendered
>            text and inappropriate for internet identifiers. 303B
>            is another vertical rendering form. And 303C is an
>            abbreviatory symbol that happens to equate to "masu"
>            in Japanese, but is not a part of the regular orthography
>            of Japanese.
> Combining Half Marks
>   omit:    FE20..FE23
>   reason:  These are compatibility half forms, used only in the
>            mapping of certain legacy bibliographic character encodings.
>            They are not appropriate for normal Unicode text
> representation.
> Arabic Presentation Forms-B
>   omit:    FE73
>   reason:  This is another oddball compatibility character, encoded only
>            for transcoding to some old IBM code pages, but which doesn't
>            have any compatibility decomposition mapping, and so which
>            didn't get filtered by the NFKC(cp) != cp criterion. It should
>            simply be omitted by exception here because it is
>            inappropriate for use in internet identifiers.
> _______________________________________________
> Idna-update mailing list
> Idna-update at
-------------- next part --------------
An HTML attachment was scrubbed...

More information about the Idna-update mailing list