Moving Right Along on the Inclusions Table...

Mark Davis mark.davis at icu-project.org
Wed Dec 20 00:45:05 CET 2006


Those are all reasonable changes.


   - We should also add the Joiner/NonJoiner. They would however, as
   discussed, be restricted to very specific contexts by additional clauses
   (like the current bidi restrictions).
   - I think those are the only ones from
   http://unicode.org/reports/tr31/#Specific_Character_Adjustments that
   we need to consider, but others may have more information.


(That also reminded me that
http://www.unicode.org/reports/tr31/#Backward_Compatibility is an example,
for those not deeply familiar with Unicode properties, of how derived
properties are stabilized.)

Mark

On 12/19/06, Kenneth Whistler <kenw at sybase.com> wrote:
>
> On
>
> > Date: Sat, 16 Dec 2006 11:58:43 +0100
>
> Cary asked:
>
> > Perhaps we can now take a look at the way the Hebrew script is being
> > handled?
> >
> > The table currently excludes the U+05F3 HEBREW PUNCTUATION GERESH and
> > the U+05F4 HEBREW PUNCTUATION GERSHAYIM. Although it is reassuring to
> > note that we recognize the fundamental role they play in Hebrew and
> > Ladino orthographies, and the likelihood of their appearing in the
> > exception table, I am a bit more concerned about the main table
> > permitting the U+059C HEBREW ACCENT GERESH and the U+059E HEBREW ACCENT
> > GERSHAYIM. The latter pair, along with the other 28 HEBREW ACCENTs
> > strike me as prime examples of what we explicitly need to be excluding.
>
> > Can the rules, or the sequence of their application be modified to
> > include the two characters that are missing, or are we stuck with
> > allowing the dozens of characters that are not needed for IDN, and
> > treating the remaining two as exceptions?  One alternative would be
> > simply to permit all Hebrew characters in the range 0591..05F4. (At
> > least one of the three characters that would thereby be reintroduced,
> > U+05BE HEBREW PUNCTUATION MAQAF, can be justified in its own right.)
> > This would make nothing substantially worse, and would at least call
> > registry attention to the fact that there are different kinds of geresh.
>
> No one has responded on the specifics of these suggestions about
> Hebrew, and the thread took off to discuss other more general
> issues that Cary brought up.
>
> Also, no one has responded regarding the specific suggestions I
> had made regarding possible omissions of some Hebrew accents and
> Arabic Koranic annotation marks.
>
> So, to move this along, I will make a large number of very specific
> suggestions for omission of combining marks from the inclusion
> table (and a few non-combining characters as well). The
> results of that can be viewed at:
>
> http://www.unicode.org/~whistler/SPInclusionList061219.txt
>
> That has 127 fewer characters than the December 16 draft.
> (The exact details of ranges removed and justification for
> each are given below.)
>
> To address Cary's concern about geresh, gershayim, and maqaf in
> Hebrew, I have constructed a separate table (with only 3 entries
> in it so far), which contains the explicit suggestions for
> characters that should be *added* back to the inclusions table,
> after having been removed by some more generic rule. In the
> case of the 3 Hebrew characters, the fact that they are
> listed as gc=Po (Punctuation, Other) in the Unicode Character
> Database removed them from the candidate inclusions list very
> early on in the rules. But once we do a script-by-script
> review to check whether all the omissions make sense, we may
> come up with instances, as for these three, where characters
> belong in the inclusions list despite their General_Category
> values. The results can be seen at:
>
> http://www.unicode.org/~whistler/SPInclusionAdd061219.txt
>
> O.k. here are the exact details of what I am suggesting be
> next omitted from the inclusions list. (This should then be
> implemented as a series of exclusions by rule for the overall
> expression that Mark is using to generate tables.)
>
> Common Diacritics
>
>   omit:    0363..036F
>
>   reason:  These Latin letters above are specialist medievalist
>            usage for manuscripts, and are not a part of regular
>            orthographies. They would also be quite confusing
>            for internet identifiers.
>
> Hebrew
>
>   omit:    0591..05AF, 05C4..05C5
>
>   reason:  0591..05AF are the Hebrew accent marks Cary was talking about;
>            their major function is as cantillation marks, to help
>            in the chanting and singing of sacred texts. 05C4..05C5
>            are more marks used in the annotation of Biblical text,
>            and are not part of the regular pointing system for vowels.
>
> Arabic
>
>   omit:    0610..0615, 06D6..06ED
>
>   reason:  0610..0615 are honorific annotations added to names
>            in text. 06D6..06ED are annotation marks used in Koranic
>            text, again mostly for guidance in chanting and singing
>            sacred text. None of these are part of regular orthographies,
>            and should not be confused with the harakat used for
>            indicating vowels in Arabic.
>
> Syriac
>
>   omit:    0740..074A
>
>   reason:  Again, these are marks used in annotating text, and need
>            to be distinguished from the regular vowel marks needed
>            for the orthography. There is no need for these annotation
> marks
>            for internet identifiers.
>
> Devanagari
>
>   omit:    0953..0954
>
>   reason:  These are the dubious clones of acute and grave accent
>            marks included in the Devanagari block. While not formally
>            deprecated, there is no obvious function for them in
>            Devanagari, and they are otherwise easily confused with
>            the common diacritic acute and grave accent marks.
>
> Tibetan
>
>   omit:    0F18..0F19, 0F35, 0F37, 0F3E..0F3F, 0FC6
>
>   reason:  Some of these are astrological signs, only used for special
>            purpose markup of digits (or occasionally other signs) in
>            Tibetan astrology. 0F35 and 0F37 are text highlighting
>            marks; they are used like underlining. 0FC6 is a
>            symbol diacritic, not used with regular Tibetan text.
>
> Khmer
>
>   omit:    17D3
>
>   reason:  This is a deprecated character originally intended as
>            part of the formation of lunar date symbols. It is not
>            used in regular text.
>
> Mongolian
>
>   omit:    180B..180D
>
>   reason:  These are the Mongolian-specific variation selectors.
>            They get automatically removed (by an earlier rule),
>            because they are Default_Ignorable_Code_Point. I am
>            just cleaning up my list here to match the rules to
>            date.
>
> Balinese
>
>   omit:    1B6B..1B73
>
>   reason:  These are combining marks only used in Balinese musical
>            notation, rather than in regular text.
>
> Combining Diacritical Marks Supplement
>
>   omit:    1DC0..1DC1, 1DC3
>
>   reason:  1DC0..1DC1 are editorial signs for Ancient Greek, used only
>            in academic annotation. 1DC3 is a combining mark for
>            Glagolitic, a historic script already omitted from the list.
>
> CJK Symbols and Punctuation
>
>   omit:    302A..302F
>
>   reason:  These are tone mark annotations only used in nonstandard
>            annotations of Han characters or Hangul. They are not
>            part of either standard CJK orthographies or the commonly
>            encountered Latin transliterations for Chinese or Korean.
>
>   omit:    3031..3035, 303B..303C
>
>   reason:  While these are not combining marks, they should also be
>            omitted from the inclusions list. 3031..3035 are special
>            character forms only appropriate for vertically-rendered
>            text and inappropriate for internet identifiers. 303B
>            is another vertical rendering form. And 303C is an
>            abbreviatory symbol that happens to equate to "masu"
>            in Japanese, but is not a part of the regular orthography
>            of Japanese.
>
> Combining Half Marks
>
>   omit:    FE20..FE23
>
>   reason:  These are compatibility half forms, used only in the
>            mapping of certain legacy bibliographic character encodings.
>            They are not appropriate for normal Unicode text
> representation.
>
> Arabic Presentation Forms-B
>
>   omit:    FE73
>
>   reason:  This is another oddball compatibility character, encoded only
>            for transcoding to some old IBM code pages, but which doesn't
>            have any compatibility decomposition mapping, and so which
>            didn't get filtered by the NFKC(cp) != cp criterion. It should
>            simply be omitted by exception here because it is
>            inappropriate for use in internet identifiers.
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061219/4e1c032d/attachment.html


More information about the Idna-update mailing list