Moving Right Along on the Inclusions Table...
mark.davis at icu-project.org
Wed Dec 20 00:45:05 CET 2006
Those are all reasonable changes.
- We should also add the Joiner/NonJoiner. They would however, as
discussed, be restricted to very specific contexts by additional clauses
(like the current bidi restrictions).
- I think those are the only ones from
we need to consider, but others may have more information.
(That also reminded me that
http://www.unicode.org/reports/tr31/#Backward_Compatibility is an example,
for those not deeply familiar with Unicode properties, of how derived
properties are stabilized.)
On 12/19/06, Kenneth Whistler <kenw at sybase.com> wrote:
> > Date: Sat, 16 Dec 2006 11:58:43 +0100
> Cary asked:
> > Perhaps we can now take a look at the way the Hebrew script is being
> > handled?
> > The table currently excludes the U+05F3 HEBREW PUNCTUATION GERESH and
> > the U+05F4 HEBREW PUNCTUATION GERSHAYIM. Although it is reassuring to
> > note that we recognize the fundamental role they play in Hebrew and
> > Ladino orthographies, and the likelihood of their appearing in the
> > exception table, I am a bit more concerned about the main table
> > permitting the U+059C HEBREW ACCENT GERESH and the U+059E HEBREW ACCENT
> > GERSHAYIM. The latter pair, along with the other 28 HEBREW ACCENTs
> > strike me as prime examples of what we explicitly need to be excluding.
> > Can the rules, or the sequence of their application be modified to
> > include the two characters that are missing, or are we stuck with
> > allowing the dozens of characters that are not needed for IDN, and
> > treating the remaining two as exceptions? One alternative would be
> > simply to permit all Hebrew characters in the range 0591..05F4. (At
> > least one of the three characters that would thereby be reintroduced,
> > U+05BE HEBREW PUNCTUATION MAQAF, can be justified in its own right.)
> > This would make nothing substantially worse, and would at least call
> > registry attention to the fact that there are different kinds of geresh.
> No one has responded on the specifics of these suggestions about
> Hebrew, and the thread took off to discuss other more general
> issues that Cary brought up.
> Also, no one has responded regarding the specific suggestions I
> had made regarding possible omissions of some Hebrew accents and
> Arabic Koranic annotation marks.
> So, to move this along, I will make a large number of very specific
> suggestions for omission of combining marks from the inclusion
> table (and a few non-combining characters as well). The
> results of that can be viewed at:
> That has 127 fewer characters than the December 16 draft.
> (The exact details of ranges removed and justification for
> each are given below.)
> To address Cary's concern about geresh, gershayim, and maqaf in
> Hebrew, I have constructed a separate table (with only 3 entries
> in it so far), which contains the explicit suggestions for
> characters that should be *added* back to the inclusions table,
> after having been removed by some more generic rule. In the
> case of the 3 Hebrew characters, the fact that they are
> listed as gc=Po (Punctuation, Other) in the Unicode Character
> Database removed them from the candidate inclusions list very
> early on in the rules. But once we do a script-by-script
> review to check whether all the omissions make sense, we may
> come up with instances, as for these three, where characters
> belong in the inclusions list despite their General_Category
> values. The results can be seen at:
> O.k. here are the exact details of what I am suggesting be
> next omitted from the inclusions list. (This should then be
> implemented as a series of exclusions by rule for the overall
> expression that Mark is using to generate tables.)
> Common Diacritics
> omit: 0363..036F
> reason: These Latin letters above are specialist medievalist
> usage for manuscripts, and are not a part of regular
> orthographies. They would also be quite confusing
> for internet identifiers.
> omit: 0591..05AF, 05C4..05C5
> reason: 0591..05AF are the Hebrew accent marks Cary was talking about;
> their major function is as cantillation marks, to help
> in the chanting and singing of sacred texts. 05C4..05C5
> are more marks used in the annotation of Biblical text,
> and are not part of the regular pointing system for vowels.
> omit: 0610..0615, 06D6..06ED
> reason: 0610..0615 are honorific annotations added to names
> in text. 06D6..06ED are annotation marks used in Koranic
> text, again mostly for guidance in chanting and singing
> sacred text. None of these are part of regular orthographies,
> and should not be confused with the harakat used for
> indicating vowels in Arabic.
> omit: 0740..074A
> reason: Again, these are marks used in annotating text, and need
> to be distinguished from the regular vowel marks needed
> for the orthography. There is no need for these annotation
> for internet identifiers.
> omit: 0953..0954
> reason: These are the dubious clones of acute and grave accent
> marks included in the Devanagari block. While not formally
> deprecated, there is no obvious function for them in
> Devanagari, and they are otherwise easily confused with
> the common diacritic acute and grave accent marks.
> omit: 0F18..0F19, 0F35, 0F37, 0F3E..0F3F, 0FC6
> reason: Some of these are astrological signs, only used for special
> purpose markup of digits (or occasionally other signs) in
> Tibetan astrology. 0F35 and 0F37 are text highlighting
> marks; they are used like underlining. 0FC6 is a
> symbol diacritic, not used with regular Tibetan text.
> omit: 17D3
> reason: This is a deprecated character originally intended as
> part of the formation of lunar date symbols. It is not
> used in regular text.
> omit: 180B..180D
> reason: These are the Mongolian-specific variation selectors.
> They get automatically removed (by an earlier rule),
> because they are Default_Ignorable_Code_Point. I am
> just cleaning up my list here to match the rules to
> omit: 1B6B..1B73
> reason: These are combining marks only used in Balinese musical
> notation, rather than in regular text.
> Combining Diacritical Marks Supplement
> omit: 1DC0..1DC1, 1DC3
> reason: 1DC0..1DC1 are editorial signs for Ancient Greek, used only
> in academic annotation. 1DC3 is a combining mark for
> Glagolitic, a historic script already omitted from the list.
> CJK Symbols and Punctuation
> omit: 302A..302F
> reason: These are tone mark annotations only used in nonstandard
> annotations of Han characters or Hangul. They are not
> part of either standard CJK orthographies or the commonly
> encountered Latin transliterations for Chinese or Korean.
> omit: 3031..3035, 303B..303C
> reason: While these are not combining marks, they should also be
> omitted from the inclusions list. 3031..3035 are special
> character forms only appropriate for vertically-rendered
> text and inappropriate for internet identifiers. 303B
> is another vertical rendering form. And 303C is an
> abbreviatory symbol that happens to equate to "masu"
> in Japanese, but is not a part of the regular orthography
> of Japanese.
> Combining Half Marks
> omit: FE20..FE23
> reason: These are compatibility half forms, used only in the
> mapping of certain legacy bibliographic character encodings.
> They are not appropriate for normal Unicode text
> Arabic Presentation Forms-B
> omit: FE73
> reason: This is another oddball compatibility character, encoded only
> for transcoding to some old IBM code pages, but which doesn't
> have any compatibility decomposition mapping, and so which
> didn't get filtered by the NFKC(cp) != cp criterion. It should
> simply be omitted by exception here because it is
> inappropriate for use in internet identifiers.
> Idna-update mailing list
> Idna-update at alvestrand.no
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update