Those are all reasonable changes. <ul><li>We should also add the Joiner/NonJoiner. They would however, as discussed, be restricted to very specific contexts by additional clauses (like the current bidi restrictions).

</li><li>I think those are the only ones from <a href="http://unicode.org/reports/tr31/#Specific_Character_Adjustments">http://unicode.org/reports/tr31/#Specific_Character_Adjustments</a> that we need to consider, but others may have more information. 

</li></ul><br>(That also reminded me that <a href="http://www.unicode.org/reports/tr31/#Backward_Compatibility">http://www.unicode.org/reports/tr31/#Backward_Compatibility</a> is an example, for those not deeply familiar with Unicode properties, of how derived properties are stabilized.)

<br><br>Mark<br><br><div><span class="gmail_quote">On 12/19/06, <b class="gmail_sendername">Kenneth Whistler</b> &lt;<a href="mailto:kenw@sybase.com">kenw@sybase.com</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

On<br><br>&gt; Date: Sat, 16 Dec 2006 11:58:43 +0100<br><br>Cary asked:<br><br>&gt; Perhaps we can now take a look at the way the Hebrew script is being<br>&gt; handled?<br>&gt;<br>&gt; The table currently excludes the U+05F3 HEBREW PUNCTUATION GERESH and

<br>&gt; the U+05F4 HEBREW PUNCTUATION GERSHAYIM. Although it is reassuring to<br>&gt; note that we recognize the fundamental role they play in Hebrew and<br>&gt; Ladino orthographies, and the likelihood of their appearing in the

<br>&gt; exception table, I am a bit more concerned about the main table<br>&gt; permitting the U+059C HEBREW ACCENT GERESH and the U+059E HEBREW ACCENT<br>&gt; GERSHAYIM. The latter pair, along with the other 28 HEBREW ACCENTs

<br>&gt; strike me as prime examples of what we explicitly need to be excluding.<br><br>&gt; Can the rules, or the sequence of their application be modified to<br>&gt; include the two characters that are missing, or are we stuck with

<br>&gt; allowing the dozens of characters that are not needed for IDN, and<br>&gt; treating the remaining two as exceptions?&nbsp;&nbsp;One alternative would be<br>&gt; simply to permit all Hebrew characters in the range 0591..05F4. (At

<br>&gt; least one of the three characters that would thereby be reintroduced,<br>&gt; U+05BE HEBREW PUNCTUATION MAQAF, can be justified in its own right.)<br>&gt; This would make nothing substantially worse, and would at least call

<br>&gt; registry attention to the fact that there are different kinds of geresh.<br><br>No one has responded on the specifics of these suggestions about<br>Hebrew, and the thread took off to discuss other more general<br>

issues that Cary brought up.<br><br>Also, no one has responded regarding the specific suggestions I<br>had made regarding possible omissions of some Hebrew accents and<br>Arabic Koranic annotation marks.<br><br>So, to move this along, I will make a large number of very specific

<br>suggestions for omission of combining marks from the inclusion<br>table (and a few non-combining characters as well). The<br>results of that can be viewed at:<br><br><a href="http://www.unicode.org/~whistler/SPInclusionList061219.txt">

http://www.unicode.org/~whistler/SPInclusionList061219.txt</a><br><br>That has 127 fewer characters than the December 16 draft.<br>(The exact details of ranges removed and justification for<br>each are given below.)<br><br>

To address Cary&#39;s concern about geresh, gershayim, and maqaf in Hebrew, I have constructed a separate table (with only 3 entries in it so far), which contains the explicit suggestions for characters that should be *added* back to the inclusions table,

<br>after having been removed by some more generic rule. In the<br>case of the 3 Hebrew characters, the fact that they are<br>listed as gc=Po (Punctuation, Other) in the Unicode Character<br>Database removed them from the candidate inclusions list very

<br>early on in the rules. But once we do a script-by-script<br>review to check whether all the omissions make sense, we may<br>come up with instances, as for these three, where characters<br>belong in the inclusions list despite their General_Category

<br>values. The results can be seen at:<br><br><a href="http://www.unicode.org/~whistler/SPInclusionAdd061219.txt">http://www.unicode.org/~whistler/SPInclusionAdd061219.txt</a><br><br>O.k. here are the exact details of what I am suggesting be

<br>next omitted from the inclusions list. (This should then be<br>implemented as a series of exclusions by rule for the overall<br>expression that Mark is using to generate tables.)<br><br>Common Diacritics<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;0363..036F

&nbsp;&nbsp;reason:&nbsp;&nbsp;These Latin letters above are specialist medievalist &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; usage for manuscripts, and are not a part of regular &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; orthographies. They would also be quite confusing &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for internet identifiers.

<br><br>Hebrew<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;0591..05AF, 05C4..05C5<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;0591..05AF are the Hebrew accent marks Cary was talking about;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; their major function is as cantillation marks, to help<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; in the chanting and singing of sacred texts. 05C4..05C5

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; are more marks used in the annotation of Biblical text,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and are not part of the regular pointing system for vowels.<br><br>Arabic<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;0610..0615, 06D6..06ED<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;0610..0615 are honorific annotations added to names

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; in text. 06D6..06ED are annotation marks used in Koranic<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; text, again mostly for guidance in chanting and singing<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sacred text. None of these are part of regular orthographies,<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and should not be confused with the harakat used for<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indicating vowels in Arabic.<br><br>Syriac<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;0740..074A<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;Again, these are marks used in annotating text, and need

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; to be distinguished from the regular vowel marks needed<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for the orthography. There is no need for these annotation marks<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for internet identifiers.<br><br>Devanagari<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;0953..0954

<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;These are the dubious clones of acute and grave accent<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; marks included in the Devanagari block. While not formally<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; deprecated, there is no obvious function for them in<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Devanagari, and they are otherwise easily confused with

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the common diacritic acute and grave accent marks.<br><br>Tibetan<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;0F18..0F19, 0F35, 0F37, 0F3E..0F3F, 0FC6<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;Some of these are astrological signs, only used for special<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; purpose markup of digits (or occasionally other signs) in &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Tibetan astrology. 0F35 and 0F37 are text highlighting &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; marks; they are used like underlining. 0FC6 is a &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; symbol diacritic, not used with regular Tibetan text.

<br><br>Khmer<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;17D3<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;This is a deprecated character originally intended as<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; part of the formation of lunar date symbols. It is not<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; used in regular text.<br><br>

Mongolian<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;180B..180D<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;These are the Mongolian-specific variation selectors.<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; They get automatically removed (by an earlier rule),<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; because they are Default_Ignorable_Code_Point. I am

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; just cleaning up my list here to match the rules to &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; date. Balinese &nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;1B6B..1B73 &nbsp;&nbsp;reason:&nbsp;&nbsp;These are combining marks only used in Balinese musical &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; notation, rather than in regular text.

<br><br>Combining Diacritical Marks Supplement<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;1DC0..1DC1, 1DC3<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;1DC0..1DC1 are editorial signs for Ancient Greek, used only<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; in academic annotation. 1DC3 is a combining mark for

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Glagolitic, a historic script already omitted from the list.<br><br>CJK Symbols and Punctuation<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;302A..302F<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;These are tone mark annotations only used in nonstandard<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; annotations of Han characters or Hangul. They are not

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; part of either standard CJK orthographies or the commonly<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; encountered Latin transliterations for Chinese or Korean.<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;3031..3035, 303B..303C<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;While these are not combining marks, they should also be

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; omitted from the inclusions list. 3031..3035 are special<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; character forms only appropriate for vertically-rendered<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; text and inappropriate for internet identifiers. 303B<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; is another vertical rendering form. And 303C is an

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; abbreviatory symbol that happens to equate to &quot;masu&quot;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; in Japanese, but is not a part of the regular orthography<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; of Japanese.<br><br>Combining Half Marks<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;FE20..FE23

&nbsp;&nbsp;reason:&nbsp;&nbsp;These are compatibility half forms, used only in the &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mapping of certain legacy bibliographic character encodings. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; They are not appropriate for normal Unicode text representation.

<br><br>Arabic Presentation Forms-B<br><br>&nbsp;&nbsp;omit:&nbsp;&nbsp;&nbsp;&nbsp;FE73<br><br>&nbsp;&nbsp;reason:&nbsp;&nbsp;This is another oddball compatibility character, encoded only<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for transcoding to some old IBM code pages, but which doesn&#39;t<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; have any compatibility decomposition mapping, and so which &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; didn&#39;t get filtered by the NFKC(cp) != cp criterion. It should &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; simply be omitted by exception here because it is &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inappropriate for use in internet identifiers.

<br><br><br>_______________________________________________<br>Idna-update mailing list<br><a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br><a href="http://www.alvestrand.no/mailman/listinfo/idna-update">

http://www.alvestrand.no/mailman/listinfo/idna-update</a><br></blockquote></div><br>