comments on IDNAbis: draft-faltstrom-idnabis-tables-04.txt Arabic block

Patrik Fältström patrik at frobbit.se
Mon Feb 18 20:00:16 CET 2008


On 18 feb 2008, at 18.21, Sarmad Hussain wrote:

> What is the definition of Cf.  Could not find it in the document.   
> Could you
> please elaborate.  0600..0603 are symbols, neither letters nor digits.
> Thus, should be DISALLOWED.

The following are Cf (from UnicodeData.txt):

00AD;SOFT HYPHEN;Cf;0;BN;;;;;N;;;;;
0600;ARABIC NUMBER SIGN;Cf;0;AL;;;;;N;;;;;
0601;ARABIC SIGN SANAH;Cf;0;AL;;;;;N;;;;;
0602;ARABIC FOOTNOTE MARKER;Cf;0;AL;;;;;N;;;;;
0603;ARABIC SIGN SAFHA;Cf;0;AL;;;;;N;;;;;
06DD;ARABIC END OF AYAH;Cf;0;AL;;;;;N;;;;;
070F;SYRIAC ABBREVIATION MARK;Cf;0;BN;;;;;N;;;;;
17B4;KHMER VOWEL INHERENT AQ;Cf;0;L;;;;;N;;*;;;
17B5;KHMER VOWEL INHERENT AA;Cf;0;L;;;;;N;;*;;;
200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;;
200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;;
202A;LEFT-TO-RIGHT EMBEDDING;Cf;0;LRE;;;;;N;;;;;
202B;RIGHT-TO-LEFT EMBEDDING;Cf;0;RLE;;;;;N;;;;;
202C;POP DIRECTIONAL FORMATTING;Cf;0;PDF;;;;;N;;;;;
202D;LEFT-TO-RIGHT OVERRIDE;Cf;0;LRO;;;;;N;;;;;
202E;RIGHT-TO-LEFT OVERRIDE;Cf;0;RLO;;;;;N;;;;;
2060;WORD JOINER;Cf;0;BN;;;;;N;;;;;
2061;FUNCTION APPLICATION;Cf;0;BN;;;;;N;;;;;
2062;INVISIBLE TIMES;Cf;0;BN;;;;;N;;;;;
2063;INVISIBLE SEPARATOR;Cf;0;BN;;;;;N;;;;;
206A;INHIBIT SYMMETRIC SWAPPING;Cf;0;BN;;;;;N;;;;;
206B;ACTIVATE SYMMETRIC SWAPPING;Cf;0;BN;;;;;N;;;;;
206C;INHIBIT ARABIC FORM SHAPING;Cf;0;BN;;;;;N;;;;;
206D;ACTIVATE ARABIC FORM SHAPING;Cf;0;BN;;;;;N;;;;;
206E;NATIONAL DIGIT SHAPES;Cf;0;BN;;;;;N;;;;;
206F;NOMINAL DIGIT SHAPES;Cf;0;BN;;;;;N;;;;;
FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;;
FFF9;INTERLINEAR ANNOTATION ANCHOR;Cf;0;ON;;;;;N;;;;;
FFFA;INTERLINEAR ANNOTATION SEPARATOR;Cf;0;ON;;;;;N;;;;;
FFFB;INTERLINEAR ANNOTATION TERMINATOR;Cf;0;ON;;;;;N;;;;;
1D173;MUSICAL SYMBOL BEGIN BEAM;Cf;0;BN;;;;;N;;;;;
1D174;MUSICAL SYMBOL END BEAM;Cf;0;BN;;;;;N;;;;;
1D175;MUSICAL SYMBOL BEGIN TIE;Cf;0;BN;;;;;N;;;;;
1D176;MUSICAL SYMBOL END TIE;Cf;0;BN;;;;;N;;;;;
1D177;MUSICAL SYMBOL BEGIN SLUR;Cf;0;BN;;;;;N;;;;;
1D178;MUSICAL SYMBOL END SLUR;Cf;0;BN;;;;;N;;;;;
1D179;MUSICAL SYMBOL BEGIN PHRASE;Cf;0;BN;;;;;N;;;;;
1D17A;MUSICAL SYMBOL END PHRASE;Cf;0;BN;;;;;N;;;;;
E0001;LANGUAGE TAG;Cf;0;BN;;;;;N;;;;;
E0020;TAG SPACE;Cf;0;BN;;;;;N;;;;;
E0021;TAG EXCLAMATION MARK;Cf;0;BN;;;;;N;;;;;
E0022;TAG QUOTATION MARK;Cf;0;BN;;;;;N;;;;;
E0023;TAG NUMBER SIGN;Cf;0;BN;;;;;N;;;;;
E0024;TAG DOLLAR SIGN;Cf;0;BN;;;;;N;;;;;
E0025;TAG PERCENT SIGN;Cf;0;BN;;;;;N;;;;;
E0026;TAG AMPERSAND;Cf;0;BN;;;;;N;;;;;
E0027;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;
E0028;TAG LEFT PARENTHESIS;Cf;0;BN;;;;;N;;;;;
E0029;TAG RIGHT PARENTHESIS;Cf;0;BN;;;;;N;;;;;
E002A;TAG ASTERISK;Cf;0;BN;;;;;N;;;;;
E002B;TAG PLUS SIGN;Cf;0;BN;;;;;N;;;;;
E002C;TAG COMMA;Cf;0;BN;;;;;N;;;;;
E002D;TAG HYPHEN-MINUS;Cf;0;BN;;;;;N;;;;;
E002E;TAG FULL STOP;Cf;0;BN;;;;;N;;;;;
E002F;TAG SOLIDUS;Cf;0;BN;;;;;N;;;;;
E0030;TAG DIGIT ZERO;Cf;0;BN;;;;;N;;;;;
E0031;TAG DIGIT ONE;Cf;0;BN;;;;;N;;;;;
E0032;TAG DIGIT TWO;Cf;0;BN;;;;;N;;;;;
E0033;TAG DIGIT THREE;Cf;0;BN;;;;;N;;;;;
E0034;TAG DIGIT FOUR;Cf;0;BN;;;;;N;;;;;
E0035;TAG DIGIT FIVE;Cf;0;BN;;;;;N;;;;;
E0036;TAG DIGIT SIX;Cf;0;BN;;;;;N;;;;;
E0037;TAG DIGIT SEVEN;Cf;0;BN;;;;;N;;;;;
E0038;TAG DIGIT EIGHT;Cf;0;BN;;;;;N;;;;;
E0039;TAG DIGIT NINE;Cf;0;BN;;;;;N;;;;;
E003A;TAG COLON;Cf;0;BN;;;;;N;;;;;
E003B;TAG SEMICOLON;Cf;0;BN;;;;;N;;;;;
E003C;TAG LESS-THAN SIGN;Cf;0;BN;;;;;N;;;;;
E003D;TAG EQUALS SIGN;Cf;0;BN;;;;;N;;;;;
E003E;TAG GREATER-THAN SIGN;Cf;0;BN;;;;;N;;;;;
E003F;TAG QUESTION MARK;Cf;0;BN;;;;;N;;;;;
E0040;TAG COMMERCIAL AT;Cf;0;BN;;;;;N;;;;;
E0041;TAG LATIN CAPITAL LETTER A;Cf;0;BN;;;;;N;;;;;
E0042;TAG LATIN CAPITAL LETTER B;Cf;0;BN;;;;;N;;;;;
E0043;TAG LATIN CAPITAL LETTER C;Cf;0;BN;;;;;N;;;;;
E0044;TAG LATIN CAPITAL LETTER D;Cf;0;BN;;;;;N;;;;;
E0045;TAG LATIN CAPITAL LETTER E;Cf;0;BN;;;;;N;;;;;
E0046;TAG LATIN CAPITAL LETTER F;Cf;0;BN;;;;;N;;;;;
E0047;TAG LATIN CAPITAL LETTER G;Cf;0;BN;;;;;N;;;;;
E0048;TAG LATIN CAPITAL LETTER H;Cf;0;BN;;;;;N;;;;;
E0049;TAG LATIN CAPITAL LETTER I;Cf;0;BN;;;;;N;;;;;
E004A;TAG LATIN CAPITAL LETTER J;Cf;0;BN;;;;;N;;;;;
E004B;TAG LATIN CAPITAL LETTER K;Cf;0;BN;;;;;N;;;;;
E004C;TAG LATIN CAPITAL LETTER L;Cf;0;BN;;;;;N;;;;;
E004D;TAG LATIN CAPITAL LETTER M;Cf;0;BN;;;;;N;;;;;
E004E;TAG LATIN CAPITAL LETTER N;Cf;0;BN;;;;;N;;;;;
E004F;TAG LATIN CAPITAL LETTER O;Cf;0;BN;;;;;N;;;;;
E0050;TAG LATIN CAPITAL LETTER P;Cf;0;BN;;;;;N;;;;;
E0051;TAG LATIN CAPITAL LETTER Q;Cf;0;BN;;;;;N;;;;;
E0052;TAG LATIN CAPITAL LETTER R;Cf;0;BN;;;;;N;;;;;
E0053;TAG LATIN CAPITAL LETTER S;Cf;0;BN;;;;;N;;;;;
E0054;TAG LATIN CAPITAL LETTER T;Cf;0;BN;;;;;N;;;;;
E0055;TAG LATIN CAPITAL LETTER U;Cf;0;BN;;;;;N;;;;;
E0056;TAG LATIN CAPITAL LETTER V;Cf;0;BN;;;;;N;;;;;
E0057;TAG LATIN CAPITAL LETTER W;Cf;0;BN;;;;;N;;;;;
E0058;TAG LATIN CAPITAL LETTER X;Cf;0;BN;;;;;N;;;;;
E0059;TAG LATIN CAPITAL LETTER Y;Cf;0;BN;;;;;N;;;;;
E005A;TAG LATIN CAPITAL LETTER Z;Cf;0;BN;;;;;N;;;;;
E005B;TAG LEFT SQUARE BRACKET;Cf;0;BN;;;;;N;;;;;
E005C;TAG REVERSE SOLIDUS;Cf;0;BN;;;;;N;;;;;
E005D;TAG RIGHT SQUARE BRACKET;Cf;0;BN;;;;;N;;;;;
E005E;TAG CIRCUMFLEX ACCENT;Cf;0;BN;;;;;N;;;;;
E005F;TAG LOW LINE;Cf;0;BN;;;;;N;;;;;
E0060;TAG GRAVE ACCENT;Cf;0;BN;;;;;N;;;;;
E0061;TAG LATIN SMALL LETTER A;Cf;0;BN;;;;;N;;;;;
E0062;TAG LATIN SMALL LETTER B;Cf;0;BN;;;;;N;;;;;
E0063;TAG LATIN SMALL LETTER C;Cf;0;BN;;;;;N;;;;;
E0064;TAG LATIN SMALL LETTER D;Cf;0;BN;;;;;N;;;;;
E0065;TAG LATIN SMALL LETTER E;Cf;0;BN;;;;;N;;;;;
E0066;TAG LATIN SMALL LETTER F;Cf;0;BN;;;;;N;;;;;
E0067;TAG LATIN SMALL LETTER G;Cf;0;BN;;;;;N;;;;;
E0068;TAG LATIN SMALL LETTER H;Cf;0;BN;;;;;N;;;;;
E0069;TAG LATIN SMALL LETTER I;Cf;0;BN;;;;;N;;;;;
E006A;TAG LATIN SMALL LETTER J;Cf;0;BN;;;;;N;;;;;
E006B;TAG LATIN SMALL LETTER K;Cf;0;BN;;;;;N;;;;;
E006C;TAG LATIN SMALL LETTER L;Cf;0;BN;;;;;N;;;;;
E006D;TAG LATIN SMALL LETTER M;Cf;0;BN;;;;;N;;;;;
E006E;TAG LATIN SMALL LETTER N;Cf;0;BN;;;;;N;;;;;
E006F;TAG LATIN SMALL LETTER O;Cf;0;BN;;;;;N;;;;;
E0070;TAG LATIN SMALL LETTER P;Cf;0;BN;;;;;N;;;;;
E0071;TAG LATIN SMALL LETTER Q;Cf;0;BN;;;;;N;;;;;
E0072;TAG LATIN SMALL LETTER R;Cf;0;BN;;;;;N;;;;;
E0073;TAG LATIN SMALL LETTER S;Cf;0;BN;;;;;N;;;;;
E0074;TAG LATIN SMALL LETTER T;Cf;0;BN;;;;;N;;;;;
E0075;TAG LATIN SMALL LETTER U;Cf;0;BN;;;;;N;;;;;
E0076;TAG LATIN SMALL LETTER V;Cf;0;BN;;;;;N;;;;;
E0077;TAG LATIN SMALL LETTER W;Cf;0;BN;;;;;N;;;;;
E0078;TAG LATIN SMALL LETTER X;Cf;0;BN;;;;;N;;;;;
E0079;TAG LATIN SMALL LETTER Y;Cf;0;BN;;;;;N;;;;;
E007A;TAG LATIN SMALL LETTER Z;Cf;0;BN;;;;;N;;;;;
E007B;TAG LEFT CURLY BRACKET;Cf;0;BN;;;;;N;;;;;
E007C;TAG VERTICAL LINE;Cf;0;BN;;;;;N;;;;;
E007D;TAG RIGHT CURLY BRACKET;Cf;0;BN;;;;;N;;;;;
E007E;TAG TILDE;Cf;0;BN;;;;;N;;;;;
E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;

>>>>>>> 0615 should be DISALLOWED as it is a punctuation, to mark a  
>>>>>>> pause
>>
>> 0615;ARABIC SMALL HIGH TAH;Mn;230;NSM;;;;;N;;;;;
>>
>> It is of GeneralCategory Mn, and those are allowed. I.e. not
>> classified as a punctuation. Is what you are saying that you suggest
>> this should be added as an exception and DISALLOWED?
>>
>
> Yes, if that is the case, 0615 should be added to exception list and
> DISALLOWED.  Please note that 0615 + 066E may be confused with  
> 0679.  Thus,
> 066E should also be disallowed (if it is not a character in any  
> language) as
> it may cause security problems.  Also see comment for 066E below.

I let others comment on this.

>>> 0640..065E  ; PVALID     # ARABIC TATWEEL..ARABIC FATHA WITH TWO  
>>> DOTS
>>>
>>>
>>>>>>> 0640 should be DISALLOWED as it will create significant security
>>>>>>> problems (kashida only causes stylistic (not shape) variation of
>>>>>>> characters)
>>
>> 0640;ARABIC TATWEEL;Lm;0;AL;;;;;N;;;;;
>>
>> Codepoints of GeneralCategory Lm is allowed (matches category A).  
>> Same
>> question as for 0615.
>>
>
> Yes, 0640 should be DISALLOWED, by adding through exception list.   
> It has no
> linguistic significance and may cause security problem if allowed.

I let others comment on this.

>>> 066E..0674  ; PVALID     # ARABIC LETTER DOTLESS BEH..ARABIC LETTER
>>> HIGH
>>>
>>>>>>> agreed; though reservations with 066E..066F (as Unicode standard
>>>>>>> does not mention if they are actually used in any language; if
>>>>>>> not part of any language, their inclusion may only contribute to
>>>>>>> security problems)
>>
>> 066E;ARABIC LETTER DOTLESS BEH;Lo;0;AL;;;;;N;;;;;
>> 066F;ARABIC LETTER DOTLESS QAF;Lo;0;AL;;;;;N;;;;;
>>
>> GeneralCategory Lo, and because of that PVALID.
>
> If they do not belong to any language, they should be DISALLOWED as  
> they may
> cause security problems, e.g. see comment on 0615 above, and 066E +  
> 065C may
> be confusable with 0628.

They belong to the Arabic script. I can not personally say whether  
they belong to some language that use the Arabic script or not.

>>> 06D4        ; DISALLOWED # ARABIC FULL STOP
>>>
>>>>>>> should be allowed as a delimeter for Urdu, like the dot in the
>>>>>>> domain name (should be mapped onto a dot automatically at client
>>>>>>> layer);  As internationalized domain names deal with the end
>>>>>>> user layer (application layer), they need to be a bit more
>>>>>>> sensitive to user needs.  This delimeter, as specified in
>>>>>>> Unicode, is only required for Urdu.  However, Urdu writing does
>>>>>>> not have a dot and dot is also not present on Urdu keyboards.
>>>>>>> If the delimeter is not allowed (and then mapped to dot), the
>>>>>>> user will get confused and also will not be able to type the dot
>>>>>>> without having an English keyboard installed and without
>>>>>>> switching to English keyboard 2-3 times within writing a single
>>>>>>> domain name in Urdu (once to-english-and-back-to-Urdu between
>>>>>>> each level of TLD).  Standard should include this as a
>>>>>>> recommendation for applications.
>>
>> 06D4;ARABIC FULL STOP;Po;0;AL;;;;;N;ARABIC PERIOD;;;;
>>
>> This is DISALLOWED as it is of GeneralCategory Al. I let others
>> discuss the issues with full stop.
>
>
> Recommendation for application layer and map to dot/full stop within  
> the
> standard would help, if not allowed otherwise.  Again, this is a  
> requirement
> for Urdu as described earlier.

That is one solution. How the application do handle various codepoints  
is not part of IDNA200X.

>>> 06DD        ; CONTEXTO   # ARABIC END OF AYAH
>>>
>>>>>>> should be DISALLOWED
>>
>> 06DD;ARABIC END OF AYAH;Cf;0;AL;;;;;N;;;;;
>>
>> GeneralCategory Cf. See above on 0600..0603.
>
> 06DD should be DISALLOWED as it marks end of phrase/sentence like  
> full stop
> in English.

Ok.

>>> 06FD..06FE  ; DISALLOWED # ARABIC SIGN SINDHI AMPERSAND..ARABIC SIGN
>>> SIND
>>>
>>>>>>> need time to consult and comment on this.
>>
>> 06FD;ARABIC SIGN SINDHI AMPERSAND;So;0;AL;;;;;N;;;;;
>> 06FE;ARABIC SIGN SINDHI POSTPOSITION MEN;So;0;AL;;;;;N;;;;;
>>
>>>>>>>
>>
>> DISALLOWED because GeneralCategory So.
>>
>
> Could not find So in the document.  Please elaborate.

So stands for the General Category Other_Symbol. Sometimes it is  
written as "Symbol, Other". Sometimes as "So", sometimes as  
"Other_Symbol".

> Just consulted with Dr. Qasim Bughio on telephone.  He was the  
> Chairman of
> Sindhi Language Authority in Pakistan and is currently Professor and  
> Dean of
> Faculty of Arts at University of Sindh in Jomshoro, Pakistan (see
> http://arts.usindh.edu.pk/).  According to him 06FD is the word  
> "and" in
> Sindhi and has no replacement.  Similarly 06FE is the word "in" in  
> Sindhi
> which also has no replacement.  Both are used very frequently in the
> language.
>
> Thus, both 06FD and 06FE MUST be considered PVALID and allowed in
> internationalized domain names for Sindhi.

I let others comment on this.

>>> In addition, in Urdu we also would have a problem for not allowing
>>> space as we do not have use of ZWNJ in Pakistan.  Urdu users in
>>> Pakistan type space whether it is required to shape letter within a
>>> word or at the end of it.  It is not possible to train all users to
>>> distinguish between space and ZWNJ (especially as the latter is not
>>> a linguistic entity in the language and users are never taught its
>>> concept, but a computational engineering solution from the
>>> perspective of Urdu).  As the domain name standard has to deal with
>>> applications with which users will be directly interacting, it may
>>> also be included as a recommendation (at least for Urdu) that the
>>> users may be allowed to type it and it may be automatically be
>>> converted to ZWNJ (and could follow same rules as ZWNJ after such
>>> conversion).
>>
>> There is a separate discussion on ZWJ and ZWNJ and space.
>
> Space should be allowed at user end applications, and collapsed to  
> ZWNJ
> during pre-processing, at least for Urdu and some other languages  
> spoken in
> Pakistan.  Such recommendations could be added to these drafts.

Understood.

> Thanks for considering these requirements and for your response.

Thanks again.

FYI: version -05 of the tables document will be released shortly. It  
does not include any changes regarding for example "Cf" category, or  
anything else discussed in this email. That does NOT preclude changes  
be made for -06 (or later versions).

    Patrik

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://www.alvestrand.no/pipermail/idna-update/attachments/20080218/e74283c5/PGP.bin


More information about the Idna-update mailing list