NSM flaw?

Fri Sep 18 10:41:20 CEST 2009

Hello Abdulrahman,

On 2009/09/15 17:40, Abdulrahman I. ALGhadir wrote:
> Thank you for replay,
> But as what I see in the protocol now that it did fix some problems which they have a Contextual-form rather than considering them as plain Unicode (ex. Allowing ZWJ/ZWNJ, disallowing starting of numbers in U-labels, not mixing scripts,... etc) all of these issues are contextual, and based on what you said they should be treated on the browser-level(or any level) and not in the protocol itself.
> Well I see the protocol at current stage is mutual allowing to fix some problems and rejecting some, I know it is hard to govern all the languages in this world and fixing all contextual problems which may lead for spoofing attempts, but the protocol should follow a clear path either to support them (by fixing them all, that is) Or to consider these labels as plain sequence of Unicode and leave other levels to handle the fixing of these kind of problems.

I think Harald already explained some of the rationale for the decisions 
taken, but here is some more.

The basic ideas is to just say which characters are allowed and which 
characters are not allowed. We started with the extension of the LDH 
(letters, digits, hyphen) rule to non-ASCII, and made as few tweaks as 
we could.

The two additional restrictions that you mention, bidi rules and 
zwj/zwnj, can be motivated as follows:

For the bidi rule, there is a need to define some restriction to avoid 
really bad user surprises. The problems, and the solution, are based 
only on bidi properties, and are therefore essentially language 
independent. The restrictions are per label, but they are designed in a 
way so as to "do the right thing" (i.e. prevent additional damage) even 
across labels, to the extent possible. If every registry made up 
restrictions on their own, these would differ and not fit together.

For the zwj/zwnj case, based on the LDH rule, these would have been out. 
But it turned out that completely excluding them would affect certain 
scripts and languages in a needlessly bad way, and completely allowing 
them would open a very wide door for essentially limitless spoofing, so 
they are handled with great care.

In the case of vowel marks for Arabic, indeed common sense makes it easy 
to see that only one of these per base letter is useful. So that's why 
some rendering engines apparently take a shortcut and just "overpaint" a 
subsequent vowel mark, rather than showing that there is a special 
situation. On the other hand, the fact that there is a lot of common 
sense also makes it easier to delegate this and similar cases to 
registries. In my opinion, a registry/registrar accepting Arabic scrit 
labels shouldn't reject double vowel marks because of some potential 
display problems, but simply because they don't make any sense. Indeed, 
I can imagine that in some places, even single vowel marks would not be 
allowed, because this would eliminate a lot of problems with 
bundling,..., would make the names easier to type for the users, and so 
on. Of course, this would heavily depend on the language, as far as I 
understand.

For other scripts than Arabic, there is likewise a wide range of choices 
on what specific combining marks (and how many) to allow on what base 
letters. Because each registry has to take these decisions anyway, and 
because they are heavily influenced by language and culture, putting 
restrictions into the protocol that on one hand might turn out to be too 
restrictive, but on the other hand may suggest to registries that they 
don't have a job to do in this area, seemed clearly to be the wrong choice.

As for mixing scripts (in the general case), there is no such 
restriction in the protocol, but we strongly suggest to registries to 
make such restrictions except for cases where this isn't appropriate 
(e.g. mixing Kanji, Hiragana, and Katakana scripts is essential for 
Japanese).

> I know I am a bit late to arise things like this, but for the importance of the problem I had to do it, Sorry.

I think you did a very good job to raise the awareness for this problem.

Thanks and regards,    Martin.

> Thank you,
> Abdulrahman.
>
> -----Original Message-----
> From: Vint Cerf [mailto:vint at google.com]
> Sent: 14/Sep/2009 4:56 PM
> To: "Martin J. Dürst"
> Cc: Abdulrahman I. ALGhadir; idna-update at alvestrand.no; Arabic Scripts IDNA
> Subject: Re: NSM flaw?
>
> Martin, Abdulrahman,
>
> thanks for this contribution. I think Martin's points are very well
> taken.
> Finding protocol level rules for situations like that is not only very
> hard but probably not productive because of all the various potential
> hazards that exist with the introduction of so many new scripts.
> The programs implementing very complex rules increases the
> risk of bugs in the programs and thus, incompatibilities in
> implementation. The observation about varying treatment at
> browser level underscores some of the hazards.
>
> Vint
>
>
> On Sep 14, 2009, at 7:16 AM, Martin J. Dürst wrote:
>
>> Hello Abdulrahman,
>>
>> Are you saying that there is a problem with two successive (identical)
>> vowel marks (such as fatha, damma, kasra) because display engines will
>> ignore the second one (because essentially, there is no point in
>> indicating the vowel twice)?
>>
>> First, my mail agent (Thunderbird) displays both U+064E characters in
>> your examples below (the second one above a (maybe dotted, but I can't
>> see the actual dots) circle. But there may well be display engines
>> that
>> do what I think you say, so this may indeed be a problem.
>>
>> Second, while such NSM combinations (as well as much more far-fetched
>> combinations of NSMs, or letters and NSMs, or letters and letters) are
>> all allowed in the protocol, registries can (and in the case you point
>> out most probably should) reject them. Because of the complexity of
>> languages and scripts around the world, it wasn't possible to
>> incorporate such restrictions (except for a few extremely crucial
>> ones)
>> into the protocol, but it would definitely be good if the cases you
>> point out are documented by the group working on Arabic domain
>> names. I
>> have cc'ed the Arabic Scripts IDNA mailing list, maybe they are
>> already
>> aware of this and related issues.
>>
>> Regards,   Martin.
>>
>> On 2009/09/14 19:21, Abdulrahman I. ALGhadir wrote:
>>> Hello,
>>>
>>> I am Abdulrahman I. Al-Ghadir from SaudiNIC. I am new to IDNA and
>>> joined the mailing list lately.
>>>
>>> While revision and reading the last drafts I found something which
>>> may be a flow in the protocol in draft ftp://ftp.ietf.org/internet-drafts/draft-ietf-idnabis-bidi-05.txt
>>>   ,
>>>
>>>
>>>
>>> In bidi-05 “2.  The BIDI Rule”:
>>>
>>> “  2.  In an RTL label, only characters with the BIDI properties
>>> R, AL,
>>>         AN, EN, ES, CS, ET, ON, BN and NSM are allowed.
>>>
>>>     3.  In an RTL label, the end of the label must be a character with
>>>         BIDI property R, AL, EN or AN, followed by zero or more
>>>         characters with BIDI property NSM.”
>>>
>>> A sequence of NSM can be represented in the label thus this may
>>> arise a problem on the display level for the label.
>>>
>>>
>>> Assume these two labels (image attached for both words):
>>>
>>>
>>>
>>> سيارةَ          ->             u0633\u064A
>>> \u0627\u0631\u0629\u064E                     ->     xn--mgbexg9i1a
>>>
>>> سيارةََ          ->             u0633\u064A
>>> \u0627\u0631\u0629\u064E\u064E       ->    xn--mgbexg9i1aa
>>>
>>>
>>>
>>> http://unicode.org/cldr/utility/idna.jsp?a=%D8%B3%D9%8A%D8%A7%D8%B1%D8%A9%D9%8E%0D%0A%D8%B3%D9%8A%D8%A7%D8%B1%D8%A9%D9%8E%D9%8E&f=
>>> [%C3%9F+%CF%82+[%3AJoin_C%3A<http://unicode.org/cldr/utility/idna.jsp?a=%D8%B3%D9%8A%D8%A7%D8%B1%D8%A9%D9%8E%0D%0A%D8%B3%D9%8A%D8%A7%D8%B1%D8%A9%D9%8E%D9%8E&f=%5b%C3%9F+%CF%82+%5b%3AJoin_C%3A
>>>
>>>
>>> As you see both words have same display but different codes which
>>> might leads to problems same goes with other because as you know
>>> NSM display is the same position and rest of NSM which are after it
>>> will be display on same position too(later NSM after first NSM
>>> displayed will be invisible), same goes to other NSMs which act in
>>> same behavior. (am I right?)
>>>
>>>
>>>
>>> I know it is abit late to arise  things like that but it may be a
>>> problem?
>>>
>>>
>>>
>>> Thank you,
>>>
>>> Abdulrahman.
>>>
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> تنويه:
>>> هذه الرسالة و مرفقاتها (إن وجدت) تمثل
>>> وثيقة سرية قد تحتوي على معلومات تتمتع
>>> بحماية وحصانة قانونية. إذا لم تكن
>>> الشخص المعني بهذه الرسالة يجب عليك
>>> تنبيه المُرسل
>>> بخطأ وصولها إليك، و حذف الرسالة و
>>> مرفقاتها (إن وجدت) من الحاسب الآلي
>>> الخاص بك. ولا يجوز لك نسخ هذه الرسالة
>>> أو مرفقاتها (إن وجدت) أو أي جزئ منها،
>>> أو
>>> البوح بمحتوياتها لأي شخص أو
>>> استعمالها لأي غرض. علماً بأن الإفادات
>>> و الآراء التي تحويها هذه الرسالة تعبر
>>> فقط عن رأي المُرسل و ليس بالضرورة رأي
>>> هيئة الاتصالات و
>>> تقنية المعلومات، ولا تتحمل الهيئة أي
>>> مسئولية عن الأضرار الناتجة عن هذ
>>> البريد.
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>> --
>> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
>> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
> -----------------------------------------------------------------------------------
> Disclaimer:
> This message and its attachment, if any, are confidential and may contain legally
> privileged information. If you are not the intended recipient, please contact the
> sender immediately and delete this message and its attachment, if any, from your
> system. You should not copy this message or disclose its contents to any other
> person or use it for any purpose. Statements and opinions expressed in this e-mail
> are those of the sender, and do not necessarily reflect those of the Communications
> and Information Technology Commission (CITC). CITC accepts no liability for damage
> caused by this email.
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp