Follow-up from Tuesday's discussion of digits in the

Martin Duerst duerst at it.aoyama.ac.jp
Thu Dec 4 02:37:15 CET 2008


I agree with Shawn. Mixing digits from different digit series
is extremely similar to mixing characters from different scripts.
Until this issue with Arabic and Eastern Arabic digits came up,
we haven't given very much attention to digit series, but now
we know about them, and can document them wherever we will
document recommendations to registries.

Looking carefully at the text and the PDF attached to Vint's
mail with Message-Id: <7EC56FDF-286D-4B65-8D58-88C5F817D072 at google.com>
(shameless plug: I wish the IETF archives would support Archived-At:
(http://www.ietf.org/rfc/rfc5064.txt)) of
Date: Mon, 24 Nov 2008 16:57:13 -0500, the confusability of
the two series of Arabic digits isn't the only issue.

There is a second issue, which has been alluded to but never
described on the list itself, is that on many systems, a
user may input, or see, one series of digits, but they may be
represented internally as another series. The abovementioned
document proposes to use mapping to deal with this problem.
If we think we need mapping, then that becomes a protocol
issue, but then again, this is possible without mapping:

The registries just have to register two or three variants
with the digits mapped to different digit series.

There is no combinatorial explosion as long as there is no
mixture of digits from different series, which makes sense
because the registries should make such a restriction anyway
and because it should be extremely rare that different digits
in a single label are input piecemeal, resulting in different
internal representations.


In summary, there are two problems, both of which can be
addressed by protocol restrictions, but none of which actually
needs protocol restrictions.

Regards,    Martin.


At 03:44 08/12/04, Shawn Steele wrote:
>I don't like the proposal because it causes extra effort and confusion and 
>I don't see a real benefit.
>
>> Since the IDNA2008 effort long ago decided to ban symbols, including
>> the
>> 11 "heart" symbols in Unicode (all of which are class "So"), fairness
>> would dictate that we give no special consideration to use of numbers
>> as
>> symbols outside their linguistic context.
>
>That makes sense, but the intent is to prohibit mixed digits, then all of 
>them should be prohibited from being mixed, not just these 3.  Why allow 
>mixing of the indic digits for example?  And then if we go that far, why 
>allow mixing of scripts at all?
>
>The intent is to prohibit homographs, as is the restriction of symbols, but 
>with all the Unicode characters out there, homographs can't be avoided.  
>Even within a script, such as rnicrosoft.com, there are confusables.  In 
>CJK its even worse since most fonts have a limited space to render very 
>complex ideographs.  Even if a character isn't a strict homograph, it can 
>still be easily confused with the glyph a reader expects.
>
>So I don't see this proposal as adding security to IDNA2008 as a whole.  It 
>*may* reduce some confusables in some cases, but mixed scripts are already 
>warned about in modern browsers.
>
>
>- Shawn
>
>
>_______________________________________________
>Idna-update mailing list
>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp     



More information about the Idna-update mailing list