markdavis at google.com
Mon Nov 27 19:48:29 CET 2006
I'm a bit confused. What we are starting with is the published RFCs,
deployed now for several years, and seeing yet wider deployment because of
IE7. Any removal of characters that are currently allowed by those RFCs is a
backward incompatible change. And any backwards-incompatible change surely
needs good, solid justification.
Some cases are clear enough that they don't need much evidence. The
characters are (1) not part of normal words, and (2) have clear and present
spoofing problems: fraction slash, for example.
Others are not so clear, such as the combining marks. Removal of these would
severely handicap many languages (I gave the vowel analogy for English), so
they need to be carefully assessed:
- Is there any evidence of known spoofs with these characters?
- Do the techniques already in use (eg in Firefox and IE7), or
recommended in http://www.unicode.org/reports/tr36/, handle known or
prospective spoof attempts?
On 11/27/06, Vint Cerf <vint at google.com> wrote:
> taking this from the other direction, one might start with a pretty
> limited set(s) of characters (but far more than present use of LDH) that are
> believed to be "safe" and then try to find ways to expand the set(s) within
> the tolerance of safety risk. Plainly there will be differences of opinion
> as to what is "safe enough" - the expressiveness of the characters permitted
> in IDNs should not, in my opinion, be required to have the same degree of
> expressiveness as one would expect in natural written languages. These are,
> after all, computer-based identifiers, technically speaking. Plainly we want
> them to have some linguistic value in the sense that they are memorable, but
> the presence of search, cut/paste, and directories suggests that perfect
> memorability is less critical than, say, global interoperability.
> I hope no one reads this and thinks I am deliberately short-changing the
> expressiveness side of the equation but I am deeply concerned that we
> appreciate the intended utility of IDNs compared to general multilingual
> Vinton G Cerf
> Chief Internet Evangelist
> Regus Suite 384
> 13800 Coppermine Road
> Herndon, VA 20171
> +1 703 234-1823
> +1 703-234-5822 (f)
> vint at google.com
> *From:* idna-update-bounces at alvestrand.no [mailto:
> idna-update-bounces at alvestrand.no] *On Behalf Of *Mark Davis
> *Sent:* Monday, November 27, 2006 12:19 PM
> *To:* idna-update at alvestrand.no
> *Subject:* IDNAbis Goals
> In order to assess the advantages and disadvantages of any approach, we
> need to have a good idea of the goals and the weights attached to them. Here
> is an initial take on some of the issues so far discussed, divided into
> A. Loosen some restrictions on IDNA. The goal is to allow, *where
> feasible*, the same kind of expressive capability in other languages that
> is now provided for in English. It should be recognized that not all
> reasonable words of every language will qualify: even in English the lack of
> spaces and other punctuation forces compromises: words like "can't" are
> Here is what I've heard so far:
> 1. Allow Unicode 5.0 characters
> 2. Provide for some mechanism for more quickly updating to
> successive Unicode versions.
> 3. Allow for combining marks at the end of bidi fields
> 4. Allow for ZWJ/ZWNJ in limited contexts (see a previous message).
> Except for #4, which probably most people haven't looked through yet, it
> appears that these are mostly uncontroversial.
> B. Tighten some restrictions on IDNA. The purpose of this appears to be to
> reduce the opportunity for spoofing. Thus any proposed restrictions should
> be assessed against that metric. That is: (a) does the restriction reduce
> spoofing significantly? (b) Are there no other reasonable mechanisms for
> doing so?
> Here is what I've heard so far:
> 1. Remove (or discourage) symbols and (most) punctuation.
> - This appears to be mostly uncontroversial. While the vast
> majority of symbols and punctuation do not cause spoofing problems (I♥NY.com
> is not a problem, for example), there is not enough value to having them to
> be worth the effort.
> 2. Remove (or discourage) non-spacing marks.
> - This is quite controversial. These marks are needed by many
> languages; excluding them is like removing vowels from English: "microsoft.com"
> becoming "mcrsft.cm".
> - A very good case has to be made that they (a) cause
> problems, and (b) those problems can't feasibly be handled with other
> 3. Remove (or discourage) archaic / technical characters (characters
> not in common modern use)
> - Unicode supplies a proposed list of such characters, in
> However, it is recognized that any such list will need refinement and
> extension in the future.
> - Certain scripts are quite clearly archaic, and could be
> easily removed or discouraged.
> - Judging whether a character in a modern script is archaic,
> especially those in broad usage such as Latin, Arabic, and Cyrillic, can be
> quite difficult -- often these characters are pressed into use in minority
> A major issue is the choice between removal and discouragement. Removal
> has the very significant cost of breaking backwards compatibility, so a
> clear case has to be made that there is no feasible alternative to handle
> spoofing problems that would otherwise occur.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update