Fwd: Criteria for exceptional characters

Sun Dec 17 04:26:17 CET 2006

Forgot to "Reply to all"

---------- Forwarded message ----------
From: Mark Davis <mark.davis at icu-project.org>
Date: Dec 16, 2006 6:49 PM
Subject: Re: Criteria for exceptional characters
To: Michael Everson <everson at evertype.com>

Thanks, comments below.

On 12/16/06, Michael Everson <everson at evertype.com> wrote:
>
> At 15:58 -0800 2006-12-16, Mark Davis wrote:
>
> >The major problems I see with the current system* are:
> >
> >1. It does not allow Unicode 5.0 characters.
>
> To be honest, we MUST refer to Unicode 5.1. Of
> course, all characters in Unicode 5.0 are
> important, but if Unicode 5.1 is not taken as the
> benchmark, the Myanmar (Burmese) script will be
> left out, and that is simply not something that
> can be countenanced.

According to John, we can't wait that long. What I should have added is

1a. some kind of process that makes it easy to update to successive versions
of Unicode. [Having the kind of property-based approach that we are
developing is a solid step in that direction.]

>2. It restricts some combinations that are required for certain languages.
> >   a) Mn at the end of BIDI fields
>
> This prevents Thaana from being used, as well as
> Yiddish, and probably a number of languages which
> use the Arabic script.

right

>   b) ZWJ/NJ in limited contexts
>
> A problem for some Brahmic scripts, at least some
> of the major scripts of India.

right

>3. There are concerns about the stability of normalization
>
> Are they valid? What are they, specifically?

See http://www.unicode.org/reports/tr15/#Versioning

>4. There are opportunities for spoofing. This
> >breaks down into a number of sub-problems, of
> >which the major ones are:
> >   a) non-letter confusables, like fraction slash
> >in
> ><http://amazon.com/badguy.com> amazon.com/badguy.com
> >   b) confusable letters/numbers within mixtures
> >of scripts, like cyrillic 'a' in paypal.
>
> I thought we agreed to ban this kind of mixing long ago.

That is not in any of the proposals on the table (eg in internet drafts or
our rule development), as far as I know. It is among the Unicode
recommendations, but nobody has proposed it for the protocol. One has to be
a bit careful to get the right level, given that certain orthographies (eg
Japanese) use multiple scripts. For that reason, it is unclear whether this
should be baked into the protocol, or up to more flexible mechanisms, like
the user-agents.

See also:

http://www.unicode.org/reports/tr36/#Security_Levels_and_Alerts

One ramification of this would that it would no
> longer be possible to say that Kurdisk Ww was
> LATIN W instead of CYRILLIC WE. We would be
> obliged, for security's sake, to encode CYRILLIC
> WE. There would be no disadvantage here. In fact,
> it would be better for Kurdish. Consider a
> glossary of Kurdish words in its three
> orthographies, Arabic, Latin, and Cyrillic. If
> LATIN W and CYRILLIC WE are encoded separately,
> it is possible to correctly sort (or search, à la
> Google) the multi-script list. If they are
> unified with LATIN W (as at present), there is no
> solution.
>
> >c) confusable letters in same script, like <http://inte1.com> inte1.com
> >[There is a finer breakdown in
> ><http://unicode.org/reports/tr36/>http://unicode.org/reports/tr36/ ]
>
> Well worth reading.
>
> >The reason I say "system*" is that the options
> >for solutions can be at different points:
> >
> >A. What should the protocol allow?
> >B. What should a registry allow?
> >C. What should the user agent block (or flag, eg with raw punycode)?
> >
> >For example, nobody has yet proposed that the
> >protocol disallow mixtures of scripts, even
> >though that represents by far the largest
> >opportunity for spoofing.
>
> This is **NOT** correct. I have advocated this
> more or less loudly since September 2005, when I
> discussed the question at length with Cary Karp
> when I was at the Sophia Antipolis meeting of WG2
> and advised him on the draft recommendations he
> was writing. I continue to favour this
> anti-spoofing solution, and if, as has been
> suggested, Unicode script properties of
> characters can be used to ensure that scripts are
> not mixed or mixable (modulo Jpan for instance)
> then there should be no problem with this.
>
> (The only problem I could see is that UTC would
> have to accept CYRILLIC WE, and possibly LATIN
> SOFT SIGN, LATIN THETA as characters used for
> specialist purposes. We are talking less than two
> dozen characters here, and I'm being pretty
> generous in my estimate. A small price to pay for
> security.)

see above

>Instead, it appears that the solutions taken by
> >the user agents are sufficient there: while the
> >" <http://paypal.com>paypal.com" case got a lot
> >of attention, when you look at the actual,
> >practical impact in terms of real, reported
> >security problems, it is not in practice
> >significant. I have no doubt that the
> >user-agents will continue to refine and improve
> >their approaches.
>
> I don't understand how you can say that the paypal.com case is
> "insignificant".

What I said was: "when you look at the actual, practical impact in terms of
real, reported security problems, it is not in practice significant". And
this is, I believe, because of steps taken in the browsers to alert users to
this. Such cases are quite easy to detect in the user-agent.

Listings of actual reported fraud using this technique, and their impact,
would be useful.

>So, what progress are we making?
> >
> >1. Looks like we have a solution
>
> *If* Unicode 5.1, and *if* IETF bites the bullet
> and realizes that there will be a Unicode 6, and
> 7, and 8, which may have needed characters. (No,
> Vint, I'm not talking about non-essential
> characters.)

see above

>2a. Also looks like we have a solution
>
> If we change the rule.

Everything I have to say is conditional on successful completion. What I
mean by "we have a solution" is that it looks like we have consensus on an
approach.

>2b. Not yet consensus on this
>
> Going to have to bite this bullet for some
> scripts, but if script properties are accessed,
> and the use of the joiners is restricted to
> certain script (or even certain character)
> environments, this may not be a problem.

See also http://www.unicode.org/review/pr-96.html

>3. Looks like we have a solution (restrict the
> >sequences that could change between 3.2 and 5.0;
> >the Unicode consortium is tightening stability
> >to disallow further changes)
>
> Please create a separate thread to discuss this particular issue.

If and when it requires further discussion.

>4a. Our proposed rules fix this. By tossing out
> >all non-LMN, we remove the bad cases. Although
> >the problematic characters are a small fraction
> >of the few thousand characters in question, most
> >of which are not problematic, there is general
> >agreement that as a class these are not needed,
> >and we are not worried about tossing out any
> >babies with the bathwater.
>
> The list should be reviewed. I'm not saying you
> haven't done a good job, but I haven't reviewed
> it, and I don't know if anyone else has either.

The lists are there, and Patrik, Ken, myself and others have been working on
them. If you are going to review them, you can start anytime ;-)

>4b. We're not tackling this in the protocol,
> >leaving it up to user agents (and to some degree
> >registries).
>
> I think it is NOT A GOOD IDEA not to tackle this
> in the protocol. I think IT WOULD BE A VERY GOOD
> IDEA for this to be dealt with in the protocol.
> It would be far safer for the end user, because
> there would be no danger of error (intentional or
> unintentional) on the part of agents or
> registries. We *should* police this because we
> can.

I have no strong feeling either way. It is not difficult to do this in the
user-agent.

>4c. Here I also suspect that the principle
> >solution is in the user agents, but what we can
> >do at the protocol level is to make some
> >exclusions where there are clear cases that we
> >can handle via well-established properties, or
> >particular exception cases where we add or
> >remove particular characters(s). What we have
> >done so far is to toss out certain classes of
> >characters that are clearly not needed for
> >modern languages (historic scripts). [Here
> >again, frankly, their removal doesn't
> >fundamentally reduce spoofability, but it does
> >little hard to remove them. But because there is
> >not much benefit to their removal, we don't
> >really need to argue whether there is a real
> >need for ones like Runic, because there aren't
> >really demonstrable problems with allowing it,
> >given solutions in (4b).]
>
> For this I suspect that the best we can do is
> make recommendations. STRONG recommendations
> based on real linguistic knowledge and data.
> Recommendations so strong that a given registry
> should have to give reasons for deviating from
> them.

We're only discussing here the protocol. The question of who can force
registries to "give reasons" is not one I want to get into here.

>(4c) is where your current question falls. These
> >are characters that are not covered by the rules
> >we have developed so far. My suggestion for
> >criteria are:
> >
> >A. If there is clearly defined class of
> >characters that are clearly never needed in
> >modern languages (in this case Hebrew/Yiddish),
> >we can exclude them.
>
> "In this case"? But I agree, linguistic expertise
> can help weed out characters which are really not
> needed.

"in this case": Cary was raising this issue with regard to certain Hebrew
characters.

>B. If there are particular characters that may
> >be used as a normal part of the language that we
> >want to consider including or excluding, then we
> >consider two factors in weighing the question:
> >
> >   B1. Can this character cause a spoofing
> >problem in a monoscript string, and if so, how
> >severe is the problem?
> >
> >   B2. Is this character used in the regular
> >orthography of a modern language, and if so, how
> >essential is it?
>
> Good questions, requiring linguistic expertise.
> This would be a "white list" sort of thing, not
> something that could be done algorithmically.
>
> >We want to keep the exceptional characters
> >(included or excluded) that are not covered by
> >the normal rules we've developed so far to a
> >minimum, so only those with a large negative
> >weight should get exceptionally excluded, and
> >only those with a high positive weight get
> >exceptionally included.
>
> Agreed.
>
> >For example, a character that looks like a
> >period or a slash (important syntax characters
> >in URLs), and is optional in the language (eg
> >used in abbreviations, but not regular words)
> >gets a large negative weight. A character that
> >doesn't look like a syntax character or another
> >Hebrew character, and is required by common
> >Hebrew or Yiddish words would get a high
> >positive weight.
>
> Well, the Ethiopic wordspace looks like a colon
> to readers of Latin script, and from a distance,
> though its dots are square and not round.
> However, it can ONLY occur between two ethiopic
> SYLLABLEs, and (obviously) if it were entered
> accidentally inside "http://" it would cause no
> difficulty, because that would be no different
> from entering "http$//" -- it would have no
> effect because it is not a protocol element.

This is where one has to have a more thorough knowledge of the syntax, which
the DNS honchos here can obviously supply. For example, a URL can contain a
colon in several other positions,
such as:

http://<user>:<password>@<host>:<port>/<url-path>

From your end, any characters that you can identify that could cause
problems would be useful. That is, they are typically letters that resemble
either other letters, or the ASCII syntax characters (dot, colon, slash,
...)

A good place to start is the data table in
http://www.unicode.org/reports/tr39/#Confusable_Detection

We have some mappings there, but we can definitely add more. Eg for colon we
have currently:

FF1A ;	003A ;	SA	#* ( ： → : ) FULLWIDTH COLON → COLON	# {nfkc:65307}
0589 ;	003A ;	SA	#* ( ։ → : ) ARMENIAN FULL STOP → COLON	# {source:12}
FE30 ;	003A ;	SA	#* ( ︰ → : ) PRESENTATION FORM FOR VERTICAL TWO DOT
LEADER → COLON	# {source:3328}

05C3 ;	003A ;	SA	#* ( ׃ → : ) HEBREW PUNCTUATION SOF PASUQ → COLON	# {source:13}

> I think we are making progress, and I hope my comments are helpful.

Yes, thanks.

--
> Michael Everson * http://www.evertype.com
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061216/d1638205/attachment-0001.html