Criteria for exceptional characters

Sun Dec 17 01:37:25 CET 2006

At 15:58 -0800 2006-12-16, Mark Davis wrote:

>The major problems I see with the current system* are:
>
>1. It does not allow Unicode 5.0 characters.

To be honest, we MUST refer to Unicode 5.1. Of 
course, all characters in Unicode 5.0 are 
important, but if Unicode 5.1 is not taken as the 
benchmark, the Myanmar (Burmese) script will be 
left out, and that is simply not something that 
can be countenanced.

>2. It restricts some combinations that are required for certain languages.
>   a) Mn at the end of BIDI fields

This prevents Thaana from being used, as well as 
Yiddish, and probably a number of languages which 
use the Arabic script.

>   b) ZWJ/NJ in limited contexts

A problem for some Brahmic scripts, at least some 
of the major scripts of India.

>3. There are concerns about the stability of normalization

Are they valid? What are they, specifically?

>4. There are opportunities for spoofing. This 
>breaks down into a number of sub-problems, of 
>which the major ones are:
>   a) non-letter confusables, like fraction slash 
>in 
><http://amazon.com/badguy.com>amazon.com/badguy.com
>   b) confusable letters/numbers within mixtures 
>of scripts, like cyrillic 'a' in paypal.

I thought we agreed to ban this kind of mixing long ago.

One ramification of this would that it would no 
longer be possible to say that Kurdisk Ww was 
LATIN W instead of CYRILLIC WE. We would be 
obliged, for security's sake, to encode CYRILLIC 
WE. There would be no disadvantage here. In fact, 
it would be better for Kurdish. Consider a 
glossary of Kurdish words in its three 
orthographies, Arabic, Latin, and Cyrillic. If 
LATIN W and CYRILLIC WE are encoded separately, 
it is possible to correctly sort (or search, à la 
Google) the multi-script list. If they are 
unified with LATIN W (as at present), there is no 
solution.

>c) confusable letters in same script, like <http://inte1.com>inte1.com
>[There is a finer breakdown in 
><http://unicode.org/reports/tr36/>http://unicode.org/reports/tr36/]

Well worth reading.

>The reason I say "system*" is that the options 
>for solutions can be at different points:
>
>A. What should the protocol allow?
>B. What should a registry allow?
>C. What should the user agent block (or flag, eg with raw punycode)?
>
>For example, nobody has yet proposed that the 
>protocol disallow mixtures of scripts, even 
>though that represents by far the largest 
>opportunity for spoofing.

This is **NOT** correct. I have advocated this 
more or less loudly since September 2005, when I 
discussed the question at length with Cary Karp 
when I was at the Sophia Antipolis meeting of WG2 
and advised him on the draft recommendations he 
was writing. I continue to favour this 
anti-spoofing solution, and if, as has been 
suggested, Unicode script properties of 
characters can be used to ensure that scripts are 
not mixed or mixable (modulo Jpan for instance) 
then there should be no problem with this.

(The only problem I could see is that UTC would 
have to accept CYRILLIC WE, and possibly LATIN 
SOFT SIGN, LATIN THETA as characters used for 
specialist purposes. We are talking less than two 
dozen characters here, and I'm being pretty 
generous in my estimate. A small price to pay for 
security.)

>Instead, it appears that the solutions taken by 
>the user agents are sufficient there: while the 
>" <http://paypal.com>paypal.com" case got a lot 
>of attention, when you look at the actual, 
>practical impact in terms of real, reported 
>security problems, it is not in practice 
>significant. I have no doubt that the 
>user-agents will continue to refine and improve 
>their approaches.

I don't understand how you can say that the paypal.com case is "insignificant".

>So, what progress are we making?
>
>1. Looks like we have a solution

*If* Unicode 5.1, and *if* IETF bites the bullet 
and realizes that there will be a Unicode 6, and 
7, and 8, which may have needed characters. (No, 
Vint, I'm not talking about non-essential 
characters.)

>2a. Also looks like we have a solution

If we change the rule.

>2b. Not yet consensus on this

Going to have to bite this bullet for some 
scripts, but if script properties are accessed, 
and the use of the joiners is restricted to 
certain script (or even certain character) 
environments, this may not be a problem.

>3. Looks like we have a solution (restrict the 
>sequences that could change between 3.2 and 5.0; 
>the Unicode consortium is tightening stability 
>to disallow further changes)

Please create a separate thread to discuss this particular issue.

>4a. Our proposed rules fix this. By tossing out 
>all non-LMN, we remove the bad cases. Although 
>the problematic characters are a small fraction 
>of the few thousand characters in question, most 
>of which are not problematic, there is general 
>agreement that as a class these are not needed, 
>and we are not worried about tossing out any 
>babies with the bathwater.

The list should be reviewed. I'm not saying you 
haven't done a good job, but I haven't reviewed 
it, and I don't know if anyone else has either.

>4b. We're not tackling this in the protocol, 
>leaving it up to user agents (and to some degree 
>registries).

I think it is NOT A GOOD IDEA not to tackle this 
in the protocol. I think IT WOULD BE A VERY GOOD 
IDEA for this to be dealt with in the protocol. 
It would be far safer for the end user, because 
there would be no danger of error (intentional or 
unintentional) on the part of agents or 
registries. We *should* police this because we 
can.

>4c. Here I also suspect that the principle 
>solution is in the user agents, but what we can 
>do at the protocol level is to make some 
>exclusions where there are clear cases that we 
>can handle via well-established properties, or 
>particular exception cases where we add or 
>remove particular characters(s). What we have 
>done so far is to toss out certain classes of 
>characters that are clearly not needed for 
>modern languages (historic scripts). [Here 
>again, frankly, their removal doesn't 
>fundamentally reduce spoofability, but it does 
>little hard to remove them. But because there is 
>not much benefit to their removal, we don't 
>really need to argue whether there is a real 
>need for ones like Runic, because there aren't 
>really demonstrable problems with allowing it, 
>given solutions in (4b).]

For this I suspect that the best we can do is 
make recommendations. STRONG recommendations 
based on real linguistic knowledge and data. 
Recommendations so strong that a given registry 
should have to give reasons for deviating from 
them.

>(4c) is where your current question falls. These 
>are characters that are not covered by the rules 
>we have developed so far. My suggestion for 
>criteria are:
>
>A. If there is clearly defined class of 
>characters that are clearly never needed in 
>modern languages (in this case Hebrew/Yiddish), 
>we can exclude them.

"In this case"? But I agree, linguistic expertise 
can help weed out characters which are really not 
needed.

>B. If there are particular characters that may 
>be used as a normal part of the language that we 
>want to consider including or excluding, then we 
>consider two factors in weighing the question:
>
>   B1. Can this character cause a spoofing 
>problem in a monoscript string, and if so, how 
>severe is the problem?
>
>   B2. Is this character used in the regular 
>orthography of a modern language, and if so, how 
>essential is it?

Good questions, requiring linguistic expertise. 
This would be a "white list" sort of thing, not 
something that could be done algorithmically.

>We want to keep the exceptional characters 
>(included or excluded) that are not covered by 
>the normal rules we've developed so far to a 
>minimum, so only those with a large negative 
>weight should get exceptionally excluded, and 
>only those with a high positive weight get 
>exceptionally included.

Agreed.

>For example, a character that looks like a 
>period or a slash (important syntax characters 
>in URLs), and is optional in the language (eg 
>used in abbreviations, but not regular words) 
>gets a large negative weight. A character that 
>doesn't look like a syntax character or another 
>Hebrew character, and is required by common 
>Hebrew or Yiddish words would get a high 
>positive weight.

Well, the Ethiopic wordspace looks like a colon 
to readers of Latin script, and from a distance, 
though its dots are square and not round. 
However, it can ONLY occur between two ethiopic 
SYLLABLEs, and (obviously) if it were entered 
accidentally inside "http://" it would cause no 
difficulty, because that would be no different 
from entering "http$//" -- it would have no 
effect because it is not a protocol element.

I think we are making progress, and I hope my comments are helpful.
-- 
Michael Everson * http://www.evertype.com