Criteria for exceptional characters
Michael Everson
everson at evertype.com
Sun Dec 17 01:37:25 CET 2006
At 15:58 -0800 2006-12-16, Mark Davis wrote:
>The major problems I see with the current system* are:
>
>1. It does not allow Unicode 5.0 characters.
To be honest, we MUST refer to Unicode 5.1. Of
course, all characters in Unicode 5.0 are
important, but if Unicode 5.1 is not taken as the
benchmark, the Myanmar (Burmese) script will be
left out, and that is simply not something that
can be countenanced.
>2. It restricts some combinations that are required for certain languages.
> a) Mn at the end of BIDI fields
This prevents Thaana from being used, as well as
Yiddish, and probably a number of languages which
use the Arabic script.
> b) ZWJ/NJ in limited contexts
A problem for some Brahmic scripts, at least some
of the major scripts of India.
>3. There are concerns about the stability of normalization
Are they valid? What are they, specifically?
>4. There are opportunities for spoofing. This
>breaks down into a number of sub-problems, of
>which the major ones are:
> a) non-letter confusables, like fraction slash
>in
><http://amazon.com/badguy.com>amazon.com/badguy.com
> b) confusable letters/numbers within mixtures
>of scripts, like cyrillic 'a' in paypal.
I thought we agreed to ban this kind of mixing long ago.
One ramification of this would that it would no
longer be possible to say that Kurdisk Ww was
LATIN W instead of CYRILLIC WE. We would be
obliged, for security's sake, to encode CYRILLIC
WE. There would be no disadvantage here. In fact,
it would be better for Kurdish. Consider a
glossary of Kurdish words in its three
orthographies, Arabic, Latin, and Cyrillic. If
LATIN W and CYRILLIC WE are encoded separately,
it is possible to correctly sort (or search, à la
Google) the multi-script list. If they are
unified with LATIN W (as at present), there is no
solution.
>c) confusable letters in same script, like <http://inte1.com>inte1.com
>[There is a finer breakdown in
><http://unicode.org/reports/tr36/>http://unicode.org/reports/tr36/]
Well worth reading.
>The reason I say "system*" is that the options
>for solutions can be at different points:
>
>A. What should the protocol allow?
>B. What should a registry allow?
>C. What should the user agent block (or flag, eg with raw punycode)?
>
>For example, nobody has yet proposed that the
>protocol disallow mixtures of scripts, even
>though that represents by far the largest
>opportunity for spoofing.
This is **NOT** correct. I have advocated this
more or less loudly since September 2005, when I
discussed the question at length with Cary Karp
when I was at the Sophia Antipolis meeting of WG2
and advised him on the draft recommendations he
was writing. I continue to favour this
anti-spoofing solution, and if, as has been
suggested, Unicode script properties of
characters can be used to ensure that scripts are
not mixed or mixable (modulo Jpan for instance)
then there should be no problem with this.
(The only problem I could see is that UTC would
have to accept CYRILLIC WE, and possibly LATIN
SOFT SIGN, LATIN THETA as characters used for
specialist purposes. We are talking less than two
dozen characters here, and I'm being pretty
generous in my estimate. A small price to pay for
security.)
>Instead, it appears that the solutions taken by
>the user agents are sufficient there: while the
>" <http://paypal.com>paypal.com" case got a lot
>of attention, when you look at the actual,
>practical impact in terms of real, reported
>security problems, it is not in practice
>significant. I have no doubt that the
>user-agents will continue to refine and improve
>their approaches.
I don't understand how you can say that the paypal.com case is "insignificant".
>So, what progress are we making?
>
>1. Looks like we have a solution
*If* Unicode 5.1, and *if* IETF bites the bullet
and realizes that there will be a Unicode 6, and
7, and 8, which may have needed characters. (No,
Vint, I'm not talking about non-essential
characters.)
>2a. Also looks like we have a solution
If we change the rule.
>2b. Not yet consensus on this
Going to have to bite this bullet for some
scripts, but if script properties are accessed,
and the use of the joiners is restricted to
certain script (or even certain character)
environments, this may not be a problem.
>3. Looks like we have a solution (restrict the
>sequences that could change between 3.2 and 5.0;
>the Unicode consortium is tightening stability
>to disallow further changes)
Please create a separate thread to discuss this particular issue.
>4a. Our proposed rules fix this. By tossing out
>all non-LMN, we remove the bad cases. Although
>the problematic characters are a small fraction
>of the few thousand characters in question, most
>of which are not problematic, there is general
>agreement that as a class these are not needed,
>and we are not worried about tossing out any
>babies with the bathwater.
The list should be reviewed. I'm not saying you
haven't done a good job, but I haven't reviewed
it, and I don't know if anyone else has either.
>4b. We're not tackling this in the protocol,
>leaving it up to user agents (and to some degree
>registries).
I think it is NOT A GOOD IDEA not to tackle this
in the protocol. I think IT WOULD BE A VERY GOOD
IDEA for this to be dealt with in the protocol.
It would be far safer for the end user, because
there would be no danger of error (intentional or
unintentional) on the part of agents or
registries. We *should* police this because we
can.
>4c. Here I also suspect that the principle
>solution is in the user agents, but what we can
>do at the protocol level is to make some
>exclusions where there are clear cases that we
>can handle via well-established properties, or
>particular exception cases where we add or
>remove particular characters(s). What we have
>done so far is to toss out certain classes of
>characters that are clearly not needed for
>modern languages (historic scripts). [Here
>again, frankly, their removal doesn't
>fundamentally reduce spoofability, but it does
>little hard to remove them. But because there is
>not much benefit to their removal, we don't
>really need to argue whether there is a real
>need for ones like Runic, because there aren't
>really demonstrable problems with allowing it,
>given solutions in (4b).]
For this I suspect that the best we can do is
make recommendations. STRONG recommendations
based on real linguistic knowledge and data.
Recommendations so strong that a given registry
should have to give reasons for deviating from
them.
>(4c) is where your current question falls. These
>are characters that are not covered by the rules
>we have developed so far. My suggestion for
>criteria are:
>
>A. If there is clearly defined class of
>characters that are clearly never needed in
>modern languages (in this case Hebrew/Yiddish),
>we can exclude them.
"In this case"? But I agree, linguistic expertise
can help weed out characters which are really not
needed.
>B. If there are particular characters that may
>be used as a normal part of the language that we
>want to consider including or excluding, then we
>consider two factors in weighing the question:
>
> B1. Can this character cause a spoofing
>problem in a monoscript string, and if so, how
>severe is the problem?
>
> B2. Is this character used in the regular
>orthography of a modern language, and if so, how
>essential is it?
Good questions, requiring linguistic expertise.
This would be a "white list" sort of thing, not
something that could be done algorithmically.
>We want to keep the exceptional characters
>(included or excluded) that are not covered by
>the normal rules we've developed so far to a
>minimum, so only those with a large negative
>weight should get exceptionally excluded, and
>only those with a high positive weight get
>exceptionally included.
Agreed.
>For example, a character that looks like a
>period or a slash (important syntax characters
>in URLs), and is optional in the language (eg
>used in abbreviations, but not regular words)
>gets a large negative weight. A character that
>doesn't look like a syntax character or another
>Hebrew character, and is required by common
>Hebrew or Yiddish words would get a high
>positive weight.
Well, the Ethiopic wordspace looks like a colon
to readers of Latin script, and from a distance,
though its dots are square and not round.
However, it can ONLY occur between two ethiopic
SYLLABLEs, and (obviously) if it were entered
accidentally inside "http://" it would cause no
difficulty, because that would be no different
from entering "http$//" -- it would have no
effect because it is not a protocol element.
I think we are making progress, and I hope my comments are helpful.
--
Michael Everson * http://www.evertype.com
More information about the Idna-update
mailing list