Criteria for exceptional characters

Mark Davis mark.davis at icu-project.org
Sun Dec 17 00:58:32 CET 2006


When we set out to deal with an engineering problem, it helps to describe
carefully the problem, including giving scenarios which (a) clearly exhibit
the problem, and (b) are weighted by importance. One can then assess all of
the options for solutions, judging them against the statement of the
problem, and assessing based upon their ROI: what is their cost, what is the
importance of the scenarios they solve, and what are their interactions with
other options. Moreover, you keep one eye on what is happening out in the
world, to see if what you thought was a problem ends up not being as
important as you thought originally.

We don't really yet have a clear statement of the problem, with a list of
scenarios demonstrating each of the issues, which is part of the reason that
I think we may end up diverging on the means. So I'll try to set out a
strawman.

The major problems I see with the current system* are:

1. It does not allow Unicode 5.0 characters.
2. It restricts some combinations that are required for certain languages.
  a) Mn at the end of BIDI fields
  b) ZWJ/NJ in limited contexts
3. There are concerns about the stability of normalization
4. There are opportunities for spoofing. This breaks down into a number of
sub-problems, of which the major ones are:
  a) non-letter confusables, like fraction slash in amazon.com/badguy.com
  b) confusable letters/numbers within mixtures of scripts, like cyrillic
'a' in paypal.
  c) confusable letters in same script, like inte1.com
[There is a finer breakdown in http://unicode.org/reports/tr36/]

The reason I say "system*" is that the options for solutions can be at
different points:

A. What should the protocol allow?
B. What should a registry allow?
C. What should the user agent block (or flag, eg with raw punycode)?

For example, nobody has yet proposed that the protocol disallow mixtures of
scripts, even though that represents by far the largest opportunity for
spoofing. Instead, it appears that the solutions taken by the user agents
are sufficient there: while the "paypal.com" case got a lot of attention,
when you look at the actual, practical impact in terms of real, reported
security problems, it is not in practice significant. I have no doubt that
the user-agents will continue to refine and improve their approaches.

So, what progress are we making?

1. Looks like we have a solution
2a. Also looks like we have a solution
2b. Not yet consensus on this
3. Looks like we have a solution (restrict the sequences that could change
between 3.2 and 5.0; the Unicode consortium is tightening stability to
disallow further changes)
4a. Our proposed rules fix this. By tossing out all non-LMN, we remove the
bad cases. Although the problematic characters are a small fraction of the
few thousand characters in question, most of which are not problematic,
there is general agreement that as a class these are not needed, and we are
not worried about tossing out any babies with the bathwater.
4b. We're not tackling this in the protocol, leaving it up to user agents
(and to some degree registries).
4c. Here I also suspect that the principle solution is in the user agents,
but what we can do at the protocol level is to make some exclusions where
there are clear cases that we can handle via well-established properties, or
particular exception cases where we add or remove particular characters(s).
What we have done so far is to toss out certain classes of characters that
are clearly not needed for modern languages (historic scripts). [Here again,
frankly, their removal doesn't fundamentally reduce spoofability, but it
does little hard to remove them. But because there is not much benefit to
their removal, we don't really need to argue whether there is a real need
for ones like Runic, because there aren't really demonstrable problems with
allowing it, given solutions in (4b).]

(4c) is where your current question falls. These are characters that are not
covered by the rules we have developed so far. My suggestion for criteria
are:

A. If there is clearly defined class of characters that are clearly never
needed in modern languages (in this case Hebrew/Yiddish), we can exclude
them.

B. If there are particular characters that may be used as a normal part of
the language that we want to consider including or excluding, then we
consider two factors in weighing the question:

  B1. Can this character cause a spoofing problem in a monoscript string,
and if so, how severe is the problem?

  B2. Is this character used in the regular orthography of a modern
language, and if so, how essential is it?

We want to keep the exceptional characters (included or excluded) that are
not covered by the normal rules we've developed so far to a minimum, so only
those with a large negative weight should get exceptionally excluded, and
only those with a high positive weight get exceptionally included.

For example, a character that looks like a period or a slash (important
syntax characters in URLs), and is optional in the language (eg used in
abbreviations, but not regular words) gets a large negative weight. A
character that doesn't look like a syntax character or another Hebrew
character, and is required by common Hebrew or Yiddish words would get a
high positive weight.

Mark

On 12/16/06, Cary Karp <ck at nic.museum> wrote:
>
> Perhaps we can now take a look at the way the Hebrew script is being
> handled?
>
> The table currently excludes the U+05F3 HEBREW PUNCTUATION GERESH and
> the U+05F4 HEBREW PUNCTUATION GERSHAYIM. Although it is reassuring to
> note that we recognize the fundamental role they play in Hebrew and
> Ladino orthographies, and the likelihood of their appearing in the
> exception table, I am a bit more concerned about the main table
> permitting the U+059C HEBREW ACCENT GERESH and the U+059E HEBREW ACCENT
> GERSHAYIM. The latter pair, along with the other 28 HEBREW ACCENTs
> strike me as prime examples of what we explicitly need to be excluding.
>
> We also all seem to recognize that a healthy amount of language-based
> tweaking is going to be done at the registry level (please note the
> distinction between "registry" and "registrar"), but that protocol level
> constraint is needed on the scope of the policies that can be
> implemented. (Why else are we having this discussion?). Given the
> nomenclatural and graphic similarities between the two forms of GERESH
> and GERSHAYIM, there is an obvious risk that our permitting the wrong
> ones, but excluding the right ones, will be taken as a reasoned
> statement of what registry practice should be.
>
> Can the rules, or the sequence of their application be modified to
> include the two characters that are missing, or are we stuck with
> allowing the dozens of characters that are not needed for IDN, and
> treating the remaining two as exceptions?  One alternative would be
> simply to permit all Hebrew characters in the range 0591..05F4. (At
> least one of the three characters that would thereby be reintroduced,
> U+05BE HEBREW PUNCTUATION MAQAF, can be justified in its own right.)
> This would make nothing substantially worse, and would at least call
> registry attention to the fact that there are different kinds of geresh.
>
> Better still would be to enable registries only to accept the characters
> for which warrant can be demonstrated, and to block all the others.
> Finally, given the disparate opinions about the IPA extensions already
> voiced on this list, I don't see how the registries can be adequately
> supported without our making a similarly detailed enforced inventory of
> the IPA need'ems and don'ts.
>
> /Cary
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061216/c10ee4ef/attachment.html


More information about the Idna-update mailing list