Unicode versions (Re: Criteria for exceptional characters)

Mark Davis mark.davis at icu-project.org
Tue Dec 19 23:25:55 CET 2006


> Many, including Arabic, Sanskrit and Dhivehi. Possibly Hebrew too. But
> "leaving out" may be an underspecified term here - see next comment.

Your statement pretty much floored me. Before we remove the ability to use
domain names from billions of people, it'd be good to have solid, defensible
reasons for doing so.

I really would like to get back to my original message, which was to try to
get a solid problem statement so that we can assess what we are doing
against that. John hasn't replied to Ken's suggestions for changes to
http://www.ietf.org/internet-drafts/draft-klensin-idnabis-issues-00.txt, and
frankly even with those changes the document does not yet provide sufficient
rationale for the steps it proposes.

As I said, when we go out to fix an engineering problem, we need to have a
clear statement of the problems, with example scenarios for each. Without
the scenarios, you often don't get a clear idea from all parties as to what
the issues really are, and why fixes need to be made. You can then assess
the options based on how well they handle the problems, and you have some
concrete cases to look at, instead mistly, inchoate anxieties.

So I'd really like feedback on the problem statement, included below. I made
an update to #1 as per your message. I don't think at all that #2 and #4 are
irrelevant -- you may have read them over too quickly. We need a good
statement as to the problems that removing Arabic, for example, would fix
AND why we can't solve the problem without removing Arabic.

Here is a restatement of what I see are the problems.

1. It is bound to a specific version of Unicode, and therefore does not
allow the adoption of new scripts over time; in particular, it does not
allow Unicode 5.0 characters. Examples: see the updated section "## Show a
list of all the characters not currently allowed" of
http://www.macchiato.com/idn/UnicodePropertyResults.html . This takes the
current proposal given by the rules we're working on, and shows the
characters not permitted by the current IDNA. It currently amounts to 956
characters. (You may have to refresh your browser to see it.) Now, many of
these are historic characters, and won't much matter to modern users, but
many are in current, modern usage, and required by well-populated languages.

2. It restricts some combinations that are required for certain languages.
  a) Mn at the end of BIDI fields; as in Dhivehi, see http://www.ietf.org/in
ternet-drafts/draft-alvestrand-idna-bidi-00.txt
  b) ZWJ/NJ in limited contexts; see
http://www.unicode.org/review/pr-96.html

3. There are concerns about the stability of normalization
(discussed elsewhere)

4. There are opportunities for spoofing. This breaks down into a number of
sub-problems, of which the major ones are:
  a) non-letter confusables; like fraction slash in amazon.com/badguy.com
  b) confusable letters/numbers within mixtures of scripts; like cyrillic
'a' in paypal.com
  c) confusable letters in same script; like inte1.com

If there are other problems beyond these, I'd really like to know about
them. Otherwise I can just forsee continuing confusion.

Mark

On 12/19/06, Harald Alvestrand < harald at alvestrand.no> wrote:
>
> Mark Davis wrote:
> >
> >     If that is accepted as the problem definition, it is reasonable to
> >     assume
> >     that a solution does NOT lock us again into a fixed set of
> >     scripts, but
> >     rather allows scripts to be added in an incremental fashion.
> >     And if that is accepted, the option of disallowing a script "until
> >     we have
> >     sorted out the identified issues" becomes far less of an issue
> >     than it
> >     seems to be regarded by Mark/Ken/Michel today
> >     (apologies if I have mischaracterized a position here).
> >
> >
> >
> > I think the "until we have sorted out the identified issues" is too
> > vague to be a useful criterion. There is general consensus that there
> > isn't any problem with leaving out the historic scripts (although, as
> > I said, frankly it doesn't buy much in terms of reducing spoofing).
> > But which other scripts did you have in mind omitting, and on what
> > grounds?
> Many, including Arabic, Sanskrit and Dhivehi. Possibly Hebrew too. But
> "leaving out" may be an underspecified term here - see next comment.
> >
> > There is also a big difference between the flexibility in the protocol
> > vs that available to registries and user-agents. Suppose that in the
> > protocol we allow Hebrew, but recommend against (for some reason)
> > final forms of letters. Registries and user-agents can then start by
> > following those recommendations, but if it turns out to be necessary
> > to allow them in (either fully or in limited circumstances), it is
> > relatively easy for them to do so. Baking a prohibition against
> > final-forms of letters into the protocol is a much different matter --
> > it takes quite a while for everyone to update to a new version. (And
> > during that time, I have no doubt that we will hear charges of
> > discrimination...)
> >
> You may want to review draft-klensin-idnabis-issues again, and see at
> which steps of the protocol we are thinking of switching to an
> inclusion-based model that starts off with a sharply limited set.
>
> I think we are best served if we install the maximum amount of
> restrictions initially in section 2.1.3 "Registration of IDNs -
> Permitted Character Identification" (and therefore also in section
> 2.1.5), while we should install a minmum set of restrictions in section
> 2.2.3 "Domain Name Resolution - Pre-Nameprep Validation and Character
> List Testing".
> >
> >     I think.
> >
> >
> >
> > Since you didn't comment on any of the other issues I wrote, does that
> > mean that you agree with them, or that you just hadn't gotten to them.
> > ;-)
> It meant that I regarded them as irrelevant until we get this one
> settled, I think.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061219/5ed47174/attachment.html


More information about the Idna-update mailing list