IDNAbis Main Open Issues

Sat Jan 19 20:33:51 CET 2008

Here's what I think are the main issues.

(live document at http://docs.google.com/Doc?id=dfqr8rd5_50hdnzwmdh)

   1. Settle on character repertoire. Basic formulation is ok, but...
   1. Add extensions for stability
      2. Don't be Eurocentric: ensure that modern scripts are in
      ALWAYS (or whatever it is called)
      3. Resolve ALWAYS/MAYBE/NEVER problem (see below).

There are other smaller issues, wording of the text, continuing to make
progress on BIDI, and so on, but I think the above are the chief remaining
issues to get consensus on.

 In particular, it sounds like people would not be adverse to having a
separate preprocessing document, which we think is required, so as long as
we do that I'm not including it here. I have a draft at Draft IDNA
Preprocessing <http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx> for
discussion. Of course, that doesn't mean that we agree on the details for
that yet!
ALWAYS/MAYBE/NEVER problemIf we take the very strong approach that Patrik
has currently, where only Latin, Cyrillic, and Greek are really guaranteed,
then over 90 thousand characters are no longer guaranteed to be part of
IDNA. I use the word "not guaranteed" specifically. In Google, for example,
we want to be able to look at a URL and say that it is either compliant to
IDNAbis or not. And we don't want its compliance to change from TRUE to
FALSE according to browser, or in the future. It's ok for it to change from
FALSE to TRUE in the future, but not the reverse.

The operational difference between MAYBE and ALWAYS is a problem.

Let's take a look at document authoring. If HTML document authors/generators
use hyperlinks with IRIs that contain characters in the MAYBE category
(since registries are allowed to register them, even if it's not
recommended), then those links would break if any of those characters became
NEVERs (assuming that browsers obey the rule that NEVERs must not be looked
up). That makes perfectly conformant pages suddenly become non-conformant.

The structure right now in tables/protocol gives user-agents (including
browers, but also search engines like Google's) a choice of the frying pan
or the fire:

   1. If we want stability, only accept ALWAYS. That's untenable, since
   we couldn't handle most of the world.
   2. If we want to serve our customers, accept ALWAYS+MAYBE. That is
   instable, since MAYBE could change to NEVER at any time.

Programs always have to be prepared for characters becoming valid that were
not in previous versions. With any new version of Unicode, a company like
Apple, Google, or Microsoft updates its software to use that version, and
characters become acceptable that were not previously. Everyone who deals
with Unicode needs to be prepared for that. (This situation is a bit like
language/country codes, where new ones arrive on our doorstep -- but
mechanisms like BCP 47 ensure that the old ones never become invalid.) The
key requirement for stability is that characters that were acceptable in
IDNAbis don't suddenly become invalid. If a character (or script) moves from
(MAYBE + NEVER) into ALWAYS in the future, it is not a problem for
implementers. Moving from (MAYBE + ALWAYS) to NEVER is a serious problem.

My recommendation is and has been: permit all characters from all modern
scripts. Those are easily identified, and do not disadvantage any modern
language group. It does not require an elaborate -- and probably unworkable
-- process for getting buy-in. It would be acceptable to have historic
scripts in MAYBE, on the off chance that there is a successful revival,
because it doesn't put us in the frying-pan-or-fire position above, since
all modern scripts would be allowed.

This protocol is the wrong place to be making fine-grained linguistic
determinations in any event. Restrictions can be imposed by registries or
other parties, and user-agents where needed. Such restrictions are an
exceeding small problem compared to handling the issues raised by spoofs
like "paypal.com", and pale in comparison.

If this approach is argued against, it should be with concrete examples that
can be reviewed and assessed. And the bad cases have to be sufficient in
number to warrant the complexity of the ALWAYS, MAYBE, NEVER process.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080119/7353e139/attachment.html