IDNAbis Main Open Issues

Sun Jan 20 06:22:26 CET 2008

--On Saturday, 19 January, 2008 11:33 -0800 Mark Davis
<mark.davis at icu-project.org> wrote:

> Here's what I think are the main issues.

Mark,

I'm going to try to comment on a few of these in the hope of
making a bit more progress before you get on the plane (or at
least of putting any differences of opinion in sharper focus).
Please note that these are quick personal impressions only.

> (live document at
> http://docs.google.com/Doc?id=dfqr8rd5_50hdnzwmdh)
>
>
>    1. Settle on character repertoire. Basic formulation is ok,
> but...
>    1. Add extensions for stability
>       2. Don't be Eurocentric: ensure that modern scripts are
> in       ALWAYS (or whatever it is called)
>       3. Resolve ALWAYS/MAYBE/NEVER problem (see below).

The explanation of this situation in issues-06 is much better
than it is in issues-05 (others will have to judge whether it is
adequate).  However, the ALWAYS problem is that it is not
necessarily all of a script and the  boundaries require
explicit, IDN-specific, input from the users of that script (as
we have both noted earlier, that is a tough problem, but let's
isolate it a bit).

So, for example, we've got advice from the three important
pieces of the CJK community about characters that are considered
appropriate (by one or more of them) for use in IDNs (see
RFC3743, RFC4713, and the corresponding IANA table
registrations).  So, because the community has told us what they
consider safe, appropriate, and sufficiently unambiguous for IDN
use, those characters go into ALWAYS.  The rest of the CJK
characters (i.e., the characters of the Han and Han-derived
script) belong in MAYBE somewhere.  Since they are legitimate
language characters and the community has not said "these are
evil" or the equivalent, they cannot be classified as NEVER and
should not be.   Which of the MAYBE categories those additional
characters belong in is still something we are trying to sort
out (and hence another unresolved issue), but they certainly
don't go to ALWAYS.

I don't have an opinion about it and seek your advice but, to
the extent to which there are particularly problematic
characters in Western European scripts, they might be kept out
of ALWAYS for the present and until more experience and advice
from the affected parties comes along.   Such characters might
include the notorious dotless "i" and perhaps such difficulties
as Eszett and Final Sigma, although backward-compatibility or
other considerations might dictate the handling of those
characters.

Note that the model described above involves splitting up the
characters of scripts in ways that cannot be done by Unicode
properties alone since judgments specific to IDN usability are
required.

> There are other smaller issues, wording of the text,
> continuing to make progress on BIDI, and so on, but I think
> the above are the chief remaining issues to get consensus on.
>
>  In particular, it sounds like people would not be adverse to
> having a separate preprocessing document, which we think is
> required, so as long as we do that I'm not including it here.
> I have a draft at Draft IDNA Preprocessing
> <http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx> for
> discussion. Of course, that doesn't mean that we agree on the
> details for that yet!

There is considerable text on the preprocessing issue in
issues-06 (see my previous note) although I don't think nearly
as extensive as what you have done.   I will try to study your
draft in the next few days and see if I can identify any
important differences of perspective.

> ALWAYS/MAYBE/NEVER problemIf we take the very strong approach
> that Patrik has currently, where only Latin, Cyrillic, and
> Greek are really guaranteed, then over 90 thousand characters
> are no longer guaranteed to be part of IDNA. I use the word
> "not guaranteed" specifically. In Google, for example, we want
> to be able to look at a URL and say that it is either
> compliant to IDNAbis or not. And we don't want its compliance
> to change from TRUE to FALSE according to browser, or in the
> future. It's ok for it to change from FALSE to TRUE in the
> future, but not the reverse.

I think we have a conceptual problem here even given your 
explanation below.  I'll try to address it in more detail 
separately but, with a relatively constrained set of exceptions 
(for IDNs, some are in IDNA2003 and the NEVER category obviously 
makes a much longer list), the only way to determine whether a 
domain name is valid is to look it up in the DNS.  In the past, 
local checks for validity caused us a lot of problems because 
applications "knew" that top-level domain names came in only 
three lengths -- two characters for ccTLDs, three characters for 
gTLDs, and four characters for ARPA.   That knowledge was used, 
not just to reject putatively-invalid names but in all sorts of 
heuristics to determine, e.g., the difference between an FQDN 
and a somewhat more localized name.

So the presumption here is that NEVER means what it says but
that, unless a string

    (i) contains a character that is NEVER in your current
    implementation (obviously characters cannot migrate out of
    NEVER
    and into a valid category, or

    (ii) violates some fundamental syntactic rule (which is why
    it
    is important that we get the "dot" issue straightened out, or

    (iii) contains RtL characters but violates a bidi rule

Then you need to assume it is valid and look it up.  Because of
the IDNA2003-compatibility part of the preprocessing issue,
things are a little more complex than the above -- you
essentially have to apply IDNA2003-like string preparation to
the domain name before applying those tests and may need to
convert IRIs to URIs-- but the basic principles are the same.
And, obviously, as the network evolves to only using
IDNA200X-conformant U-labels, A-labels, and LDH-labels rather
than strings that one hopes will map into them, the easier it
gets to make those tests that can be made.

> The operational difference between MAYBE and ALWAYS is a
> problem.

Only to a registrar or registry.   It may also be an issue to an 
application author who wants to issue cautionary notices (a 
situation I don't believe Google is in and it may not be 
advisable for anyone), there is no operational difference.  I 
thought that was clear in issues-05 (and even -03 and -04) but 
will go back and look at the text again.

> Let's take a look at document authoring. If HTML document
> authors/generators use hyperlinks with IRIs that contain
> characters in the MAYBE category (since registries are allowed
> to register them, even if it's not recommended), then those
> links would break if any of those characters became NEVERs
> (assuming that browsers obey the rule that NEVERs must not be
> looked up). That makes perfectly conformant pages suddenly
> become non-conformant.

I'd dispute "suddenly", but let's go on.

> The structure right now in tables/protocol gives user-agents
> (including browers, but also search engines like Google's) a
> choice of the frying pan or the fire:
>
>
>    1. If we want stability, only accept ALWAYS. That's
> untenable, since    we couldn't handle most of the world.
>    2. If we want to serve our customers, accept ALWAYS+MAYBE.
> That is    instable, since MAYBE could change to NEVER at any
> time.

Let's look at this a different way, keeping in mind that URLs 
become invalid all the time, for all sorts of reasons.  The most 
common reason is that the relevant page is moved or simply 
disappears.  So, absent a location that has made a commitment to 
stability of reference, and has whatever it takes to back that 
up, nothing you do really gives you "stability" in the 
permanent, archival, sense.   If I were such a location, I'd 
stick to ALWAYS (or possibly not use IDNs at all) for my URLs. 
Note that this has nothing to do with what Google or the 
browsers do, but with decisions on the part of the 
content-owner.

What you do is to follow the spec, which says that, on lookup, 
you treat ALWAYS and MAYBE in exactly the same way.

Now, the one precaution I would take if I were a browser vendor 
(and maybe if I were Google, but I'm a little less sure about 
that) would be to adopt a slow-transition model on migration 
from MAYBE to NEVER.  There will not be many such migrations 
since most of what will end up in NEVER will get there because 
of Unicode properties that so classify it the very first time it 
is coded (with symbols being the obvious example here).  But, 
when a such a migration does occur, it might be sensible for you 
to go slow about enforcing the newly-applied NEVER category and 
rule, perhaps waiting some months, or even a year or two, before 
making the exclusion.    While the time scale is longer, the 
analogy between this and the "slow and soft update" model used 
by the DNS should be obvious -- there are really no facilities 
to be sure that all cached versions of a record are updated 
simultaneously and systems that rely on the DNS (the possibility 
of a zero TTL with dynamic update notwithstanding) had better 
not be designed on the assumption of instant synchronization (or 
records that are always perfectly consistent across all 
authoritative and non-authoritative servers, which is the same 
condition in a different form).

> Programs always have to be prepared for characters becoming
> valid that were not in previous versions. With any new version
> of Unicode, a company like Apple, Google, or Microsoft updates
> its software to use that version, and characters become
> acceptable that were not previously. Everyone who deals with
> Unicode needs to be prepared for that. (This situation is a
> bit like language/country codes, where new ones arrive on our
> doorstep -- but mechanisms like BCP 47 ensure that the old
> ones never become invalid.) The key requirement for stability
> is that characters that were acceptable in IDNAbis don't
> suddenly become invalid. If a character (or script) moves from
> (MAYBE + NEVER) into ALWAYS in the future, it is not a problem
> for implementers. Moving from (MAYBE + ALWAYS) to NEVER is a
> serious problem.

It really isn't, especially at the end of the system that you 
cite.  It is more serious for registries and registrars, but 
there are bits of "if you need a guarantee of stability and 
global accessibility, don't use labels containing MAYBE 
characters in your URL" and even some "registrant beware" 
properties in this.

> My recommendation is and has been: permit all characters from
> all modern scripts. Those are easily identified, and do not
> disadvantage any modern language group. It does not require an
> elaborate -- and probably unworkable -- process for getting
> buy-in. It would be acceptable to have historic scripts in
> MAYBE, on the off chance that there is a successful revival,
> because it doesn't put us in the frying-pan-or-fire position
> above, since all modern scripts would be allowed.

The difficulty, as I tried to explain above, isn't whether to 
permit the _script_ or not.  It is that "all characters" part 
when the communities of users of languages that depend on the 
script are telling us that some characters are problematic, 
either in the absence of additional conditions or entirely.

> This protocol is the wrong place to be making fine-grained
> linguistic determinations in any event. Restrictions can be
> imposed by registries or other parties, and user-agents where
> needed. Such restrictions are an exceeding small problem
> compared to handling the issues raised by spoofs like
> "paypal.com", and pale in comparison.

First, the issue isn't fine-grained linguistic determinations. 
It is, first, characters that made perfectly good sense to put 
into Unicode (because they are used with other characters of the 
script) but may not make sense to have in IDNs because they are 
problematic within the languages of that script.  Second, it 
arises when we have the rather difficult situation of the 
community of users of a language --sometimes with the sanction 
of a recognized authority such as a language academy or 
institute-- claiming that Unicode's handling of the script, and 
the conventions required, are simply unsuitable for DNS use 
(whether it may be suitable, with appropriate adjustments, for 
other uses of not).

This is not an attempt to prevent spoofing or phishing.  If it 
has any effect on either, that is an attractive side-effect.

> If this approach is argued against, it should be with concrete
> examples that can be reviewed and assessed. And the bad cases
> have to be sufficient in number to warrant the complexity of
> the ALWAYS, MAYBE, NEVER process.

The NEVER part of the process is amply justified by the need to 
be sure that seriously problematic characters do not resolve (or 
do not resolve regularly and reliably) because that is the only 
way to prevent rogue registries from registering them.

And the ALWAYS / MAYBE cases, as outlined above, involve whole 
scripts, at least temporarily, not just a few characters.

But there is another issue here, and maybe it lies at the root 
of our disagreement.  You, and others close to the Unicode 
Consortium, have tended to make decisions on the assumption that 
you know all that one would ever need to know about a character 
or script at the time it is assigned a codepoint.   With regard 
to the form and common use of the character itself, I assume you 
are always (or at least almost always) correct despite 
occasional loud claims from a few language and script 
communities to the contrary.  For more complex properties, you 
have very carefully-worded statements about, and definitions of, 
stability that give you some "wiggle room" if things are not 
quite right the first time.

Our situation is a little different.  The first of our problems 
is illustrated by the chapter of TUS that discussions why 
language information isn't needed as often as people would 
expect (my apologies but I'm away from home and don't have it in 
front of me).  IDNs don't obey the constraints nor provide the 
hints that material suggests: labels are almost always short 
relative to sentences or any more significant body of text and 
do not have the orthographic constraints associated with "words" 
in any language (consider embedded digits, strange case mixtures 
(for ASCII labels), final-form characters in the middle of 
strings (quite plausible for IDNs formed by simply catenation of 
words), and so on.  So we are in a situation in which "is this 
ok?" or its opposite "will this cause a mess?" may be fairly 
IDN-specific and tied to IDN use of characters.    Again, this 
is not about subtle linguistic distinctions or spoofing.   We 
have also dealt with some of the relevant oddities even in 
IDNA2003.  For example, our friend Sharp S (U+00DF) is mapped to 
"ss" as a specific Nameprep action, not as a consequence of NFKC 
application.  For IDNA200X, we can

    * Treat it as a special case and map it to "ss" in the
    protocol

    * Make appropriate recommendations for preprocessing and
    then treat it in tables just as it would have been treated
    if it had an NFKC mapping to something else, i.e., classify
    it as NEVER.

    * Reverse the IDNA2003 decision and treat it as an ordinary
    character, for which there is probably a strong argument
    today and for which there would be a much stronger argument
    should the proposed upper case form ever enter Unicode.

For the specific case of Eszett, that third option is probably 
impractical for compatibility reasons.

However, similar conditions arise for final-form characters (in 
European scripts and elsewhere) and their relationships to the 
base characters.  Unicode properties are of no help with this 
since, in ordinary textual applications, there is no question 
but that they are separate characters that should have been 
assigned their own code points (and not either compatibility 
characters for the base forms or something tricky).  But IDNs 
may be different, and that may specifically justify MAYBE until 
the relevant community figures out what they want, followed by 
migration to NEVER.

This case, incidentally, illustrates my point above about the 
problem (if any) being one for the registrant and not for Google 
or the browser implementers.  It is very easy to explain that, 
if one puts a final-form character in the middle of a label, or 
relies on a final-form character being different from the 
corresponding base one, that one should keep one's expectations 
about stability very minimal... and that is true whether the 
long-term solution is prohibition, some sort of mapping 
activity, a character variant (JET-like) approach at 
registration time, or something else.

     best,
       john