IDNAbis Main Open Issues
John C Klensin
klensin at jck.com
Sun Jan 20 06:22:26 CET 2008
--On Saturday, 19 January, 2008 11:33 -0800 Mark Davis
<mark.davis at icu-project.org> wrote:
> Here's what I think are the main issues.
I'm going to try to comment on a few of these in the hope of
making a bit more progress before you get on the plane (or at
least of putting any differences of opinion in sharper focus).
Please note that these are quick personal impressions only.
> (live document at
> 1. Settle on character repertoire. Basic formulation is ok,
> 1. Add extensions for stability
> 2. Don't be Eurocentric: ensure that modern scripts are
> in ALWAYS (or whatever it is called)
> 3. Resolve ALWAYS/MAYBE/NEVER problem (see below).
The explanation of this situation in issues-06 is much better
than it is in issues-05 (others will have to judge whether it is
adequate). However, the ALWAYS problem is that it is not
necessarily all of a script and the boundaries require
explicit, IDN-specific, input from the users of that script (as
we have both noted earlier, that is a tough problem, but let's
isolate it a bit).
So, for example, we've got advice from the three important
pieces of the CJK community about characters that are considered
appropriate (by one or more of them) for use in IDNs (see
RFC3743, RFC4713, and the corresponding IANA table
registrations). So, because the community has told us what they
consider safe, appropriate, and sufficiently unambiguous for IDN
use, those characters go into ALWAYS. The rest of the CJK
characters (i.e., the characters of the Han and Han-derived
script) belong in MAYBE somewhere. Since they are legitimate
language characters and the community has not said "these are
evil" or the equivalent, they cannot be classified as NEVER and
should not be. Which of the MAYBE categories those additional
characters belong in is still something we are trying to sort
out (and hence another unresolved issue), but they certainly
don't go to ALWAYS.
I don't have an opinion about it and seek your advice but, to
the extent to which there are particularly problematic
characters in Western European scripts, they might be kept out
of ALWAYS for the present and until more experience and advice
from the affected parties comes along. Such characters might
include the notorious dotless "i" and perhaps such difficulties
as Eszett and Final Sigma, although backward-compatibility or
other considerations might dictate the handling of those
Note that the model described above involves splitting up the
characters of scripts in ways that cannot be done by Unicode
properties alone since judgments specific to IDN usability are
> There are other smaller issues, wording of the text,
> continuing to make progress on BIDI, and so on, but I think
> the above are the chief remaining issues to get consensus on.
> In particular, it sounds like people would not be adverse to
> having a separate preprocessing document, which we think is
> required, so as long as we do that I'm not including it here.
> I have a draft at Draft IDNA Preprocessing
> <http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx> for
> discussion. Of course, that doesn't mean that we agree on the
> details for that yet!
There is considerable text on the preprocessing issue in
issues-06 (see my previous note) although I don't think nearly
as extensive as what you have done. I will try to study your
draft in the next few days and see if I can identify any
important differences of perspective.
> ALWAYS/MAYBE/NEVER problemIf we take the very strong approach
> that Patrik has currently, where only Latin, Cyrillic, and
> Greek are really guaranteed, then over 90 thousand characters
> are no longer guaranteed to be part of IDNA. I use the word
> "not guaranteed" specifically. In Google, for example, we want
> to be able to look at a URL and say that it is either
> compliant to IDNAbis or not. And we don't want its compliance
> to change from TRUE to FALSE according to browser, or in the
> future. It's ok for it to change from FALSE to TRUE in the
> future, but not the reverse.
I think we have a conceptual problem here even given your
explanation below. I'll try to address it in more detail
separately but, with a relatively constrained set of exceptions
(for IDNs, some are in IDNA2003 and the NEVER category obviously
makes a much longer list), the only way to determine whether a
domain name is valid is to look it up in the DNS. In the past,
local checks for validity caused us a lot of problems because
applications "knew" that top-level domain names came in only
three lengths -- two characters for ccTLDs, three characters for
gTLDs, and four characters for ARPA. That knowledge was used,
not just to reject putatively-invalid names but in all sorts of
heuristics to determine, e.g., the difference between an FQDN
and a somewhat more localized name.
So the presumption here is that NEVER means what it says but
that, unless a string
(i) contains a character that is NEVER in your current
implementation (obviously characters cannot migrate out of
and into a valid category, or
(ii) violates some fundamental syntactic rule (which is why
is important that we get the "dot" issue straightened out, or
(iii) contains RtL characters but violates a bidi rule
Then you need to assume it is valid and look it up. Because of
the IDNA2003-compatibility part of the preprocessing issue,
things are a little more complex than the above -- you
essentially have to apply IDNA2003-like string preparation to
the domain name before applying those tests and may need to
convert IRIs to URIs-- but the basic principles are the same.
And, obviously, as the network evolves to only using
IDNA200X-conformant U-labels, A-labels, and LDH-labels rather
than strings that one hopes will map into them, the easier it
gets to make those tests that can be made.
> The operational difference between MAYBE and ALWAYS is a
Only to a registrar or registry. It may also be an issue to an
application author who wants to issue cautionary notices (a
situation I don't believe Google is in and it may not be
advisable for anyone), there is no operational difference. I
thought that was clear in issues-05 (and even -03 and -04) but
will go back and look at the text again.
> Let's take a look at document authoring. If HTML document
> authors/generators use hyperlinks with IRIs that contain
> characters in the MAYBE category (since registries are allowed
> to register them, even if it's not recommended), then those
> links would break if any of those characters became NEVERs
> (assuming that browsers obey the rule that NEVERs must not be
> looked up). That makes perfectly conformant pages suddenly
> become non-conformant.
I'd dispute "suddenly", but let's go on.
> The structure right now in tables/protocol gives user-agents
> (including browers, but also search engines like Google's) a
> choice of the frying pan or the fire:
> 1. If we want stability, only accept ALWAYS. That's
> untenable, since we couldn't handle most of the world.
> 2. If we want to serve our customers, accept ALWAYS+MAYBE.
> That is instable, since MAYBE could change to NEVER at any
Let's look at this a different way, keeping in mind that URLs
become invalid all the time, for all sorts of reasons. The most
common reason is that the relevant page is moved or simply
disappears. So, absent a location that has made a commitment to
stability of reference, and has whatever it takes to back that
up, nothing you do really gives you "stability" in the
permanent, archival, sense. If I were such a location, I'd
stick to ALWAYS (or possibly not use IDNs at all) for my URLs.
Note that this has nothing to do with what Google or the
browsers do, but with decisions on the part of the
What you do is to follow the spec, which says that, on lookup,
you treat ALWAYS and MAYBE in exactly the same way.
Now, the one precaution I would take if I were a browser vendor
(and maybe if I were Google, but I'm a little less sure about
that) would be to adopt a slow-transition model on migration
from MAYBE to NEVER. There will not be many such migrations
since most of what will end up in NEVER will get there because
of Unicode properties that so classify it the very first time it
is coded (with symbols being the obvious example here). But,
when a such a migration does occur, it might be sensible for you
to go slow about enforcing the newly-applied NEVER category and
rule, perhaps waiting some months, or even a year or two, before
making the exclusion. While the time scale is longer, the
analogy between this and the "slow and soft update" model used
by the DNS should be obvious -- there are really no facilities
to be sure that all cached versions of a record are updated
simultaneously and systems that rely on the DNS (the possibility
of a zero TTL with dynamic update notwithstanding) had better
not be designed on the assumption of instant synchronization (or
records that are always perfectly consistent across all
authoritative and non-authoritative servers, which is the same
condition in a different form).
> Programs always have to be prepared for characters becoming
> valid that were not in previous versions. With any new version
> of Unicode, a company like Apple, Google, or Microsoft updates
> its software to use that version, and characters become
> acceptable that were not previously. Everyone who deals with
> Unicode needs to be prepared for that. (This situation is a
> bit like language/country codes, where new ones arrive on our
> doorstep -- but mechanisms like BCP 47 ensure that the old
> ones never become invalid.) The key requirement for stability
> is that characters that were acceptable in IDNAbis don't
> suddenly become invalid. If a character (or script) moves from
> (MAYBE + NEVER) into ALWAYS in the future, it is not a problem
> for implementers. Moving from (MAYBE + ALWAYS) to NEVER is a
> serious problem.
It really isn't, especially at the end of the system that you
cite. It is more serious for registries and registrars, but
there are bits of "if you need a guarantee of stability and
global accessibility, don't use labels containing MAYBE
characters in your URL" and even some "registrant beware"
properties in this.
> My recommendation is and has been: permit all characters from
> all modern scripts. Those are easily identified, and do not
> disadvantage any modern language group. It does not require an
> elaborate -- and probably unworkable -- process for getting
> buy-in. It would be acceptable to have historic scripts in
> MAYBE, on the off chance that there is a successful revival,
> because it doesn't put us in the frying-pan-or-fire position
> above, since all modern scripts would be allowed.
The difficulty, as I tried to explain above, isn't whether to
permit the _script_ or not. It is that "all characters" part
when the communities of users of languages that depend on the
script are telling us that some characters are problematic,
either in the absence of additional conditions or entirely.
> This protocol is the wrong place to be making fine-grained
> linguistic determinations in any event. Restrictions can be
> imposed by registries or other parties, and user-agents where
> needed. Such restrictions are an exceeding small problem
> compared to handling the issues raised by spoofs like
> "paypal.com", and pale in comparison.
First, the issue isn't fine-grained linguistic determinations.
It is, first, characters that made perfectly good sense to put
into Unicode (because they are used with other characters of the
script) but may not make sense to have in IDNs because they are
problematic within the languages of that script. Second, it
arises when we have the rather difficult situation of the
community of users of a language --sometimes with the sanction
of a recognized authority such as a language academy or
institute-- claiming that Unicode's handling of the script, and
the conventions required, are simply unsuitable for DNS use
(whether it may be suitable, with appropriate adjustments, for
other uses of not).
This is not an attempt to prevent spoofing or phishing. If it
has any effect on either, that is an attractive side-effect.
> If this approach is argued against, it should be with concrete
> examples that can be reviewed and assessed. And the bad cases
> have to be sufficient in number to warrant the complexity of
> the ALWAYS, MAYBE, NEVER process.
The NEVER part of the process is amply justified by the need to
be sure that seriously problematic characters do not resolve (or
do not resolve regularly and reliably) because that is the only
way to prevent rogue registries from registering them.
And the ALWAYS / MAYBE cases, as outlined above, involve whole
scripts, at least temporarily, not just a few characters.
But there is another issue here, and maybe it lies at the root
of our disagreement. You, and others close to the Unicode
Consortium, have tended to make decisions on the assumption that
you know all that one would ever need to know about a character
or script at the time it is assigned a codepoint. With regard
to the form and common use of the character itself, I assume you
are always (or at least almost always) correct despite
occasional loud claims from a few language and script
communities to the contrary. For more complex properties, you
have very carefully-worded statements about, and definitions of,
stability that give you some "wiggle room" if things are not
quite right the first time.
Our situation is a little different. The first of our problems
is illustrated by the chapter of TUS that discussions why
language information isn't needed as often as people would
expect (my apologies but I'm away from home and don't have it in
front of me). IDNs don't obey the constraints nor provide the
hints that material suggests: labels are almost always short
relative to sentences or any more significant body of text and
do not have the orthographic constraints associated with "words"
in any language (consider embedded digits, strange case mixtures
(for ASCII labels), final-form characters in the middle of
strings (quite plausible for IDNs formed by simply catenation of
words), and so on. So we are in a situation in which "is this
ok?" or its opposite "will this cause a mess?" may be fairly
IDN-specific and tied to IDN use of characters. Again, this
is not about subtle linguistic distinctions or spoofing. We
have also dealt with some of the relevant oddities even in
IDNA2003. For example, our friend Sharp S (U+00DF) is mapped to
"ss" as a specific Nameprep action, not as a consequence of NFKC
application. For IDNA200X, we can
* Treat it as a special case and map it to "ss" in the
* Make appropriate recommendations for preprocessing and
then treat it in tables just as it would have been treated
if it had an NFKC mapping to something else, i.e., classify
it as NEVER.
* Reverse the IDNA2003 decision and treat it as an ordinary
character, for which there is probably a strong argument
today and for which there would be a much stronger argument
should the proposed upper case form ever enter Unicode.
For the specific case of Eszett, that third option is probably
impractical for compatibility reasons.
However, similar conditions arise for final-form characters (in
European scripts and elsewhere) and their relationships to the
base characters. Unicode properties are of no help with this
since, in ordinary textual applications, there is no question
but that they are separate characters that should have been
assigned their own code points (and not either compatibility
characters for the base forms or something tricky). But IDNs
may be different, and that may specifically justify MAYBE until
the relevant community figures out what they want, followed by
migration to NEVER.
This case, incidentally, illustrates my point above about the
problem (if any) being one for the registrant and not for Google
or the browser implementers. It is very easy to explain that,
if one puts a final-form character in the middle of a label, or
relies on a final-form character being different from the
corresponding base one, that one should keep one's expectations
about stability very minimal... and that is true whether the
long-term solution is prohibition, some sort of mapping
activity, a character variant (JET-like) approach at
registration time, or something else.
More information about the Idna-update