Table-building

Kenneth Whistler kenw at sybase.com
Fri Feb 2 01:34:30 CET 2007


> --On Thursday, 01 February, 2007 13:49 -0800 Erik van der Poel
> <erikv at google.com> wrote:
> 
> >> It is clearly stupid of any registry to allow the
> >> registraition of such characters, given that the property MAY
> >> end up false. It is equally stupid of any application
> >> developer to deny the attempt to lookup such characters,
> >> given that the property MAY end up true.
> > 
> > Somewhat related to this, MSIE7 currently does not allow the
> > lookup of
> > a URL containing U+03F7 or U+03F8. Firefox 1.5, on the other
> > hand,
> > will cheerfully lookup xn--mza or xn--nza, respectively, even
> > though
> > U+03F7 has a lower-case mapping to U+03F8 in Unicode 4.0.
> > 
> > I think I prefer MSIE7's behavior.
> 
> This is _exactly_ the reason why we are arguing that unassigned
> code points should not be looked up -- thanks for the excellent
> example.  (To save people the checking I just did, these two
> code points are unassigned in Unicode 3.2, the relevant version
> for the current IDNA standard, but were added in 4.0). 

And for reference:

Unicode 3.2:

   NFKC(U+03F7) = U+03F7
   NFKC(U+03F8) = U+03F8
   casefold(U+03F7) = U+03F7
   casefold(U+03F8) = U+03F8
   
Unicode 4.0 (and subsequent):

   NFKC(U+03F7) = U+03F7
   NFKC(U+03F8) = U+03F8
   casefold(U+03F7) = U+03F8  <-- that changes
   casefold(U+03F8) = U+03F8
  
> 
> Let's walk through this:
> 
> * The IDNA2003 requirement is that putative labels containing
> unassigned code points are looked up.  So Firefox is behaving
> according to the standard and MSIE7 is not in conformance.  
> 
> * However, were we to upgrade IDNA2003 to the current version of
> Unicode without making any other changes, U+03E7 would become
> invalid for actual lookup because we would presumably expect it
> to be mapped to U+03E8 in Stringprep's case-mapping function.

U+03F7 and U+03F8 (and similar typos below), but yes.

> 
> Now let's assume that a registry, following IDNA2003bis,
> registers a label containing U+03E8.  Assume a user then types
> in a domain name that contains U+03E7 to her favorite browser.
> We then have:
> 
> MSIE7.0: will not look it up and resolve it.  Presumably, it
> will tell the user that the label is invalid, _not_ that it is
> not found.  That distinction is very important.   Armed with the
> knowledge (perhaps after a discussion with the owner of the
> name) the user will start clamoring for MSIE7.1.
> 
> MSIE7.1, which was upgraded to IDNA2003bis, will map U+03E8 to
> U+03E7, which will be looked up successfully.

Actually, the reverse: it will map U+03F7 to U+03F8, but otherwise, yes.

> 
> Firefox 1.5, which is presumably stuck forever on IDNA2003 and
> Unicode 3.2, will look up the Punycode-converted version of the
> label containing U+03E8.  It will get an authoritative "not
> found" since only the label containing U+03E7 is in the DNS.
> That false negative is _very_ bad news.

I don't see why. Undesirable, but hardly catastrophic.

> 
> Firefox 2.something, which was upgraded to IDNA2003bis, will
> work exactly the way MSIE7.1 works, which I think is the desired
> behavior.
> 
> My conclusion: one dare not look up a character whose status is
> "unassigned" in whatever version of Unicode underpins the
> libraries one is using.   One can't know if such a character
> will turn out to case-map to something else. 

Correct. But if you only allow lowercase in the DNS, that is
a nonissue.

Situation *before* the uppercase of the character pair Cu/Cl is added
to the standard:

   Cu cannot legally exist in a registry.
   Cl cannot legally exist in a registry.
   
   Attempt to resolve Cu will pass the resolver, and the
   registry will say, "not here, boss".
   
Situation *after* the uppercase of the pair of Cu/Cl is added
to the standard:

   Cu cannot legally exist in a registry.
   Cl can legally exist in a registry.
   
   Attempt to resolve Cu in non-updated resolver will pass
   Cu to the registry, and the registry will say, "not here,
   boss".
   
   Updated application will casefold Cu -> Cl. Attempt to resolve
   Cl in non-updated resolver will pass Cl to the registry,
   and the registry will say, "here it is."

Yes, there is a transitional stage during which something can
end up in a registry, using newly encoded characters, that
on unupdated application cannot find. But expecting to set
this up so that an unupdated application can *always* find
any future encoded character in a registry is a will-o'-the-wisp,
I think. You could only enable that by imposing impossible
conditions on any future character encodings in Unicode
and 10646.

The important thing is that this scenario updates gracefully,
and the worst condition is some false negatives for unupdated
applications, rather than false *positive* matches, which
*would* be catastrophic.

Also, if the false negatives of this sort are constrained
to potential future case pairs, then the actual impact
on DNS is microscopic. All important case pairs for the
bicameral scripts were added years and years ago.

You do realize, of course, that U+03F7 and U+03F8 are
encoded for *Bactrian*, for gawdsake, which went extinct
by the 9th century.

We could eliminate all issues of this sort by hunting and
prowling through Latin, Greek, and Cyrillic to do a search
and destroy on historic letters -- it just seems like a
waste of time to have to argue historic letters one by
one for DNS, when it won't make any difference to systems upgrading
to Unicode 5.0, anyway.   
   
> One also can't
> know that it won't have an NFKC mapping to something else

That is a nonissue if cp != NFKC(cp) characters are precluded
from the inclusion table in the first place. Don't know how
many times I have to say that.

> or
> properties that require special handling prior to lookup.

That's fearing what you don't know, I think. Give me a
specific example where it makes a difference.

ZWJ and ZWNJ we already know about, and are proposed to
be *in* the inclusions table, but certain strings would
be disallowed containing them. Adding new characters won't
change that.

Adding a new combining character would result in disallowing
it as the initial character in a string to be resolved. But
then it couldn't have been in a registry before it was
added, anyway.

Adding a new Arabic character would create a new character
that could be added to a registry, and which would be
constrained to occur only in RtoL runs for domains. But
then it would be encoded at a code point that already
defaults to bc=AL for its bidirectional property, so
even if you were testing an unassigned code point in a
string, you'd end up with the same results for the bidirectional
contextual test.

>   Some
> of those cases can be dealt with simply by saying "it won't be
> registered and therefore there is no problem" (assuming all
> registries follow the rules), but some, including case-mapping,
> cannot.

Some, including *only* case-folding, as far as I can tell.

There is one other false negative match scenario including
a sequence of combining marks of different combining classes,
as Michel has pointed out, but that is even more marginal
than the case-folding one.

>   And, in practice, even viewing most or all NFKC and
> case mappings as external to the protocol (watch for
> ...idnabis-issues-01 early next week) doesn't change this
> situation.

The situation is exaggerated, I'm afraid.

And if you think you can create a specification that will
allow asynchronous updates to registry rules, asynchronous
updates to resolvers, asynchronous updates to applications,
asynchronous updates to application libraries, and
at the same time:

   1. obtain no false negative matches ever
   2. obtain no false positive matches ever
   3. have no impossible constraints on future Unicode additions
   
then you have simply overconstrained the engineering
requirements, I believe, and have effectively mandated
the construction of a perpetual motion machine.

I think #2 is the most important, and *is* obtainable.

I think you can only obtain #1 by giving up #3, and trying
to impose impossible constraints on future Unicode additions
is, well, not to mince words, impossible.

The only alternative I see is to go back to the same problem
you already have in IDNA2003 -- give up adaptability to
future versions, and simply lock down IDNAbis to the Unicode 5.0
repertoire forever.

But if the *real* fear here is that locking down IDNA to some
given repertoire makes IETF a target for vocal complainers
who feel (for whatever reason) that a particular decision
shafted their language because it was left out, then you
have simply invited that problem back into your laps.

--Ken

> 
>     john
> 
> 
> 



More information about the Idna-update mailing list