Additonal prefixes (was: Re: Final Sigma (was: RE: Esszett, Final Sigma, ZWJ and ZWNJ))

John C Klensin klensin at jck.com
Fri Feb 27 05:44:46 CET 2009



--On Friday, February 27, 2009 02:22 +0100 JFC Morfin
<jefsey at jefsey.com> wrote:

> John,just a minute. This is not TC37. This is IETF. We do not
> discuss languages. We care about typographic formats.

I am not sure I would agree that we care about typographic
formats.  We had better not care about them because Unicode
quite explicitly does not support typographic format differences
except via additional (and unspecified) metadata.

> We have a problem: how to revive on the end side an U-label we
> killed and buried in an A-label on the origin side. This is a
> negentropy issue. Either we transmit the U-label end to end
> and punycode it each time an A-label is needed, or we transmit
> it as an A-label together with the metainformation to properly
> reverse the punycoding.  The cases we know are limited to the
> way to restore or not some characters in some positions. Do
> you mean you can see more than 35 (x0-- to xz--) or 1295
> (x00-- to xzz--) different formats through variations in the
> punycode algorithm?

Well, yes, I can.  

There are two possible ways to handle the problem of recovering
the information that goes into an A-label encoding such that the
same U-label comes out that went in.  One is to restrict the
characters that can appear in a U-label so that what comes out
is exactly identical to what goes in.  That is the premise on
which IDNA2008 was constructed, but it is one that bans
upper-case characters (along with all Unicode compatibility
mappings, etc.), rather than worrying about what lower-case
forms they might assume.   The other is to use some sort of
encoding profile that preserves the context into which the
characters can be reverse-mapped.  

If one were not worried about maximum label lengths, the latter
could be easily (and globally) accomplished by having a DNS
label encode, not just ACE(U-label) but some catenation of
ACE(U-label) with ACE(preferred-display-form).  Not rocket
science, just space-consuming and, for many applications,
totally unnecessary because the relevant context can take care
of the problem.  For example, consider
     <a href="A-label1.A-label2">preferred-display-form</a>

Of course, if not used with care and a complete absence of
malice, this could create security risks (one phisher trick uses
just this technique) and general user confusion.

But, if one did try to do it on a locale-specific basis rather
than simply preserving the original, then, yes, I can easily see
a need for more than 35 locales (even if all labels with "x" and
"--" in the first, third, and fourth positions respectively were
candidates for an encoding trigger, which is not possible).  And
"x00--" would not be recognized as a reserved string type at all.

     john






More information about the Idna-update mailing list