Q3: What characters should be allowed in a revised IDNA2008 specification?

Thu Apr 2 03:01:30 CEST 2009

Mark

On Tue, Mar 31, 2009 at 22:59, John C Klensin <klensin at jck.com> wrote:
>
>
> --On Tuesday, March 31, 2009 12:14 -0400 Vint Cerf
> <vint at google.com> wrote:
>
>> IDNA2008 currently allows a more restricted set of characters
>> to be   used in domain name labels than IDNA2003 does.
>>
>> Does the working group agree that the more restricted set of
>> the   current IDNA2008 Tables document should apply once
>> IDNA2008 is   adopted?
>
> I believe that this question has been answered in the
> affirmative several times and that, in the absence of strong new
> evidence or arguments, we should not need to revisit it again.

I agree, with the one caveat that we might end up tweaking Exceptions
or ContextO (the context tables are in bad shape, and we've never
reviewed them in any detail).

That is, the A-Label and U-Label should remain substantially as they
are. This is a different question than whether the M-Labels are a
non-empty set.

>
>> What should be done about registrations
>> that use characters   that would not be allowed under
>> IDNA2008?
>
> These registrations have always violated established advice --
> from the IESG as well as from ICANN and others-- against
> registering labels containing characters that are not used to
> write the words of at least some language.

ICANN may have such advice, but that would be for top-level registries
only. I don't know about the proportions, but I would not at all be
surprised if the majority of registered domain names are *not* done by
top-level registries. And the IESG advice on registration is hardly
prominent.

> Long-term support
> for  them simply encourages more such registrations, some of
> which can be problematic (punctuation or symbol characters that
> appear to be slashes or non-label-separating dots being the most
> obvious examples).

Once IDNA2008 is in place, conformant registries would disallow
registrations with symbols like heart. And I'd expect them to
gradually wither.

As I've said before, it is not a big problem to exclude most symbols
and punctuation. However, it does not help matters to keep overstating
the case for it. All good client software in this day and age would
warn against the small handful of such problematic symbols and
punctuation, as Gervase points out. And there are so vastly more
confusability problems remaining in IDNA2008 that removing that
handful of characters will have no appreciable benefit.

If you are going to keep repeating these statements, and expect them
to have any force (except among those who are not aware of the
magnitudes), at some point or other you need to supply some data to
back them up. There are demonstrably thousands of confusable
characters left in IDNA2008, and tens of thousands of characters with
possibly ambiguous names. So far from you, I've heard a handful of
symbols/punctuation that are visually confusable (fraction slash), and
a handful of symbols/punctuation with possibly ambiguous names (hollow
heart vs black heart). Before you go leaping again to conclusions,
let's see some numbers to back those conclusions up.

>
>>  Should there be a   transitional period of finite
>> duration after which these registrations   will become
>> invalid?
>
> The period in which IDNA2003 lookup implementations are
> gradually replaced by IDNA2008 ones should provide a more than
> adequate transition period without taking any special measures.
> Registries and zones that have created and installed such labels
> should certainly work out transition strategies, but the exact
> nature of those strategies is beyond the scope of this WG.

That is focusing on the registration side, and for that matter only on
the top levels. There are vastly more desktop applications and servers
that have to interoperate with one another. I know that you just don't
care about such interoperation, but it is important. And during some
period, it will be important for products such as Google search to be
able to support both implementations, until most of the client
software is reved. That can take years.

>
>> Should they be grandfathered somehow? If we   believe
>> all future registrations should be restricted, how would such
>> grandfathered registrations be found if the IDNA2008 rules
>> would   reject lookups of the disallowed characters?
>
> That is exactly the problem.  If these strings are grandfathered
> and guaranteed to be looked up, then we would effectively have
> to abandon substantially all lookup-time checking.  One could
> argue that even code points that were UNASSIGNED at the time of
> IDNA2003 (i.e., in Unicode 3.2) would have to be looked up
> because it is not clear that a registry installing a label that
> uses a code point that first appeared in, e.g., Unicode 4.0 is a
> more severe violation of the IDNA2003 standard and associated
> registration recommendations than simply installing a label that
> contains symbol or punctuation characters.

This is a red herring. There isn't any reason for any at all
compelling reason for any lookup implementation to support U3.2
UNASSIGNED characters, and nobody seriously suggested it. No
conformant IDNA2003 registry would support them, and the data is clear
that the occurrence of such URLs is in the noise (the same as random
garbaged data).

>
>> A two-lookup scheme might solve this problem:
>>
>> 1. lookup according to IDNA2008 rules (if disallowed
>> characters are   present, go to step 2); if domain name record
>> is found, return the   information. If not, go to step 2
>> 2. lookup according to IDNA2003 rules (permitting a broader
>> range of   characters in the lookup process). If domain record
>> is found, return   it, if not return "no such domain name"
>
> If continued for any length of time, this approach (which
> appears to be equivalent to the one I suggested in the Appendix
> to Protocol-11 without fully understanding its implications)
> would effectively redefine all characters that are present in
> Nameprep/Stringprep as PVALID, even if IDNA2008 had intended to
> DISALLOW them.

What was suggested in the appendix was indeed flawed, because it
doesn't distinguish between the Forbidden characters and the Mapped
characters. For the former (hearts and fraction slashes) we can have a
transitional stage, while the latter should be mapped indefinitely.

>
> It seems to me that, if we are going to perform any sort of
> compatibility mapping, we will need to create a new
> Stringprep-like table by filtering out any mappings whose target
> is a DISALLOWED or CONTEXT character under in the IDNA2008 rules
> and then use a table no larger than that one for the mapping
> function.
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>