Making progress on the mapping question

Tue Mar 31 23:20:22 CEST 2009

On 3/30/09 at 7:41 AM -0400, Vint Cerf wrote:

>There has not been any significant objection to the proposals made 
>during the IETF 74 meeting to apply some form of mapping during 
>lookup. The two questions outstanding are:
>
>1. what mapping function should be used?
>2. how should it be used

I want to back up a step and take a look at this architecturally. I 
don't think the following conclusion is going to make anybody happy. 
However, nobody being happy is often a sign that one has gotten the 
answer correct. Of course, I'm happy to be argued out of it:

My position is that architecturally, doing *any* DNS lookup (whether 
a normal DNS lookup or a 2003 or 2008 IDN lookup) is a series of 5 
steps:

1. User input
2. Normalization
3. Syntactic validation
4. Encoding
5. Lookup

That is, the user inputs (via keyboard, voice, writing, or otherwise) 
a series of characters which are then normalized for whatever 
purposes may be desired (i.e., just because in a handwriting input 
method a user wrote a lowercase "a" and then put an acute accent on 
it, or wrote them in reverse, an input-method may rightfully 
normalize that to a lowercase-a-with-acute-accent). Only *after* 
input-method idiosyncrasies are dealt with by normalization 
(resulting in a string of characters) is syntactic validation done. 
In particular, it doesn't make any sense to do syntactic validation 
before normalization as you wouldn't be accounting for how the user 
happened to enter the text. Following syntactic validation, the 
string of characters is encoding in whatever the appropriate wire 
format for lookups might be, and then a DNS query is constructed to 
do the lookup.

Some of these steps are odd depending on whether we're talking about 
old DNS, IDNA2003, or IDNA2008. For instance, in good-old ASCII DNS 
lookup, it's unlikely that much in the way of normalization is done 
from the user input (though I'll bet some old Japanese terminal 
shells normalized double-wide-ASCII to 7-bit-ASCII), and although 
some applications validate strings to make sure they conform to LDH, 
others do not. And it's been a long time since anyone had to worry 
about encoding to US-ASCII from, say, EBCIDIC.

IDNA2003 conflates a few of the steps, and IDNA2008 attempts (for the 
better, IMO) to clearly separate them. Some of the conflation we're 
going to continue to be stuck with: By their very nature, user input 
methods do some encoding (most hand their output over in UTF-8, which 
would have to be decoded to "see it as characters" if we were going 
to be serious about the architectural steps) and most input methods 
do some kinds of normalization whether you ask them to or not. 
However, we'd like to keep these steps as architecturally separate as 
we can. In particular, by design, IDNA2008 wants a strong line 
between 2 and 3.

As far as I'm concerned, IDNA2003's mapping step is simply a 
particular kind of normalization. Because of this, I feel perfectly 
comfortable doing an IDNA2003-like mapping as "user input 
normalization" in IDNA2008. Because of that, I also think it (like 
other normalizations) should be done prior to IDNA2008 validation. 
However, given that it is a normalization step and given that 
IDNA2008 does not conflate normalization with validation anymore, I'd 
prefer this to be something done outside of the IDNA2008 spec itself.

And please note that *all* of this (input, normalization, validation) 
comes *before* encoding, which in our case involved encoding into 
UTF-8 and then further encoding in punycode. All of the normalization 
and validation must act on characters, not encoded characters.

Does this do anything for anyone?

pr
--
Pete Resnick <http://www.qualcomm.com/~presnick/>
Qualcomm Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102
-- 
Pete Resnick <http://www.qualcomm.com/~presnick/>
Qualcomm Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102