Making progress on the mapping question
Pete Resnick
presnick at qualcomm.com
Tue Mar 31 23:20:22 CEST 2009
On 3/30/09 at 7:41 AM -0400, Vint Cerf wrote:
>There has not been any significant objection to the proposals made
>during the IETF 74 meeting to apply some form of mapping during
>lookup. The two questions outstanding are:
>
>1. what mapping function should be used?
>2. how should it be used
I want to back up a step and take a look at this architecturally. I
don't think the following conclusion is going to make anybody happy.
However, nobody being happy is often a sign that one has gotten the
answer correct. Of course, I'm happy to be argued out of it:
My position is that architecturally, doing *any* DNS lookup (whether
a normal DNS lookup or a 2003 or 2008 IDN lookup) is a series of 5
steps:
1. User input
2. Normalization
3. Syntactic validation
4. Encoding
5. Lookup
That is, the user inputs (via keyboard, voice, writing, or otherwise)
a series of characters which are then normalized for whatever
purposes may be desired (i.e., just because in a handwriting input
method a user wrote a lowercase "a" and then put an acute accent on
it, or wrote them in reverse, an input-method may rightfully
normalize that to a lowercase-a-with-acute-accent). Only *after*
input-method idiosyncrasies are dealt with by normalization
(resulting in a string of characters) is syntactic validation done.
In particular, it doesn't make any sense to do syntactic validation
before normalization as you wouldn't be accounting for how the user
happened to enter the text. Following syntactic validation, the
string of characters is encoding in whatever the appropriate wire
format for lookups might be, and then a DNS query is constructed to
do the lookup.
Some of these steps are odd depending on whether we're talking about
old DNS, IDNA2003, or IDNA2008. For instance, in good-old ASCII DNS
lookup, it's unlikely that much in the way of normalization is done
from the user input (though I'll bet some old Japanese terminal
shells normalized double-wide-ASCII to 7-bit-ASCII), and although
some applications validate strings to make sure they conform to LDH,
others do not. And it's been a long time since anyone had to worry
about encoding to US-ASCII from, say, EBCIDIC.
IDNA2003 conflates a few of the steps, and IDNA2008 attempts (for the
better, IMO) to clearly separate them. Some of the conflation we're
going to continue to be stuck with: By their very nature, user input
methods do some encoding (most hand their output over in UTF-8, which
would have to be decoded to "see it as characters" if we were going
to be serious about the architectural steps) and most input methods
do some kinds of normalization whether you ask them to or not.
However, we'd like to keep these steps as architecturally separate as
we can. In particular, by design, IDNA2008 wants a strong line
between 2 and 3.
As far as I'm concerned, IDNA2003's mapping step is simply a
particular kind of normalization. Because of this, I feel perfectly
comfortable doing an IDNA2003-like mapping as "user input
normalization" in IDNA2008. Because of that, I also think it (like
other normalizations) should be done prior to IDNA2008 validation.
However, given that it is a normalization step and given that
IDNA2008 does not conflate normalization with validation anymore, I'd
prefer this to be something done outside of the IDNA2008 spec itself.
And please note that *all* of this (input, normalization, validation)
comes *before* encoding, which in our case involved encoding into
UTF-8 and then further encoding in punycode. All of the normalization
and validation must act on characters, not encoded characters.
Does this do anything for anyone?
pr
--
Pete Resnick <http://www.qualcomm.com/~presnick/>
Qualcomm Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102
--
Pete Resnick <http://www.qualcomm.com/~presnick/>
Qualcomm Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102
More information about the Idna-update
mailing list