Making progress on the mapping question
John C Klensin
klensin at jck.com
Tue Mar 31 23:36:42 CEST 2009
To answer your last question first, yes, I find it helpful...
and find evidence in some of the discussions during the last
week that it is important. More below.
--On Tuesday, March 31, 2009 16:20 -0500 Pete Resnick
<presnick at qualcomm.com> wrote:
> I want to back up a step and take a look at this
> architecturally. I don't think the following conclusion is
> going to make anybody happy. However, nobody being happy is
> often a sign that one has gotten the answer correct. Of
> course, I'm happy to be argued out of it:
> My position is that architecturally, doing *any* DNS lookup
> (whether a normal DNS lookup or a 2003 or 2008 IDN lookup) is
> a series of 5 steps:
> 1. User input
> 2. Normalization
> 3. Syntactic validation
> 4. Encoding
> 5. Lookup
> That is, the user inputs (via keyboard, voice, writing, or
> otherwise) a series of characters which are then normalized
> for whatever purposes may be desired (i.e., just because in a
> handwriting input method a user wrote a lowercase "a" and
> then put an acute accent on it, or wrote them in reverse, an
> input-method may rightfully normalize that to a
> lowercase-a-with-acute-accent). Only *after* input-method
> idiosyncrasies are dealt with by normalization (resulting in
> a string of characters) is syntactic validation done. In
> particular, it doesn't make any sense to do syntactic
> validation before normalization as you wouldn't be accounting
> for how the user happened to enter the text. Following
> syntactic validation, the string of characters is encoding in
> whatever the appropriate wire format for lookups might be,
> and then a DNS query is constructed to do the lookup.
> Some of these steps are odd depending on whether we're talking
> about old DNS, IDNA2003, or IDNA2008. For instance, in
> good-old ASCII DNS lookup, it's unlikely that much in the way
> of normalization is done from the user input (though I'll bet
> some old Japanese terminal shells normalized
> double-wide-ASCII to 7-bit-ASCII), and although some
> applications validate strings to make sure they conform to
> LDH, others do not. And it's been a long time since anyone
> had to worry about encoding to US-ASCII from, say, EBCIDIC.
> IDNA2003 conflates a few of the steps, and IDNA2008 attempts
> (for the better, IMO) to clearly separate them. Some of the
> conflation we're going to continue to be stuck with: By their
> very nature, user input methods do some encoding (most hand
> their output over in UTF-8, which would have to be decoded to
> "see it as characters" if we were going to be serious about
> the architectural steps) and most input methods do some kinds
> of normalization whether you ask them to or not.
Actually, some Unicode-based operating systems seem keep
information internally in UTF-16 (and maybe UTF-32), even though
they export and import it in UTF-8. That difference just
strengthens your point, of course.
> we'd like to keep these steps as architecturally separate as
> we can. In particular, by design, IDNA2008 wants a strong line
> between 2 and 3.
> As far as I'm concerned, IDNA2003's mapping step is simply a
> particular kind of normalization. Because of this, I feel
> perfectly comfortable doing an IDNA2003-like mapping as "user
> input normalization" in IDNA2008. Because of that, I also
> think it (like other normalizations) should be done prior to
> IDNA2008 validation. However, given that it is a
> normalization step and given that IDNA2008 does not conflate
> normalization with validation anymore, I'd prefer this to be
> something done outside of the IDNA2008 spec itself.
Me too. However, please note that the
NFKC(CaseFold(NFKC(string))) model that underlies
Nameprep/Stringprep mapping is a fairly drastic normalization,
one that some people might think loses significant information.
In particular, to the extent to which one wants to preserve
usability of some of the characters that process takes out
(Eszett comes to mind here) or to prohibit characters that
process might map out (e.g., some Jamo combinations), it is
appropriate for the base specification to impose some
restrictions on whatever normalization is performed, even if it
is performed outside the spec.
> And please note that *all* of this (input, normalization,
> validation) comes *before* encoding, which in our case
> involved encoding into UTF-8 and then further encoding in
> punycode. All of the normalization and validation must act on
> characters, not encoded characters.
Actually, that is "encoding into an acceptable Unicode form" --
there is nothing about the Punycode conversion algorithm that
requires (or even favors) UTF-8 input. But that is a nit at
More information about the Idna-update