Making progress on the mapping question

Tue Mar 31 23:36:42 CEST 2009

Pete,

To answer your last question first, yes, I find it helpful...
and find evidence in some of the discussions during the last
week that it is important.  More below.

--On Tuesday, March 31, 2009 16:20 -0500 Pete Resnick
<presnick at qualcomm.com> wrote:

>...
> I want to back up a step and take a look at this
> architecturally. I  don't think the following conclusion is
> going to make anybody happy.  However, nobody being happy is
> often a sign that one has gotten the  answer correct. Of
> course, I'm happy to be argued out of it:
> 
> My position is that architecturally, doing *any* DNS lookup
> (whether  a normal DNS lookup or a 2003 or 2008 IDN lookup) is
> a series of 5  steps:
> 
> 1. User input
> 2. Normalization
> 3. Syntactic validation
> 4. Encoding
> 5. Lookup
> 
> That is, the user inputs (via keyboard, voice, writing, or
> otherwise)  a series of characters which are then normalized
> for whatever  purposes may be desired (i.e., just because in a
> handwriting input  method a user wrote a lowercase "a" and
> then put an acute accent on  it, or wrote them in reverse, an
> input-method may rightfully  normalize that to a
> lowercase-a-with-acute-accent). Only *after*  input-method
> idiosyncrasies are dealt with by normalization  (resulting in
> a string of characters) is syntactic validation done.  In
> particular, it doesn't make any sense to do syntactic
> validation  before normalization as you wouldn't be accounting
> for how the user  happened to enter the text. Following
> syntactic validation, the  string of characters is encoding in
> whatever the appropriate wire  format for lookups might be,
> and then a DNS query is constructed to  do the lookup.
> 
> Some of these steps are odd depending on whether we're talking
> about  old DNS, IDNA2003, or IDNA2008. For instance, in
> good-old ASCII DNS  lookup, it's unlikely that much in the way
> of normalization is done  from the user input (though I'll bet
> some old Japanese terminal  shells normalized
> double-wide-ASCII to 7-bit-ASCII), and although  some
> applications validate strings to make sure they conform to
> LDH,  others do not. And it's been a long time since anyone
> had to worry  about encoding to US-ASCII from, say, EBCIDIC.
> 
> IDNA2003 conflates a few of the steps, and IDNA2008 attempts
> (for the  better, IMO) to clearly separate them. Some of the
> conflation we're  going to continue to be stuck with: By their
> very nature, user input  methods do some encoding (most hand
> their output over in UTF-8, which  would have to be decoded to
> "see it as characters" if we were going  to be serious about
> the architectural steps) and most input methods  do some kinds
> of normalization whether you ask them to or not.

Actually, some Unicode-based operating systems seem keep
information internally in UTF-16 (and maybe UTF-32), even though
they export and import it in UTF-8.  That difference just
strengthens your point, of course.

> However,
> we'd like to keep these steps as architecturally separate as 
> we can. In particular, by design, IDNA2008 wants a strong line 
> between 2 and 3.
> 
> As far as I'm concerned, IDNA2003's mapping step is simply a 
> particular kind of normalization. Because of this, I feel
> perfectly  comfortable doing an IDNA2003-like mapping as "user
> input  normalization" in IDNA2008. Because of that, I also
> think it (like  other normalizations) should be done prior to
> IDNA2008 validation.  However, given that it is a
> normalization step and given that  IDNA2008 does not conflate
> normalization with validation anymore, I'd  prefer this to be
> something done outside of the IDNA2008 spec itself.

Me too.  However, please note that the
NFKC(CaseFold(NFKC(string))) model that underlies
Nameprep/Stringprep mapping is a fairly drastic normalization,
one that some people might think loses significant information.
In particular, to the extent to which one wants to preserve
usability of some of the characters that process takes out
(Eszett comes to mind here) or to prohibit characters that
process might map out (e.g., some Jamo combinations), it is
appropriate for the base specification to impose some
restrictions on whatever normalization is performed, even if it
is performed outside the spec.

> And please note that *all* of this (input, normalization,
> validation)  comes *before* encoding, which in our case
> involved encoding into  UTF-8 and then further encoding in
> punycode. All of the normalization  and validation must act on
> characters, not encoded characters.

Actually, that is "encoding into an acceptable Unicode form" --
there is nothing about the Punycode conversion algorithm that
requires (or even favors) UTF-8 input.  But that is a nit at
best.

      john