exact match vs mapping

Erik van der Poel erikv at google.com
Tue Mar 31 00:09:58 CEST 2009


On Mon, Mar 30, 2009 at 2:00 PM, John C Klensin <klensin at jck.com> wrote:
> --On Tuesday, March 24, 2009 17:17 -0700 Erik van der Poel <erikv at google.com> wrote:
>> Thanks for the meetings. After the 2nd meeting, Pete Resnick
>> and I discussed a "layer" model that is probably already
>> familiar to most of us, but it raises an interesting question
>> that I thought I'd pose to the mailing list.
>>
>> We have often talked about protocol "stacks" where e.g. HTTP
>> sits on top of TCP, which sits on top of IP, and so on. In our
>> IDNA discussions, we have often talked about the HTML stack
>> and the email stack. If we take these stacks to their logical
>> extreme, they would include the human user at the top:
>>
>> human user
>> email app
>> message body
>> 822 header
>> SMTP envelope
>> TCP
>> IP
>>
>> This is a very rough description of the stack, and I realize
>> that SMTP goes back and forth between client and server, but,
>> I hope you get the general idea. So far, my assumption has
>> been that SMTP extensions would probably want to use U-labels.
>> I have no idea what people are thinking for the 822 header.
>> (John?)
>
> The base SMTP and 822 protocols don't allow non-ASCII
> characters, so the answer is "A-labels".  The internationalized
> extension work is still experimental and hence very subject to
> change as far as the domain-part is concerned (the local-part
> follows precedent by being exact-match).

That makes sense, thanks.

>> The Web stack might look like this:
>>
>> human user
>> Web app
>> HTML/IRI/IDNA2003
>
> In terms of standards (no matter how much they are ignored in
> some quarters), HTML prior to the still-under-development HTML5
> requires URIs (and hence A-labels).

Yes, I realize that a strict reading of HTML4 and IDNA2003 could lead
to that conclusion, but it seems you also realize that the
implementations themselves have changed since then.

>> HTTP/URI/DNS
>> TCP/UDP
>> IP
>>
>> Now, one of the issues with IDNA2008 is whether or not to
>> include mapping as a MUST. Of course, one way to do this is to
>> have a separate RFC for mapping, and have the main IDNA
>> protocol refer to the mapping spec, saying that the mapping
>> must occur "somewhere" in the stack above. It sounds like some
>> of the WG members would like to "push" the mapping all the way
>> up the stack to the app (in the UI, immediately after keyboard
>> or other entry).
>
> A variation, which I'm finding increasingly interesting for
> other reasons, is to consider mapping part of the IRI/URI
> boundary, thereby permitting it to be different for different
> protocols if that is useful (and it may be).

Until recently, I had been thinking that having different mappings for
different protocols (e.g. HTTP, email, etc) could lead to
incompatibilities between the domain names used in the different
protocol stacks, somewhat akin to the "balkanization" problem that
people have mentioned whenever someone appears to want to do something
differently for a single language (such as a European Latin-based
language).

But perhaps the protocol stacks are different enough in nature that
having (slightly?) different mapping rules might still be OK. Would we
extend that to the prohibition tables themselves, though? I.e. would
HTTP have a different set of PVALID characters than email? (Note that
I am not actually pushing for this at the moment. I am just exploring
this idea, as a kind of logical extension of the idea of different
mappings that you mentioned.)

>> But we have also talked about "getting the user used to
>> lower-case in the DNS" (by displaying in lower-case, etc). So
>> my question is: What is the goal of IDNA? Is it a goal to have
>> software map non-ASCII characters to lower-case to simulate
>> traditional DNS behavior with ASCII strings? Or is it a goal
>> to teach the user to enter lower-case in the first place
>> (effectively pushing the lower-case mapping all the way up to
>> the human brain)?
>
> I don't think either of those is the goal.  I think the goal is
> to permit useful mnemonics for network resources in a wide range
> of scripts.  To me, "useful mnemonics" means as much flexibility
> as possible without compromising the utility or integrity of
> identifiers or the contexts in which they are embedded.  And I
> consider things that create confusion among users --including
> having things floating around that seem to match but don't and
> vice versa-- to compromise the utility and integrity of those
> identifiers.

Yes, I understand, but we currently have upper and lower case ASCII
floating around, so a user might be confused if upper-case non-ASCII
did not "work".

> The questions you raise above are, IMO, about tradeoffs for
> realizing that goal, not goals in themselves.  Remember too
> that, if the best thing for the user is that anything she
> expects to match should match, then we really need mapping rules
> that cause decorated versions of some characters to match the
> undecorated versions, at least sometimes (and a way to evaluate
> and determine "sometimes"), not just case mapping to
> most-nearly-related character.

Yes, it is a question of drawing a line between "reasonable" mappings
and "unreasonable" ones. The German umlauted characters are often
considered equivalent to ae/oe/ue, but that is not true of the
Scandinavian decorated characters, as you have said a number of times.
It is unfortunate that we keep getting drawn into
character-by-character discussions and decisions, when we are aiming
not to do that. Perhaps that is just the nature of the beast, since we
are trying to use a world-wide standard (Unicode) in a system that is
being used in different ways in different countries (DNS).

Erik


More information about the Idna-update mailing list