Standardizing on IDNA 2003 in the URL Standard

John C Klensin klensin at jck.com
Thu Jan 16 20:04:21 CET 2014



--On Thursday, January 16, 2014 12:55 -0500 John Cowan
<cowan at mercury.ccil.org> wrote:

> John C Klensin scripsit:
> 
>> The distinction between mapping for something typed or
>> otherwise specified directly by the user and a mapping
>> requirement for domains or URLs/URIs stored in documents,
>> search or DNS examination programs, and the like keeps
>> getting lost in this set of discussions, but is really,
>> seriously, important.
> 
> I'm not so sure.  In the end, URLs in documents tend to be
> typed by the user too, it's just a different kind of user.

But there is always something of a transformation process to get
it into the document.  For example, users don't type UTF-8, they
type stuff that gets mapped via various procedures into UTF-8 or
something else.

> You could argue that document editors should do the mapping
> themselves, but then you're back to the old stand.

Maybe I am "back to the old stand" -- I'm just trying to explain
a perspective that has some history of being useful.  That
history, for me and even for i18n issues specifically, extends
back to the last 60s, which is, indeed. very "old stand".
However, I think there are ultimately two cases as far as
document editors are concerned:

(1) The mapping that might be used is trivial -- either the
ASCII cases or things like full-width East Asian character (many
ASCII characters fall into this category only if one is willing
to assume that, e.g., "A" always means/maps to "a" rather than
any of the decorated lower-case forms that, in various localized
writing system contexts, lose their decorations when being
mapped to upper case.   For most or all of these cases, it ought
to be trivial for document editors to simply enter the canonical
forms.  If there is some reasons why they don't and mapping is
needed, that is ok too.

(2) The more complex cases in which mappings can turn a
character into a non-obvious alternative.  For these cases, the
document author/ editor better know what she is doing.  The
reality is that those mappings may be done or not done,
unpredictably and depending on environment and circumstances and
the decisions may have inadvertent blocking side-effects.  If,
for example, a label that contains ZWNJ is registered and (as
UTS46 and other things recommend as a reasonable option) the
same string with ZWNJ is blocked. then an IDN resolving engine
that maps ZWNJ to nothing prevents use of the name (similarly
for sharp-S, etc.).   For these cases, if the document editors
knows what is going on, then specifying exactly what is intended
(in A-label or at least U-label form) is the best and least
risky thing she can do.  We betray the trust that implies
--trust that she is, in fact, smart enough to know what she is
doing-- if we second-guess here canonical strings by mapping
them to something else.   Conversely, if the document editor
doesn't have a clue, it is not clear to me that we are doing
either him or his users/readers a favor by encouraging ambiguity
in an identifier that they have been told is not ambiguous.

At least that is what it looks like from here.  YMMD.

   john





More information about the Idna-update mailing list