Parsing the issues and finding a middle ground -- another attempt

Wed Mar 4 00:54:09 CET 2009

(Message rearranged ... easy part first)

--On Tuesday, March 03, 2009 15:05 -0800 Erik van der Poel
<erikv at google.com> wrote:

>...
> I can live with Eszett, Final Sigma, ZWJ and ZWNJ under xn--,
> but I'd be happier if we heard some kind of confirmation or
> approval from the German, Greek and Iranian registries,
> respectively.

We already have "confirmation" from the German registry.  I
think the notes that Marcos posted months ago were completely
clear.  If I correctly understand the position of the Greek
registry, they would prefer the IDNA2003 mappings for Final
Sigma and would also prefer additional mappings to cover the
Tonos cases.  If we adopted this model, they could effectively
get the behavior they want for Final Sigma by declining to
register any labels that actually contained that character.  I
don't know whether they would be better or worse of with the
character itself as PVALID or banned, but my instinct says it
would be better treated as PVALID in the protocol.

For ZWJ/ ZWNJ, there are actually a fairly significant number of
registries and scripts involved.  The characters are as, or
more, important in Pakistan than they are in Iran and may be
more important for Devanagari and some other Indic scripts than
than they are for anything Arabic-script-based.  Over the years,
we've gotten significant input from the Indian registry that
they are needed (although no longer, with Unicode 5.1, for
Malayalam).

>...
> I am sympathetic with John's concern that implementations
> ought to be able to perform local mappings for Turkish, and
> still claim compliance with the protocol.
> 
> I am also sympathetic with John's concern that other
> protocols, such as email, ought to be able to mandate U-labels
> or A-labels (without mapping).
>...

I've been hinting about what is about to follow but seem to get
sidetracked every time I try to construct a note that works its
way through all of the cases and issues.  Just to mention
another alternative before we go down too far down one path or
the other...

Independent of what various implementations have done and gotten
away with often enough to establish a frequent practice
(remembering that a small fraction of the web pages in the world
is still a very large number), the intended (by the IETF, at
least) reading of the URI spec is that IDNs in the domain name
field must be in A-label form.  There is no real provision in
that interpretation of that spec for %-escaped UTF-8: one can
write them, but it isn't an IDN slot so one should really expect
that they would be looked up in the DNS with the percent signs
and digits, not converted to A-labels.

IRIs, obviously, can contain UTF-8 strings in the domain part.

Just as an idea, rather than as a proposal, one possibility
would be to more clearly specify that the IRI-> URI mapping was
required to convert the domain-specific field into A-label form
(rather than converting to %-escape form) _and_ to apply the
IDNA2003 mappings (or whatever mappings one could get consensus
on) there, possibly with the mappings specified on a
per-protocol basis.

That would:

	* Isolate the mappings completely from IDNA, while
	providing a logical place to apply them.

	* Permit isolating concerns about the web generally, and
	HTML/HTTP in particular, from other protocols and
	concerns.

	* Be consistent with current extended use of URIs if one
	adopted some changes in terminology (but with no effect
	on practice).  Specifically, since %-escapes are valid
	in IRIs (although generally silly), we could describe
	the extensions that use non-ASCII strings in URI slots
	as extensions that permit IRIs in those contexts rather
	than the standard-required URIs.  That still doesn't
	make it a good idea (that is a separate discussion), but
	would decouple the issue of non-ASCII characters, and
	even %-escapes for domain slots, in URIs from the issue
	from the URI spec and make them issues of where IRIs are
	permitted.   And, for protocols where mapping was
	appropriate, it would permit the mappings to be applied
	regardless of whether the domain name was written in
	UTF-8, in %-escaped UTF-8 or, in principle, in some
	other character set or in A-labels -- the decisions
	would be matters for the IRI spec, not IDNA.

I don't know if this is a good idea or not.  And, obviously, the
above is just an outline of a description rather than what I had
intended to write.  But is a different logical possibility for
dealing with the mapping issues and one that would make most of
the transition issues into web issues rather than IDNA issues.

     john