IDNA applications (was: RE: sharp s (Eszett))

Fri Mar 7 22:45:11 CET 2008

--On Friday, 07 March, 2008 13:03 -0800 Paul Hoffman
<phoffman at imc.org> wrote:

> Just to be clear: this is not just browsers. All applications
> that use IDNA2003 (mail programs, IM clients, and so on) will
> need to have the self-update property described above,
> assuming that we adopt the "mapping is done in the
> application" idea.

Paul,

One needs this even without the "mapping is done in the
application" rule.  

What requires it is getting rid of the rule that unassigned code
points are looked up.  Looking those code points up and having
that be stable across Unicode versions essentially requires the
assumption that no character will be added to Unicode in the
future that would have either a canonical or compatibility
composition or that it would be transformed by casemapping
(essentially that no more upper case characters would be added).
That requirement is untenable, as evidenced by both Erik's
Bactrian examples or the recent introduction of an upper-case
Eszett that was part of the beginning of this thread.

As long as one permits looking up unassigned characters, all
three possible cases (other than "freeze at 3.2 forever") get us
into different types of trouble:

	* Simply upgrading Nameprep/ Stringprep to Unicode 5.0
	or 5.1 without making any of the other IDNA200X changes
	would create an ambiguity.  As I think Erik pointed out,
	an IDNA2003-conforming implementation would encode
	U+03F7 into Punycode as itself (as an unknown code
	point) and try to look it up, while an updated
	application would presumably casefold it to U+03F8 and
	look that up.

	* An IDNA200X-conforming implementation without the "no
	mappings" rule (i.e., with "look up unassigned code
	points" and with mappings similar to Stringprep) would
	behave exactly as above for U+03F7.  However, an
	implementation or standards model consistent with
	Unicode 5.0 would run into substantially the same
	problem when compared to Unicode 5.1 and the inclusion
	of upper-case Eszett with the latter.   Without looking
	up unassigned code points, the Unicode 5.0 version would
	be unable to access the code point that Unicode 5.1
	assigns to upper-case Eszett; once that code point was
	accessible (i.e., after upgrading to Unicode 5.1), it
	would casefold properly.

	* An IDNA200X-conforming implementation without mappings
	would reject U+03F7 entirely for Unicode 5.0 and 5.1 (no
	mappings) and would reject the codepoint for upper-case
	Eszett entirely for 5.1 (again, no mappings) but would
	also reject that codepoint for 5.0 because it was
	unassigned.

We have the "no mapping" rule in the proposed document because
we believe that it is much more clear for both end users and
registrants and because it appears to offer better stability in
the types of relationships above.   But explicit upgrading to
gain access to newly-coded characters under any scenario that
doesn't involve looking up unassigned code points.  And looking
up those code points, _especially_ if mapping is included in the
protocol, leads to inconsistent results.

     john