Bundling

John C Klensin klensin at jck.com
Mon Dec 7 19:20:04 CET 2009



--On Monday, December 07, 2009 21:02 +0900 "\"Martin J.
Dürst\"" <duerst at it.aoyama.ac.jp> wrote:

> Hello Shawn,
> 
> I think with respect to bundling, ß and ς are quite
> different, as follows:
> 
> ς:
> 1) ς/σ distinction (virtually?) never a distinction of
> meaning, only  contextual.

I hope that is true, but, given the tendency to create domain
labels (rather than words) by mashing words together, I don't
know of any way to be completely sure without a really extensive
knowledge of Greek.

> 2) Need for bundling limited to registries/zone operators
> allowing Greek. [3) Potentially needed soon for cypriot IDN
> TLD]

> ß:
> 1) ß/ss distinction actually significant to distinguish
> between certain  words (and especially names)
> 2) "ss" substring essentially used/usable in every
> registry/zone around  the world.
> 
> [I hope somebody else can provide details on ZWJ/ZWNJ for
> point 1); it's  clear that for point 2), they are more like ς
> than like ß.]

We've been told by folks with a great deal of knowledge that
their presence or absence can change one word to another in
several languages.   That makes them more like "ß/ss" than like
"ς/σ".  I don't know what we are trying to reopen here --Vint
has already indicated that ZWJ/ZWNJ are settled issues and off
the table, which I believe should be correct-- but I also note
that, absent contextual rules or in the presence of any "map to
nothing" convention, ZWJ or ZWNJ could be inserted into any
string in any script in the world... and would cause
native-character comparisons to fail.

I'd also question the "essentially ... every registry in the
world" assertion wrt "ss".  While it may be true today, many of
those proposing IDN TLDs intend to keep those domains (top to
bottom) single-script.   One can speculate on how realistic they
are being, but the intent is clear.

> This suggests to me that for ς, we can go with IDNA 2008 and
> bundling  immediately, without the need for TR46. (Even in the
> long term, we may  not get rid of bundling because Greeks seem
> to care a lot about all-uppercase.)

I'm not making any predictions here, but I can imagine the same
forces that drove German interests to push for, and get, upper
case Eszett to eventually lead to a Greek demand for an
upper-case Sigma that would differ from the normal one only by
virtue of mapping unambiguously to and from [lower case] final
Sigma.  These decisions we make about computer coding and use --
distinctions that are not necessary when writing or typing-- can
lead to other decisions, sometimes ones we would not predict, to
make things work predictably as expected by end users.

> It suggests that ß is much tougher, because we essentially
> have a choice  between giving up and staying with the
> half-baked situation that we have  now, and doing the right
> thing in the long run. Both of these choices  are clearly
> suboptimal.

I'd describe that differently because I think three choices are
being discussed:

	(a) Treat it is PVALID and deal with a transition (I
	think that is your "doing the right thing in the long
	run").
	
	(b) Treat it as DISALLOWED, ban the character, and hope
	that no one forces us to change that decision.  Some
	people will map it to "ss" and some won't.
	
	(c) Continue to map as a requirement of the protocol. 

While the first two involve obvious tradeoffs, the third is,
IMO, harmful because it breaks the U-label <-> A-label
relationship with all of the sweeping costs of that decision
(see Patrik's several notes on the subject).   I agree that
there are no ideal solutions that don't involve turning back the
clock.  

Perhaps more important, as Cary and others have pointed out, we
may be exaggerating the transition difficulties.  While the
"ß/ss" exists as the result of decisions made by Unicode and
IDNA2003 and the "ö/oe" relationship does not, from the
standpoint of a registry considering permitting registration of
labels based on German, they are almost the same: a new
character is being introduced that was formerly commonly
represented in a different way, the old form could appear in any
registry that supports Latin characters, there are many
situations in which the old form cannot safely be converted to
the new one even though the new one can almost always
(correctness of spelling aside) be represented by the old one,
and so on.  "Bundling" or some other variation of the JET
Variant approach are certainly possible mechanisms, but so are a
whole collection of "sunrise" or other privileged registration
processes.  The latter are _lots_ easier and cheaper for the
registry and may be equally satisfactory in the long term.  They
might even work better in some situations.   But, either way, we
have lots of experience with them and the level of pain didn't
kill anyone.

regards,
   john





More information about the Idna-update mailing list