Bundling

"Martin J. Dürst" duerst at it.aoyama.ac.jp
Tue Dec 8 09:37:14 CET 2009


Hello John,

On 2009/12/08 3:20, John C Klensin wrote:
>
>
> --On Monday, December 07, 2009 21:02 +0900 "\"Martin J.
> Dürst\""<duerst at it.aoyama.ac.jp>  wrote:
>
>> Hello Shawn,
>>
>> I think with respect to bundling, ß and ς are quite
>> different, as follows:
>>
>> ς:
>> 1) ς/σ distinction (virtually?) never a distinction of
>> meaning, only  contextual.
>
> I hope that is true, but, given the tendency to create domain
> labels (rather than words) by mashing words together, I don't
> know of any way to be completely sure without a really extensive
> knowledge of Greek.

I'm not sure either. There is definitely the potential of a long word 
(with a σ in the middle) and a combination of two words (with a ς at the 
end of the first word, such as Peter's μεγεθοςσοφια) to be otherwise 
identical. But I think this is highly unlikely, by magnitudes less 
likely than conflicts in German (where names are the big source, and we 
don't have to string words together to try to find issues). Of course, 
I'd like to hear from Greek experts on this issue.

>> 2) Need for bundling limited to registries/zone operators
>> allowing Greek. [3) Potentially needed soon for cypriot IDN
>> TLD]
>
>> ß:
>> 1) ß/ss distinction actually significant to distinguish
>> between certain  words (and especially names)
>> 2) "ss" substring essentially used/usable in every
>> registry/zone around  the world.
>>
>> [I hope somebody else can provide details on ZWJ/ZWNJ for
>> point 1); it's  clear that for point 2), they are more like ς
>> than like ß.]
>
> We've been told by folks with a great deal of knowledge that
> their presence or absence can change one word to another in
> several languages.   That makes them more like "ß/ss" than like
> "ς/σ".  I don't know what we are trying to reopen here --Vint
> has already indicated that ZWJ/ZWNJ are settled issues and off
> the table, which I believe should be correct--

In one sense, that's fine. But I can't withhold the impression that we 
aren't talking too much about these just because we are more familiar 
with the other two examples. I definitely have to admit that this is the 
case for me. Anyway, currently, TR46 treats all four cases the same, and 
if we are going to introduce something like TRANSITIONAL, we better make 
sure it works for all four, or we know pretty sure it's not needed for 
some of them.


> but I also note
> that, absent contextual rules or in the presence of any "map to
> nothing" convention, ZWJ or ZWNJ could be inserted into any
> string in any script in the world... and would cause
> native-character comparisons to fail.



> I'd also question the "essentially ... every registry in the
> world" assertion wrt "ss".  While it may be true today, many of
> those proposing IDN TLDs intend to keep those domains (top to
> bottom) single-script.   One can speculate on how realistic they
> are being, but the intent is clear.

I agree that I don't have any guarantee that just because I have used 
maße.xy in the past, I can actually get maße.xy from the xy registry. 
Actually, I my maße.xy was completely fictuous and exaggerated. It might 
be a good idea to get an idea from Mark or Erik about how many links 
with "ß" in domain names they found with prefixes for which they didn't 
find any with ä/ö/ü, which would be those prefixes that don't seem to 
address German.


>> This suggests to me that for ς, we can go with IDNA 2008 and
>> bundling  immediately, without the need for TR46. (Even in the
>> long term, we may  not get rid of bundling because Greeks seem
>> to care a lot about all-uppercase.)
>
> I'm not making any predictions here, but I can imagine the same
> forces that drove German interests to push for, and get, upper
> case Eszett to eventually lead to a Greek demand for an
> upper-case Sigma that would differ from the normal one only by
> virtue of mapping unambiguously to and from [lower case] final
> Sigma.  These decisions we make about computer coding and use --
> distinctions that are not necessary when writing or typing-- can
> lead to other decisions, sometimes ones we would not predict, to
> make things work predictably as expected by end users.

Something like this might indeed happen, but I think the probability is 
much, much lower than for German. The reason for this is that ς is 
essentially a typographic variant of σ, whereas ß is an orthographic 
variant of ss.

>> It suggests that ß is much tougher, because we essentially
>> have a choice  between giving up and staying with the
>> half-baked situation that we have  now, and doing the right
>> thing in the long run. Both of these choices  are clearly
>> suboptimal.
>
> I'd describe that differently because I think three choices are
> being discussed:
>
> 	(a) Treat it is PVALID and deal with a transition (I
> 	think that is your "doing the right thing in the long
> 	run").

Yes.

> 	(b) Treat it as DISALLOWED, ban the character, and hope
> 	that no one forces us to change that decision.  Some
> 	people will map it to "ss" and some won't.
> 	
> 	(c) Continue to map as a requirement of the protocol.
>
> While the first two involve obvious tradeoffs, the third is,
> IMO, harmful because it breaks the U-label<->  A-label
> relationship with all of the sweeping costs of that decision
> (see Patrik's several notes on the subject).

Also, having to map exactly one character (or maybe two or four) as part 
of the protocol seems a bad idea in and by itself.

> I agree that
> there are no ideal solutions that don't involve turning back the
> clock.
>
> Perhaps more important, as Cary and others have pointed out, we
> may be exaggerating the transition difficulties.  While the
> "ß/ss" exists as the result of decisions made by Unicode and
> IDNA2003 and the "ö/oe" relationship does not, from the
> standpoint of a registry considering permitting registration of
> labels based on German, they are almost the same: a new
> character is being introduced that was formerly commonly
> represented in a different way, the old form could appear in any
> registry that supports Latin characters, there are many
> situations in which the old form cannot safely be converted to
> the new one even though the new one can almost always
> (correctness of spelling aside) be represented by the old one,
> and so on.  "Bundling" or some other variation of the JET
> Variant approach are certainly possible mechanisms, but so are a
> whole collection of "sunrise" or other privileged registration
> processes.  The latter are _lots_ easier and cheaper for the
> registry and may be equally satisfactory in the long term.  They
> might even work better in some situations.   But, either way, we
> have lots of experience with them and the level of pain didn't
> kill anyone.

Well, yes, but there is an important difference between bundling or 
sunrising or whatever for a truly new character (for example a character 
that's in Unicode 5.2 but not in Unicode 3.2) and doing the same thing 
for a character that is completely legal and has a well-defined mapping 
in IDNA 2003.

Regards,    Martin.


-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp


More information about the Idna-update mailing list