Mapping and Variants

Mon Mar 9 14:51:07 CET 2009

Vint,

I believe John was making a different point. He used the words
"tradeoff" and "conflict" to describe the situation with mapping and
variants.

Until now, we have often talked about bundling as a way of handling
characters that look similar. A very commonly cited piece of work is
the East Asian JET RFC, since there are so many Han characters, and
many of them look similar.

However, the Greek small alpha and Latin small a do not look similar,
so you wouldn't normally think of bundling them. But since
upper-to-lower case mapping is performed by a certain number of
clients, the registrant must "all of a sudden" consider bundling Greek
small alpha and Latin small a.

So this could be viewed as a "tradeoff" because the advantages of
mapping lead to the disadvantage of having to bundle.

It could even be viewed as a "conflict" because the mapping leads to
the need to bundle, even if there might be two different registrants,
one who really wants the Greek small alpha and the other who really
wants the Latin small a.

Personally, instead of saying "tradeoff" or "conflict", I would have
used the word "exacerbate". I.e. the large number of characters in the
Unicode standard is both a blessing and a curse. It is a blessing
because everybody in the world gets to have their own language, even
if some characters look almost the same. It is a curse because of the
security issue in DNS. This security issue has led to the idea of
bundling, but bundling is a pain in the behind. (The Greeks have
complained about the pain of bundling via DNAME, which is not a
complete solution.)

So the existence of upper-to-lower case mapping has increased the need
to bundle, thereby exacerbating the server-side bundling situation.

This is yet another nail in the coffin of client-side global mapping.
Local mapping can solve the problem, but you'd have to be very
careful. I.e. "did you really mean to type a Greek A when the rest of
the label is Latin? For heaven's sake, please use lower-case, which is
less confusing."

Erik

On Mon, Mar 9, 2009 at 6:25 AM, Vint Cerf <vint at google.com> wrote:
> Erik, et al,
>
> this is plainly a "side of the bus" problem. Each argument that opens up
> another portion of the Unicode glyph space to use with IDNs increases the
> combinatoric implications for bundling or for abusive registrations.
>
> Martin,
>
> rather than focusing solely on the example that John used, I think it is
> probably more useful to think about the evident side-effects of
> incorporating IPA characters as PVALID under IDNA rules. I am not arguing
> here that they should be excluded but only that if they are included, we
> must think how best to deal with the kinds of confusion that Erik and others
> have described.
>
> I think we all understand that we cannot avoid all forms of confusion by
> relying on protocol-level constraints alone. We already know about the
> zero/one "oh"/"ell" confusion even with the LDH constrained set for example
> and with the inclusion of the new Unicode characters, the opportunities for
> confusing registrations is vastly larger.
>
> If you buy the argument that we can't solve this problem entirely with
> protocol rules, then we have to rely on educating
> registries/registrars/registrants using all levels of the hierarchical DNS
> that these problems exist. Of course, there will be those who will exploit
> any opportunity to use PVALID characters to create misleading domain names.
>
> However, it does seem useful to make sure that inclusion of a potentially
> confusing block of Unicode characters is explicitly considered.
>
> In the case of IPA, despite the ample and clear potential for confusion, it
> is my understanding that Mark Davis has pointed out that some (many?) of
> these characters in the International Phonetic Alphabet are used in written
> African (others?) languages. If it were the case that these glyphs were used
> ONLY for phonetic representations, I would argue against their inclusion in
> the PVALID set of IDNA characters. But if it is correct that they are or are
> expected to be used in written languages, one can understand an argument for
> their inclusion. What is painful, is the combinatoric effect these
> characters produce if one is to try to counter their abuse through treatment
> as variants (ie bundling, or other restrictive registration policies).
> Perhaps that is a price we have to pay for attempting to be open to
> including written languages not yet a part of the Unicode system?
>
> vint
>
>
>
> Vint Cerf
> Google
> 1818 Library Street, Suite 400
> Reston, VA 20190
> 202-370-5637
> vint at google.com
>
>
>
>
> On Mar 9, 2009, at 8:57 AM, Erik van der Poel wrote:
>
>> I'm not sure why John hasn't responded to this, but let me give my own
>> reason for agreeing that this is an issue. Note that John said that
>> Greek small alpha and Latin small a must be treated as variants (i.e.
>> bundling), not mapping.
>>
>> John didn't mention keyboard input explicitly, but that is what I
>> thought of when I agreed. I.e. a user might accidentally type a Greek
>> A where a Latin A was "supposed" to be, and if the registrant wants
>> all users to reach their site no matter what keyboard accidents they
>> might make, then the registrant must perform a bundling operation to
>> make that work.
>>
>> My keyboard example may be a little contrived, but not outrageous, in
>> my opinion. John may have a different point of view or a different
>> reason for suggesting the bundling.
>>
>> Erik
>>
>> On Mon, Mar 9, 2009 at 1:25 AM, Martin Duerst <duerst at it.aoyama.ac.jp>
>> wrote:
>>>
>>> John said in an earlier mail
>>> (http://www.alvestrand.no/pipermail/idna-update/2009-March/003751.html,
>>> second to last paragraph) that he thinks that if we do mapping,
>>> we have to map all of upper and lower case Latin a and Greek alpha
>>> to the same thing.
>>>
>>> The only thing I want is to very, very strongy question the above.
>>>
>>> Of course, somebody will registers AΑ, where the first is Latin
>>> and the second is Greek, e.g. on a third or fourth level, just
>>> because they can, but what I'm trying to say is that this is not
>>> a typical use case, and not one that we have to design mapping for
>>> (independent of whether mapping is part of the protocol
>>> (most probably not) or otherwise).
>>>
>>> Regards,   Martin.
>>>
>>>
>>> At 15:08 09/03/09, Patrik F舁tstr�����阡綺
>>> 章��轣�屋姐��癆�鯵�岡�浴鶯蜴�汀纈齡��阡綺
>>> �松�吏�竟蜴����癆��蜩�蜩�轣蜴踟��繽鱚�竅讙�矼竅��迚�鈑
>>> 松�黹鱸頸�蜴�艱鈬鱇讙�瘤��鴒�逡竏�蜴��蜩�竅黼��蜩��矚�蜆縺�
>>> 松�鈿鋏蜚蓴�鈔蜴�纔瘢韭纉��纈�蜚�痺�瘡踟�迚艾�轣諷�齒辣�黼銖絳
>>> 松�齦竏�癈�μvolt. Can you give an example that makes a bit more
>>>>>
>>>>> sense than just "AA"?
>>>>
>>>> Martin, people will most certainly register this, "just because they
>>>> can". The example because of this I think is valid.
>>>>
>>>> You also have to remember that people do have interest in mixing
>>>> scripts, for example various scripts and latin.
>>>>
>>>> To limit the problems we do have in IDNA2008 two things that protect
>>>> against problems:
>>>>
>>>> - We have defined what is a U-label and A-label, and because of this,
>>>> it is a very very clear signal what codepoints should be used. If we
>>>> also have mappings, fine, but it is clear that those characters are in
>>>> the gray area whether they should be used for example in publications.
>>>>
>>>> - We have for the most problematic situations regular expressions that
>>>> limit the use of some codepoints that create real problems if they are
>>>> used in a non-intended-context.
>>>>
>>>> What do you want more? You want more regular expressions? You want to
>>>> reopen the discussion on mixing scripts again?
>>>>
>>>>   Patrik
>>>>
>>>>
>>>>
>>>> content-type: application/pgp-signature;
>>>> x-mac-type=70674453;name=PGP.sig
>>>> content-description: This is a digitally signed message part
>>>> content-disposition: inline; filename=PGP.sig
>>>> content-transfer-encoding: 7bit
>>>>
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: GnuPG v1.4.8 (Darwin)
>>>>
>>>> iD8DBQFJtLJErMabGguI180RAiejAJwPnN20mypjEy4cMccW8luTM8/c5wCfXxmG
>>>> S117mtZOxEs1rQNlATKwI7o=
>>>> =QXj6
>>>> -----END PGP SIGNATURE-----
>>>>
>>>> _______________________________________________
>>>> Idna-update mailing list
>>>> Idna-update at alvestrand.no
>>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>
>>>
>>> #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>>> #-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp
>>>
>>>
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>
>>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>