Update of RFC 2606 based on the recent ICANN changes?
lyman at acm.org
Thu Jul 3 22:05:14 CEST 2008
>> Is "сом" identical to "com"? (the first of these is U+0441
>> U+043E U+043C)
> The current principle is that it should be be a "confusing string",
> which is vague enough to cover the case above (but perhaps not able to
> cover .co)
"Similarity" can be defined and tested, by setting thresholds and the
like, but "confusing" refers to a state of mind - something is
"confusing" if the people who are likely to encounter it consider it
to be confusing. There's no way to objectively define or test for
"confusing" similarity without reference to how actual people respond
to a particular string. That means either mining data collected from
circumstances in which people have mistaken one string for another
(perhaps a history of Google searches), or consulting a panel of real
people whenever it is necessary to decide whether or not two strings
are "confusingly" similar.
>>> (b) be identical to a Reserved Name;
>>> (c) consist of a single character;
>> I've heard it argued repeatedly that this is an unreasonable
>> rule for ideographic characters. I don't have an opinion, but
>> hope that ICANN has considered that case in full details.
> This is where we dive into a discussion what is a "character". In
> ideographic based language, there isnt a concept of a "word".
> For example, Chinese, Japanese and Korean are actually "phonetics
> language", and that ideograph characters are used to express the
> phonetics. A "word" or more accurately "morphemes" can be express in a
> single or more ideographs. A single latin character is unlikely to be
> useful by itself (except of a and i) but thats not the case in CJK.
> If the condition is that "no single ASCII character", I may be neutral
> about it (since a single ideograph would never translate to a single
> ASCII character in the zonefile, due to the xn-- prefix) but if the
> "character" is defined more broadly to cover "U-label" character, then
> I would have strong objections.
At the moment, the condition is "no single Unicode code point." To
the extent that a single CJK ideograph can be expressed using a
single Unicode code point, this would represent the situation to
which you say you would object. I will dig through my notes to find
out why the "single character" condition was adopted -
More information about the Idna-update