display of RightToLeft chars in localparts and hostnames

John C Klensin klensin at jck.com
Thu Dec 7 23:11:59 CET 2006



--On Friday, 08 December, 2006 06:54 +0900 Soobok Lee
<lsb at lsb.org> wrote:

> On Thu, Dec 07, 2006 at 02:20:49PM -0500, John C Klensin wrote:
>> Hi.
>> 
>> I've had it pointed out to me that I got confused as a
>> consequence of very similar discussions on two separate
>> mailing lists and replied to this list with a comment that
>> should have been directed to the other.   
> 
> The "[EAI]" in the subject in this thread might have confused
> you. I am sorry for that. :-0 

Yes, that was part of it.   No long-term harm done.

>...
>> (1) We do as little mapping as possible.  NFC-type mapping is
>> unavoidable to make different representations of the same
>> character compare equal.  Case mapping is unavoidable to
>> prevent astonishment between the way IDNs are handled and the
>> way basic ASCII domain name labels are handled.  
>> Anything else, especially
>> anything that involves either a compatibility mapping, is
>> prohibited (i.e., the character that would map to another one
>> is prohibited entirely since it would not appear when
>> reversed-mapped from the DNS storage form back to a
>> conventional Unicode sequence).  
> 
> But, compabilibility mappings sometimes map FullWidthChar -> 
> HalfWidthChar , both of which have the same basic glyph. 
> for example, U+FF21 (FullWidth A) ==> A (U+0041) (NFKC,NFKD)
> while NFC,NFD don't.

That is correct.    Having them assigned separate code points
unfortunately violates the basic Unicode principle that, if the
character is the same, and even has the same glyph, it should
get only one code point.  Unfortunately, I gather it is
consistent with another Unicode principle of trying to follow
the conventions in existing national character sets.  Life
becomes hard when one tries to balance such often-contradictory
principles.

However, for IDN purposes, it has seemed to use that there is
another useful principle.  If ToUnicode(ToASCII(codepoint)) is
not equal to the codepoint itself, we have a real, and serious,
opportunity for confusion, whether that confusion leads directly
to spoofing opportunities or not.

As I have said in more general terms, a user interface that
deals with half width and full width characters separately may
well want to map them to whatever IDNA accepts.  If they are
normally considered equivalent in that operating system or
environment, to do otherwise would, IMO, represent bad judgment.
But IDNA should accept _only_ whatever will actually get mapped
into DNS storage and mapped back out again.

>> And "map to nothing" cases are prohibited
>> because they, well, map to nothing and don't reverse-map
>> either. Either the characters / code points are nothing, and
>> so we don't need them, or they carry information, such as
>> impacting presentation, so discarding them is dangerous.
> 
> Do you mean that past "map to nothing" candidates are to be
> "prohibited" or "allowd" ?

If a character is a candidate for mapping to nothing, then it
should be, IMO, prohibited.   If it actually carries information
that significantly impacts presentation, then we should see if
there is a way to accommodate it (hence the discussion about
zero-width breaks) but should, under no circumstances, map it to
nothing.

>> (2) We try to keep this as simple as possible.  One of the
>> frequently-repeated complaints about IDNA(2003) is that no one
>> can predict what it will actually permit or not from
>> principles -- one must either hand-execute the algorithms or
>> use a computer program to do that.  Not a good basis for a
>> standard.  It also leads to the belief that IDNA is just too
>> complicated and should be replaced by, e.g., UTF-8 in the DNS
>> (which would turn out not to be any less complicated, because
>> these rules are about character acceptability and mapping,
>> not about the final coding/decoding stage)
 
> Right. had-been-proposed UTF8 DNS and IDNA share the
> "stringprep"ed unicode string output. Choice between utf8 vs
> punycode is merely  an encoding issue wrt backward
> compatibility for various RFC protocols.

Exactly.

>> Now, as we have discussed in the email i18n context, a user
>> interface may well have reason to do mappings of various sorts
>> prior to getting near IDNA.  That may be sensible and we
>> certainly expect it in some cases.  But keeping it out of the
>> protocol makes the protocol less complex and makes it much
>> more clear what forms of a domain name can be incorporated
>> into URLs (and URIs and IRIs generally) and passed between
>> systems.
> 
> For example, (RL*) (LR*) peeling off can remain out of
> protocol. The main point is that whether we enforce single
> strict display order  between (bidi) IDN labels or not *MUST*
> be addressed clearly  in new IDNA200x.

I don't see how any IETF protocol can "enforce" anything about
user interfaces and therefore display order.  We can only
specify and require network order.  Anything else... well, we
can make recommendations, but that is about all.

> (network order) (input time order) (display order) are
> different from one another, in the case of
> bidiLocal at bidiIDN.bidiIDN.com, as i posted by previous mails
> into this list.

Yes, I think it is generally understood there may be differences.

regards,
    john



More information about the Idna-update mailing list