[lsb@lsb.org: [EAI] (summary) display of RightToLeft chars in localparts and hostnames]

Thu Dec 7 20:20:49 CET 2006

Hi.

I've had it pointed out to me that I got confused as a
consequence of very similar discussions on two separate mailing
lists and replied to this list with a comment that should have
been directed to the other.   The response is still the same --
"the idea is bad news" -- but, for this list, should have been
different.  Since the reply to this list is the one I started to
write, then erased and substituted the other one  (I got
_really_ confused) the intended response is below.  Please
ignore the note in this thread sent at 12:51 -0500.

--On Thursday, 07 December, 2006 09:36 +0100 Harald Alvestrand
<harald at alvestrand.no> wrote:

> --On 7. desember 2006 13:01 +0900 Soobok Lee <lsb at lsb.org>
> wrote:
>
>> I found this section in stringprep2003:
>> 
>> <quote from section 5.7>
>>  5.8 Change display properties or are deprecated
>> 
>>    The following characters can cause changes in display or
>>    the order in which characters appear when rendered, or are
>>    deprecated in Unicode.
>> 
>>    200E; LEFT-TO-RIGHT MARK
>>    200F; RIGHT-TO-LEFT MARK
>>    202A; LEFT-TO-RIGHT EMBEDDING
>>    202B; RIGHT-TO-LEFT EMBEDDING
>>    202C; POP DIRECTIONAL FORMATTING
>>    202D; LEFT-TO-RIGHT OVERRIDE
>>    202E; RIGHT-TO-LEFT OVERRIDE
>>    206A; INHIBIT SYMMETRIC SWAPPING
>>    206B; ACTIVATE SYMMETRIC SWAPPING
>>    206C; INHIBIT ARABIC FORM SHAPING
>>    206D; ACTIVATE ARABIC FORM SHAPING
>> </quote>
>> 
>> My suggestion for new stringprep200x is to move these chars
>>   to "mapped to nothing lists". that is, how about deleting
>>   silently them instead of prohibiting them and returning
>>   error ?
> 
> Any string that contains them will (one assumes) depend on
> their correct interpretation for correct display.
> 
> Mapping them out and letting people use the resulting string
> powerfully violates the principle of least astonishment; if I,
> for reasons of my own, choose to send in the string (in
> network order) <RLO> D N A R T S E V L A <RLO>, expecting to
> see the display ALVESTRAND, I will be astonished if the result
> is DNARTSEVLA.

In addition to Harald's answer, mapping these out violates two
principles we have been trying to use in sorting through which
characters are to be included and how they are handled.  Those
principles are debatable, but, if they are not clear in
"issues", I'd appreciate suggestions of specific text to make
them clear.

(1) We do as little mapping as possible.  NFC-type mapping is
unavoidable to make different representations of the same
character compare equal.  Case mapping is unavoidable to prevent
astonishment between the way IDNs are handled and the way basic
ASCII domain name labels are handled.  Anything else, especially
anything that involves either a compatibility mapping, is
prohibited (i.e., the character that would map to another one is
prohibited entirely since it would not appear when
reversed-mapped from the DNS storage form back to a conventional
Unicode sequence).  And "map to nothing" cases are prohibited
because they, well, map to nothing and don't reverse-map either.
Either the characters / code points are nothing, and so we don't
need them, or they carry information, such as impacting
presentation, so discarding them is dangerous.

(2) We try to keep this as simple as possible.  One of the
frequently-repeated complaints about IDNA(2003) is that no one
can predict what it will actually permit or not from principles
-- one must either hand-execute the algorithms or use a computer
program to do that.  Not a good basis for a standard.  It also
leads to the belief that IDNA is just too complicated and should
be replaced by, e.g., UTF-8 in the DNS (which would turn out not
to be any less complicated, because these rules are about
character acceptability and mapping, not about the final
coding/decoding stage).

Now, as we have discussed in the email i18n context, a user
interface may well have reason to do mappings of various sorts
prior to getting near IDNA.  That may be sensible and we
certainly expect it in some cases.  But keeping it out of the
protocol makes the protocol less complex and makes it much more
clear what forms of a domain name can be incorporated into URLs
(and URIs and IRIs generally) and passed between systems.

Again, if more text to the effect of the above needs to be added
to "issues" I'd welcome text or at least specific comments.

    john