Nameprep and NFKC

Wed Oct 13 06:34:24 CEST 2010

So as you mentioned new IDNA is using NFC over NFKC right?
Well while I was at it testing some "precomposed forms" with their
"decomposition forms"
I found this case .

http://unicode.org/cldr/utility/idna.jsp?a=\u0627\u0654%0D%0A\u0623%0D%0
A

the upper label supposed to be decomposition form of the lower one as it
mentioned here :

http://unicode.org/charts/PDF/U0600.pdf

"0623 <Unicode> ARABIC LETTER ALEF WITH HAMZA ABOVE = 0627 <Unicode>
0654 <Unicode>"

And to double check it I looked for the character properties here :

http://unicode.org/cldr/utility/character.jsp?a=0623&B1=Show

and it showed "isNFC" flagged as "Yes"

yet both labels showed different punycodes as IDNA2008 BUT as IDNA2003
showed same code.

Am I missing something here ?

AbdulRahman,

P.S: the disclaimer isn't supposed to show-up anymore here BUT if it did
I am really sorry about it and I'll have to contact the mail server
admin again I guess =\ .

-----Original Message-----
From: John C Klensin [mailto:klensin at jck.com] 
Sent: 10/Oct/2010 9:35 PM
To: Abdulrahman I. ALGhadir; idna-update at alvestrand.no
Subject: Re: Nameprep and NFKC

--On Sunday, October 10, 2010 3:50 PM +0300 "Abdulrahman I.
ALGhadir" <aghadir at citc.gov.sa> wrote:

> As It mentioned that Nameprep profile will use the NFKC form
> of the string right? (as mentioned in section of 4)

Section 4 of what?  Nameprep is now completely obsolete; there
is no dependency on it at all in IDNA2008.

> I am not sure if this is a valid example to begin with.
> 
> Imagine these two chars 'a' and '*' if they appeared
> sequencely they would yield  a unique display char. If those
> two didn't have a normalized char (is it possible?) and
> someone used the domain a* if UNICODE released a normalized
> form later on What will happen?

First of all, there are rather strict Unicode rules against most
or all changes that would affect normalization. I'll let someone
more expert than I am comment on the relationship between those
rules and your example.

I think by "normalized form", you mean what Unicode calls a
"precomposed form", i.e., a single code point that represents
the combination of 'a' and '*' ("a+* below).  My understanding
that the such new code points are now discouraged entirely and
that, if they are added, NFC (see below) is not changed to
reflect the mapping you might expect.  Instead, NFC is changed
to _decompose_ the new "a+*" codepoint into the "a" and "*"
combining sequence.   This is one of the few advantages of using
NFD over NFC -- the behavior of NFD should always be predictable
(guessable) without one's needing to know the sequence in which
characters were added to the standard.  Again, someone more
expert than I am may want to confirm or correct my understanding.

As far as IDNA is concerned, scenarios like the one you describe
are among the reasons why the Standard uses the much less
drastic NFC rather than NFKC and why it requires that input
strings be in NFC-compliant form, rather than doing its own
normalization.  

    john

p.s. Please try to not send messages containing confidentiality
statements to IETF mailing lists.  They violate IETF rules and
may either be ignored or discarded as a result.

-----------------------------------------------------------------------------------
Disclaimer:
This message and its attachment, if any, are confidential and may contain legally
privileged information. If you are not the intended recipient, please contact the
sender immediately and delete this message and its attachment, if any, from your
system. You should not copy this message or disclose its contents to any other
person or use it for any purpose. Statements and opinions expressed in this e-mail
are those of the sender, and do not necessarily reflect those of the Communications
and Information Technology Commission (CITC). CITC accepts no liability for damage
caused by this email.