Nameprep and NFKC
patrik at frobbit.se
Wed Oct 13 07:43:38 CEST 2010
On 13 okt 2010, at 06.34, Abdulrahman I. ALGhadir wrote:
> So as you mentioned new IDNA is using NFC over NFKC right?
IDNA2008 require the string to be in NFC form, and then there are rules that say whether the codepoints are valid or not.
Please see RFC 5892.
> Well while I was at it testing some "precomposed forms" with their
> "decomposition forms"
> I found this case .
> the upper label supposed to be decomposition form of the lower one as it
> mentioned here :
> "0623 <Unicode> ARABIC LETTER ALEF WITH HAMZA ABOVE = 0627 <Unicode>
> 0654 <Unicode>"
> And to double check it I looked for the character properties here :
> and it showed "isNFC" flagged as "Yes"
> yet both labels showed different punycodes as IDNA2008 BUT as IDNA2003
> showed same code.
> Am I missing something here ?
> P.S: the disclaimer isn't supposed to show-up anymore here BUT if it did
> I am really sorry about it and I'll have to contact the mail server
> admin again I guess =\ .
> -----Original Message-----
> From: John C Klensin [mailto:klensin at jck.com]
> Sent: 10/Oct/2010 9:35 PM
> To: Abdulrahman I. ALGhadir; idna-update at alvestrand.no
> Subject: Re: Nameprep and NFKC
> --On Sunday, October 10, 2010 3:50 PM +0300 "Abdulrahman I.
> ALGhadir" <aghadir at citc.gov.sa> wrote:
>> As It mentioned that Nameprep profile will use the NFKC form
>> of the string right? (as mentioned in section of 4)
> Section 4 of what? Nameprep is now completely obsolete; there
> is no dependency on it at all in IDNA2008.
>> I am not sure if this is a valid example to begin with.
>> Imagine these two chars 'a' and '*' if they appeared
>> sequencely they would yield a unique display char. If those
>> two didn't have a normalized char (is it possible?) and
>> someone used the domain a* if UNICODE released a normalized
>> form later on What will happen?
> First of all, there are rather strict Unicode rules against most
> or all changes that would affect normalization. I'll let someone
> more expert than I am comment on the relationship between those
> rules and your example.
> I think by "normalized form", you mean what Unicode calls a
> "precomposed form", i.e., a single code point that represents
> the combination of 'a' and '*' ("a+* below). My understanding
> that the such new code points are now discouraged entirely and
> that, if they are added, NFC (see below) is not changed to
> reflect the mapping you might expect. Instead, NFC is changed
> to _decompose_ the new "a+*" codepoint into the "a" and "*"
> combining sequence. This is one of the few advantages of using
> NFD over NFC -- the behavior of NFD should always be predictable
> (guessable) without one's needing to know the sequence in which
> characters were added to the standard. Again, someone more
> expert than I am may want to confirm or correct my understanding.
> As far as IDNA is concerned, scenarios like the one you describe
> are among the reasons why the Standard uses the much less
> drastic NFC rather than NFKC and why it requires that input
> strings be in NFC-compliant form, rather than doing its own
> p.s. Please try to not send messages containing confidentiality
> statements to IETF mailing lists. They violate IETF rules and
> may either be ignored or discarded as a result.
> This message and its attachment, if any, are confidential and may contain legally
> privileged information. If you are not the intended recipient, please contact the
> sender immediately and delete this message and its attachment, if any, from your
> system. You should not copy this message or disclose its contents to any other
> person or use it for any purpose. Statements and opinions expressed in this e-mail
> are those of the sender, and do not necessarily reflect those of the Communications
> and Information Technology Commission (CITC). CITC accepts no liability for damage
> caused by this email.
> Idna-update mailing list
> Idna-update at alvestrand.no
More information about the Idna-update