Lookup & NFC

Fri Mar 28 13:24:11 CET 2008

--On Friday, 28 March, 2008 14:22 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> At 03:10 08/03/28, John C Klensin wrote:
> 
>> The more important answer is that the intent of the spec is
>> "if you need this mapping, it is your job to apply it before
>> you invoke IDNA".  Taking NFC as an example, let's assume we
>> have two operating systems,
>> 
>>       * One of them gets strings into NFC form as soon as they
>>       are typed and verifies that (and corrects them if
>>       necessary) any time they are loaded or otherwise
>>       examined.
>>       
>>       * The other lets users type strings and carried them
>>       around in whatever form they are typed, presumably
>>       unnormalized.
> 
> This is in essence correct, but it implies that things mainly
> depend on how the user types them. This is very much NOT the
> case. Whether the user types some accents with modifier keys
> (in some cases called dead keys), some shift combination, or
> a predefined key for that accented character is independent
> of whether these characters enter the system as precomposed
> or depomposed. Microsoft and most Unix/Linux systems use
> precomposed characters, so that's what an application gets
> from the keyboard driver and related machinery. The Mac
> uses decomposed, so there, that's what you get.

Sorry.  I had inferred from Shawn's note that the Mac did not
normalize at all.  It makes much more sense that it adjusts
terminal input to decomposed form.  

> Also, as far as I know (the Mac may be an exception), the
> data is not usually normalized or checked for normalization.
> In general, that is not necesary, because the keyboard driver
> already takes care of this. But if the user e.g. enters
> some non-normalized characters from a character picker or so,
> then these enter the datastream as they are, unchecked.

It seems to me that this doesn't change my main point.  It also
explains some problems I think I've seen but have never been
able to figure out.   In deference to my co-author on the
recently-published RFC 5198, the community understood this
problem forty years ago. If there is going to be more than one
way to represent a given string --and one has that problem as
soon as more than one code point can be used to assemble a
character graphic, regardless of the method for doing so-- then
operating systems must have very clear rules about what the form
looks like _and_ must ensure that strings are forced into that
form as early as possible and again before any operation that
might involve a comparison or the equivalent.   NFC and NFD are
perfectly well-defined standard forms.  It is unfortunately that
there are two of them, but I understand, sympathize, and accept
the reasons.  

But the bottom line, at least IMO, is that the protocol needs to
say "by the time the string gets here, it must be in NFC form,
because that is the only way comparisons will work".  The
particular implementation of an API on the relevant system has
to take whatever responsibility is necessary to be sure things
are in that form.  It can do that by knowing that something else
has already done it, or by checking that the string is in that
form, or by forcing the string into that form.  But, if it does
none of those things and the string is in some other form, the
lookup will almost certainly fail.

regards,
    john