U-labels, NFC, and symmetry

Fri Apr 15 21:26:27 CEST 2011

On 4/15/2011 9:01 AM, John C Klensin wrote:
> I do remember Mark Davis telling the IDNABIS
> WG that testing for NFC conformance was appreciably less costly
> than converting to NFC.  I don't know if the same relationship
> would hold for NFD but, again, if the string to be compared to a
> stored one comes from a keyboard, it is more likely to be in NFC
> form than in NFD form

Elaborating a bit on John's observation...

Yes, it is possible to implement a very efficient quick test to determine
whether a string is already in NFC form, without actually normalizing it.
And any Unicode normalization implementation worth its salt does so,
because the average performance gain over simply normalizing every
string to NFC is considerable.

Actually, the quick test is the opposite: it is a quick check to determine
if a string is *not* in NFC (or potentially not in NFC) and thus requires
normalization to ensure that it *is* in NFC. The test is based on
determining that the string does not contain any characters with
NFC_Quick_Check property values of "No" or "Maybe" and no
combining marks with non-zero combining classes. If the string
contains any character that meets one of those criterion, then the
check fails, and you proceed to normalize the string. But characters
that meet any of those criteria are rare in actual data and nearly 
non-existent
in any of the characters from legacy character sets.

The test can be very fast (and efficient in size, too) if the relevant 
values, which can be
extracted from the two UCD data files:

http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
http://www.unicode.org/Public/UNIDATA/extracted/DerivedCombiningClass.txt

are crunched down into a trie access table which just needs to store a
single bit for each Unicode character. (Actually less than a bit, because
there are compression methods for such a table which take advantage of
the fact that huge swaths of ideographic space, for example, all have
the same property values for this test.)

ASCII data is always in NFC. Latin-1 data is also always in NFC, as
is data derived from most (but not all) of the ISO 8859 8-bit character
sets. Any Chinese data from GB 2312 or Big 5, and almost any Japanese
data from Shift-JIS implementations is also in NFC, and so on.

So the performance win for testing *if* a string is already in NFC is
very significant, compared to actually normalizing every string. It can 
also save
significantly on the overhead for allocating and/or storing additional 
buffers to
hold strings converted to NFC for transient comparisons.

It is possible to build a comparable quick test to check whether strings
are *not* in NFD form (or potentially not in NFD form), and thus need to
be normalized, but the relative performance gains for this are very much
less than for the NFC test, because many, many more strings will fail
that quick test and require normalization anyway to ensure they are in NFD.

--Ken