U-labels, NFC, and symmetry

Mark Davis ☕ mark at macchiato.com
Sat Apr 16 00:08:14 CEST 2011

You could do isNFD with a DFA (too many
there) without much problem. It'd look something like this (I haven't
checked it in detail, tho’):

start + disallowed => FAIL
start + allowedStarter => start
start + EOS => SUCCESS

// for each of the distinct n values of CCC* (< 255), have the following
start + CCCn => state_n
state_n + disallowed => FAIL
state_n + allowedStarter => start
state_n + CCCm => state_m // for all m >= n
state_n + CCCm => FAIL // for all other m
state_n + EOS => SUCCESS

You could do a DFA for toNFC, but the table of transitions would be huge,
since you would have to split every set of transitions based on every class
of non-starter you were dealing with. For example:

<a CCC1 CCC1 CCC1 .... CCC55 ... umlaut> is not valid, but
<b CCC1 CCC1 CCC1 .... CCC55 ... umlaut> is valid.

So while theoretically feasible, in practice not. It is easier, faster, and
smaller to do this in code.


*— Il meglio è l’inimico del bene —*

On Fri, Apr 15, 2011 at 13:47, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

> * Mark Davis wrote:
> >NFD is a better internal format for some forms of processing, because it
> has
> >a smaller data-table footprint for the implementation, and is a bit
> faster.
> >Here is a simple example for an ASCII string (on my laptop, in Java, so
> >caveats apply). Where you are converting an NFD string to NFC, or vice
> >versa, then the times for that conversion go up.
> >
> >   isNFC: 50ns
> >
> >toNFC: 84ns
> >
> >isNFD: 21ns (-57.8%)
> >
> >toNFD: 84ns (+0%)
> The set of strings that are in NFC, and the set of strings that are in
> NFD, are both regular languages, so you can implement isNFC and isNFD as
> deterministic finite automata which would be pretty much optimal in in-
> struction count the only performance differences would come from cache
> and memory performance (you can do little tricks beyond that but you'd
> be in hardware-dependent optimization land in any case), so I think the
> figures are a bit misleading.
> I don't think this is particularily relevant to Peter Saint-Andre's con-
> cern though, if you are primarily interested in comparisons of few and
> short strings that repeat a lot, as you would be when, say, routing, it
> may be wiser to simply cache frequently used strings and their properly
> normalized forms, or XMPP could mandate a particular form, like, always
> send the U-Label, only do binary comparisons, which would remove all the
> pressure from the wire protocol implementations.
> I suppose it would be easier for us to make recommendations if we knew
> more about the underlying performance and interoperability concerns in-
> stead of discussing the properties of NFC and NFD without a good under-
> standing of those concerns.
> --
> Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
> Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
> 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20110415/8994c79e/attachment-0001.html>

More information about the Idna-update mailing list