Change of the algorithm

Mark Davis mark.davis at icu-project.org
Thu Mar 20 10:10:46 CET 2008


1. I just did a test of the 05 tables (against Unicode 5.0).

Other than the Cf issue, I found one other thing. There are <reserved>
characters (that is, General_Category=Cn) that show up as DISALLOWED when
they shouldn't.

2064..2069  ; DISALLOWED  # <reserved>..<reserved>
...

I believe the reason is that they are default_ignoreable. But
General_Category=Cn should take precedence.


2. In addition, this thread reinforces my opinion that the current use of
lettered categories is needlessly obscure. Someone can't make heads nor
tails of the rules unless they carefully decipher the cryptic Category A, B,
..., J, leafing back and forth in the document. As remarked before, there is
no need to use FORTRAN-style single-letters for obscurity when we can use
whole word labels for clarity.

That is, I think the rules should be something understandable. We can get
the same results by casting the rules into the following form:

CONTEXTJ = Join_Controls

CONTEXTO = Context_Exceptions

UNASSIGNED = Unassigned_Code_Points

PVALID = Letters_Marks_Numbers
       - Not_Stable_Under_NFKC_Case_Folding
       - Default_Ignorables
       - Block_Exceptions
       - CONTEXTJ
       - CONTEXTO
       + PValid_Exceptions
       - Disallowed_Exceptions

DISALLOWED = <everything else>

This gives the same results as the 05 formulation (with the exceptions given
at the top: the unassigned 2064..2069,...), and is much more understandable.
You can see at a glance where each category is making a contribution,
instead of needing to follow a chain of logic. The necessary changes to the
05 text are not that large:

1. Renaming your categories:

A => Letters_Marks_Numbers
B => Not_Stable_Under_NFKC_Case_Folding
C => Default_Ignorables
D => Block_Exceptions
E+F.1+G => PValid_Exceptions
F.2 => Context_Exceptions
H => Join_Controls
I => (not necessary, as per discussion of Cf)
J => Unassigned_Code_Points

2. Making certain changes to the text in the category sections.

   - Split F into two parts, depending on where the characters go.
   - Merge E, part of F and G into PValid_Exceptions. If you really
   wanted to keep E and G separate you could, as PValid_ASCII_Exceptions and
   PValid_Compatibility_Exceptions respectively.
   - Also remove noncharacters and whitespace from C. Since those can
   never be in Letters_Marks_Numbers anyway, they don't need to be subtracted.
   - Add a new section Disallowed_Exceptions. Currently empty, but could
   have exceptions (such as grandfathered characters) in the future.

Mark


On Wed, Mar 19, 2008 at 10:55 AM, Paul Hoffman <phoffman at imc.org> wrote:

> In summary:
>
> a) The desired change was "Make sure everything in {Cf}, category I, is
> disallowed".
>
> b) We discovered that what was really wanted was "Make sure everything in
> {Cf}, category I, other than those with property Join_Control, category H,
> is disallowed". That leaves ZWNJ and ZWJ allowed with contextual rules.
>
> Patrik's first pass at the change did (a), not (b). That proposed change
> was:
>
> At 5:02 PM -0400 3/15/08, Patrik Fältström wrote:
> >OLD:
> >  o  If the codepoint is in Category H (Section 2.2.4), the value is
> >     CONTEXTJ.
> >  o  If the codepoint is in Category I (Section 2.2.5), the value is
> >     CONTEXTO.
> >>  o  If the codepoint is in Category B (Section 2.1.2), the value is
> >     DISALLOWED.
> >>  o  If the codepoint is in Category C (Section 2.1.3), the value is
> >     DISALLOWED.
> >>  o  If the codepoint is in Category D (Section 2.1.4), the value is
> >     DISALLOWED.
> >
> >NEW:
> >  The algorithm to calculate the value of the derived property is as
> >  follows.
> > >  o  If the codepoint is in Category B (Section 2.1.2), the value is
> >     DISALLOWED.
> >>  o  If the codepoint is in Category C (Section 2.1.3), the value is
> >     DISALLOWED.
> >>  o  If the codepoint is in Category D (Section 2.1.4), the value is
> >     DISALLOWED.
> >  o  If the codepoint is in Category H (Section 2.2.4), the value is
> >     CONTEXTJ.
> >  o  If the codepoint is in Category I (Section 2.2.5), the value is
> >     CONTEXTO.
>
> A different fix, one that achieves (b). would instead be:
>
> NEWER-YET:
>   The algorithm to calculate the value of the derived property is as
>  follows.
>   o  If the codepoint is in Category H (Section 2.2.4), the value is
>     CONTEXTJ.
>  o  If the codepoint is in Category B (Section 2.1.2), the value is
>     DISALLOWED.
>  o  If the codepoint is in Category C (Section 2.1.3), the value is
>     DISALLOWED.
>  o  If the codepoint is in Category D (Section 2.1.4), the value is
>     DISALLOWED.
> That is, check H before B, C, and D. Also, note that we no longer need
> category I at all, because it is a subset of C. Thus, the entire category I
> can be removed.
>
> Can someone else check my work here?
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080320/113fac63/attachment-0001.html


More information about the Idna-update mailing list