Reserved general punctuation

Mark Davis mark.davis at icu-project.org
Wed Apr 30 23:33:05 CEST 2008


comments below

On Wed, Apr 30, 2008 at 1:53 PM, Paul Hoffman <phoffman at imc.org> wrote:

> At 1:38 PM -0700 4/30/08, Mark Davis wrote:
>
> > It *is* related to Noncharacters. Default_Ignorable_Code_Point is a
> > derived property. The code points that are unassigned (gc=Cn) but that
> > should be DISALLOWED are all and only the Noncharacters.
> >
>
> Then I'm really confused. From the new draft:
>
> 2.1.3.  IgnorableProperties (C)
>
>   C: property(cp) is in {Default_Ignorable_Code_Point, White_Space,
>                          Noncharacter_Code_Point}
>
>   This category is used to group codepoints that are not recommended
>   for use in identifiers.  In general, these codepoints are not
>   suitable for use for IDN.
>
>   The definition for Default_Ignorable_Code_Point can be found in
>   DerivedCoreProperties.txt [1] (and erratum of 2007-January-25 [2])
>   and is
>
>   Other_Default_Ignorable_Code_Point + Cf + Cc + Cs
>   + Noncharacter_Code_Point + Variation_Selector
>   - White_Space - FFF9..FFFB (Annotation Characters)
>

That text has not been updated to U5.1. As I said earlier:

"Note that there was a one-time cleanup of the Default Ignorable Code Point
values in Unicode 5.1.0, specifically to get it into good shape for IDNA (
http://www.unicode.org/versions/Unicode5.1.0/ - see "Rendering Default
Ignorable Code Points" and the section following). This changed the
composition, so if noncharacters are to be DISALLOWED, then they need to be
specifically mentioned. Functionally, it doesn't make a lot of difference,
since the Noncharacter_Code_Point values are immutable, and will always be
unassigned (gc=Cn), so they will never be part of valid labels. But they can
be specifically excluded by making Noncharacter_Code_Point be specifically
DISALLOWED, and for consistency I'd recommend doing that in the tables
document. BTW, here are the code points:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B:Noncharacter_Code_Point=True>
:]"

The U5.1 definition is (
http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt):

# Derived Property: Default_Ignorable_Code_Point
#  Generated from
#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0603, 06DD, 070F (exceptional Cf characters that should be visible)

Because the other changes (Cs, Cc, some Cf) are already excluded due to
other Categories, Noncharacters are the only change relevant to IDNAbis.

>
> Why have what whole list of things for "Default_Ignorable_Code_Point" if
> all we want is Noncharacter_Code_Point, which is already in the list for C?
> Why not have it at all?


It is not the only thing. Some of them are redundant (already put in
DISALLOWED via other Categories); the key ones are the Variation_Selector
characters.

Does that help make things any clearer?

-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080430/b6121366/attachment.html


More information about the Idna-update mailing list