Reserved general punctuation

Mark Davis mark.davis at icu-project.org
Wed Apr 30 18:16:37 CEST 2008


The formal name is Noncharacter_Code_Point.

Note that there was a one-time cleanup of the Default Ignorable Code Point
values in Unicode 5.1.0, specifically to get it into good shape for IDNA (
http://www.unicode.org/versions/Unicode5.1.0/ - see "Rendering Default
Ignorable Code Points" and the section following). This changed the
composition, so if noncharacters are to be DISALLOWED, then they need to be
specifically mentioned. Functionally, it doesn't make a lot of difference,
since the Noncharacter_Code_Point values are immutable, and will always be
unassigned (gc=Cn), so they will never be part of valid labels. But they can
be specifically excluded by making Noncharacter_Code_Point be specifically
DISALLOWED, and for consistency I'd recommend doing that in the tables
document. BTW, here are the code points:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True
:]

That immutability is part of the Unicode stability policies: see
http://www.unicode.org/policies/stability_policy.html#Property_Value. Note
that there were some additional constraints on property value stability
added in conjunction with Unicode 5.1, largely for IDNA. Note however, that
the property value table is organized not by when the policy became
effective, but by the earliest version that the policy was true of. That is,
if policy was imposed in the Unicode 5.0 timeframe, but actually was true
for any version at or after Unicode 3.0, then it is listed under 3.0+.

Note there were also additional code points made Deprecated: see
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Deprecated=True:]
(Deprecated does not mean removed -- characters are never removed or moved
-- but means that they are strongly discouraged.) None of the additions
affect the tables document.

Mark

On Wed, Apr 30, 2008 at 8:45 AM, Erik van der Poel <erikv at google.com> wrote:

> The trouble with that 2nd definition is that UnicodeData.txt does not
> contain "noncharacters" such as U+FFFF. I would prefer IDNA's
> "UNASSIGNED" to exclude Unicode "noncharacters", since they will never
> be reassigned to a different meaning.
>
> I suspect Ken would know how to state this properly. (Thanks in
> advance for any input you may provide.)
>
> Erik
>
> On Wed, Apr 30, 2008 at 7:50 AM, Paul Hoffman <phoffman at imc.org> wrote:
> > At 12:08 PM +0200 4/30/08, Patrik Fältström wrote:
> >
> > > On 28 apr 2008, at 16.21, Paul Hoffman wrote:
> > >
> > >
> > > > I'm not suggesting changing the defined marks; just making
> 2064..2069
> > UNASSIGNED.
> > > >
> > >
> > > One view could be that as the block 2065..2069 is defined as
> > Other_Default_Ignorable_Code_Point, why would it not be DISALLOWED?
> Because
> > when the codepoint is assigned, this might change?
> > >
> > > Another view that all unassigned codepoints (as defined by not being
> > defined in UnicodeData.txt) are UNASSIGNED.
> > >
> > > What do you all on this list want? Today we are implementing the
> first.
> > >
> >
> >  The danger with implementing the first is that the Unicode Consortium
> folks
> > can easily change the boundaries of Other_Default_Ignorable_Code_Point
> if
> > they really want a non-ignorable code point to be at a certain position
> for
> > some bureaucratic or aesthetic reason. We in the IETF do that in some of
> our
> > IANA registries.
> >
> >  I think the second may be safer.
> >
> >
> >  _______________________________________________
> >  Idna-update mailing list
> >  Idna-update at alvestrand.no
> >  http://www.alvestrand.no/mailman/listinfo/idna-update
> >
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080430/c563638c/attachment.html


More information about the Idna-update mailing list