Combining mark vs combining character?

Kenneth Whistler kenw at sybase.com
Wed Jan 5 21:54:29 CET 2011


Simon asked:

> I need a clarification regarding this paragraph in section 4.2.3.2 of
> RFC 5891:
> 
>    The Unicode string MUST NOT begin with a combining mark or combining
>    character (see The Unicode Standard, Section 2.11 [Unicode] for an
>    exact definition).

Mark Davis suggested that this would better read:

     The Unicode string MUST NOT begin with a character having a General
     Category property value of Mark (M).
     
and I concur that that would be more precise.

And to add to Mark Davis' clarification and respond further
to one of Simon's questions:

> There is one section 3.6 on "Combination" that gives the precice
> definition of a "Combining character":
> 
>    Combining character: A character with the General Category of
>    Combining Mark (M).

> 3) What is the precice definition of a "combining mark"?

In the Unicode Standard, "combining character" is the term
of art. That is the general term which is used throughout
the standard for referring to characters which combine.
Hence the normative definition D52 for "Combining character"
which Simon quotes from Section 3.6 of Unicode 5.0.

"Mark" is a property value alias which refers to any of the
three possible General Category values for a combining character
in the standard. This is defined by the following entries in
the UCD data file, PropertyValueAliases.txt:

gc ; M         ; Mark                             # Mc | Me | Mn
gc ; Mc        ; Spacing_Mark
gc ; Me        ; Enclosing_Mark
gc ; Mn        ; Nonspacing_Mark

"Combining mark" is an unofficial synonym for "combining character".
It occurs occasionally in Unicode-related documents, including
the text of the standard itself, because Unicode implementers
often talk about "spacing marks" and "nonspacing marks" and
"enclosing marks" and then treat the union of all those as
"combining marks" by force of habit in talking about "marks".

My advice for external standards referring to the Unicode
Standard would be to stick to "combining character", which is
and will remain the term with the normative definition in
the Unicode Standard. And the best point of reference is
to Section 3.6, "Combination", which is where this term (and
related terms) have their normative definitions in the standard.

--Ken



More information about the Idna-update mailing list