Apostrophes in non-ASCII names (was: A proposed solution for
descriptions)
Karl Ove Hufthammer
karl at huftis.org
Wed Jun 28 12:25:45 CEST 2006
Tysdag 27 juni 2006 20:37 skreiv Ciarán Ó Duibhín:
>Agreed, names are only informative. But the basic fact is that there is a
>processing need for two characters for this mark, one when it stands for
>elision (or possession), and another when it terminates a quote. These
>functions are normally called "apostrophe" and "right single quote" and it
>is fortunate (and unlikely to be coincidental) that Unicode contains
>distinct non-decomposable characters with those names. It will not do to
>encode both functions as the same character, as we (amongst others) are in
>danger of doing.
Yes, it will. U+2019 *is* the preferred character for the apostrophe used for
contraction and posession. Let me quote the relevant sections of the Unicode
Standard 4.1.0 "http://www.unicode.org/versions/Unicode4.0.0/ch06.pdf":
Encoding Characters with Multiple Semantic Values.
Some ASCII characters have multiple uses, either through ambiguity in the
original standards or through accumulated reinterpretations of a limited code
set. For example, 2716 is defined in ANSI X3.4 as apostrophe (closing single
quotation mark; acute accent), and 2D16 is defined as hyphen-minus. In
general, the Unicode Standard provides the same interpretation for the
equivalent code points, without adding to or subtracting from their
semantics. The Unicode Standard supplies unambiguous codes elsewhere for the
most useful particular interpretations of these ASCII values; the
corresponding unambiguous characters are cross-referenced in the character
names list for this block. For a complete list of space characters and dash
characters in the Unicode Standard, see “General Punctuation” later in this
section.
For historical reasons, U+0027 is a particularly overloaded character. In
ASCII, it is used to represent a punctuation mark (such as right single
quotation mark, left single quotation mark, apostrophe punctuation, vertical
line, or prime) or a modifier letter (such as apostrophe modifier or acute
accent). Punctuation marks generally break words; modifier letters generally
are considered part of a word. The preferred character for apostrophe is
U+2019, but U+0027 is commonly present on keyboards. In modern software, it
is therefore common to substitute U+0027 by the appropriate character in
input. In these systems, a U+0027 in the data stream is always represented as
a straight vertical line and can never represent a curly apostrophe or a
right quotation mark. For more information, see “Apostrophes” later in this
section.
Apostrophes
U+0027 APOSTROPHE is the most commonly used character for apostrophe. However,
it has ambiguous semantics and direction. When text is set, U+2019 RIGHT
SINGLE QUOTATION MARK is preferred as apostrophe. Word processors commonly
offer a facility for automatically converting the U+0027 APOSTROPHE to a
contextually selected curly quotation glyph.
Letter Apostrophe.
U+02BC MODIFIER LETTER APOSTROPHE is preferred where the apostrophe is to
represent a modifier letter (for example, in transliterations to indicate a
glottal stop). In the latter case, it is also referred to as a letter
apostrophe.
Punctuation Apostrophe.
U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to
represent a punctuation mark, as for contractions: “We’ve been here before.”
In this latter case, U+2019 is also referred to as a punctuation apostrophe.
An implementation cannot assume that users’ text always adheres to the
distinction between these characters. The text may come from different
sources, including mapping from other character sets that do not make this
distinction between the letter apostrophe and the punctuation
apostrophe/right single quotation mark. In that case, all of them will
generally be represented by U+2019.
The semantics of U+2019 are therefore context-dependent. For example, if
surrounded by letters or digits on both sides, it behaves as an in-text
punctuation character and does not separate words or lines.
--
Karl Ove Hufthammer
More information about the Ietf-languages
mailing list