Apostrophes in non-ASCII names (was: A proposed solution for descriptions)

Karl Ove Hufthammer karl at huftis.org
Wed Jun 28 12:25:45 CEST 2006


Tysdag 27 juni 2006 20:37 skreiv Ciarán Ó Duibhín:

>Agreed, names are only informative.  But the basic fact is that there is a
>processing need for two characters for this mark, one when it stands for
>elision (or possession), and another when it terminates a quote.  These
>functions are normally called "apostrophe" and "right single quote" and it
>is fortunate (and unlikely to be coincidental) that Unicode contains
>distinct non-decomposable characters with those names.  It will not do to
>encode both functions as the same character, as we (amongst others) are in
>danger of doing.

Yes, it will. U+2019 *is* the preferred character for the apostrophe used for 
contraction and posession. Let me quote the relevant sections of the Unicode 
Standard 4.1.0 "http://www.unicode.org/versions/Unicode4.0.0/ch06.pdf":

Encoding Characters with Multiple Semantic Values.

Some ASCII characters have multiple uses, either through ambiguity in the 
original standards or through accumulated reinterpretations of a limited code 
set. For example, 2716 is defined in ANSI X3.4 as apostrophe (closing single 
quotation mark; acute accent), and 2D16 is defined as hyphen-minus. In 
general, the Unicode Standard provides the same interpretation for the 
equivalent code points, without adding to or subtracting from their 
semantics. The Unicode Standard supplies unambiguous codes elsewhere for the 
most useful particular interpretations of these ASCII values; the 
corresponding unambiguous characters are cross-referenced in the character 
names list for this block. For a complete list of space characters and dash 
characters in the Unicode Standard, see “General Punctuation” later in this 
section.

For historical reasons, U+0027 is a particularly overloaded character. In 
ASCII, it is used to represent a punctuation mark (such as right single 
quotation mark, left single quotation mark, apostrophe punctuation, vertical 
line, or prime) or a modifier letter (such as apostrophe modifier or acute 
accent). Punctuation marks generally break words; modifier letters generally 
are considered part of a word. The preferred character for apostrophe is 
U+2019, but U+0027 is commonly present on keyboards. In modern software, it 
is therefore common to substitute U+0027 by the appropriate character in 
input. In these systems, a U+0027 in the data stream is always represented as 
a straight vertical line and can never represent a curly apostrophe or a 
right quotation mark. For more information, see “Apostrophes” later in this 
section.

Apostrophes

U+0027 APOSTROPHE is the most commonly used character for apostrophe. However, 
it has ambiguous semantics and direction. When text is set, U+2019 RIGHT 
SINGLE QUOTATION MARK is preferred as apostrophe. Word processors commonly 
offer a facility for automatically converting the U+0027 APOSTROPHE to a 
contextually selected curly quotation glyph.

Letter Apostrophe.
U+02BC MODIFIER LETTER APOSTROPHE is preferred where the apostrophe is to 
represent a modifier letter (for example, in transliterations to indicate a 
glottal stop). In the latter case, it is also referred to as a letter 
apostrophe.

Punctuation Apostrophe.
U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to 
represent a punctuation mark, as for contractions: “We’ve been here before.” 
In this latter case, U+2019 is also referred to as a punctuation apostrophe. 
An implementation cannot assume that users’ text always adheres to the 
distinction between these characters. The text may come from different 
sources, including mapping from other character sets that do not make this 
distinction between the letter apostrophe and the punctuation 
apostrophe/right single quotation mark. In that case, all of them will
generally be represented by U+2019.

The semantics of U+2019 are therefore context-dependent. For example, if 
surrounded by letters or digits on both sides, it behaves as an in-text 
punctuation character and does not separate words or lines.

-- 
Karl Ove Hufthammer


More information about the Ietf-languages mailing list