Apostrophe (was Re: Names)

Kenneth Whistler kenw at sybase.com
Tue Mar 24 20:51:00 CET 2009


Patrik asked:

[Please see the end of this note, if you want to
skip the explanations, as I express an opinion
about whether the change should be made to the table.]

> Then I must ask myself, why does it have the properties in Unicode it  
> has?

I'm not quite sure whether that was intended as a serious
question or as rhetorical.

At any rate, the Unicode Standard has long had an explicit
discussion about apostrophes, precisely because of the
complications in encoding and function for them. See
TUS 5.0, pp. 211-212.

U+2019 is particularly complicated, because it is both the
recommended character for an apostrophe in Latin text
(often now converted to that on input in word processors
when a user presses the "'" key on a keyboard), *and* it
is one of the pair of directional single quotation marks.

At any rate, the relevant characters and their properties
are:

U+0027 APOSTROPHE

       gc=Po, bc=ON, lb=QU, wb=MidNumLet
       
U+2019 RIGHT SINGLE QUOTATION MARK

       gc=Pf, bc=ON, lb=QU, wb=MidNumLet
       
U+02BC MODIFIER LETTER APOSTROPHE

       gc=Lm, bc=L,  lb=AL, wb=ALetter
       
U+02BC is the "letter apostrophe" -- that is the character
you use when you want a *real* letter, as for an orthography
that uses a apostrophe-like letter shape for a glottal
stop, for example. It has the General_Category of a modifier
letter, is strongly directional for bidi, and line breaks
and word breaks like an ordinary letter.

U+0027 and U+2019 are the "punctuation apostrophes". They
have the General_Category of punctuation -- the difference
being that U+0027 isn't formally paired, whereas, U+2019
has the punctuation property of similarly paired quotation
marks. Both are neutral for bidi, and they line break like
quotation marks. For word break, they are given the
MidNumLet property, because you get better default behavior
for word breaking if you assume that internal use of
apostrophes to indicate contraction, elision, or liaison
doesn't indicate word boundaries. Both U+0027 and U+2019
get used this way, and are often mixed up.

U+0027 is special, because it is the formal syntax character
in many formal languages for a single quotation mark,
whereas U+2019 is not. And, of course, it is in the
ASCII subset, whereas U+2019 is not.

> I.e. I really do not like having the IETF "overriding" various  
> definitions Unicode Consortium has decided upon,

Nor do I.

> because there must be  
> a reason why the codepoint has the classification it has.

Of course it is often more complicated than that. There are
multiple reasons in such cases, often crosscutting and
conflicting, between actual character functions, choices
made for encoding, ambiguity in implementations, and
procrustean fits for categories that don't necessarily
match exactly the distinctions to be made.

It would be wonderful if there was just a "truth" to be
discovered about character properties, and the Unicode
Standard defined it. But, alas, character properties are
human inventions, applied by humans, to more human inventions,
the writing systems themselves.

> If it is "used as a letter in a name", then it should be 
> one of the letter variants? Or?

That's for U+02BC, and *not* for U+2019.

U+2019 is *not* "used as a letter in a name" -- Mark just
said it was "a character used in many languages, in many
surnames". Which is true. You could restate this as:
the punctuation apostrophe (the one indicating contractions,
elisions, or liaison) is a commonly occurring punctuation
mark in English and French (and numerous other written
languages) and *DOES NOT INDICATE A WORD BOUNDARY* -- in
other words, it is considered "part of a word" in those
orthographies, including common use in personal and
place names.

But then we are right back around to IDNA first principles
again. We aren't attempting to write a standard for defining
usable words in all writing systems -- we are attempting to
update IDNA2003's attempt to update the definition of
domain name labels from LDH to the much broader repertoire
of Unicode characters. Those aren't words -- they are labels,
and not being able to express some words in domain name
labels is o.k., it really is.

> I am just trying to understand, and will of course given consensus in  
> this wg change (potentially) exceptions accordingly.

The issue here comes down to this:

Neither IDNA2003 nor IDNA2008 allow U+0027 in domain
name labels. Nobody wants to change that, *regardless*
of the fact that U+0027 commonly appears in English
and French words.

U+2019 is *allowed* in domain name labels in IDNA2003.
It is not allowed in domain name lables in IDNA2008,
by the current table, which classifies it as DISALLOWED,
along with most of the rest of the punctuation and
symbols that were allowed in IDNA2003.

Mark asked that U+2019 be added to Exceptions (F) as
PVALID. He has evidence that it is already used in
labels based on IDNA2003, so allowing it as PVALID in IDNA2008
would marginally improve compatibility with IDNA2003. It
is also something that people would generally like to
work, since apostrophes in personal and place names are
so common.

My opinion is that we are already in for the big
transition by eliminating punctuation and symbols
from domain name labels, and I don't think making
an exception for apostrophe is worth it. In particular,
my biggest concern is that U+0027 and U+2019 are
not clearly enough distinguished, either by glyph,
usage, or implementations. U+0027 cannot become
PVALID, clearly, so making U+2019 PVALID has the potential
for confusion and mischief-making. 

> Where does this slippery slope end?

When IDNA2008 is finished, perhaps.

--Ken




More information about the Idna-update mailing list