Apostrophes in non-ASCII names (was: A proposed solution for descriptions)

Tue Jun 27 20:37:18 CEST 2006

I will risk further sickening those who are tired of this topic by
responding to recent posts which refer to one of mine.  In doing so, I am
not challenging the consensus which clearly exists on the list on the
encoding of apostrophes, and which is embodied in Doug's latest proposals.
If that means I shouldn't post this here, I apologize, but I would be
unhappy to leave the discussion as it presently stands.

I wrote:
> In the case of "Côte d'Ivoire" there is no such doubt.  The mark functions
> and should be processed as an apostrophe (&#x0027;), not as a
> right single quote (&#x2019;).

and Kent Karlsson replied:
> The preferred character for a punctuation apostrophe is U+2019.
> U+0027 is a typewriterish apostrophe (it has a symmetric, and usually
> rather ugly, glyph). The names of characters don't tell the full story.

Agreed, names are only informative.  But the basic fact is that there is a
processing need for two characters for this mark, one when it stands for
elision (or possession), and another when it terminates a quote.  These
functions are normally called "apostrophe" and "right single quote" and it
is fortunate (and unlikely to be coincidental) that Unicode contains
distinct non-decomposable characters with those names.  It will not do to
encode both functions as the same character, as we (amongst others) are in
danger of doing.

**Skeptics of the requirement for two characters are invited to visit my
temporary webpage http://www.smo.uhi.ac.uk/~oduibhin/apostrophe.htm, and try
the challenge there, off-list.**

Jon Hanna said:

> U+0027 is useful as an input method to get U+2019 (when some cleverness
> is used to determine it should be that rather than U+2018, U+2032 or any
> of various other possibilities), for programming, and to meet historical
> requirements, or because one's editor doesn't do the aforementioned
> cleverness, but it's an uncooth character, not really suited to polite
> company.

Actually, U+2018 and U+2019 can be keyed directly, using free input method
layouts. This is a safe way to do it, and works system-wide, under any
application.  By contrast, in the case of MS Word's "smart quotes" at least,
the cleverness which converts U+0027 to one or other of the foregoing does
pretty well considering, but is not fully adequate even for English, let
alone other languages (try it on "'twas brillig" or "go get 'em").  Other
systems may do better, but it seems questionable whether a perfect
conversion is possible without fuller natural language understanding.

I wrote:
> ... I have come to the view that "N&#x2019;Ko" looks like a
> meaningless character string, and I would not support introducing it as a
> description.

and Markus Scherer replied:
> It does look like a meaningless string, but only when viewed in this
> raw form, and only because the registry has decided to use an ASCII
> encoding with HTML Numeric Character References rather than UTF-8.
> When viewed in any kind of end-user application, or on a web page, or
> converted to UTF-8 text, it will look just fine as N'Ko.

Agreed, but this was not what I meant.  I meant that it is meaningless as a
string of four Unicode characters.  The reason I say it is meaningless for
processing is that it appears to encode an unmatched quotation mark within a
word.

I wrote:
>If no one in this forum knows how the mark in "N'Ko" ought to be processed,
>and therefore what the most appropriate character to encode the mark would
>be

and Michael Everson replied:
>But we do.

This is by far the most important point, and I hesitated to include anything
else in this message for fear of distracting from it.  We can really hope to
make some progress if you (singular or plural) would be good enough to tell
us a few things about this.

• In tokenizing a string containing "N'Ko", is it desirable to retain or to
drop the mark?

• In tokenizing a string containing "N'Ko", is it desirable to break the
string at the mark, and if so should the break be before, at, or after the
mark?

• It would also be very helpful to know the answers to the same two
questions in relation to a string containing "Gwich'in".

For comparison, the answers for "d'Ivoire" are (as I hope everyone can agree
is obvious): the mark should be retained.  Depending on purpose, the string
may be broken after the mark, or not broken at all.

In reply to Mark Crispin, Michael Everson said:
>Pshaw. APOSTROPHE is a coding wierdness deriving from typewriter
>technology. It may be useful as a delimiter to programmers. Correct
>spelling and typography uses the traditional curly quote.

There are a battalion of straw men appearing here.

I don't know that anyone said that the apostrophe character should be used
in the delimiter function.  I said that it should not be.

Also, the various Unicode characters displayed as a "raised 9-comma" are not
a convenience for programmers, they are there for the use of linguists
encoding text corpora, who expect their encoding to result in sensible
processing behaviour.

I am not arguing against typographical accuracy — why on earth would I?  I
am arguing against achieving typographical accuracy at the expense of
encoding accuracy, which would be a kludge.  We need to find ways of
achieving both kinds of accuracy together.

In that spirit, I am happy to shift my ground a little.  What matters for
processing is that apostrophes which are not quotes (elision apostrophes
etc) are not encoded as U+2019.  I have been suggesting U+0027, which I said
should be displayed as "raised 9-comma".  I admit that last bit may be
impractical — at any rate, the discussion has ignored it and people have
argued instead that U+0027 "looks wrong", which is true in practice if not
in principle but is a red herring from my point of view.  So let me abandon
U+0027 and consider instead the range of other Unicode characters which
display (in practice even) as "raised 9-comma".  Let us then jointly try to
see which of them has a processing semantics which matches the answers to
the questions posed above, which I sincerely hope will be forthcoming.  Does
that seem reasonable as a way forward?

Ciarán Ó Duibhín