Last call for ISO 15924-based updates
Doug Ewell
doug at ewellic.org
Wed Mar 18 04:38:45 CET 2009
CE Whitehead <cewcathar at hotmail dot com> wrote:
> However, I do question "Not used to tag documents" I am still totally
> lost (in spite of Peter's great explanations, below). What exactly
> does "Not used to tag documents" mean? Does it mean not used in the
> language tag indicating the overall document language, but possibly
> used somewhere in the document to indicate a diacritic mark on a
> character (where the display of the diacritic mark depends on the
> script/character)
>
> (Sorry to ask a dumb question & I know this is long but I like lucid
> explanations that make sense to the unitiated.)
Here is a longer example that I hope is sufficiently lucid to answer CE
Whitehead's lingering question.
Take an ordinary English sentence containing a loan word (borrowed from
French) written with two diacritical marks:
Send us your résumé.
(Note that this word is often written in English without the accents.
Bear with me; this is just an example.)
Suppose this message is encoded in Unicode or ISO 10646 (any format),
and suppose the letter "e with acute" is represented by two characters,
U+0065 LATIN SMALL LETTER E and U+0301 COMBINING ACUTE ACCENT. (In
Unicode this is referred to as Normalization Form D, but it is also a
valid representation in ISO 10646 using implementation level 3.) To
avoid display problems that might distract from this example, I will use
the string "[U+0301]" to represent the combining character:
Send us your re[U+0301]sume[U+0301].
In Unicode, every coded character has a script property, and in
particular, general-purpose combining characters like U+0301 have a
script property of "inherited," which means that they temporarily assume
the script property of the base character that precedes them. In this
example, the acute accent inherits the script property of "Latin" from
the base letter U+0065. The combining acute accent is not permanently
assigned to the Latin script because it could reasonably be used in a
different context with some other script, such as Cyrillic.
Now, I'd like someone to explain to me the benefit of writing something
like this, in a context such as HTML where BCP 47 language tags can be
used to tag arbitrary sections of text:
<span lang="en">
Send us your re<span lang="en-Zinh">[U+0301]</span>sume<span
lang="en-Zinh">[U+0301]</span>.
</span>
Silly, right? This is why we say that the subtag 'Zinh' will not be
useful for tagging the sort of content for which BCP 47 is intended.
At the same time, I personally don't understand why we spend so much
time on this list and LTRU trying to prevent users from doing stupid and
obscure things like tagging content as "inherited script," when we
continue to have larger problems with users making up their own language
subtags, specifying correct but unnecessary region subtags, and worrying
about Suppress-Script being normative and comprehensive.
--
Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
More information about the Ietf-languages
mailing list