Last call for ISO 15924-based updates

Doug Ewell doug at ewellic.org
Wed Mar 18 04:38:45 CET 2009


CE Whitehead <cewcathar at hotmail dot com> wrote:

> However, I do question "Not used to tag documents" I am still totally 
> lost (in spite of Peter's great explanations, below).  What exactly 
> does "Not used to tag documents" mean?  Does it mean not used in the 
> language tag indicating the overall document language, but possibly 
> used somewhere in the document to indicate a diacritic mark on a 
> character (where the display of the diacritic mark depends on the 
> script/character)
>
> (Sorry to ask a dumb question & I know this is long but I like lucid 
> explanations that make sense to the unitiated.)

Here is a longer example that I hope is sufficiently lucid to answer CE 
Whitehead's lingering question.

Take an ordinary English sentence containing a loan word (borrowed from 
French) written with two diacritical marks:

    Send us your résumé.

(Note that this word is often written in English without the accents. 
Bear with me; this is just an example.)

Suppose this message is encoded in Unicode or ISO 10646 (any format), 
and suppose the letter "e with acute" is represented by two characters, 
U+0065 LATIN SMALL LETTER E and U+0301 COMBINING ACUTE ACCENT.  (In 
Unicode this is referred to as Normalization Form D, but it is also a 
valid representation in ISO 10646 using implementation level 3.)  To 
avoid display problems that might distract from this example, I will use 
the string "[U+0301]" to represent the combining character:

    Send us your re[U+0301]sume[U+0301].

In Unicode, every coded character has a script property, and in 
particular, general-purpose combining characters like U+0301 have a 
script property of "inherited," which means that they temporarily assume 
the script property of the base character that precedes them.  In this 
example, the acute accent inherits the script property of "Latin" from 
the base letter U+0065.  The combining acute accent is not permanently 
assigned to the Latin script because it could reasonably be used in a 
different context with some other script, such as Cyrillic.

Now, I'd like someone to explain to me the benefit of writing something 
like this, in a context such as HTML where BCP 47 language tags can be 
used to tag arbitrary sections of text:

<span lang="en">
Send us your re<span lang="en-Zinh">[U+0301]</span>sume<span 
lang="en-Zinh">[U+0301]</span>.
</span>

Silly, right?  This is why we say that the subtag 'Zinh' will not be 
useful for tagging the sort of content for which BCP 47 is intended.

At the same time, I personally don't understand why we spend so much 
time on this list and LTRU trying to prevent users from doing stupid and 
obscure things like tagging content as "inherited script," when we 
continue to have larger problems with users making up their own language 
subtags, specifying correct but unnecessary region subtags, and worrying 
about Suppress-Script being normative and comprehensive.

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ



More information about the Ietf-languages mailing list