Subtag for indicating "marked" text?
sascha at brawer.ch
Wed Jul 6 12:22:20 CEST 2016
What would you think of registering an IETF language variant subtag to
denote text with marks for tones, gemination, vowel length, vowel quality,
etc. in languages where such marks are not part of the regular spelling?
For example, Arabic and Hebrew usually do not write short vowels. However,
optional marks can be used to indicate the vowels. Without a variant
subtag, we cannot give a BCP47 language code to corpora of text written in
“Arabic with vowel markers”.
Another example is Lingala, where optional marks are used to indicate
tones. In the Unicode UDHR project, we have Lingala text once with and once
without tones. However, currently we cannot express this distinction with
BCP47 language tags:
(Apart from tones, the two texts should be identical. Currently they
aren’t, but that’s an unrelated problem).
Another example is Cherokee, where optional marks can be used to indicate
Another example is Amharic (and all other Ethiopic languages), where
optional marks are used to indicate syllables with geminated (=long)
consonants, and/or long vowels.
In all these examples, the markers are usually not written in regular text.
But in children’s books, teaching material for language learners, religious
texts, etc., the markers would be written to indicate the otherwise
ambiguous pronunciation. Also, there’s specialized applications (eg.
corpora for speech applications) that explicitly collect texts with such
markers attached. To identify marked text, it would be useful to have a
An alternative to registering a general "marked" subtag might be different
subtags for "vowelmarked", "geminationmarked", "tonemarked", etc. Seems a
bit complicated, and those tags would have to be shortened to fit into the
What do you think?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Ietf-languages