Progressing beyond borders—making subtags inclusive
Nicholas Shanks
contact at nickshanks.com
Fri Jan 4 16:45:25 CET 2008
On 3 Jan 2008, at 19:33, Karen Broome wrote:
> Is it simply up to the user to decide whether to use regional or
> variant tagging? Or should some guidelines be written to indicate a
> preference for variant tagging over regional tagging if both exist?
I'd like to second the call for some guidelines to be widely
disseminated. I am a web developer and would like to see all of the
web tagged (correctly!) with language data.
My own opinion is that using country codes to define dialects is
flawed. When borders change, Czechoslovakia splits in two, Germany
reunifies, etc, then all the old country codes become obsolete even
though linguistically nothing has changed. When populations are
displaced they take their language with them.
I feel that all dialects should have their own subtags, not just the
ones that partizan individuals propose. As a great example, there's a
subtag for en-scouse but not one for yorkshire, geordie or brummie,
because the guy that submitted the scouse request has a vested
interest in his own dialect, and nobody has bothered to register the
others. The distinction between en-US and en-GB is mainly an
orthographic one. I say this because en-US represents a cluster of
dialects and accents, with a unified orthography, and en-GB represents
a cluster of accents and dialects (some overlapping with en-US), but a
different orthography. Thus en-GB/US is pretty useless to people who
are tagging audio data, but quite useful to those tagging written data.
I believe that having a subtag registered is at present too difficult
(requirement for dictionaries!? what if it's mostly just an accent
with only phonemic changes relative to surrounding accents). A
relaxation of the barriers would lead to more de facto recognised
dialects being available to choose from.
As an example, things like the supposedly "British English" speech
synthesizer voices on my computer (which the OS processes using the
tag "en_GB" from the voice's property list) sound nothing like most of
the accents of the United Kingdom, they would be better marked as "en-
received" or similar.
Consider if you will a speech synthesizer trying to render a website
with the following:
<dialog>
<dt>George Bush
<dd lang="en-US-cowboy">Now that's what I call a stonkin' good supper!
<dt>British Ambassador
<dd lang="en-GB-received">Yes, indeed sir. That would appear to be the
case.
</dialog>
The synth has available half a dozen male voices variously described
as "en-US" and "en-GB" it would probably not render the dialogue
closely to the author's intentions, but if those voice descriptions
could be "en-general", "en-cowboy", "en-drawl", "en-received", "en-
westcountry" and "en-estuary", then the synth would have far more
freedom to select an appropriate voice to use.
I'm sure we can all agree on commonly recognised dialects for English,
as it is a first langauge for many people on this list, and familiar
for many others. For other languages compiling a list might involve
asking a scholar for suggestions.
Footnote:
It occurred to me while writing this that perhaps a good solution
would be to use country codes for written content that uses the
national orthography, and dialect tags when transcribing spoken
content or for audio data. You would only combine the two if you were
transcribing the speech of someone with that dialect into the
orthography of a country (maybe not the country of the speaker).
- Nicholas Shanks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2427 bytes
Desc: not available
Url : http://www.alvestrand.no/pipermail/ietf-languages/attachments/20080104/daadb3ab/smime.bin
More information about the Ietf-languages
mailing list