Progressing beyond borders—making subtags inclusive

Fri Jan 4 16:45:25 CET 2008

On 3 Jan 2008, at 19:33, Karen Broome wrote:

> Is it simply up to the user to decide whether to use regional or  
> variant tagging? Or should some guidelines be written to indicate a  
> preference for variant tagging over regional tagging if both exist?

I'd like to second the call for some guidelines to be widely  
disseminated. I am a web developer and would like to see all of the  
web tagged (correctly!) with language data.

My own opinion is that using country codes to define dialects is  
flawed. When borders change, Czechoslovakia splits in two, Germany  
reunifies, etc, then all the old country codes become obsolete even  
though linguistically nothing has changed. When populations are  
displaced they take their language with them.

I feel that all dialects should have their own subtags, not just the  
ones that partizan individuals propose. As a great example, there's a  
subtag for en-scouse but not one for yorkshire, geordie or brummie,  
because the guy that submitted the scouse request has a vested  
interest in his own dialect, and nobody has bothered to register the  
others. The distinction between en-US and en-GB is mainly an  
orthographic one. I say this because en-US represents a cluster of  
dialects and accents, with a unified orthography, and en-GB represents  
a cluster of accents and dialects (some overlapping with en-US), but a  
different orthography. Thus en-GB/US is pretty useless to people who  
are tagging audio data, but quite useful to those tagging written data.
I believe that having a subtag registered is at present too difficult  
(requirement for dictionaries!? what if it's mostly just an accent  
with only phonemic changes relative to surrounding accents). A  
relaxation of the barriers would lead to more de facto recognised  
dialects being available to choose from.

As an example, things like the supposedly "British English" speech  
synthesizer voices on my computer (which the OS processes using the  
tag "en_GB" from the voice's property list) sound nothing like most of  
the accents of the United Kingdom, they would be better marked as "en- 
received" or similar.

Consider if you will a speech synthesizer trying to render a website  
with the following:
<dialog>
<dt>George Bush
<dd lang="en-US-cowboy">Now that's what I call a stonkin' good supper!
<dt>British Ambassador
<dd lang="en-GB-received">Yes, indeed sir. That would appear to be the  
case.
</dialog>

The synth has available half a dozen male voices variously described  
as "en-US" and "en-GB" it would probably not render the dialogue  
closely to the author's intentions, but if those voice descriptions  
could be "en-general", "en-cowboy", "en-drawl", "en-received", "en- 
westcountry" and "en-estuary", then the synth would have far more  
freedom to select an appropriate voice to use.

I'm sure we can all agree on commonly recognised dialects for English,  
as it is a first langauge for many people on this list, and familiar  
for many others. For other languages compiling a list might involve  
asking a scholar for suggestions.

Footnote:
It occurred to me while writing this that perhaps a good solution  
would be to use country codes for written content that uses the  
national orthography, and dialect tags when transcribing spoken  
content or for audio data. You would only combine the two if you were  
transcribing the speech of someone with that dialect into the  
orthography of a country (maybe not the country of the speaker).

- Nicholas Shanks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2427 bytes
Desc: not available
Url : http://www.alvestrand.no/pipermail/ietf-languages/attachments/20080104/daadb3ab/smime.bin