RE: Progressing beyond borders—making subtags inclusive

CE Whitehead
Sat Jan 5 19:43:26 CET 2008

O.k. but of course many people's dialects are mixed or multiple:
(mine is; thus yielding multiple pronunciations of some words, which might or might not be register-dependent)
But yes, sure, the more subtags the better because storage is getting better; we can store all this information.  But sometimes it's impossible to identify a particular dialect.
--C. E. Whitehead
cewcathar at
> > On 3 Jan 2008, at 19:33, Karen Broome wrote:> > > Is it simply up to the user to decide whether to use regional or > > variant tagging? Or should some guidelines be written to indicate a > > preference for variant tagging over regional tagging if both exist?> > > I'd like to second the call for some guidelines to be widely > disseminated. I am a web developer and would like to see all of the > web tagged (correctly!) with language data.> > My own opinion is that using country codes to define dialects is > flawed. When borders change, Czechoslovakia splits in two, Germany > reunifies, etc, then all the old country codes become obsolete even > though linguistically nothing has changed. When populations are > displaced they take their language with them.> > I feel that all dialects should have their own subtags, not just the > ones that partizan individuals propose. As a great example, there's a > subtag for en-scouse but not one for yorkshire, geordie or brummie, > because the guy that submitted the scouse request has a vested > interest in his own dialect, and nobody has bothered to register the > others. The distinction between en-US and en-GB is mainly an > orthographic one. I say this because en-US represents a cluster of > dialects and accents, with a unified orthography, and en-GB represents > a cluster of accents and dialects (some overlapping with en-US), but a > different orthography. Thus en-GB/US is pretty useless to people who > are tagging audio data, but quite useful to those tagging written data.> I believe that having a subtag registered is at present too difficult > (requirement for dictionaries!? what if it's mostly just an accent > with only phonemic changes relative to surrounding accents). A > relaxation of the barriers would lead to more de facto recognised > dialects being available to choose from.> > As an example, things like the supposedly "British English" speech > synthesizer voices on my computer (which the OS processes using the > tag "en_GB" from the voice's property list) sound nothing like most of > the accents of the United Kingdom, they would be better marked as "en- > received" or similar.> > Consider if you will a speech synthesizer trying to render a website > with the following:> <dialog>> <dt>George Bush> <dd lang="en-US-cowboy">Now that's what I call a stonkin' good supper!> <dt>British Ambassador> <dd lang="en-GB-received">Yes, indeed sir. That would appear to be the > case.> </dialog>> > The synth has available half a dozen male voices variously described > as "en-US" and "en-GB" it would probably not render the dialogue > closely to the author's intentions, but if those voice descriptions > could be "en-general", "en-cowboy", "en-drawl", "en-received", "en- > westcountry" and "en-estuary", then the synth would have far more > freedom to select an appropriate voice to use.> > I'm sure we can all agree on commonly recognised dialects for English, > as it is a first langauge for many people on this list, and familiar > for many others. For other languages compiling a list might involve > asking a scholar for suggestions.> > > Footnote:> It occurred to me while writing this that perhaps a good solution > would be to use country codes for written content that uses the > national orthography, and dialect tags when transcribing spoken > content or for audio data. You would only combine the two if you were > transcribing the speech of someone with that dialect into the > orthography of a country (maybe not the country of the speaker).> > - Nicholas Shanks.
