Phonetic orthographies

John Cowan cowan at
Sat Nov 11 20:16:06 CET 2006

Gerard Meijssen scripsit:

> To make things more interesting, how would you indicate the dialects
> of languages like Cantonese or Min-Nan?

With variant subtags.  It is precisely the function of this list to
consider, process, and approve proposals for variant subtags, which
may (as I posted a moment ago) represent orthographical, dialectal,
sociolectal, diatypical, and temporal varieties without limitation.

> The problem of the RFC 4646 is imho, that it does not appreciate
> that ISO-639-3 makes what are considered languages under the previous
> codes macro-languages.

RFC 4646 does not take macrolanguages or ISO 639-3 into account except
in limited and unsystematic ways.  RFC 4646bis will do so.

> Chinese is not the only macro  language

Quite so.  All macrolanguages will be treated identically in the proposed
RFC 4646bis.

> There are also codes like bat (Baltic (Other)) in the ISO-639-3 that
> are not part of the ISO-639-3. The consequence is that for instance
> in the Wikimedia Foundation start a project and call their language
> bat-ltg because they do not want to be associated with Latvian, the
> language that it is an dialect of according to Ethnologue.

No one contends that Ethnologue and ISO 639-3 are either complete or
perfect.  Nor is there such a thing as a compelling objective definition
of "language" (as distinct from "variant") that will be equally useful
for all purposes.  Nevertheless, the ISO standards are what we have got.
Several courses of action are open:

1) Petition the 639-3/RA, with supporting evidence, to recognize Latgalian
as a distinct language.

2) Register "latgalia" as a variant subtag under "lat" using this list.

3) Petition this list to use its extraordinary powers to register
"latgalia" as a language subtag.  This would require compelling evidence
that neither #1 nor #2 is possible.

4) Continue to remain noncompliant; the only penalties for noncompliance
are social ones.

> My appreciation of the RFC 4646 is very much that it aims to preserve
> backwards compatibility.

Just so.  Any tag that was valid since the writing of RFC 1766 is still
valid today and always will be.  We consider it essential to maintain
the validity of existing data.

> They have been ditched with reason in the ISO-639-3 and the insistence
> to preserve the outdated codes will imho prove to more of a hindrance
> than of a benefit when you want to make the Internet more multi lingual.

No tags have been ditched, with or without reason.  Macrolanguage code
elements are still present in 639-3.  Language-collection code elements
are part of 639-2, but are excluded from 639-3 as out of scope.

> The RFC 4646 indicates that specific indications of languages is also
> needed for things like spell checking maybe even CAT or Computer
> Aided Translation programs. To make this function there is a need
> to build upon the existing standardisation work because how do you
> safely indicate dialects, orthographies? There are no public lists
> I know off that help indicate what possibilities are recognised,
> let alone exist for what languages.

The Language Subtag Registry is the registry for currently recognized
variants.  It is known to be very incomplete, so if you have proposals,
make them on this mailing list.

> Indicating orthographies by date is not safe because in Dutch for
> instance we have an official orthography, "het groene boekje" and an
> unofficial one, "de witte lijst", they are both from 2006 and both
> have powerful factions using them. Similar situations exist for several
> other languages that I know off.

Excellent!  Grab a copy of 4646, copy out the language-variant proposal
form, post it here, and make your case.

> One further argument I would like to add, for languages that have not
> such a rich history on the Internet: people will use the ISO-639-3
> code that is specific to their language. Using the logic of the RFC
> 4646 they should however use a different code. Something that will be
> and has been roundly rejected by the people who want to use their code
> for their language.

ISO 639-3 is not yet final, and people who use it, use it at their
own risk.  As soon as it is final (and allowing for the delays of the
IETF process), RFC 4646bis will incorporate it.  All languages in 639-3
will be permitted in 4646bis language tags, either as single language
subtags or as a macrolanguage-encompassed language pair:  thus "ar-arz"
for Egyptian Arabic (for backward compatibility with systems that expect
"ar" to mean any sort of Arabic), but "grt" for Garo.

> Practically many of the application that make use of content of the
> Internet already have to check what language content a website is
> because the information is often incorrectly attributed according to
> RFC 4646 or its predecessors.

The most usual reason for incorrect attribution is mere sloppiness or
ignorance, of course, not the limitations of RFC 4646.

John Cowan    cowan at
Half the lies they tell about me are true.
-- Tallulah Bankhead, American actress

More information about the Ietf-languages mailing list