Encoding scripts in tags: evil or just unpleasant?

John Cowan cowan at mercury.ccil.org
Fri May 23 09:02:37 CEST 2003


Michael Everson scripsit:

> >at http://www.sil.org/silewp/2000/001/SILEWP2000-001.pdf

BTW, this is the wrong URL:  I meant
http://www.sil.org/silewp/2002/SILEWP2002-003.pdf .  Apparently
it's my morning for blunders.

> >1) If a language not yet in ISO 639 is requested, register it using an
> >   ISO 639 tag qualified by an Ethnologue tag.
> 
> Or some other tag. Ethnologue doesn't cover everything.

Granted.

> >2) If a language is written in multiple scripts, register each script
> >   using the tag for the language qualified by an ISO 15924 tag for
> >   the script.  Evidence should be demanded showing that the language
> >   is indeed written in that script.
> 
> Each? So we have to register yi-Hebr? ga-Latn and ga-Ogam? pt-Arab and 
> pt-Latn?

Indeed.  Getting the wrong script can be worse than getting the wrong
language: if you read only ga-latn, then which would you be happier
with, ga-ogam or gd-latn?

> What about the DUPLICATION OF CODES issue? Isn't no/ny/nb a problem as 
> well?

I absolutely disagree that yi-hebr and yi constitute a duplication; they
have different semantics.  One means "Yiddish in the Hebrew script only",
and the other means "Yiddish".  As for the Norwegian problem, it is probably 
unique and ad hoc, and we should leave it the @#$ alone.

> This is within one particular script? Examples other than the German ones?

This is the conventional way in which en-us, en-uk, en-ie etc. have been
used, and I believe rightly so.  zh-tw OTOH represents a difference
in writing system (script, considered broadly) and ought to be deprecated.

> >4) If a language is written in multiple scripts *and* has multiple
> >   spelling systems etc. etc., register each spelling system in use for
> >   each script using the tag for the script of the language qualified
> >   by an ISO 3166-1 etc. etc.  Evidence should be demanded etc.
> 
> Each?

Each, yes.  Tags don't cost that much!

> Is it being nasty to backtrack and wonder why <script="Latn"> isn't a 
> cleaner solution than rolling all this into the <lang="yi"> tag? (I 
> suppose this had better be asked again at this stage.)

I think it is simply too late to be purist about what "language tag"
means: they clearly encode more than language proper, but in an ad hoc
and confusing fashion.  Peter Constable's model is, I believe, our best
hope for bringing order into the system.  If it sometimes overprescribes,
that is far better than the current underprescription which leads tags
to be applied in ad hoc fashion (especially the ones that require no
registration).

-- 
One Word to write them all,             John Cowan <jcowan at reutershealth.com>
  One Access to find them,              http://www.reutershealth.com
One Excel to count them all,            http://www.ccil.org/~cowan
  And thus to Windows bind them.                --Mike Champion


More information about the Ietf-languages mailing list