My old paper, feedback on registration proposals

Thu May 29 17:50:31 CEST 2003

Peter Edberg scripsit:

> Okay, I was finally able to work through the maze of Apple process to
> get my old paper on RFC3066 extensions posted. This is from February,
> but is updated to include some annotations from e-mail discussions in
> March with Peter Constable, Michael Everson, and others. The URL is:

Thanks for going to this trouble.

> I also support the need to have tags for all of the particular cases
> requested by Mark Davis:

Yet another stone in the arch of consensus.

> My only questions are about the specific form of the tags used for the
> Chinese cases. These registrations are intended to be steps on the path
> to a productive model using ISO 15924 script tags. However, "Hans" and
> "Hant" are not currently in ISO 15924; is there a commitment to add them?

I believe that Michael (qua ISO 15924 RA rather than qua IETF language
tag reviewer) has said  that he will register them as soon as the RA is
properly launched.

Comment, Michael?

> The same scheme could be used to eliminate the other variants in
> ISO 15924, thus giving "en-Latn-Fraktur" -> "en-Fraktur" instead of
> "en-Latf", etc. It could also be used to cover other cases, such as
> polytonic/monotonic modern Greek: "el-Grek-poly" -> "el-poly".

I would weakly favor adding "Grep" and "Grem" as tags for polytoniko and
monotoniko Greek, but I am willing to be talked out of this.

> - For Mongolian, we need mn-Cyrl and mn-Mong.
> - For Malay, we need ms-Latn and ms-Arab.
> - For Tatar, we will probably need tt-Latn and tt-Cyrl.

I think these are no more controversial than Mark Davis's proposals.
I urge you to go ahead and file the RFC 3066 forms.

> - For Irish, we need a way to indicate the old orthography (with dots
> above) and the modern orthography (using h instead): e.g. "ga-Latn-dots"
> -> "ga-dots" ??

I'm not clear on what "Latg" means; whether it implies the special shapes of
t, g, etc. as well as the dots, or can be applied when the dots are present
but the glyph shapes are those of Latn.

Comment, Michael?

> - For English, we need a way to indicate English written in an
> orthography that is restricted to the ASCII subset of characters only,
> versus the full range of possible characters (curly quotes, em-dashes,
> etc). This is actually one of the localizations that we support. Perhaps
> "en-Latn-ASCII" -> "en-ASCII" ??

Hmm.  This seems more problematic to me, and I'd like to hear what others think.

Here are my specific comments on your excellent paper:

ISO 639-3 (p. 4):  As I understand it, the Ethnologue codes will be
changed so that 1-1 matches with 639-2 will use the same 3-letter code
elements.  As an interim measure, I would favor using the Ethnologue
code as the second subtag following an ISO 639 tag: this happens to not
conflict with any existing registration.

ISO 3166 (p. 5):  The 2-letter codes of ISO 3166-1 are as unstable as
the codes of ISO 639 have historically been (but I believe that the ISO
639 folks have been beaten into submission in this respect, and won't be
changing codes any more :-)).  The U.N. Statistical Office defines 3-digit
codes for the large-scale regions of the Earth, which are non-overlapping
with the 3-digit codes of ISO 3166.  A 3-digit second subtag could not
be a date, and could be uniformly understood as a region code.

zh (p. 6): I agree with Peter that zh means "Sinitic languages called
'Chinese'" or something similar: it's too late to use it as a synonym for
"Mandarin".

en-gb-us (p. 9):  I have actually seen books published in en-us-gb; that
is, the spelling and punctuation of an American book were altered by the
British publisher, but the diction, syntax, and idiom were left alone.
Nevertheless, it may not be necessary to allow this possibility as a
productive one in RFC 3066bis.

default writing systems (p. 12):  I think it more perspicuous, when a
writing system component is omitted, to understand not a default script
but an unspecified one.  Thus a sound recording in Tatar can be tagged
simply en without implying the Latin script, and rather than solemnly
declaring en to be the same as en-latn, we simply allow en to represent
any writing system, and assume that Latin will prevail.  Because of the
existence of Braille, almost any language that is written at all has at
least two writing systems.  Still, an RFC 3066bis might wish to inquire
about usual writing systems in the registration form.

-- 
There are three kinds of people in the world:   John Cowan
those who can count,                            http://www.reutershealth.com
and those who can't.                            jcowan at reutershealth.com