RFC3066bis: looking ahead

Tue Jan 20 22:27:22 CET 2004

> From: Addison Phillips [wM] [mailto:aphillips at webmethods.com]

> I didn't (and don't) understand how your response fixed this issue:
two
> kinds of alpha-3 code can appear in the second position, which is
ambiguous
> given that certainly SOME of these codes will be the same three
letters.

The only way different kinds of alpha-3 IDs can appear in second
position are:

- registrations like i-ami

- tags that used ISO 3166-2 alpha-3 region IDs

- registrations no-bok, no-nyn zh-gan, zh-min, zh-min-nan, zh-wuu,
zh-yue

- tags along the lines I'm suggesting, that pair an ISO 639
macrolanguage ID with an ISO 639 individual-language ID, like  the
example I gave zh-yue

The first group aren't a problem because an initial subtag of "i-" can
only be used for tags registered per RFC 1766 or RFC 3066. The second
group, we haven't supported up to now (expect potentially via
registration -- but not have been registered so far).

The third group are a mixed bunch: no-bok and no-nyn have already been
deprecated, and zh-min-nan is obviously a special case requiring either
deprecation or grandfathering. The rest, though, fit exactly with what I
am proposing (assuming gan, min, wuu and yue are IDs used in ISO 639-3).

> I agree that we should be forward looking. As I understand it, though,
> ISO639-3 would not displace ISO639-2 or -1, so we need to have ways to
> express all three kinds of tags, right?

You can't say "all three kinds of tags" since ISO 639-2 and ISO 639-3
belong to a single codespace and are mutually compatible. In fact, it
might be feasible to replace ISO 639-2 with ISO 639-3 in a future RFC
since the latter will contain what is useful from the former. (ISO 639-3
will not include IDs for collections, but it's not particularly useful
IMO to tag content as something like "Indo-European"; but if collections
are still needed, then ISO 639-2 can be kept, or ISO 639-5 can be used
if and when it's published.)

The only problem I foresee is the relationship between macrolanguages
and individual languages that will exist in ISO 639, and I'm suggesting
that we can easily deal with that, and in a way that turns out to have
precedent in registrations "zh-yue" etc. The only problem that might
arise would be if we used ISO 3166-2 alpha-3 IDs, and as John suggested,
we should be able to choose not to use those.

> That's "sit-Latn-MY-jingpho-y2008"... the order is constant :-)
> 
> It would be poor practice to use such heavyweight tags, but not
illegal, and
> the fact that each subtag has intrinsic meaning would allow for
matching
> where it makes sense.

Yes, but there wouldn't be any way to match "sit-Latn-MY-jingpho-y2008"
with "sit-jingpho" given the existing language-range mechanism (and it
wouldn't be a simple revision to language-range to support matching in
such a situation). The tag "sit-Latn-MY-jingpho" would match with "sit",
"sit-Latn" or "sit-Latn-MY", all of which are pretty much useless.

> If what you're saying is that the ISO63-3 tag doesn't need the
ISO639-1/-2
> introduction, then why not just make it the first tag?

Normally, that would be the case. I'm only trying to consider what
should be done in situations when we have something like "yue" (not a
perfect example, given the registered tag), for which existing content
may already be tagged as "zh". That is the *only* situation in which I'm
suggesting a two-part language ID be used.

> The whole tag is 'language information'. The primary language is the
first
> tag, followed by other distinguishing information. Dialects and
orthographic
> variations are distinguishing information, by that definition. If what
> you're saying is "dialect trumps script", then we need a structure for
that.

That's not what I'm saying because I'm not talking about dialects. I
still think region IDs and perhaps variants might be useful portions for
indicating dialect distinctions.

> But...
> 
> I think what you're saying is that the ISO639-3 codes are really fully
> formed language codes on their own and thus "should go first"

Yes.

> and that at
> least a subset of these codes have the obvious but inconvenient
property of
> identifying closely related languages (which gets us into the complex
swamp
> of deciding the difference between a language, a dialect, and so
forth) and
> thus merit greater structure in their corresponding RFC3066bis tags.

Yes, except that I do not see a complex swamp of deciding differences
between languages, dialects, etc. All we will have to sort through are
the entities listed collectively in the various parts of ISO 639, and
there will just happen to be a small set of cases in which there are
one-to-many mappings between things that have all been considered
individual languages (not collections). In most of these cases, the
entity with more-inclusive scope has already been part of ISO 639-1/-2
and so can be in existing usage. It is simply that small set of cases
I'm trying to deal with.

> If so, we should, IMO, try to follow the design goals Mark and I had,
which
> include unambiguous parsing. The "i-" prefix could be mandated
(required) to
> come second...

I'm suggesting that we do not need to introduce something like "i-" to
result in unambiguous parsing. The only obstacle to that is ISO 3166-2,
which we haven't used thus far, and can easily avoid in the future.

> > We could do that, but I was trying to consider the possibility that
> > there is existing content in something like Yue that is already
tagged
> > using "zh".
> 
> It'll still be tagged 'zh' the day after 'zh-i-yue' (or whatever) is
> allowed. Converting the language tags means converting the language
tags,
> regardless of how they are formed.

But I'm not imagining that existing data will need any converting of
language tags at all. I did not bring up any question of conversion.

> > Well, obviously we can't revise RFC 3066 to incorporate ISO 639-3
until
> > the latter is published, which is probably about a year away.
> 
> As you say, we can make allowances now, but Mark and I had specific
design
> goals, one of which was unambiguous parsing.

And I'm raising the matter so that unambiguous parsing can be
implemented now in a way that will allow us to incorporate ISO 639-3
smoothly a year or so from now.

Peter

Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division