[SPAM] RE: Points 3, 4 and 2 [RE: About: Tags for Identifying Languages (draft-phillips-langtags-01)]

Tue Mar 9 20:28:07 CET 2004

Hi John,

Thanks again for your incisive analysis. More commentary below.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: John Clews [mailto:scripts20 at uk2.net]
> Sent: mardi 9 mars 2004 04:46
> To: aphillips at webmethods.com
> Cc: John Clews; ietf-languages at alvestrand.no
> Subject: [SPAM] RE: Points 3, 4 and 2 [RE: About: Tags for Identifying
> Languages (draft-phillips-langtags-01)]
>
>
> Hello Addison - thanks for your reply. I still think that there are some
> holes or workarounds that will leave the 3066bis a bit messier than the
> current 3066.

I don't agree, obviously. What, specifically, is messier?

> >
> > We didn't consider LOCODEs in the design of draft-01. I haven't looked
> > that closely at them.
>
> Probably you should. They are widely used in many ICT applications.

I've looked at them. I'd like to hear more from the community rather than
just adding lots and lots of different kinds of subtags because they might
be useful to one user or another. What makes the most sense for identifying
geographical variation in language tags, keeping compatibility with RFC3066?
>
> > The M49 materials cover the immediate needs that Mark and
> > I were dealing with.
>
> But they don't help with more specific descriptions of language, such as
> the example above, and indeed others that have been discussed on this list
> over previous months/years.

I haven't seen a consensus or a demand for LOCODE. Maybe if more folks
chimed in...

More specific isn't necessarily better, if it makes implementation more
complex. It isn't necessarily a bad thing either. I observe that RFC3066
doesn't provide for the additional divisions in LOCODE and that RFC3066:bis
doesn't add or detract from that. RFC3066:bis does give you a mechanism to
use LOCODES (extensions) if your application needs them. And you can
register specific ones as variants.

>
> > I don't personally care for LOCODE, which is a bit too specific for my
> > tastes. It also incorporates all the problems that ISO3166 has (WRT
> > stability and ambiguity) since it uses ISO3166 as a basis.
>
> The rather stupdid reassignment by ISO 3166/MA doesn't invalidate the rest
> of ISO 3166, which earlier RFCs use normatively.

No, it doesn't invalidate 3166 and you shouldn't read 3066:bis or my
comments as implying that. All I'm pointing out is that we didn't
incorporate LOCODE as a new mechanism because it didn't solve the problem we
were looking to solve. Yes, it describes geo location in more (greater)
detail, but that detail isn't necessary to any of my applications and would
(in my personal view) complicate implementation quite a bit. Without any
benefit to my needs, I don't feel a particular urgency to assign language
codes for particular subnational regions.

I should also note that the subdivisions that LOCODE identifies do not
necessarily describe the borders between languages and dialects any better
or worse than countries do. In some cases yes and in other cases not at all.

I should also add that, while I consider ISO3166/MA's choices with regard to
'CS' as unfortuate, I also recognize the problem they face: ISO3166 is
supposed to assign both an alpha2 and alpha3 to every country. If they
cannot reuse, then I can see that they will, one way or another, run out of
alpha2 codes eventually. The current system isn't necessarily evil, but it
relies on the set of countries being more-or-less static over time. A look
at historical maps over any 50-year duration you care to select shows the
fallacy of that assumption.

>
> Locodes has its own problems caused by the YU/CS cockup. But one option
> that they won't have is of using 3-digit codes.

That's an orthogonal problem for LOCODE to deal with.
>
> In addition, some of the software you refer to yourself below also won't
> have the option of using 3-digit codes.

No, but as I pointed out, that software cannot accept any other solution
either.
>
> > Not unless ISO639-3 advances before the draft does. It isn't realistic
> > to incorporate ISO639-3 normatively before it exists...
>
> Agreed entirely, but you should keep a watching brief on ISO 639-3 in my
> view. I would expect that from its current stage, ISO 639-3 might advance
> quite quickly (though Peter Constable's in a better position to comment on
> that than me).

Okay, ISO639-3: it is supposed to be a strict superset of ISO639-2. If that
holds true (and I'm sure Peter and others will ensure that it does so), then
the change to RFC3066bis will be a search-and-replace of the ISO639-2
references with ISO639-3 references.

Peter also pointed out the development of language "family" relationships,
for which we explicitly added the 'extlang' subtags with a note that the
exact usage would be determined as ISO639-3 evolves. I hope Peter will give
us feedback on that mechanism. There is no incompatibility.

>
> > All we can do now it provide support for it (which is what the
> > whole -s-extlang stuff is about). When ISO639-3 is done or nearing
> > completion, RFC3066:bis can itself be revised to incorporate
> > ISO639-3 normatively.
>
> In that case, why not work in tandem with ISO 639-3 development, otherwise
> there will be the danger of even more inconsistencies being added? No
> point in revising it twice.

I don't want to wait several years: I need these changes sooner rather than
later. RFC3066bis was designed to be forward compatible and will be directly
amenable to ISO639-3 when it comes out. Mark and I are not ignoring 639-3.
If there are potential inconsistencies, people should sing out now. But
frankly I don't think there are any such.

>
> > I certainly hope we're not still working on various 3066:bis
> > drafts in a year when that takes place!
>
> Why not, if it's possible that working in tandem may solve many of the
> problems? Planned delays can avoid incompatibilities later. On a much
> larger scale, see the way ISO/IEC 10646 and Unicode came together - that
> wasn't always the case, and there could have been two conflicting codes if
> heads weren't knocked together.

Unicode/10646 is not the same problem. Those groups were in competition.
3066bis and 639-3 are not in competition and will not be incompatible
anyway.

> >
> > I don't see why M49's utility is reduced by that.
>
> I do: special cases tend to require a later need for new rules, or new
> registers, to work around several special cases, later on, typically.

Then you should provide comments on how to specifically resolve
reassignments if you would do it differently. M49 is only for that purpose.
The draft makes it clear.

>
> >> 2. It covers the CS/CS problem well in dealing with ISO 3166 codes
> >> (though naturally it would be better if the ISO 3166/MA didn't do
> >> such stupid things - has anybody heard of top-level actions
> >> regarding the allocation of the CS code in ISO 3166?)
>
> Has anybody heard of any action on that? I've heard of no comments from
> anyone on that.

Yes: CS is Serbia and Montenegro, that is, nothing has changed.

>
> >
> > You can use YU if you want to. It might not be a good idea to do so (you
> > may
> > be offending someone), but it is permitted by rfc3066:bis.
> > 'YU' is a code for a (defunct) country.
>
> But your text uses the specific example:
>
>              cs-CS (Czech for Czechoslovakia)
>
> 'CS' is a code for a (defunct) country.
>
> I don't see why you make a difference between CS and YU.
> Why is CS valid, and YU not?

One more time: both cs-CS and sr-YU are valid under RFC3066:bis. But you
might not want to use either of them any more becuase those countries are
defuncted. NOWHERE does it say in the draft anything about 'YU'. The
question for Serbians and Montenegrins is how to represent their current
country. And I'm suggesting that they might not want to use 'YU' for that
purpose. Since 'CS' is ambiguous, using 891 makes it clear.

>
> > Numeric subtags from M49 are an option only in the case where ISO3166
> > assigns a country an alpha-2 code that was previously assigned
> to another
> > country. M49 is advertised as being stable and consistent,
> therefore Mark
> > and I (with support from others on this list) incorporated it
> as a way of
> > tagging content for countries that have the misfortune to get assigned a
> > secondhand ID.
>
> And what happens if, for example, Serbia and Montenegro split into two
> sovereign states at some point in the future - not altogether unlikely.

The draft says: let ISO3166 and the UN assign codes to the resulting
countries. Their choices will determine how to construct language tags. For
example, if ISO3166 assigns each new country different previoulsy unused
virgin codes, then the ISO3166 alpha2 codes apply to the new countries. If
(say) Serbia kept 'CS', then presumably the UN number for Serbia would
(continue to) be used (it might be 891 or it might be a different number).
Either way, language tags build on RFC3066:bis remain unambiguous.
>
> Digits doesn't solve that problem or other potential future problems, it
> merely disguises it for the present.

I think my explanation above should make it clear that I think it does solve
the problem. I'll agree that the solution isn't a pretty as ISO3166/MA just
making a policy of never reusing (or greatly extending the rest period) on
codes. But that's the problem with basing a standard on other standards, eh?
>
> >> And how would people know when to use digit codes rather than
> >> 2-letter codes?
> >
> > It's very clear in the draft: there will be an informative
> registration in
> > the IANA registry.
>
> So we have an IANA registry which will have to keep in step with the ISO
> registries? Yet another problem area of versioning.

As opposed to.... please suggest a stability/ambiguity alternative if you
have one.
>
>
> >> And is there any software etc which specifies using 2-letter codes,
> >> which
> >> would invalidate use of 3-digit codes?
>
> > There is plenty of software that assumes that region codes all take the
> > alpha2 form. This software will not be able to store the
> 3-digit code. But
> > then, these applications won't be able to deal with
> reassignment of codes
> > very well either (the reason for moving to M49).
>
> So that could be a significant body of software which would not be able to
> deal with part of the proposed new RFC3066bis?

There is a significant body of software that cannot deal with the problem
that 3066bis is also trying to deal with. Why is this RFC3066:bis's problem?

Let me illustrate it another way: if the solution were to be that IANA would
keep a registry of alpha2 codes for countries that ISO3166MA assigns
ambiguous codes to, what would happen when ISO3166MA assigns one of those
values to another country? Ick.

>
> Rather a problem, it seems to me. A simpler solution involving 2-letter
> codes would be better.

Where do the 2-letter codes come from? A look at ISO3166 shows that there
aren't a lot of codes to choose from and it merely exacerbates the problem.

>
> > Really, RFC3066:bis makes all of this quite a bit easier to deal with.
>
> But it adds further problems too.
>
> In
> > the past it was possible for there to be registrations (with
> lengths != 2)
> > with regional meanings. And you can have (as with the sign
> language codes)
> > two or more subtags with some kind of regional meaning.
>
> Sorry - you lost me there. What do you mean? I hadn't spotted those
> problems, but I'm happy to be enlightened with an example or two.

You can register a code that contains a subtag that describes a regional
variation of a language. By definition, the subtag cannot be 2 characters
long. Thus you could register a subtag for a region and that subtag would
not work with software that wants an alpha2 code. Michael, being a smart
guy, has not permitted any such to be created--other than the sign language
tags. But that doesn't mean that one could not register one under RFC3066
next week.

RFC3066bis directly prohibits this, incidentally.

>
> > RFC3066:bis does
> > away with that. There are ISO3166 alpha2 codes and, in isolated
> cases, UN
> > M49 codes. Registrations can be made that have "regional" meanings, but
> > these will be limited to the "variant" slot in the tag.
>
> Again a simpler solution would be better. Why not normatively specify ISO
> 3166 at a certain date, before the YU/CS cockup by the ISO 3166/MA?

And new countries do what? Where do the new codes come from? Basically what
we've done is what you suggest, except that we let ISO3166/MA do their work
and only make IANA deal with problems like YS/CS as they arise (which we
hope they won't) and we specify the mechanism for doing so (so that there is
no question about it). Previous versions of the draft required IANA to
register some funky code like CS2003 and for Michael and IANA to keep track
of everything 3166MARA does, as well as the fate of various internal
registrations. Our M49 mechanism makes the UN an "oversight" body for 3166
(how is '891' worse/different than 'CS2003' as a representation of Serbia
and Montenegro??) and M49 has a policy of not reassigning values to
different geographical bodies.

Granted 891 also represented YU and might represent either a "Serbia" or a
"Montenegro" if sometime in the future the current 'CS' breaks up. But this
isn't so bad. Under the current draft, as far as I know, there is exactly
ONE case where a number would get used. It is to deal with YU/CS. Hopefully
this will remain the only case on record.

I think you making too much of something small. But if you disagree still,
please suggest alternatives that we might adopt.
>
> John
>
> --
> John Clews,
> Keytempo Limited (Information Management),
> 8 Avenue Rd, Harrogate,
> HG2 7PG
> Tel: +44 1423 888 432 (landline)
> Tel: +44 7766 711 395 (mobile)
> Email: scripts20 at uk2.net
>