On draft-phillips-langtags-01

Tue Mar 9 19:19:02 CET 2004

Hi Jeremy,

Thanks for the comments. Some inter-linear responses below.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: Jeremy Carroll [mailto:jjc at hplb.hpl.hp.com]
> Sent: mardi 9 mars 2004 00:28
> To: aphillips at webmethods.com
> Cc: ietf-languages at alvestrand.no
> Subject: On draft-phillips-langtags-01
>
>
>
> I have some comments on this draft - particularly section 2.3!
>
>
> 1) I like the generativity
>
> 2) I am uncomfortable with the defaults
>      section 2.3
>     particularly point 2
>
>     I worry that the lang tag 'fr' becomes ambiguous with these rules -
> between 'fr-FR' and 'fr' with location undefined or unknown or
> unimportant.

This has, as you note, alway been a problem. The defaulting pattern means
that information is lost on each level. While some languages have a sort of
equivalence as you work up the ladder, others are problematic. In fact, the
text in 2.4 says:

  "This relationship is not guaranteed in all cases: specifically, languages
that begin with the same sequence of subtags are NOT guaranteed to be
mutually intelligible, although they may be."

Consider the relationships between:

  zh, zh-Hant, zh-Hans, zh-CN, zh-TW, zh-HK

As software folks we would like there to be a deterministic relationship
between the tags, but human languages don't work that way and the language
tagging scheme is, at best, an *approximation* of an ontology. Rfc3066:bis
gives us more tools to work with, but it is still only a model of how
languages work and not a particularly accurate one in many cases. In fact,
you can have quite the amusing debate about what the word "Chinese" (for
example) actually means... ;-).
>
>     I also suspect that many british english speakers could
> interpret this
> to mean 'en' == 'en-GB' while over the pond 'en' == 'en-US'
>
>    IIRC this problem has been there since 1766. As 3066bis is a move
> forward, and I don't have a solution, I am not particularly
> seeking any change.
>
>   I think we already have the situation where different people mean
> different things with the same tag - not good.
>
> 3) 2.3 point 5
>     Hmmm,  how about 'und-latn': I can probably write a simple program to
> determine the script of a string, and it is probably useful in some cases
> to know that the script is at least something you can read (probably more
> so with pictograms). An alternative would be to allow the primary
> subtag to
> be omitted e.g. allow 'latn' as a full tag,

I think you're thinking set theory--the set of all content written in Latin
script or the set of all content from China. That's a useful idea for an
application. However, it suggests that what we know or can determine
something about the language of some arbitrary content. That is: it's pretty
easy to determine that the text of this email is written in Latin script and
it might be possible to infer that it was written in the USA (although my
email header will confuse that) and it might be possible to infer that it is
written in English. So I can make a language tag of en-Latn-US for it,
right?

But really this isn't how langauge tagging should work. Undetermined
language content should be labeled with the empty string, since, presumably,
one doesn't have any positive information about the content. What one infers
may be valid or it may just be happenstance. For example, a small segment of
text might be in Latin script, but the language might be Japanese. Or a word
list might be constructed that is both valid French and valid English.

>
> 4) 2.3 point 7a
>     The use of surrogates may be necessary. It might be worth reserving
> some of the private use space, e.g. the example uses qx, which
> has earlier
> been described as one of the 'user-assigned codes' (section 2.2).
> Or simply
> noting, in 2.2, that some provisions of 3066bis might actually assign for
> public use some of the private use codes. (I am thinking of the poor user
> who was making genuinely private use of 'qx' before it was taken up as a
> surrogate, in the hypothetical example).

The example is unfortunate. I used a private use code as a hypothetical
example "as if it were a non-private use code". A better example might be
'aa' or 'zz'... I just didn't want to use a real language tag (real now or
in the future).

>
> Jeremy
>
>