On draft-phillips-langtags-01

Wed Mar 10 10:50:02 CET 2004

A bit more on one of the points ...

>>3) 2.3 point 5
>>    Hmmm,  how about 'und-latn': I can probably write a simple program to
>>determine the script of a string, and it is probably useful in some cases
>>to know that the script is at least something you can read (probably more
>>so with pictograms). An alternative would be to allow the primary
>>subtag to
>>be omitted e.g. allow 'latn' as a full tag,
>>
> 
> I think you're thinking set theory--the set of all content written in Latin
> script or the set of all content from China. That's a useful idea for an
> application. However, it suggests that what we know or can determine
> something about the language of some arbitrary content. That is: it's pretty
> easy to determine that the text of this email is written in Latin script and
> it might be possible to infer that it was written in the USA (although my
> email header will confuse that) and it might be possible to infer that it is
> written in English. So I can make a language tag of en-Latn-US for it,
> right?
> 
> But really this isn't how langauge tagging should work. Undetermined
> language content should be labeled with the empty string, since, presumably,
> one doesn't have any positive information about the content. What one infers
> may be valid or it may just be happenstance. For example, a small segment of
> text might be in Latin script, but the language might be Japanese. Or a word
> list might be constructed that is both valid French and valid English.
> 

Surely if I have a string and all the code points are in a particular 
script it is correct to say that the string is in that script?

[[The simplest response is simply to tell me that I have misunderstood the 
relationship between script codes and character codes]]

Assuming this, then adding script codes to RFC3066 for the first time 
permits automatic addition of correct subtags.

Now a tag like und-hant could be added on the basis of the codepoints used, 
   the instruction that und SHOULD not be used, might be applicable or not.

If it is applicable then the language tags MUST/SHOULD not be used for 
script information alone, if it is not applicable then und MAY be used in 
language tags in which a subtag is known but the primary tag is undetermined.

Given that script subtags are a new feature of 3066bis, it is arguable that 
  the text should rule one way or the other on the use of 'und-hant' etc.

e.g.

    5.  You SHOULD NOT use the UND (Undetermined) code unless the
        protocol in use forces you to give a value for the language tag,
        even if the language is unknown. Omitting the tag is preferred.
[[
In particular, you SHOULD NOT use the UND code to construct a
language tag such as 'und-hant'; script information SHOULD be omitted
when the language is unknown.
]]
or
[[
In particular, you MAY use the UND code to construct a
language tag such as 'und-hant', when the
apropriate script code is known and the language is unknown.
]]

The point being that RFC3066bis describes a
'protocol [that] forces you to give a value for the language tag, even if 
the language is unknown'
in the case when you wish to convey a script code

Jeremy