On draft-phillips-langtags-01

Wed Mar 10 22:40:35 CET 2004

We can't lose sight of the fact that tags are used in two very different ways:
identification, and requests.

For identification, the main purpose is to distinguish data that would otherwise
be misinterpreted. So I could tag a word "chat" as being French to prevent it
being interpreted as English. For such cases, over-identification is a clear
mistake. Tagging it as French-Belgian would generally be a mistake, unless I
wanted to distinguish the particular usage from other French. Thus I might tag
"vatche"  or "spene" as fr_BE. (if I really cared about nuances, it would
probably be something even finer grained, such as fr_BE-x-central-walloon). So
where as it would be possible to tag "chat" as en, en-US, en-UK, ... fr,
fr_FR,..., one should only really tag it with enough information to discriminate
to the degree necessar. It would be possible to tag it with also fr_Latn, but in
practice that is unnecessary.

It would also be possible to tag with und-Latn, which is true if you give the
most natural interpretation of that sequence, but of limited usefulness. The
only case I can think of where it may be useful to tag with a script alone is
where you have a sequence of text containing characters of ambiguous script
(e.g. punctuation), where the choice of script may be useful in some processes
such as rendering, but where you don't otherwise know enough about the language
to be able to say anything. But for that purpose, I think "und-Latn" would be
reasonable.

For requests, it is a bit different. Typically it is something like the
following:
   - given a list of <x, y, z> possibilities, give me the best match you have.
A common example is a web page, where we will serve up the best match given what
the user has specified. Here I can't think of any reasonable scenario where
"und-Latn" would have much purpose. However, it is well-formed, and I would
interpret it as a request to provide back information in *some*
written-in-Latin-script language. That is, in such a case, it is acting like a
star.

What I'm getting to is that in the context of identification, 'und' has a
reasonable interpretation as "indeterminate" or "unknown"; while in the context
of requesting it has a reasonable interpretation of "any". This is not really
needed for any other field. ISO 15924 does have a Zyyy, which means "Code for
undetermined script", but simply not including a script in an RFC3066bis ID has
essentially the same purpose: for identification, being "indeterminate"; for
requesting, being "any".

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message ----- 
From: "Addison Phillips [wM]" <aphillips at webmethods.com>
To: "Jeremy Carroll" <jjc at hplb.hpl.hp.com>
Cc: <ietf-languages at alvestrand.no>
Sent: Wed, 2004 Mar 10 09:35
Subject: RE: On draft-phillips-langtags-01

> Hi Jeremy,
>
> As Mark pointed out, I don't think that UND is illegal, but it may be ill
> advised. The problem is that it looks like a value. I'd almost prefer
> permitting a range-like value (xml:lang="*-hant" or maybe xml:lang="?-hant")
> for wildcarding. This would break so many extant rules that I think the idea
> is a non-starter, though.
>
>
> I agree with your statement:
>
> > Surely if I have a string and all the code points are in a particular
> > script it is correct to say that the string is in that script?
>
> The question isn't whether a particular statement about the script of a text
> is true, but whether once one such statement is true all others must
> necessarily be false.
>
> So I agree with the point raise by Jon Hanna, who captured what I was trying
> to say more succinctly. Using Han as an example, thanks to Han unification
> it is entirely possible that a Japanese text is entirely composed of
> "Traditional Chinese" characters (in fact, it may be entirely composed of
> Simplified Characters too). A subset of tags that such a text could
> correctly tagged includes:
>
>   ja-Hant
>   ja-Hans
>   ja-Hani
>   ja-Jpan
>
> This level of script introspection will be difficult to do, compared with,
> say, spotting Cyrillic or Latin texts.
>
> Other examples would include things like a Russian mathematics text that
> includes Greek and Math symbols, latin math symbols (like plus, minus,
> solidus, etc.) and so on. I think it would be valid to say that that text is
> written in 'ru-Cyrl-RU', in spite of the leavings of other writing systems.
>
> RFC3066:bis didn't invent this problem, of course. It exists with the basic
> 'xx-yy' syntax of the old generative mechanism. There are plenty of texts
> where you know more about the locational aspects of the text ("this document
> is Canadian") than the language (or in which the language is mixed). RFC3066
> says (and RFC3066:bis parrots) that labeling texts with UND and MUL is
> permitted, but probably it is better to say nothing (the empty string),
> since UND and MUL suggest something about the default matching algorithm
> that is false.
>
> That is:
>
>   A request of "en-Latn-US" matches content marked "" but not "UND-Latn"
>
> UND only matches content with requests that specfically ask for content
> which is tagged as UND (unless we are to change the fallback mechanism).
>
> IOW, it might be better to think more deeply about partial tagging (how it
> is permissable, what form it takes, and what makes sense).
>
> Best Regards,
>
> Addison
>
> Addison P. Phillips
> Director, Globalization Architecture
> webMethods | Delivering Global Business Visibility
> http://www.webMethods.com
> Chair, W3C Internationalization (I18N) Working Group
> Chair, W3C-I18N-WG, Web Services Task Force
> http://www.w3.org/International
>
> Internationalization is an architecture.
> It is not a feature.
>
> > -----Original Message-----
> > From: Jeremy Carroll [mailto:jjc at hplb.hpl.hp.com]
> > Sent: mercredi 10 mars 2004 01:50
> > To: aphillips at webmethods.com
> > Cc: ietf-languages at alvestrand.no
> > Subject: Re: On draft-phillips-langtags-01
> >
> >
> >
> > A bit more on one of the points ...
> >
> >
> > >>3) 2.3 point 5
> > >>    Hmmm,  how about 'und-latn': I can probably write a simple
> > program to
> > >>determine the script of a string, and it is probably useful in
> > some cases
> > >>to know that the script is at least something you can read
> > (probably more
> > >>so with pictograms). An alternative would be to allow the primary
> > >>subtag to
> > >>be omitted e.g. allow 'latn' as a full tag,
> > >>
> > >
> > > I think you're thinking set theory--the set of all content
> > written in Latin
> > > script or the set of all content from China. That's a useful idea for an
> > > application. However, it suggests that what we know or can determine
> > > something about the language of some arbitrary content. That
> > is: it's pretty
> > > easy to determine that the text of this email is written in
> > Latin script and
> > > it might be possible to infer that it was written in the USA
> > (although my
> > > email header will confuse that) and it might be possible to
> > infer that it is
> > > written in English. So I can make a language tag of en-Latn-US for it,
> > > right?
> > >
> > > But really this isn't how langauge tagging should work. Undetermined
> > > language content should be labeled with the empty string,
> > since, presumably,
> > > one doesn't have any positive information about the content.
> > What one infers
> > > may be valid or it may just be happenstance. For example, a
> > small segment of
> > > text might be in Latin script, but the language might be
> > Japanese. Or a word
> > > list might be constructed that is both valid French and valid English.
> > >
> >
> >
> > Surely if I have a string and all the code points are in a particular
> > script it is correct to say that the string is in that script?
> >
> > [[The simplest response is simply to tell me that I have
> > misunderstood the
> > relationship between script codes and character codes]]
> >
> > Assuming this, then adding script codes to RFC3066 for the first time
> > permits automatic addition of correct subtags.
> >
> > Now a tag like und-hant could be added on the basis of the
> > codepoints used,
> >    the instruction that und SHOULD not be used, might be
> > applicable or not.
> >
> > If it is applicable then the language tags MUST/SHOULD not be used for
> > script information alone, if it is not applicable then und MAY be used in
> > language tags in which a subtag is known but the primary tag is
> > undetermined.
> >
> >
> > Given that script subtags are a new feature of 3066bis, it is
> > arguable that
> >   the text should rule one way or the other on the use of 'und-hant' etc.
> >
> > e.g.
> >
> >
> >     5.  You SHOULD NOT use the UND (Undetermined) code unless the
> >         protocol in use forces you to give a value for the language tag,
> >         even if the language is unknown. Omitting the tag is preferred.
> > [[
> > In particular, you SHOULD NOT use the UND code to construct a
> > language tag such as 'und-hant'; script information SHOULD be omitted
> > when the language is unknown.
> > ]]
> > or
> > [[
> > In particular, you MAY use the UND code to construct a
> > language tag such as 'und-hant', when the
> > apropriate script code is known and the language is unknown.
> > ]]
> >
> > The point being that RFC3066bis describes a
> > 'protocol [that] forces you to give a value for the language tag, even if
> > the language is unknown'
> > in the case when you wish to convey a script code
> >
> > Jeremy
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
>