On draft-phillips-langtags-01

Wed Mar 10 18:35:36 CET 2004

Hi Jeremy,

As Mark pointed out, I don't think that UND is illegal, but it may be ill
advised. The problem is that it looks like a value. I'd almost prefer
permitting a range-like value (xml:lang="*-hant" or maybe xml:lang="?-hant")
for wildcarding. This would break so many extant rules that I think the idea
is a non-starter, though.

I agree with your statement:

> Surely if I have a string and all the code points are in a particular
> script it is correct to say that the string is in that script?

The question isn't whether a particular statement about the script of a text
is true, but whether once one such statement is true all others must
necessarily be false.

So I agree with the point raise by Jon Hanna, who captured what I was trying
to say more succinctly. Using Han as an example, thanks to Han unification
it is entirely possible that a Japanese text is entirely composed of
"Traditional Chinese" characters (in fact, it may be entirely composed of
Simplified Characters too). A subset of tags that such a text could
correctly tagged includes:

  ja-Hant
  ja-Hans
  ja-Hani
  ja-Jpan

This level of script introspection will be difficult to do, compared with,
say, spotting Cyrillic or Latin texts.

Other examples would include things like a Russian mathematics text that
includes Greek and Math symbols, latin math symbols (like plus, minus,
solidus, etc.) and so on. I think it would be valid to say that that text is
written in 'ru-Cyrl-RU', in spite of the leavings of other writing systems.

RFC3066:bis didn't invent this problem, of course. It exists with the basic
'xx-yy' syntax of the old generative mechanism. There are plenty of texts
where you know more about the locational aspects of the text ("this document
is Canadian") than the language (or in which the language is mixed). RFC3066
says (and RFC3066:bis parrots) that labeling texts with UND and MUL is
permitted, but probably it is better to say nothing (the empty string),
since UND and MUL suggest something about the default matching algorithm
that is false.

That is:

  A request of "en-Latn-US" matches content marked "" but not "UND-Latn"

UND only matches content with requests that specfically ask for content
which is tagged as UND (unless we are to change the fallback mechanism).

IOW, it might be better to think more deeply about partial tagging (how it
is permissable, what form it takes, and what makes sense).

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: Jeremy Carroll [mailto:jjc at hplb.hpl.hp.com]
> Sent: mercredi 10 mars 2004 01:50
> To: aphillips at webmethods.com
> Cc: ietf-languages at alvestrand.no
> Subject: Re: On draft-phillips-langtags-01
>
>
>
> A bit more on one of the points ...
>
>
> >>3) 2.3 point 5
> >>    Hmmm,  how about 'und-latn': I can probably write a simple
> program to
> >>determine the script of a string, and it is probably useful in
> some cases
> >>to know that the script is at least something you can read
> (probably more
> >>so with pictograms). An alternative would be to allow the primary
> >>subtag to
> >>be omitted e.g. allow 'latn' as a full tag,
> >>
> >
> > I think you're thinking set theory--the set of all content
> written in Latin
> > script or the set of all content from China. That's a useful idea for an
> > application. However, it suggests that what we know or can determine
> > something about the language of some arbitrary content. That
> is: it's pretty
> > easy to determine that the text of this email is written in
> Latin script and
> > it might be possible to infer that it was written in the USA
> (although my
> > email header will confuse that) and it might be possible to
> infer that it is
> > written in English. So I can make a language tag of en-Latn-US for it,
> > right?
> >
> > But really this isn't how langauge tagging should work. Undetermined
> > language content should be labeled with the empty string,
> since, presumably,
> > one doesn't have any positive information about the content.
> What one infers
> > may be valid or it may just be happenstance. For example, a
> small segment of
> > text might be in Latin script, but the language might be
> Japanese. Or a word
> > list might be constructed that is both valid French and valid English.
> >
>
>
> Surely if I have a string and all the code points are in a particular
> script it is correct to say that the string is in that script?
>
> [[The simplest response is simply to tell me that I have
> misunderstood the
> relationship between script codes and character codes]]
>
> Assuming this, then adding script codes to RFC3066 for the first time
> permits automatic addition of correct subtags.
>
> Now a tag like und-hant could be added on the basis of the
> codepoints used,
>    the instruction that und SHOULD not be used, might be
> applicable or not.
>
> If it is applicable then the language tags MUST/SHOULD not be used for
> script information alone, if it is not applicable then und MAY be used in
> language tags in which a subtag is known but the primary tag is
> undetermined.
>
>
> Given that script subtags are a new feature of 3066bis, it is
> arguable that
>   the text should rule one way or the other on the use of 'und-hant' etc.
>
> e.g.
>
>
>     5.  You SHOULD NOT use the UND (Undetermined) code unless the
>         protocol in use forces you to give a value for the language tag,
>         even if the language is unknown. Omitting the tag is preferred.
> [[
> In particular, you SHOULD NOT use the UND code to construct a
> language tag such as 'und-hant'; script information SHOULD be omitted
> when the language is unknown.
> ]]
> or
> [[
> In particular, you MAY use the UND code to construct a
> language tag such as 'und-hant', when the
> apropriate script code is known and the language is unknown.
> ]]
>
> The point being that RFC3066bis describes a
> 'protocol [that] forces you to give a value for the language tag, even if
> the language is unknown'
> in the case when you wish to convey a script code
>
> Jeremy