draft-phillips-langtags-08, process, specifications, and extensions

Wed Jan 5 09:50:01 CET 2005

>  Date: 2005-01-02 19:47
>  From: "Addison Phillips [wM]" <aphillips at webmethods.com>

> > > It would be entirely possible for "en-Latn-US-boont" to be 
> > registered under the terms of RFC 3066.
> > 
> > But it hasn't been. No RFC 3066 parser will therefore find
> > that complete tag in its list of IANA registered tags, nor
> > will it be able to interpret "Latn" as an ISO 3166 2-letter
> > country code.
> 
> RFC 3066 parsers already should not interpret "Latn" as an ISO 3166 region code. It isn't two letters long.

Correct. The point being that "en-Latn-US-boont" is neither
a registered IANA tag nor a tag with the first two subtags
as specified by RFC 3066, and is therefore not a *valid*
language tag.  If I use such a string in place of a
language-tag in an RFC 2047 encoded-word-like construct
and feed it to a validating parser, I am informed that:
1. there is no valid language-tag
2. that a language tag having 3-8 characters in the second subtag
   would have to be registered
3. that RFC 1958 section 3.12 specifies use of registered names
4. that RFC 2047 requires that a sequence beginning with =? and
   ending with ?= is required to be a valid encoded-word (which
   is not the case due to an invalid language-tag-like string
   where a valid language-tag is supposed to appear)
Now you might say that that is a pedantic interpretation of
the respective RFCs, and you'd be right -- that's the point of
a validating parser.  You might then ask "what about a non-
validating parser?"  My brief answer would be that results
would be in general unpredictable (which is to say that there
is a high risk of failure to interoperate).  More specifically,
there are a number of things that an individual implementation
might or might not do. It might or might not try to decode the
alleged encoded-word for presentation (bear in mind in this and
the following discussion that "presentation" might include
a screen reader (text-to-speech) for the visually impaired).
If it does not, the raw characters comprising the string will
be presented; not necessarily intelligible, particularly to a
layman who lacks detailed knowledge of RFC 2047 and of
language-tags (which Peter has told us are not meant to be seen
by mere mortals).  If it elects to attempt presentation, it
may need to decide what language to use (particularly, as noted,
for screen readers).  It might in that case use the longest
left-most portion which is recognizable as a comprehensible
(i.e. having a defined meaning) language tag, which in this
case is simply "en" (remember, we're talking about RFC 3066
parsers, and "en-Latn" is neither registered nor comprised of
language code plus country code).  I will leave to your
judgment whether or not something in en-[Latn-]US-boont is
likely to be intelligible to a listener when presented as if
it were generic en, noting that we have already had a discussion
about directionality of specificity of language tags -- and in
this case if the listener has any indication of the specified
language, it will be what the parser can determine, viz. (plain)
English. 

> As for RFC 3066 parsers being unable to interpret the tag, what do you think happens now? New tags are registered all the time and these don't appear in the putative list of tags inside extant RFC 3066 parsers. The parsers don't know what the tag means, but that doesn't invalidate its use for content in that language or by end users, now does it?
> 
> For a concrete example, think about "sl-rozaj", just over a year old. None of the browsers in my browser collection, not even Firefox, knows what that tag means, but all of them accept it and emit it in my Accept-Language header and no web sites have complained about it. Okay, I'm not getting any Resian content back (but then it isn't first in my A-L list either).
[...]
> An RFC 3066 parser has no way of recognizing a tag registered after the parser's list of tags was created. Therefore RFC 3066 parsers do not, as a rule, reject unknown tags. Making sense of a tag is subjective in the case of generative tags today in any event. The level of sense required of an RFC 3066 parser is generally that it be able to use the remove-from-right matching rule on ranges and tags until if finds a value it "knows".
[...]
> No, a strict RFC 3066 parser has to have an up-to-the-second list of registered tags. Unless you've just written an implementation that foolishly does it, no implementations reject unknown tags as long as the tags fit the ABNF requirements of RFC 3066. Draft-langtags utilitizes this fact to its advantage and actually tidies things up a bit.

Tags are registered relatively infrequently; none have been
added in the last 6 months.  You are quite correct that there
is an issue regarding updates to a registry, however:
1. that applies also to your proposal to register subtags; new
   entries won't be known to validating parsers until said
   parsers are updated
2. it is not unique to language-tags; for example, MIME
   application subtypes seem to be added at a fast and furious
   pace, certainly much more frequently than language-tags
3. as use (at least theoretically) does not begin until after
   registration, the issue under the current arrangement isn't
   so bad (and wouldn't be so bad under the proposal but for
   the existence of an installed base that has no built-in
   knowledge of the "4-characters means script", etc. rules
   which are not present in RFC 3066 and predecessor).
4. because there is a need to be able to validate and use
   registered tags when off-line, there seems to be no
   general solution to the problem, particularly as the
   number of items in the registry would vastly increase
   under the proposed draft

> > Language is not exclusively associated with text.  It is also a
> > characteristic of spoken (sung, etc.) material (but script is
> > not).
> 
> Yes, I agree. Script is important to textual applications of language tags, though. The fact that it is not applicable to aural or otherwise signed representations of language has nothing to do with whether scripts might need to be indicated on content that is written.
> 
> > Note my use of "or" not "and".  I certainly did not state that the
> > information could be obtained from charset alone in all cases.
> 
> Groping the text is a very poor mechanism for determining the writing system used. Your suggestion is that we *should* be *forced* to grope the text. It also appears to be your position that we should *not* be given a mechanism whereby users can indicate a script preference when selecting language content.

Let me be clear; I am not in any way suggesting that indication
of script or other characteristics peculiar to written material
should be prohibited, nor am I suggesting that applications
"*should* be *forced* to grope the text" for such indication.
Rather I am suggesting that an orthogonal mechanism might be
used such that it can be applied to written text without
interfering with non-textual media; I have unofficially
suggested a hypothetical Content-Script field as one possible
approach, and have noted that the existing mechanisms for
indicating charset may be adequate in many cases for determining
script (e.g. text with a charset of ANSI_X3.4-1968 is certainly
Latin script, not Cyrillic, and the reverse is true of text with
a charset of INIS-cyrillic).  While "groping the text" would
certainly be a poor choice for large texts (e.g. a message body)
it might be appropriate in circumstances where the amount of
text is strictly limited to a small chunk, and the real estate
for a language-tag is also strictly limited; the poster child
would seem to be an IDN label which (prefix, tag, plus
Cuisinart-processed utf-8 name) has to fit into 63 octets.

> > The analogous way to handle that in Internet protocols would be
> > via Content-Script and Accept-Script where relevant (which they
> > would not be for audio media).
> 
> I think that's an awful idea. Why should users have to set two headers to get one result?

Users typically don't set header fields in an HTTP transaction
(for example) any more than they set bits in an IP packet header;
that's handled by user agents or other protocol-handling entities
as a matter of communicating information between agents according
to a protocol.  Well-designed protocols transfer pieces of
orthogonal information by mechanisms which provide for handling
those pieces of information individually and which therefore do
not burden entities with having to process unnecessary data.

As an example of an application where having the information
separate is important, consider a web search where one is
willing to accept results in Serbian language as used in Serbia
and Montenegro in any media (video+audio, audio only,
text), but where one wishes to restrict text results to
Latin script only.  Specifying language and script separately
permits return of audio and video content matching "sr-CS"
as well as text which matches both language and script as
specified. Specifying instead that results must match both
language plus script according to the proposed draft syntax
and matching rules (viz. "sr-Latn-CS") would necessarily
exclude non-textual media unless such media were inappropriately
labeled with script (we are agreed that such labeling would be
inappropriate).