comments on the draft - 2

Tue Jun 8 09:22:34 CEST 2004

Some further comments:

Section 2.3:

Item 3 after par. 3:

<quote>
   3.  When a language has both an ISO 639-1 2-character code and an ISO
       639-2 3-character code, you MUST use the ISO 639-1 2-character
       code.
</quote>

I might have suggested before that we should enumerate the precise,
fixed list of 2-character ISO 639-1 IDs that should be allowed in an
appendix. These would consist of those that exist at present. This will
remove any possible concern that a 2-character ID will be added at some
point to ISO 639-1 where a 3-character ID previously existed in ISO
639-2. There has been reference to a "freeze", but I consider it a not
so great idea to have stability of this protocol dependent on another
standard being unnecessarily constrained, and inappropriate to expect
that that standard should be unnecessarily limited in its ability to
meet the need of some users because of concerns that lie within a
different, consuming protocol.

This would mean removing the length NOTE after point 6 (which, while
carried forward from RFC 3066, I realize, is problematic IMO in that it
is presented as a quotation yet has no source reference).

Item 4 after par 3:

<quote>
       NOTE: At present
       all languages that have both kinds of 3-character code also are
       assigned a 2-character code, and the displeasure of developers
       about the existence of two different code sets has been
       adequately communicated to ISO. So this situation will hopefully
       not arise.
</quote>

This is a provoking comment that isn't really necessary. The 22 cases of
differences had a history, the members of the ISO 639 Joint Advisory
Committee has for some time been aware of the undesirability of such
differences, and did not need the authors to be informed by anyone
regarding the displeasure of developers to determine that they do not
want to create any new such cases. I would simply say,

<suggested>
NOTE: At present, all languages that have distinct "B" and "T" 
identifiers in ISO 639-2 are also assigned a 2-character identifier
in ISO 639-1. It is unlikely that a situation will arise in which 
distinct "T" and "B" ISO 639-2 identifiers exist but no 2-character
identifier exists, but should such a situation arise, it will be
clear which must be used.
</suggested>

Section 2.4, par 4: There appears to be ambiguous usage of "tag" between
the meanings 'a symbolic identifier as defined in this specification'
and 'a declaration of linguistic properties of information objects'.
Specifically, some of the bullet points seem to refer to multiple values
as a "tag":

<quote>
   The relationship between the tag and the information it relates to...

   o  For a single information object, it could be taken as the set of
      languages...

   o  For an aggregation of information objects, it should be taken as
      the set of languages...

   o  For information objects whose purpose is to provide alternatives,
      the set of tags associated with it should be regarded as a hint
      that the content is provided in several languages, and that one
      has to inspect each of the alternatives in order to find its
      language or languages. In this case, a tag with multiple
languages...
</quote>

A tag as defined in this RFC cannot denote multiple languages (unless it
uses a collective ID from ISO 639-2 -- but I don't think that's what was
in mind).

Again, I know this was carried forward from the previous RFC (so I
should have caught it when I was reviewing the draft for that five years
ago).

Section 2.4.1, par 1: Wording could be tightened up (is a language range
a set or a symbol?). 

<quote>
   A Language Range is a set of languages whose tags all begin with the
   same sequence of subtags. The following definition of language-range
   is derived from HTTP/1.1 [14].

      language-range = language-tag / "*"
</quote>

What's given in the rule is not a definition but a grammar. The opening
sentence contains a definition, but the definition describes something
other than the thing produced by the grammar (one's a set of languages,
the other is a set of formal-language sentences). Here's a suggested
revision:

<suggestion>
   A Language Range is a set of languages whose tags all begin with the
   same sequence of subtags. A given language range can be represented
by
   the sequence of subtags that is common to the languages in the given 
   set. The specification for language-range tags is as follows, taken 
   from HTTP/1.1 [14].

      language-range = language-tag / "*"
</suggestion>

2.4.3: Is this saying that extensions should be put into alphabetical
order when *generating* tags, or when *comparing* tags? 

Also, in par 3, it says "... is correctly ordered...": is "correctly"
the appropriate word here, or is "canonically" better? The bottom line
is *can* I tag data or send a request using (e.g.)
"en-B-ext3-ext2-A-ext1"? Does the RFC permit me to do so or not? 

3.1: Re the registration form: Is there some IETF policy that restricts
us to ask only for the native name of a language *transcribed into
ASCII*? 

3.2, par 4: Will tags like "zh-Hant" and "en-boont" be marked as
*obsoleted* or *superceded*? Here you say "obsoleted"; back in section
2.2.1 you said "superceded".

All for now.

Peter Constable