[Ltru] Re: "mis" update review request

Peter Constable petercon at microsoft.com
Wed Apr 18 02:33:27 CEST 2007


On 1: I disagree: taking "other" out of mis is a categorical change - it creates a completely different concept, because the heart of the concept of mis is "other".

On 1b ("language" vs. "languages"): I disagree: while the content tagged is in a single language, the concept that the ID represents is a collection of languages. The ID represents that concept, not the content; we associate the ID with the content to indicate an association of the concept with the content.

On 4: Again, I disagree. This is like saying, "It's out of scope, mostly but not completely." Either it's in scope or it's out of scope.


Peter

From: Kent Karlsson [mailto:kent.karlsson14 at comhem.se]
Sent: Tuesday, April 17, 2007 12:10 PM
To: Peter Constable; ietf-languages at iana.org; ltru at lists.ietf.org
Subject: RE: [Ltru] Re: "mis" update review request

on 1:

I don't see why 'mis' would have to be an exception when doing a semantic change of removing (implicit or explicit) "other" for various language codes. Doing so is equally much a semantic change for 'tai' (or any other "other" collection), and of exactly the same kind, so if it is not ok for 'mis' it would not be ok for 'tai' either. (If you prefer another acronym, say 'any' instead of 'mis', that is another ball-game.)

Furthermore, since 'mul' is the only code intended for multiple languages (when it is not practical to list which languages, per fragment of the document preferably), all of the "languages" codes should instead refer to "language" in singular. This would not be a semantic change, just referring to each of the items that may be tagged, not a set of items [book shelf...] so tagged.

on 4:

Programming languages of various sorts are out of scope (like 'zxx', but unlike 'art'), but I may agree that they are out of scope in a different way than 'zxx'. Perhaps "formal language" ('for'), with no further subdivision (they are still out of scope).

        /kent k

________________________________
From: Peter Constable [mailto:petercon at microsoft.com]
Sent: Tuesday, April 17, 2007 2:19 AM
To: ietf-languages at iana.org; ltru at lists.ietf.org
Subject: RE: [Ltru] Re: "mis" update review request
Re 1: Yes, be careful: (a) the majority of existing legacy usage of mis is bound to be in MARC, and (b) any existing usage would assume the context of ISO 639-2 (i.e. mis in existing usage is the exception list for ISO 639-2).

Re 2: The mis collection is inherently unstable - unavoidably so. Prior to 2005-08-16, an implementation of ISO 639-2 would have tagged Ainu content as mis; after that date, an implementation of ISO 639-2 would have tagged Ainu content as ain; existing content tagged before that date would not get retrieved by request for ain, and it would be conformant to suppose that requests for mis would not return Ainu content. The mis collection is ugly, pure and simple. So, I don't see what the point is of getting worried over whether we're making mis unstable: it's been that way for some time.

(Note: mis is badly defined from a stability perspective, though I don't think there's much question of how it's defined.)

Re 3(b): "There are times when detection can only determine that it looks like there is some linguistic content -- it is not just binary data -- but current detection can't really determine what it might be. That is, a code that means "according to our best available detection methods this doesn't look like it is zxx"." If you want to use mis for that, I would argue that that is significantly changing the semantics of mis. (Even though mis is unstable, it is unstable on a qualitative level; this is a categorical change.) I definitely oppose that. If you want an ID for "undetermined human language", then that should be proposed. We should not usurp an existing ID for that purpose.

Re 4: I don't see how your example differs from this: "Nous avons une phrase en français (but this is in English)". The fact that the parenthetical text is in English doesn't change the fact that the other text is in French. Similarly, in your example, the fact that there is a comment in English does not change the fact that the rest of the text is not in a human language. Do we create tags for "French with embedded bits of English"?


Peter

From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis
Sent: Monday, April 16, 2007 3:49 PM
To: Peter Constable
Cc: ietf-languages at iana.org; ltru at lists.ietf.org
Subject: Re: [Ltru] Re: "mis" update review request

1. I think we have to be very careful here. The meaning of a standard like ISO 639-2 is established not by what we wish it would have said, nor by what we would find out if we were able to read Peter's mind. It is established by the wording in the standard, and how reasonable people could interpret it. The fact that "mis" was incorporated in order to account for MARC codes is interesting, but is not in the text of the standard. We can't expect users of BCP 47 to all be able to read Peter's mind before tagging.

2. When we are looking at stability, that is very important: our goal is that once content is correctly tagged, people can depend on the fact that we will not change the meaning of a tag out from under them. So clarifications that we add in future versions of 4646 or the registry are fine, as long as they do not narrow the range of reasonable interpretations. We can broaden them. So in the case of "mis", a proposed narrowing to include just the MARC codes is clearly disallowed, since it was nowhere stated in ISO 639-2 at the time that "mis" was added to the language registry (the BCP 47 semantics are established at the time we add the code). That is one of the key principles of BCP 47, is to isolate us where necessary from instabilities in the source standards.

(The one exception we might be able to make is where something is so badly defined that most reasonable people couldn't come up with any consistent definition for it.)

3. Now, I think there are steps that can be taken to make the above moot. I think Peter's suggestion for ISO 639-X of broadening all of the Collections to remove the (Other) is exactly the right strategy, and if this can be done before 4646bis is issued, all the better. So having

 *   aus    Australian languages means any of the languages on http://www.ethnologue.com/show_family.asp?subid=90498
 *   bat    Baltic (Other) => Baltic languages, means any of the languages on http://www.ethnologue.com/show_family.asp?subid=90207
 *   mis    Miscellaneous languages, essentially the root for http://www.ethnologue.com/family_index.asp
and so on. This is useful on a number of levels; it resolves a number of problems in the interpretation of language codes, and makes the source standards themselves more stable. (In the ideal case, we would have codes for each of the possible "decision points" in the language tree. That is, if we look at any language code such as http://www.ethnologue.com/show_lang_family.asp?code=eng we'd have codes for each of the parent groupings, not just some of them, like "Australian languages".)

3. Randy raised the issue as to whether "mis" in the broad sense is useful (as something that has linguistic content, but I don't know what it is). It very much follows the model in #3. There are times when detection can only determine that it looks like there is some linguistic content -- it is not just binary data -- but current detection can't really determine what it might be. That is, a code that means "according to our best available detection methods this doesn't look like it is zxx".

4. I'm leery of using zxx for programming languages, instead of just binary. There is clearly some linguistic content in "if (content == null) { /* remove the item in the lookup table */ ...}". Maybe we need another code for this, something different than either 'art' or 'zxx'.

Mark
On 4/14/07, Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:
From: Randy Presuhn [mailto:randy_presuhn at mindspring.com<mailto:randy_presuhn at mindspring.com>]


> I find it very hard to believe that a reasonable analysis
> (whether done by human or machine) would classify a text a
> being "mis" without being able to recognize which of the
> languages in that grouping the text belonged to.  I can
> believe someone could look at text and say "it's a slavic
> language, but I'm not sure which one."  Do we really think
> someone or something would look at some text and say "it's
> Ainu, Andamanese, or Etruscan, but I can't tell which, so
> I'll tag it 'mis'"?

If someone were so tempted, I would argue that would be inappropriate use of mis. Since they do not know what it is, their declaration is that the language identity is not determined, and the appropriate tag for that is und. Appropriate use of mis does not require that one know the language of the content; it does, however, require that one know it is *not* a language covered by any of the available tags.



Peter

_______________________________________________
Ltru mailing list
Ltru at ietf.org<mailto:Ltru at ietf.org>
https://www1.ietf.org/mailman/listinfo/ltru



--
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/ietf-languages/attachments/20070417/07f6d96b/attachment-0001.html


More information about the Ietf-languages mailing list