[Ltru] Re: "mis" update review request

Mark Davis mark.davis at icu-project.org
Tue Apr 17 00:49:16 CEST 2007


1. I think we have to be very careful here. The meaning of a standard like
ISO 639-2 is established not by what we wish it would have said, nor by what
we would find out if we were able to read Peter's mind. It is established by
the wording in the standard, and how reasonable people could interpret it.
The fact that "mis" was incorporated in order to account for MARC codes is
interesting, but is not in the text of the standard. We can't expect users
of BCP 47 to all be able to read Peter's mind before tagging.

2. When we are looking at stability, that is very important: our goal is
that once content is correctly tagged, people can depend on the fact that we
will not change the meaning of a tag out from under them. So clarifications
that we add in future versions of 4646 or the registry are fine, as long as
they do not narrow the range of reasonable interpretations. We can broaden
them. So in the case of "mis", a proposed narrowing to include just the MARC
codes is clearly disallowed, since it was nowhere stated in ISO 639-2 at the
time that "mis" was added to the language registry (the BCP 47 semantics are
established at the time we add the code). That is one of the key principles
of BCP 47, is to isolate us where necessary from instabilities in the source
standards.

(The one exception we might be able to make is where something is so badly
defined that most reasonable people couldn't come up with any consistent
definition for it.)

3. Now, I think there are steps that can be taken to make the above moot. I
think Peter's suggestion for ISO 639-X of broadening all of the Collections
to remove the (Other) is exactly the right strategy, and if this can be done
before 4646bis is issued, all the better. So having

   - aus    Australian languages means any of the languages on
   http://www.ethnologue.com/show_family.asp?subid=90498
   - bat    Baltic (Other) => Baltic languages, means any of the
   languages on http://www.ethnologue.com/show_family.asp?subid=90207
   - mis    Miscellaneous languages, essentially the root for
   http://www.ethnologue.com/family_index.asp

and so on. This is useful on a number of levels; it resolves a number of
problems in the interpretation of language codes, and makes the source
standards themselves more stable. (In the ideal case, we would have codes
for each of the possible "decision points" in the language tree. That is, if
we look at any language code such as
http://www.ethnologue.com/show_lang_family.asp?code=eng we'd have codes for
each of the parent groupings, not just some of them, like "Australian
languages".)

3. Randy raised the issue as to whether "mis" in the broad sense is useful
(as something that has linguistic content, but I don't know what it is). It
very much follows the model in #3. There are times when detection can only
determine that it looks like there is some linguistic content -- it is not
just binary data -- but current detection can't really determine what it
might be. That is, a code that means "according to our best available
detection methods this doesn't look like it is zxx".

4. I'm leery of using zxx for programming languages, instead of just binary.
There is clearly some linguistic content in "if (content == null) { /*
remove the item in the lookup table */ ...}". Maybe we need another code for
this, something different than either 'art' or 'zxx'.

Mark

On 4/14/07, Peter Constable <petercon at microsoft.com> wrote:
>
> From: Randy Presuhn [mailto:randy_presuhn at mindspring.com]
>
>
> > I find it very hard to believe that a reasonable analysis
> > (whether done by human or machine) would classify a text a
> > being "mis" without being able to recognize which of the
> > languages in that grouping the text belonged to.  I can
> > believe someone could look at text and say "it's a slavic
> > language, but I'm not sure which one."  Do we really think
> > someone or something would look at some text and say "it's
> > Ainu, Andamanese, or Etruscan, but I can't tell which, so
> > I'll tag it 'mis'"?
>
> If someone were so tempted, I would argue that would be inappropriate use
> of mis. Since they do not know what it is, their declaration is that the
> language identity is not determined, and the appropriate tag for that is
> und. Appropriate use of mis does not require that one know the language of
> the content; it does, however, require that one know it is *not* a language
> covered by any of the available tags.
>
>
>
> Peter
>
> _______________________________________________
> Ltru mailing list
> Ltru at ietf.org
> https://www1.ietf.org/mailman/listinfo/ltru
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/ietf-languages/attachments/20070416/3039b00e/attachment.html


More information about the Ietf-languages mailing list