<br><br><div><span class="gmail_quote">On 4/16/07, <b class="gmail_sendername">Peter Constable</b> &lt;<a href="mailto:petercon@microsoft.com">petercon@microsoft.com</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div link="blue" vlink="purple" lang="EN-US">

<div>

<p><span style="font-size: 11pt; color: rgb(31, 73, 125);">Re 1: Yes, be careful: (a) the majority of existing legacy usage

of mis is bound to be in MARC, and (b) any existing usage would assume the

context of ISO 639-2 (i.e. mis in existing usage is the exception list for ISO

639-2).</span></p>

<p><span style="font-size: 11pt; color: rgb(31, 73, 125);">&nbsp;</span></p>

<p><span style="font-size: 11pt; color: rgb(31, 73, 125);">Re 2: The mis collection is inherently unstable – unavoidably

so. Prior to 2005-08-16, an implementation of ISO 639-2 would have tagged Ainu

content as mis; after that date, an implementation of ISO 639-2 would have

tagged Ainu content as ain; existing content tagged before that date would not

get retrieved by request for ain, and it would be conformant to suppose that

requests for mis would not return Ainu content. The mis collection is ugly,

pure and simple. So, I don't see what the point is of getting worried

over whether we're making mis unstable: it's been that way for some

time.</span></p></div></div></blockquote><div><br>What I&#39;m saying is that <br><ol><li>Right now in ISO 639-2, we have a number of collections defined by exclusion, where XXX (Other) means any XXX that is not already defined. Thus &quot;bat&quot; means &quot;Any Baltic language that doesn&#39;t already have a code&quot;.

<br></li><li>Those collections are inherently unstable in ISO 639-2, since they contract each time a new XXX language is added.</li><li>The way to make an collection code XXX not unstable is to make it not be not defined as an exclusion: removing the (Other). [your proposal]

<br></li><li>Then XXX is stable into the future, since adding a new language of the type XXX doesn&#39;t affect it.</li><li>Thus if we change &quot;bat&quot; from Baltic (Other)&quot; into &quot;Baltic&quot;, meaning any of the Baltic languages, it becomes stable.

</li><li>Such a change, being a broadening, can be carried into BCP 47.<br></li><li>We can apply the same methodology to &quot;mis&quot;. That would change it from the fairly useless -- and unstable -- &quot;Any Language not otherwise encoded&quot;, into &quot;Any language&quot;.

</li><li>It then becomes stable, and useful.<br></li></ol></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div link="blue" vlink="purple" lang="EN-US">

<div><p><span style="font-size: 11pt; color: rgb(31, 73, 125);">(Note: mis is badly defined from a stability perspective, though

I don't think there's much question of how it's defined.)</span></p></div></div></blockquote><div><br>I agree that that is not the current definition of &quot;mis&quot; (see below). <br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div link="blue" vlink="purple" lang="EN-US"><div><p><span style="font-size: 11pt; color: rgb(31, 73, 125);">Re 3(b): "</span>There are times when detection can only

determine that it looks like there is some linguistic content -- it is not just

binary data -- but current detection can&#39;t really determine what it might be.

That is, a code that means &quot;according to our best available detection

methods this doesn&#39;t look like it is zxx&quot;.<span style="font-size: 11pt; color: rgb(31, 73, 125);">" If you want to use

mis for that, I would argue that that is significantly changing the semantics

of mis. (Even though mis is unstable, it is unstable on a qualitative level; this

is a categorical change.) I definitely oppose that. If you want an ID for "undetermined

human language", then that should be proposed. We should not usurp an

existing ID for that purpose.</span></p></div></div></blockquote><div><br>It is a significant broadening of the semantics. And I&#39;m not fixed on that. It just seems that doing that broadening is congruent with the removal of &quot;(Other)&quot; that you&#39;ve proposed in other cases, and transforms a useless and dangerous (for stability) code into a useful code. And since it is a broadening, it is consistent with BCP 47.

<br><br>However, if that is too big a step to stomach, the alternative is to strongly recommend that people never use &quot;mis&quot;, and propose a new code for ISO 639-2 that has the meaning of &quot;Any language&quot;.

<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div link="blue" vlink="purple" lang="EN-US"><div><p><span style="font-size: 11pt; color: rgb(31, 73, 125);">

Re 4: I don't see how your example differs from this: "Nous

avons une phrase en français (but this is in English)". The fact that the

parenthetical text is in English doesn't change the fact that the other text

is in French. Similarly, in your example, the fact that there is a comment in

English does not change the fact that the rest of the text is not in a human

language. Do we create tags for "French with embedded bits of English"?</span></p></div></div></blockquote><div><br>You have a good point. Again, I&#39;m not hard and fast about this issue, but I think there is definitely a significant distinction in usage between &quot;this is a chunk of stuff that looks like random binary data, like a JPEG&quot;, and &quot;this is stuff that looks like it might be written a programming language.&quot;, a distinction that I think would be useful to provide for in BCP 47. On the detection front, it is much easier to determine &quot;this is random binary&quot;, while not necessarily very easy to determine &quot;this is a programming language fragment&quot;.

<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div link="blue" vlink="purple" lang="EN-US"><div><p><span style="font-size: 11pt; color: rgb(31, 73, 125);">

Peter</span></p>

<p><span style="font-size: 11pt; color: rgb(31, 73, 125);">&nbsp;</span></p>

<div style="border-style: solid none none; border-color: rgb(181, 196, 223) -moz-use-text-color -moz-use-text-color; border-width: 1pt medium medium; padding: 3pt 0in 0in;">

<p><b><span style="font-size: 10pt;">From:</span></b><span style="font-size: 10pt;">

<a href="mailto:mark.edward.davis@gmail.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">mark.edward.davis@gmail.com</a> [mailto:<a href="mailto:mark.edward.davis@gmail.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

mark.edward.davis@gmail.com</a>] <b>On Behalf

Of </b>Mark Davis<br>

<b>Sent:</b> Monday, April 16, 2007 3:49 PM<br>

<b>To:</b> Peter Constable<br>

<b>Cc:</b> <a href="mailto:ietf-languages@iana.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">ietf-languages@iana.org</a>; <a href="mailto:ltru@lists.ietf.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

ltru@lists.ietf.org</a><span class="q"><br>

<b>Subject:</b> Re: [Ltru] Re: &quot;mis&quot; update review request</span></span></p>

</div>

<p>&nbsp;</p><div><span class="e" id="q_111fce7d40efa055_3">

<p>1. I think we have to be very careful here. The meaning of a

standard like ISO 639-2 is established not by <i>what we wish it would have

said, </i>nor by <i>what we would find out if we were able to read Peter&#39;s

mind.</i> It is established by the wording in the standard, and how reasonable

people could interpret it. The fact that &quot;mis&quot; was incorporated in

order to account for MARC codes is interesting, but is not in the text of the

standard. We can&#39;t expect users of BCP 47 to all be able to read Peter&#39;s mind

before tagging. <br>

<br>

2. When we are looking at stability, that is very important: our goal is that

once content is correctly tagged, people can depend on the fact that we will

not change the meaning of a tag out from under them. So clarifications that we

add in future versions of 4646 or the registry are fine, as long as they do not

narrow the range of reasonable interpretations. We can broaden them. So in the

case of &quot;mis&quot;, a proposed narrowing to include just the MARC codes is

clearly disallowed, since it was nowhere stated in ISO 639-2 at the time that

&quot;mis&quot; was added to the language registry (the BCP 47 semantics are

established at the time we add the code). That is one of the key principles of

BCP 47, is to isolate us where necessary from instabilities in the source

standards. <br>

<br>

(The one exception we might be able to make is where something is so badly

defined that most reasonable people couldn&#39;t come up with any consistent

definition for it.)<br>

<br>

3. Now, I think there are steps that can be taken to make the above moot. I

think Peter&#39;s suggestion for ISO 639-X of broadening all of the Collections to

remove the (Other) is exactly the right strategy, and if this can be done

before 4646bis is issued, all the better. So having </p>

<ul type="disc">

 <li>aus&nbsp;&nbsp;&nbsp; Australian languages means

     any of the languages on <a href="http://www.ethnologue.com/show_family.asp?subid=90498" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://www.ethnologue.com/show_family.asp?subid=90498</a>

</li>

 <li>bat&nbsp;&nbsp;&nbsp; Baltic (Other) =&gt; Baltic

     languages, means any of the languages on <a href="http://www.ethnologue.com/show_family.asp?subid=90207" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://www.ethnologue.com/show_family.asp?subid=90207

</a></li>

 <li>mis&nbsp;&nbsp;&nbsp; Miscellaneous languages,

     essentially the root for <a href="http://www.ethnologue.com/family_index.asp" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://www.ethnologue.com/family_index.asp</a></li>

</ul>

<p style="margin-bottom: 12pt;">and so on. This is useful on a

number of levels; it resolves a number of problems in the interpretation of

language codes, and makes the source standards themselves more stable. (In the

ideal case, we would have codes for each of the possible &quot;decision points&quot;

in the language tree. That is, if we look at any language code such as <a href="http://www.ethnologue.com/show_lang_family.asp?code=eng" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://www.ethnologue.com/show_lang_family.asp?code=eng

</a>

we&#39;d have codes for each of the parent groupings, not just some of them, like

&quot;Australian languages&quot;.) <br>

<br>

3. Randy raised the issue as to whether &quot;mis&quot; in the broad sense is

useful (as something that has linguistic content, but I don&#39;t know what it is).

It very much follows the model in #3. There are times when detection can only

determine that it looks like there is some linguistic content -- it is not just

binary data -- but current detection can&#39;t really determine what it might be.

That is, a code that means &quot;according to our best available detection

methods this doesn&#39;t look like it is zxx&quot;. <br>

<br>

4. I&#39;m leery of using zxx for programming languages, instead of just binary.

There is clearly some linguistic content in &quot;if (content == null) { /*

remove the item in the lookup table */ ...}&quot;. Maybe we need another code

for this, something different than either &#39;art&#39; or &#39;zxx&#39;. <br>

<br>

Mark</p>

<div>

<p><span>On 4/14/07, <b>Peter Constable</b>

&lt;<a href="mailto:petercon@microsoft.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">petercon@microsoft.com</a>&gt;

wrote:</span></p>

<p>From: Randy Presuhn [mailto:<a href="mailto:randy_presuhn@mindspring.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">randy_presuhn@mindspring.com</a>]<br>

<br>

<br>

&gt; I find it very hard to believe that a reasonable analysis<br>

&gt; (whether done by human or machine) would classify a text a <br>

&gt; being &quot;mis&quot; without being able to recognize which of the<br>

&gt; languages in that grouping the text belonged to.&nbsp;&nbsp;I can<br>

&gt; believe someone could look at text and say &quot;it&#39;s a slavic<br>

&gt; language, but I&#39;m not sure which one.&quot;&nbsp;&nbsp;Do we really think <br>

&gt; someone or something would look at some text and say &quot;it&#39;s<br>

&gt; Ainu, Andamanese, or Etruscan, but I can&#39;t tell which, so<br>

&gt; I&#39;ll tag it &#39;mis&#39;&quot;?<br>

<br>

If someone were so tempted, I would argue that would be inappropriate use of

mis. Since they do not know what it is, their declaration is that the language

identity is not determined, and the appropriate tag for that is und.

Appropriate use of mis does not require that one know the language of the

content; it does, however, require that one know it is *not* a language covered

by any of the available tags. <br>

<br>

<br>

<br>

Peter<br>

<br>

_______________________________________________<br>

Ltru mailing list<br>

<a href="mailto:Ltru@ietf.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">Ltru@ietf.org</a><br>

<a href="https://www1.ietf.org/mailman/listinfo/ltru" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">https://www1.ietf.org/mailman/listinfo/ltru

</a></p>

</div>

<p><br>

<br clear="all">

<br>

-- <br>

Mark </p>

</span></div></div>

</div>

<br>_______________________________________________<br>Ltru mailing list<br><a onclick="return top.js.OpenExtLink(window,event,this)" href="mailto:Ltru@ietf.org">Ltru@ietf.org</a><br><a onclick="return top.js.OpenExtLink(window,event,this)" href="https://www1.ietf.org/mailman/listinfo/ltru" target="_blank">

https://www1.ietf.org/mailman/listinfo/ltru</a><br><br></blockquote></div><br><br clear="all"><br>-- <br>Mark