[Ltru] Re: "mis" update review request

Mark Davis mark.davis at icu-project.org
Fri Apr 20 22:52:59 CEST 2007


> I don't expect to see .cxx, .h, etc. files tagged with language tags any
time soon

Well, every file available on the web, like
http://www.cs.duke.edu/csed/tapestry/win/date.h (chosen at random) gets some
language tag when processed at Google (I can't say what MSN, Yahoo, and
other search engines do). So right under your nose millions of pages of
source code are getting tagged, all the time. We are faced with the
practical problem of what the best thing to do is according to the standard.

Mark

On 4/20/07, Peter Constable <petercon at microsoft.com> wrote:
>
>  I say your programming code example is a boundary case in the sense that
> I don't expect to see .cxx, .h, etc. files tagged with language tags any
> time soon, and I don't expect to see a book on programming concepts tagged
> as anything other than en, no matter how many pages of source code samples
> it has.
>
>
>
> (Granted, in an XML representation of that book there may be a question as
> to how individual elements should be tagged, but it's not clear to me in
> that scenario what difference it really makes whether you have <code sample
> xml:lang="en"> or <code sample xml:lang="zxx"> or <code sample
> xml:lang="und"> or <code sample xml:lang=""> or whatever.)
>
>
>
>
>
> Peter
>
>
>
> *From:* mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] *On
> Behalf Of *Mark Davis
> *Sent:* Friday, April 20, 2007 8:59 AM
> *To:* Peter Constable
> *Cc:* ietf-languages at alvestrand.no; ltru at lists.ietf.org
> *Subject:* Re: [Ltru] Re: "mis" update review request
>
>
>
> I don't think the programming language fragment is really a boundary
> condition. Most code source nowadays are not just random hex, there
> typically, not exceptionally, some real linguistic content. I would agree
> with you that a hex dump of a *compiled* program, such as perhaps you used
> for your example, is sensible to tag as zxx, but based on the wording of the
> standards, I don't think we can expect zxx to apply to typical code source.
> Yet, while there may be is some embedded English, we don't want to call it
> "en" either.
>
> It looks to me like the best choice *currently *would be "und"; as I said,
> I think it might be useful to have a special tag for this just because it is
> a reasonably common case that is otherwise difficult to categorize. An
> alternative would be to *explicitly *broaden the description of "zxx" to
> be "no linguistic content, or programming source code". That would be a
> compatible change to 4646bis, since it is a broadening.
>
> Mark
>
> On 4/20/07, *Peter Constable* <petercon at microsoft.com> wrote:
>
> *From:* Mark Davis [mailto:mark.davis at icu-project.org]
>
> *> *As in example #9 of http://docs.google.com/Doc?id=dfqr8rd5_11g425c9 ,
>
> > to think that the following contains "no linguistic content" is bizarre.
>
>
> > It obviously contains linguistic content.
>
> if (linguisticContent == null) { throw new Exception(""); }
>
>
>
> You could say the same of this:
>
> MZ
>  ------------------------------
>
>
>  ------------------------------
>
>    ÿÿ  ¸       @                                   à
> ­º
>  ´            Í!¸LÍ!This program cannot be run in DOS mode.
>
> $       Tbï›
>  ------------------------------
>
> È
>  ------------------------------
>
> È
>  ------------------------------
>
> È7ÅïÈ
>  ------------------------------
>
> È7ÅüÈ
>  ------------------------------
>
> È7ÅúÈ
>  ------------------------------
>
> È
>  ------------------------------
>
> €ÈÉ
>  ------------------------------
>
> È7ÅìÈ3
>  ------------------------------
>
> È7ÅýÈ
>  ------------------------------
>
> È7ÅùÈ
>  ------------------------------
>
> ÈRich
>  ------------------------------
>
> È
>
>
>
> We could probably come up with all kinds of boundary cases for which there
> is no "right" answer. I don't know what use it would be.
>
> Peter
>
>
> _______________________________________________
> Ltru mailing list
> Ltru at ietf.org
> https://www1.ietf.org/mailman/listinfo/ltru
>
>
>
>
> --
> Mark
>
> _______________________________________________
> Ltru mailing list
> Ltru at ietf.org
> https://www1.ietf.org/mailman/listinfo/ltru
>
>


-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/ietf-languages/attachments/20070420/786ff9b2/attachment.html


More information about the Ietf-languages mailing list