[Ltru] RE: "mis" update review request

Mon Apr 16 14:50:03 CEST 2007

Peter Constable wrote:
> Does that apply to its use in xml:lang="" too? That's different from
> omitting xml:lang.
> 
> If you have an XML document with xml:lang="en" for the root element, then
> any subelement is assumed to be in English by default. The attribute
> xml:lang="" is the recommended way of breaking such inheritance. In
> practical terms, it seems to say 'I'm not telling the language of the
> content is but I'm saying it's not English'. Or should it be interpreted
> as saying 'I'm giving no information except that you should not assume
> that the language of the content is the same as declared for its parent'?
> 

In practice xml:lang="" is most likely to arise when merging information 
from multiple xml documents. Some of this may be correctly tagged, but 
much (unfortunately the majority) is likely to have no language information.

When doing such a process automatically (which my software does 
frequently) and without using heuristics, a plausible approach is to 
preserve exactly the information given in the input, resulting in an 
output document which may tag the very same long string in one case, as 
with the correct language information, and in another as with xml:lang="".

For example, I may be merging two blogs, and my output XML document may be:

<merge xml:lang="en">
   The comments made on this topic were:
  <comment retrievedFrom="http://good.example.org/">
    <opinion>I think this is an excellent idea.</opinion>
  </comment>
  <comment retrievedFrom="http://bad.example.org/" xml:lang="">
    <opinion>I think this is an excellent idea.</opinion>
  </comment>
</merge>

Where the first comment was correctly tagged, and the second wasn't. The 
  (hypothetical) merge software then deletes the redundant tag on the 
first comment, and adds a necessary tag to the second (necessary for 
correct verbatim quoting)

I don't think xml:lang="" should be given any semantics other than a 
processing one - software should not apply any language knowledge from 
the parent element to this element. (This does not rule out more 
sophisticated ways of detecting language, e.g. in my example above, 
detecting that the two strings are identical, and long enough to mean 
that it they are unlikely to be in two different languages - so we could 
heuristically tag the second string as "en" too)

On the wider topic, I wonder if there is a compromise comment that can 
be added to the mis definition, that suggests that, when tagging, other 
more specific correct tags should be used in preference, but when 
reading, no such assumption can be made, due to the inherent instability 
of such usage. This would not invalidate any prior use, but would 
suggest that the more conservative approach was preferred.

Jeremy

-- 
Hewlett-Packard Limited
registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England