Structured documents and absence of language information

Martin Duerst duerst@w3.org
Mon, 22 Apr 2002 11:00:15 +0900


Dear language code specialists,

I would like your input/advice on a somewhat general question:

XML uses RFC 3066 language tags (mostly based on ISO 632-*) to identify
languages with the xml:lang attribute. Recently, we have become aware that
XML is often used to combine or wrap data and document pieces from different
origins. These may contain no language information at all. Because the
xml:lang attribute is inherited, including such pieces without care
may 'taint' the included piece with language information from the
wrapper, which is not at all guaranteed to be correct. There is
currently no well-defined way to say that in a given subtree, there
is no language information. This can only be said for a whole XML
document (just by not using an xml:lang attribute).

To take an example, this is the (slightly simplifed) structure of
a SOAP envelope:

<env:Envelope xml:lang="en" xmlns:env="...soap..."/>
    <env:Header>...</env:Header>
    <env:body>
    </env:body>
</env:Envelope>

This contains xml:lang="en", because (in this case) it knows that
the information in the header is in English.

Now let's assume an SVG document, without xml:lang:

<svg xmlns="...svg...">...</svg>

If we now put the SVG document inside the soap envelope, we get:

<env:Envelope xml:lang="en" xmlns:env="...soap..."/>
    <env:Header>...</env:Header>
    <env:body>
      <svg xmlns="...svg...">...</svg>
    </env:body>
</env:Envelope>

Now the <svg> part has been tainted by xml:lang="en", which may
be completely wrong. To avoid this, some way is needed to say
that there is no language information. This should preferably
be done with a special value of xml:lang (rather than with a
new attribute). Let's for a moment use @@@ to stand for this
value. Then the problem above could be solved either by putting
this on the svg element:

<env:Envelope xml:lang="en" xmlns:env="...soap..."/>
    <env:Header>...</env:Header>
    <env:body>
      <svg xmlns="...svg..." xml:lang="@@@">...</svg>
    </env:body>
</env:Envelope>

or to avoid changing the svg element, and make processing
easier, it could be put on the innermost wrapping element:

<env:Envelope xml:lang="en" xmlns:env="...soap..."/>
    <env:Header>...</env:Header>
    <env:body xml:lang="@@@">
      <svg xmlns="...svg...">...</svg>
    </env:body>
</env:Envelope>


The question where your input is appreciated most is what should
be used for the xml:lang value (@@@ above).

We have so far mainly looked at two choices. The first one, and
the one preferred in the W3C I18N WG/IG, is to use the empty string:

    xml:lang=""

The advantage of this is that it's implicitly evident, and it's
the same as other, similar attributes. This use would have to
be defined in the XML specification (either in a new version or
by an erratum).

The second possibility we have considered is to use the language
code "und" (Undetermined). One question here is whether it's
okay to use that for things that are not in any natural language
at all (e.g. pure numeric data, programs, mathematics,...).
As far as I know, the ISO 632 standards don't apply to such things.
The other problem is that we would need to change XML to say
that e.g.

<?xml version='...' ?>
<root xml:lang='und'>
...
</root>

is the same as:

<?xml version='...' ?>
<root>
...
</root>

This is not too difficult to say in a new version of the spec,
but it's not easy to actually have it deployed, because a lot of
programmers have to be instructed about the use of 'und' as a
special value.

So we would prefer to use xml:lang='' rather than xml:lang='und'
to indicate the absence of any language information.

Please tell us whether you can agree with this or not, and whether
there are additional arguments and issues that we should consider.

Many thanks in advance for your feedback.

Regards,   Martin.