[Suppress-Script] Initial list of 300 languages

Peter Constable petercon at microsoft.com
Thu Mar 16 04:07:57 CET 2006


> From: McDonald, Ira [mailto:imcdonald at sharplabs.com]


> OK - I want to reply in defense of printer manufacturers...

> Printers tend to be developed by those computer programmers
> who annoy Michael Everson - IPP as a protocol (and many others)
> REQUIRES that all received protocol parameters be validated
> FIRST for syntax and SECOND for content - when IPP/1.1 was
> defined, a four-character script subtag in the second position
> was a syntax error.
> 
> Below is a verbatim quote from page 87 of IPP/1.1 (RFC2911):
> 
> 
> 4.1.8 'naturalLanguage'
> 
>    The 'naturalLanguage' attribute syntax is a standard identifier for a
>    natural language and optionally a country.  The values for this
>    syntax type are defined by RFC 1766 [RFC1766].  Though RFC 1766
>    requires that the values be case-insensitive US-ASCII [ASCII], IPP
>    requires all lower case to simplify comparing by IPP clients and
>    Printer objects.  Examples include:
> 
>       'en':  for English
>       'en-us': for US English
>       'fr': for French
>       'de':  for German
> 
>    The maximum length of 'naturalLanguage' values used to represent IPP
>    attribute values is 63 octets.

So, this is your defence of printer manufacturers? They are concerned first and foremost with syntax, they take their syntax from RFC1766, and then they derive that a four-character subtag in the second position* is a syntax error? (*You said "script" subtag -- that's content, not syntax.) Then I suggest that they go back to school to learn about reading BNF syntax grammars. Below is a verbatim quote from page 2 of RFC 1766:

<quote>
   The language tag is composed of 1 or more parts: A primary language
   tag and a (possibly empty) series of subtags.

   The syntax of this tag in RFC-822 EBNF is:

    Language-Tag = Primary-tag *( "-" Subtag )
    Primary-tag = 1*8ALPHA
    Subtag = 1*8ALPHA
</quote>

 
> I personally have seen code at Sharp and Xerox that specifically
> checks for language and optional country subtags and treats any
> other language-tag as a syntax error.

That code would be imposing restrictions that RFC 1766 never imposed. Again, quoting verbatim (pp. 2-3):

‎<quote>‎
   In the first subtag:

    -    All 2-letter codes are interpreted as ISO 3166 alpha-2
         country codes denoting the area in which the language is
         used.

    -    Codes of 3 to 8 letters may be registered with the IANA by
         anyone who feels a need for it, according to the rules in
‎         ‎chapter 5 of this document.

   The information in the subtag may for instance be:

    -    Country identification, such as en-US (this usage is
         described in ISO 639)

    -    Dialect or variant information, such as no-nynorsk or en-
         cockney

    -    Languages not listed in ISO 639 that are not variants of
         any listed language, which can be registered with the i-
         prefix, such as i-cherokee

    -    Script variations, such as az-arabic and az-cyrillic

   In the second and subsequent subtag, any value can be registered.
‎</quote>‎

Note that the RFC explicitly anticipated the possibility of the second subtag (in our terms -- "first subtag" in RFC1766 terms) being used to indicate a script distinction.

Needless to say, I'm not impressed at the defense; I actually give the printer manufacturers more credit: I'm inclined to suspect that they weren't actually referring to RFC 1766 directly (I can't imagine anyone getting syntax *that* wrong if they had actually read it). They were probably working from someone's incomplete notion of what RFC 1766 had to say, based on having seen tags like "en-US" or "de-DE" on numerous occasions and inferring leaps of logic.


I'd better stop now. This has gotten way off topic.


Peter Constable


More information about the Ietf-languages mailing list