suppress-script values for fil, mi, pes, prs, qu members

Wed Oct 20 08:15:25 CEST 2010

From: Philip Newton [mailto:philip.newton at gmail.com] 

>> Ummm... If it's not giving a default script, then what is it giving?

> A way for implementations to recognise that "en-Latn" and "en" 
> are to be considered the same

I guess that's a not unreasonable way to look at it, but they can be considered equivalent (in common cases--there can still be exceptions) precisely because there is a script (Latin) that can be assumed by default.

To get a historical perspective on s-s, you can troll through LTRU archives. You can also refer to a paper I wrote in 2002, "Toward a model for language identification" that reflects perspective I at least contributed as we were introducing script subtags into BCP 47. 

http://www.sil.org/silewp/abstract.asp?ref=2002-003

In particular, section 5, "Default values and implicit tagging", addresses a particular issue: On the one hand, a consistent approach to describing orthographic forms would always include a declaration of the script used. On the other hand, however, there was a lot of established practice using language tags without any script subtags. But I also commented on how implicitly-assumed defaults might allow for simpler tags in these cases. Here's an excerpt:

"...That would suggest a need for much richer identifiers than 
are used in existing implementations.

"That is not necessarily the intent here. The proposed morphology 
describes the structure of a fully-qualified identifier. Often, however, 
qualifiers might not be needed...

"The reason for this is that, in practice, certain defaults are often 
applicable. For instance, while language and script are logically 
independent, in actual practice only certain combinations do occur, 
and in most situations there is an unmarked case. So, for example, the 
vast majority of English text data that the average person is likely to 
encounter uses the common English writing system. Some English text 
data may be in Braille, some may be in phonetic transcription, some may 
be in some form of shorthand or any script you might want to imagine, 
but by a large margin most of it is written using the characters of good 
old ASCII. It is the unmarked case.

"In general, where there are conventions that are considered the norm, 
it would be possible to treat them as default, unmarked cases that do not 
need explicit indication in an identifier. As a result, identifiers could have 
implicit semantics and could also be used for more than one category."

What I meant in that last sentence was, e.g., that the "en" could be used to denote language (in the strict sense) English, but given implicit semantics of "Latin", it could also be used to denote the notion of a particular language in a particular form of writing (a category that I referred to in that paper as "writing system") of 'English in its common written form'.

On the basis on what I wrote in that paper, I introduced the basic idea for s-s to LTRU. There was an issue to be worked through, though, which some things this other comment in my paper touched on:

"Most importantly, the use of implicit default semantics would depend
upon the ready availability of information for all written languages
regarding defaults with respect to writing systems and orthographies."

A debate ensued within LTRU as to whether it would be better to document cases in which best practice for common use would be to include a script subtag, or to document cases in which best practice for common use would be to leave out the script subtag. In either approach, there was a problem in compiling comprehensive information, particularly when we knew there would be lots of ambiguous cases (e.g. neo-literate languages in environments in which more than one script is in established use for other dominant languages). 

I think we ended up going the way we did--document the cases in which a script subtag could be left out in common usage--because it was felt it would be easier to identify and document those cases. And that's all I'm suggesting, really: that there are some additional cases for which s-s makes sense and, having identified them, that we document them.

Peter