More hunches - come on, we need better than that.

Fri May 23 18:04:00 CEST 2003

Michael cc'd me and asked me to weigh in.

[Note that I am not subscribed to ietf-languages, so I have
not been following all your ongoing argumentation. I only
occasionally get cc'd on snippets that migrate off of that list.]

> We do not encode duplicate characters, John, for good reasons, and I 
> think that yi-Hebr and sr-Cyrl are tanatamount to duplicate codes.

I disagree on that point. As I indicated in a separate
response:

François Yergeau's argument is good for this. Once
you allow sr-latn, you imply the hierarchy:

   sr  ---- sr-cyrl
         |_ sr-latn

where "sr" just means "Serbian", and defaults to an expectation
of the Cyrillic script, because that is what most material is
in. But it would be valid to label Latin Serbian data "sr" as
well, and if you needed to be specific, or restrict a search
to only Cyrillic data, you would need the "sr-cyrl" tag, too.

> ISO 639 has all those duplicate language codes (T/B) and it is a 
> mess. It seems to me that we ought not have duplicate codes. I have 
> asked for the opinion of people like Ken who have more experience in 
> software and database stuff than I do.

O.k., you've got it. I don't agree that an extended code which
represents an explicit restriction of the semantics of a language
tag to indicate the script of the written form is a *duplication*
of the language tag.

> My response is that I (frankly) feel bullied by this process. I have 
> objections to duplicates. I am being told that I *have* to approve 
> these because I approved another one, and I've said I consider them a 
> different case, and I've asked the proponents of these to talk to 
> some people whose judgement I trust more than my own. And no, I'm not 
> so sure about Mark's judgement in this matter. It seems to me he 
> wants the quickest fix possible, and I am not sure that is what this 
> RFC is for. Maybe it is. But I am not sure of it.

I can't speak to whether you are being bullied by the process. :-(

But while I have all kinds of theoretical objections to what Mark
has been asking for, particularly in the context of "solving"
the Traditional/Simplified Chinese tagging problem with
variant script tags, Hant and Hans, I don't see much of a practical
objection. It amounts to just one more (ad hoc) extension of an
already ad hoc system, and it clearly meets some implementation
requirements.

As for what the RFC is for, if we stop trying to be purist about
seeing it as representing language tags per se, and instead
see it as a practical (albeit ad hoc and inconsistent) mechanism
for creating identifying tags for written language forms
significantly distinct (and sufficiently prominent) to require
distinct, reliably machine-readable labels for information
processing needs, then the case for what Mark has been asking
to be registered is much easier to make.

> 
> I think this is controversial, and I have insisted on knowing that 
> all the players here are satisfied with the process. I have asked 
> that you return to Peter Edberg's paper, which asked many questions, 
> and decide a firm policy on this matter and the relationship it has 
> with the RFC, the intent of tagging.

The intent of the tagging is that for some machine processes,
some piece of text needs to be tagged as zh-Hans or zh-Hant
(or sr-Cyrl or sr-Latn, or uz-Latn or uz-Arbc, or whatever),
because it needs to be processed differently depending on
what writing form is involved. Conceptually, this is no
different than indicating en-US versus en-UK to trigger which
dictionary you use for spelling correction, for example.

*I* am the one who is the most purist and the greatest stickler about
what language is, versus what script is, versus what orthography
is, for example. But I think it is hopeless to try to get
RFC-3066 tags to make all those distinctions consistently -- I'm
not trying to fix that at this point.

Instead, *practically*, the information processing implementations
simply need to make certain distinctions. To do so, they need
extensions that aren't currently accounted for. I don't see
much point in standing in their way on grounds of consistency,
since the whole RFC 3066 apparatus is already inconsistent
in its treatment of language (written or not).

> 
> You bully very sweetly.
> 
> I want Edberg and Whistler on board on this as well. I am cc:ing them 
> particularly now because I guess no one else has gone to try to see 
> if they have the same consensus view. 

You've got my view above.

I think language tags are completely confused already. I don't
see a problem with adding a little more inconsistency, if it
resolves implementation issues.

I wouldn't advocate okaying the registration of out-and-out
duplicates or badly documented registration requests, but I
don't think that is the issue here.

> John Cowan seems to be on the 
> right track, if he can help us move from vagueness to concrete 
> guidelines.

The guideline should be: does this registration meet a
demonstrable implementation requirement reasonably, without
causing difficulty for people who are using the existing
registrations.

It shouldn't be: does this registration make an already
inconsistent labelling system marginally more inconsistent. ;-)

--Ken
> -- 
> Michael Everson * * Everson Typography *  * http://www.evertype.com