A proposed solution for descriptions

Mon Jun 19 16:28:12 CEST 2006

Addison Phillips <addison at yahoo dash inc dot com> wrote:

> 1. The form <U+1234> is equally unnecessarily obscure. There exist 
> perfectly good escape formats (\u1234 and \U123456) that would 
> probably serve better (since several programming languages will 
> interpret these formats). The choice of an NCR format stolen from XML 
> was (not actually, but may as well have been) arbitrary and is no 
> better or worse than Mark's suggestion (well, maybe slightly worse). 
> However, any array of "gunk" in the file requires additional 
> processing or observation on the part of the user.

If it's really necessary, I can try digging back through the archives to 
find the discussion of hex NCRs and why they were chosen over other 
formats (if any were considered).

One of the disadvantages of \u1234 is that we wanted to be able to 
suppress leading zeros in the Unicode Scalar Value.  That is, the 
c-with-cedilla in "Provençal" can be just &#xE7; and not &#x00E7; or 
&#x0000E7;.  This provides a modest byte saving.  A sequence like \u00E7 
cannot be variable-length like this, because there would be ambiguity if 
the sequence were followed by a character that could be interpreted as 
another hex digit.  "Proven\u00E7al" could be interpreted as "Proven๺l" 
and that would force us to adopt more conventions to break the 
ambiguity, or require fixed-length sequences.

> 2. The IETF does allow UTF-8 registries. Apparently one already exists 
> (I don't remember which). The problem here was the decision/need to 
> publish the initial registry as an I-D. I would not support an ASCII 
> only registry otherwise. I don't believe Mark Davis would either: we 
> are both committed supporters of Unicode. And personally I find an 
> ASCII only registry to be stupid.

While avoiding the word "stupid," I will note (and I doubt anyone will 
argue) that it has caused quite a bit of trouble.

> 3. The ONLY way to change the format of the registry is to update RFC 
> 3066bis. There will be an opportunity to change the format when we 
> update that document to support ISO 639-3. When that happens, I hope 
> that we will convert the registry to UTF-8 and that this foolishness 
> will be consigned to the dustbin of history.

I agree on both counts: (1) we should do it in the future, and (2) we 
cannot do it yet, so there is no point proposing it now.

> 4. Well... I do note that the sequence <U+201B> is a perfectly good 
> "latin-script" string. This list *could* register such strings and 
> ignore the guidance in RFC 3066bis, but I think that this would be 
> extremely confusing.

The sequence "<U+201B>" could be registered, but would have no special 
meaning.  It would merely make people think we didn't understand our own 
syntax rules.

> Finally: I would stipulate that the purpose of the Description field 
> is to identify to human users of the registry (i.e. implementers) what 
> the subtag values "mean". This is not at all the same thing as 
> asserting the actual description or name of the subtag in any 
> particular language. Such applications are important and users should 
> refer to external references, such as the ISO and UN standards 
> themselves or to projects such as CLDR to obtain display names in any 
> particular language.

This was the major argument, months ago, against adding alternative 
Description fields that weren't part of the corresponding core standard. 
By discussing "Ivory Coast" and "Book Norwegian" we have essentially 
abandoned that goal.

> I agree 100% with Debbie that the registry should pick up *exactly* 
> what ISO 639, 3166, 15924, or UN M.49 emits.

If ISO 639 had not added a language "N'Ko" with a different apostrophe 
from that used in ISO 15924 for the script "N’Ko", I would not have had 
to suggest changing one to match the other, and we would not be having 
this debate right now.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/