A proposed solution for descriptions

Addison Phillips addison at yahoo-inc.com
Mon Jun 19 18:06:29 CEST 2006

Some notes follow... ellision as necessary.


Addison Phillips
Internationalization Architect - Yahoo! Inc.

Internationalization is an architecture.
It is not a feature.  

> -----Original Message-----
> From: Doug Ewell [mailto:dewell at adelphia.net] 
> Sent: 2006年6月19日 7:28
> To: ietf-languages at iana.org
> Cc: Addison Phillips; 'Mark Crispin'; 'Debbie Garside'
> Subject: Re: A proposed solution for descriptions
> One of the disadvantages of \u1234 is that we wanted to be able to 
> suppress leading zeros in the Unicode Scalar Value.  That is, the 
> c-with-cedilla in "Provençal" can be just ç and not ç or 
> ç.  This provides a modest byte saving.  A sequence 
> like \u00E7 
> cannot be variable-length like this, because there would be 
> ambiguity if 
> the sequence were followed by a character that could be 
> interpreted as 
> another hex digit.  

Variable width is not necessarily better then fixed width (see pointless protracted debate: "which should I use, UTF-8 or UTF-16"). The byte savings is so modest as to be meaningless. The problem with \u#### is, of course, the representation of supplemental characters (i.e. those > U+FFFF). They either would require a surrogate pair (a la JavaScript or JSON) or an alternate escape form (such as \U######). A delimited format was chosen because it avoids such things. On the other hand, plenty of programming languages recognize the \u convention.

Whatever. LTRU chose a format and that's the format.

> > And personally I find an 
> > ASCII only registry to be stupid.
> While avoiding the word "stupid," I will note (and I doubt 
> anyone will argue) that it has caused quite a bit of trouble.

Sometimes stronger language is called for<g>. A UTF-8 registry would not fix the current debate. Sure, we would not have to look at the NCRs, but there would still be the debate about which characters to use.

> > 4. Well... I do note that the sequence <U+201B> is a perfectly good 
> > "latin-script" string. This list *could* register such strings and 
> > ignore the guidance in RFC 3066bis, but I think that this would be 
> > extremely confusing.
> The sequence "<U+201B>" could be registered, but would have 
> no special meaning.  It would merely make people think we didn't 
> understand our own syntax rules.

Yes, but it is good to point out these things before others think of them.

> > Finally: I would stipulate that the purpose of the 
> Description field 
> > is to identify to human users of the registry (i.e. 
> implementers) what 
> > the subtag values "mean". This is not at all the same thing as 
> > asserting the actual description or name of the subtag in any 
> > particular language. Such applications are important and 
> users should 
> > refer to external references, such as the ISO and UN standards 
> > themselves or to projects such as CLDR to obtain display 
> names in any 
> > particular language.
> This was the major argument, months ago, against adding alternative 
> Description fields that weren't part of the corresponding 
> core standard. 
> By discussing "Ivory Coast" and "Book Norwegian" we have essentially 
> abandoned that goal.

No. We have discussed whether that is the goal (again). Notwithstanding what the LTRU WG thinks, the IETF-languages list must form a consensus on proposed registrations. It is the IETF-langauges list that determines what is or is not to be registered. I agree that the thinking in the LTRU WG provides at least a framework for the thinking here. But the rules in RFC 3066bis are sufficiently clear about what it allowed in descriptions (nearly anything that this list approves).

Personally, I use the descriptions to make clear what the darned subtags are, not as authoritative representations of same in any particular language.

> > I agree 100% with Debbie that the registry should pick up *exactly* 
> > what ISO 639, 3166, 15924, or UN M.49 emits.
> If ISO 639 had not added a language "N'Ko" with a different 
> apostrophe 
> from that used in ISO 15924 for the script "N’Ko", I would 
> not have had 
> to suggest changing one to match the other, and we would not 
> be having 
> this debate right now.

Forced is not exactly the right word. That there are differences is unfortunate. But it does not necessarily follow that slavishly following the Ur-standards harms the registry in any way. We would be better off pointing out the discrepency to the ISO 639 and ISO 15924 MAs and letting them figure it out. In practice it makes no difference in the use or operation of the registry.

I agree that consistency is a good thing to have, but not at the expense of eternal hand-wringing.

One final point: an advantage of this particular registry and its process is its application of human reasoning to registrations. In adapting items from the ISO standards, we can apply Occam's razor to good effect and modify the descriptions only when necessary. When there is some doubt, I'll follow whatever Michael proposes or give a "-1" with my reasons. Otherwise we are just chasing our tails.

More information about the Ietf-languages mailing list