Duplicate Busters: Survey #2

Doug Ewell doug at ewellic.org
Sun Aug 3 21:32:52 CEST 2008

Frank Ellermann <nobody at xyzzy dot claranet dot de> wrote:

>> You think it's a good thing to have both "Hanunoo"
>> and "Hanun?o", or "Ge'ez" and "Ge?ez"?
> If both exist in the source, yes.  Note that I can't
> read the last word in your question.  Apparently you
> sent your mail as base64 UTF-8, this already limits
> your audience.  My MUA (not the same as a year ago)
> doesn't consider this as hostile, but my OS offers
> only a substitute glyph for "whatever it is".

First, I've told Outlook Express 6.0 not to base64-encode my text (Tools 
| Options | Send | Plain Text Settings | Encode text using: None).  The 
header on my sent message says "Content-Transfer-Encoding: 8bit".  So I 
tend to believe OE did what I told it to do, and some gateway along the 
way applied the base64 layer.

Second, it is 2008 and mail agents are supposed to be able to deal with 
base64-encoded UTF-8 by now.

> I'm not going to decode base64 into raw UTF-8, and
> then UTF-8 into a hex. code point.  It is far better
> in the subtag registry, where I get the correct NCR,
> and don't need to worry about anybody's fonts.

1. The Registry is moving to UTF-8.  This has been decided.

2. You don't ever have to worry about anybody else's fonts, only your 

3. The debate over "use the correct spelling" versus "make it legible on 
the oldest, most limited system" is at the heart of the section of 
Survey #2 that deals with Ge&#x2BB;ez and such.  I don't believe the 
solution is "encode everything twice."  Others do.  That's why your 
4645bis Editor has initiated surveys rather than make these changes 
unilaterally, as some have suggested I should do.

>> Do the protocols depend on the exact name, or do
>> they depend on the code element and meaning?
> They are based on whatever 639-1/2 said, that is how
> most 1766-3066-4646 language subtags are specified.

Let me try again: Do they depend on the EXACT name, to the extent that 
any change in hyphenation or apostrophe usage will cause compatibility 
problems?  Please provide examples.

> Any "redefinitions" in ISO 639-3 can be incompatible,
> see the "zh" vs. "cmn" part of the zh-Latn debate.

There has been no redefinition of 'zh'/'zho' in any part of 639.  639-3 
introduces the concept of macrolanguage and defines 'zho' as "any 
language that is sometimes called Chinese," the same meaning it has 
under 639-1 and -2, and introduces additional code elements for specific 
"Chinese" languages.  ietf-languages members don't agree on whether 
'pinyin' should refer to 'zh' or 'zh-cmn'/'cmn', but that is not a 
matter of 639-3 redefining anything.

> Or the stunt to redefine "fy" some years ago, where
> I fortunately never found any fy-DE or similar cases.
> Limiting an existing code (Frisian to Western Frisian,
> Chinese to Mandarin, Yugoslavia to Serbia, etc.) is
> in theory always wrong.  In practice it might work...

But the fact is that ISO does change these supposedly sacred names from 
time to time.  While we are worried about preserving exact hyphens and 
apostrophes and about whether "Borna" can be interpreted the same as 
"Borna (Ethiopia)", ISO can and does make much larger-scale changes.

>> You mean something like this?
>> Comments: Listed as "Ainu" in ISO 639-2
> Yes, I'm not sure how important or helpful the info is.
> The GG-IM-JE comment is a similar idea.

Neither am I.  But we can consider adding such comments to any subtags 
where the exact ISO name isn't preserved.

> [hyphenation of macedo romanian]
>> there have been participants on both lists who have
>> insisted that the precise ISO 639-2 name AND the
>> precise 639-3 name must be kept intact, down to the
>> last space or hyphen or &#x2BB;
> Maybe that was me, maybe it was Debbie.  The solution
> to pick one name in both source standards is fine.

Do you mean "one name from each standard," meaning we have to keep 
trivially different names, or "one name encompassing both standards," 
meaning we can choose one and discard the other?

>> ISO 15924 lists only Ge&#x2BB;ez, not Ge'ez.
> Then this ASCII Ge'ez has to be removed in the next
> round of modifications.  How did a name get into the
> registry if it is not in the source, did we miss the
> change ?

RFC 4646 doesn't prohibit this list from adding Description fields 
beyond those in the source standards.  They must not conflict with the 
existing description(s).  ietf-languages agreed to add the 
ASCII-apostrophe version in June 2006 after a lengthy debate.

Actually there was a change in ISO 15924, which originally used &#x2019; 
and which change sparked the lengthy debate.

Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

More information about the Ietf-languages mailing list