"This case isn't the important one" (was Re: Visually confusable characters (8))

Asmus Freytag asmusf at ix.netcom.com
Mon Aug 11 20:36:26 CEST 2014

I am responding to Vint's message, because, for some reason, I do not 
receive Andrew's messages via the list.

On 8/11/2014 7:47 AM, Vint Cerf wrote:
> Amen to Andrew's basic point.
> v
> On Mon, Aug 11, 2014 at 10:42 AM, Andrew Sullivan 
> <ajs at anvilwalrusden.com <mailto:ajs at anvilwalrusden.com>> wrote:
>     That behaviour is surprising to me given what I understood at
>     the time we worked on and published IDNA2008.  (It is in fact
>     surprising to me even now when I read the text of the standard, but I
>     understand the argument that in fact the new character is somehow
>     unrelated enough to the former combining sequence that the combining
>     sequence never really worked, but that doesn't matter.  I would
>     probably find that argument more compelling if I understood why this
>     case is different from ö in Swedish vs. ö in German, but never mind
>     that, either.)

First, the very same case has been in place for ø in Danish (and Norwegian)
which will look like the sequence o + combining /, but is not deemed
identical to it.

The combining / exists for a well-defined purpose, viz. mathematical

However, for letters, marks that are overlays (stroke, bar, etc.) are
extremely problematic, because while the concept can be articulated
there is a wide variability of how the overlay could be applied.
Horizontal strokes, in particular, can be applied to any part of a
glyph (stem, bowl, part of a bowl, etc.) making a decomposition
not tractable. (For diagonal strokes you have similar issues with
angle and length.)

As a result, Unicode has the principle of encoding all overlays
as precomposed forms (except for mathematics where only
those forms are precomposed where the negation is applied
irregularly). The exception for mathematics makes sense, because
there's a reasonably consistent semantics (negation) associated
with the combination, and the use is fully productive (can be
applied to essentially any symbol or operator).

The case under consideration is rather similar. The combining
hamza exists for a particular use case (Koran), but is otherwise
not part of the orthography. As I understand, the use of the
combined form for a non-Arabic language is unrelated to
applying  a "hamza" even though it uses the same squiggle.

It's really important to step back and realize that composition
in Unicode is not intended to work like a "glyph composition
toolkit". It is intended to handle certain systematic (productive)
cases, where a mark (for example breve or macron) can be
applied to many characters to indicate short/long pronunciation.

In technical use, these combinations are unrestricted, which is
reflected in Unicode by the use of combining marks.

What this has to do with two letters (whether 'a' and 'a' or
ö and ö) being used in two different languages is a bit unclear
to me, so I don't understand Andrew's question.

>     What is important at least for me now is to understand the extent to
>     which this sort of thing happens, what our expectation ought to be in
>     the future about its recurrence, and what implications that has for
>     how we build network protocols atop Unicode.

This "thing" happens regularly (but not really frequently) and
usually not the in the context of two languages competing with
each other, but more often in the context of some technical
or limited use needing a combining approach (because in that
context, there really is an underlying combination or "apply
this mark to that character") and an orthographic use of a
fixed symbol which is deemed not analyzable in that context.

For obvious reasons, this "thing" tends to happen for minority
languages, not to say "obscure" ones, if only for the simple
reason that the common, well-known, and prominent ones
are all known and accounted for - but not without having
this "thing" part of the existing Unicode. (See example above).

I keep coming back to the question of why, with the
in your face Scandinavian example of long standing,
this is suddenly such an issue for a rather obscure language.

Or, to put in terms of expectations: I would not expect
this particular code point to be handled in a totally ad-hoc
fashion, if more prominent examples went unchallenged,
and, presumably, are being dealt with more systematically
by other means.

>     Best regards,
>     A
>     --
>     Andrew Sullivan
>     ajs at anvilwalrusden.com <mailto:ajs at anvilwalrusden.com>
>     _______________________________________________
>     Idna-update mailing list
>     Idna-update at alvestrand.no <mailto:Idna-update at alvestrand.no>
>     http://www.alvestrand.no/mailman/listinfo/idna-update
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140811/6718c6c5/attachment.html>

More information about the Idna-update mailing list