Fwd: Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

Wed Aug 6 01:07:24 CEST 2014

 I hadn't heard back from John, but I'm guessing that the right place to
discuss this is here, based on Marc's email.

---------- Forwarded message ----------
From: Mark Davis ☕️ <mark at macchiato.com>
Date: Wed, Jul 30, 2014 at 7:38 AM
Subject: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization for
comparison
To: John C Klensin <john+w3c at jck.com>, Patrik Fältström <paf at frobbit.se>
Cc: member-i18n-core at w3.org, Asmus Freytag <asmusf at ix.netcom.com>

On what email address is this being discussed?  I'd like to convey to that
list some comments from an internal discussion about draft-klensin
-idna-5892upd-unicode70-00.txt.

(These are not my wording, but I agree with them. I edited slightly for
flow. I will add that from a confusability standpoint, the proposed draft
accomplishes nothing, since there are thousands of cases of confusable
characters; restricting just this one character has no useful effect; like
removing a quart of water from a lake.)

For U+08A1, this certainly is a *letter* of Fula (Fulfulde, Pula, ...),
a large language spoken across swaths of
West Africa. Fula is mostly written with the Latin script,
but Islamists also write it in Ajami (Arabic extensions for African
languages), particularly in Guinea.
See:

http://en.wikipedia.org/wiki/Fula_orthographies

The *letter* in question is the one used to write the phoneme /ɓ/,
the bilabial implosive. See:

http://en.wikipedia.org/wiki/%C6%81

for the African alphabet convention for the Latin writing of this letter.

For the Arabic Ajami alphabets for Fula, the form has been missing.
For whatever reason, in at least one Fulfulde Ajami orthography,
this implosive was (reasonably) represented by using a Hamza
diacritic on the beh letter. Following the way such *diacritic* (ijam)
letter derivations are encoded in the Unicode Standard, a separate,
non-decomposed entry was required. Note that this use of Hamza
is *different* from the Arabic (language) use of a combining Hamza
to indicate a glottal stop, often in combination with a letter that
is actually pronounced as a vowel.

As to *why* it was encoded as a single, undecomposed letter,
that is explained at length in the proposal document, as well
as in the section on Hamza in the Unicode Standard, which you
have referred to in the Internet Draft you mention.

The newly encoded character U+08A1 for Unicode 7.0 has
*already* been added to the relevant table "Arabic Letters
With Hamza Above" in the draft core specification for
Unicode 7.0, where, like the long-encoded U+0681
and U+076C, it is noted as having no decomposition.
(The core specification will be posted around October -- it
is still undergoing its final editing for all the 7.0 additions.)

U+08A1 does not have a canonical decomposition in Unicode 7.0
(nor, of course, will it *ever* have a canonical decomposition,
because of normalization stability). This is exactly the same treatment
that U+0681 and U+076C got, and for exactly the same reasons.
(And, as you know, of course, those characters date back to
Unicode 4.1 for U+076C and even earlier, Unicode 1.1 for U+0681.)

Note that it is incorrect to assert that U+08A1 ARABIC LETTER BEH
WITH HAMZA ABOVE "should" be normalized to U+0628 ARABIC LETTER
BEH + U+0654 ARABIC HAMZA ABOVE. Those are distinct sequences,
and they are never going to compare equal in their NFC normalizations.

I am concerned that the Internet Draft here is heading in
exactly the wrong direction. If it ends up changing RFC 5892
to override the derivation for U+08A1 and force it to INVALID,
all I can see that accomplishing is to guarantee forever that
correctly spelled Ajami Fulfulde cannot be used in domain
names, and that instead people would end up having to use
misspellings to represent their implosive b in a domain names.

With all due respect to the Arabic script experts that have been
consulted, I rather doubt that they are experts on Ajami orthographies
in West Africa, or are in touch with the people who would be
supporting those languages and implementing keyboarding
and such for West Africa.

Also, I don't see any way you can justify the abrupt (and permanent)
discontinuity that this would put in place between the treatment
of U+08A1 for Fulfulde and U+076C for Ormuri or U+0681 for Pashto.

If you are looking for a more analogous precedent I suggest, for example:

U+2C65 LATIN SMALL LETTER A WITH STROKE

That was added in Unicode 5.0, and nobody has ever had any problem
with it being PVALID in IDNA. It only has limited use in a minor
orthography,
but what is the harm?

Now, if you examine U+2C65, you could well claim that it *should*
be decomposed to "a" plus the combining stroke overlay, U+0338.
And both of those have been encoded for a long, long time in
the standard, so in principle, somebody *could* have been representing
their data for a letter a with stroke before Unicode 5.0 using the sequence
with the stroke
overlay. It might even look o.k. in text, depending on the font support
for the combina̸tion. But the Unicode Standard has rules now for the
encoding of certain combinations of base letters and diacritic modifiers
that overlay or modify the base character form. So U+2C65 was
separately encoded. And there is no normalization of the sequence
involved. That stroked letter use is, in text, distinct from somebody, say,
using a bunch of overlay strokes as a strikethrough convention for
some reason: a̸a̸a̸a̸a̸a̸

Consider the Hamza diacritic as falling in this same class of edge cases,
if you will.

And in this case, I don't think it will be doing anybody any favors to
update RFC 5892 to make U+08A1 DISALLOWED in IDNA. It doesn't
"fix" normalization for it. All it accomplishes is to force any Fulfulde
user of Ajami orthography to misspell their text in order to use a /ɓ/
in a domain name. It would just create an unexplained (and unfixable)
discontinuity between what the domain registrations would accept
and what the Fulfulde input and spelling tools would support. Or I
guess it would just force people to give up the Arabic spellings and
go back to the more widely supported Latin alphabets for Fula to
get their domain names.

What would be accomplished by
forcing another point incompatibility that just ends up getting
carried around forever?

====

There are four levels at which confusables, including homoglyphs
,
can be addressed for domain names

1. Encoding
2. Protocol (IDNA)
3. Label Generation Ruleset
4. String Review

A
 more natural level [for addressing confusables] would be the Label
Generation Ruleset level. For an LGR, there are three ways to deal with
homoglyphs, one of which is not available on the protocol level. The first
two of these are to rule out a code point (by not including it in the LGR's
repertoire), or to rule out a code point or sequence conditionally. Unlike
using these methods on the Protocol level, doing so on the LGR level means
that it is possible to be more restrictive, say, for the root of the DNS
than for domains several levels down the tree. The downside of using the
LGR is, of course, that it is specific to the given zone on the internet.

The upside is that an LGR has additional mechanisms, such as defining a
"blocked" variant. That creates an "either/or" situation, where both are
permitted, but not at the same time in the same position of an otherwise
identical label. This is a very nice solution for a number of
confusables/homoglyphs that are systemic (not dependent on accidents of
rendering or "arms length" similarity).

Unlike the final level, String Review, an LGR has the advantage of being
applied mechanically without any case-by-case review, which is why it's
appropriate for cases like the one that gave rise to this discussion.

In principle, both the Label Generation Ruleset or the String Review are
created/carried out by people/entities that have access to the necessary
and specific linguistic and script expertise, unlike IDNA which seems to be
created largely by protocol experts.

On Tue, Jul 22, 2014 at 12:22 AM, John C Klensin <john+w3c at jck.com> wrote:

> Hi.   I was asked to forward the announcement of this Internet
> Draft to this group once it was posted.  See attached.
>
> For information -- comments welcome, but the core issue may be
> rather specific to concerns that surround IDNs and IDNA.  Or not.
>
> Or course, if I/we are still completely confused, corrections
> and explanations would be welcome.
>
>     john
>
>
> ---------- Forwarded message ----------
> From: internet-drafts at ietf.org
> To: i-d-announce at ietf.org
> Cc:
> Date: Mon, 21 Jul 2014 04:03:58 -0700
> Subject: I-D Action: draft-klensin-idna-5892upd-unicode70-00.txt
>
> A New Internet-Draft is available from the on-line Internet-Drafts
> directories.
>
>
>         Title           : IDNA Update for Unicode 7.0.0
>         Authors         : John C Klensin
>                           Patrik Faltstrom
>         Filename        : draft-klensin-idna-5892upd-unicode70-00.txt
>         Pages           : 10
>         Date            : 2014-07-21
>
> Abstract:
>    The current version of the IDNA specifications anticipated that each
>    new version of Unicode would be reviewed to verify that no changes
>    had been introduced that required adjustments to the set of rules
>    and, in particular, whether new exceptions or backward compatibility
>    adjustments were needed.  That review was conducted for Unicode 7.0.0
>    and identified a problematic new code point.  This specification
>    updates RFC 5982 to disallow that code point and provides information
>    about the reasons why that exclusion is appropriate.  It also applies
>    an editorial clarification that was the subject of an earlier
>    erratum.
>
>
> The IETF datatracker status page for this draft is:
> https://datatracker.ietf.org/doc/draft-klensin-idna-5892upd-unicode70/
>
> There's also a htmlized version available at:
> http://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode70-00
>
>
> Please note that it may take a couple of minutes from the time of
> submission
> until the htmlized version and diff are available at tools.ietf.org.
>
> Internet-Drafts are also available by anonymous FTP at:
> ftp://ftp.ietf.org/internet-drafts/
>
> _______________________________________________
> I-D-Announce mailing list
> I-D-Announce at ietf.org
> https://www.ietf.org/mailman/listinfo/i-d-announce
> Internet-Draft
> <https://www.ietf.org/mailman/listinfo/i-d-announceInternet-Draft>
> directories: http://www.ietf.org/shadow.html
> or ftp://ftp.ietf.org/ietf/1shadow-sites.txt
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140805/b0451124/attachment-0001.html>