Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

Wed Aug 6 07:03:20 CEST 2014

To be honest, I do not think it matters where it is discussed.

   Patrik

On 6 aug 2014, at 01:07, Mark Davis ☕️ <mark at macchiato.com> wrote:

> 
> I hadn't heard back from John, but I'm guessing that the right place to discuss this is here, based on Marc's email.
> 
> 
> ---------- Forwarded message ----------
> From: Mark Davis ☕️ <mark at macchiato.com>
> Date: Wed, Jul 30, 2014 at 7:38 AM
> Subject: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison
> To: John C Klensin <john+w3c at jck.com>, Patrik Fältström <paf at frobbit.se>
> Cc: member-i18n-core at w3.org, Asmus Freytag <asmusf at ix.netcom.com>
> 
> 
> On what email address is this being discussed?  I'd like to convey to that list some comments from an internal discussion about draft-klensin-idna-5892upd-unicode70-00.txt. 
> 
> (These are not my wording, but I agree with them. I edited slightly for flow. I will add that from a confusability standpoint, the proposed draft accomplishes nothing, since there are thousands of cases of confusable characters; restricting just this one character has no useful effect; like removing a quart of water from a lake.)
> 
> For U+08A1, this certainly is a *letter* of Fula (Fulfulde, Pula, ...),
> a large language spoken across swaths of
> West Africa. Fula is mostly written with the Latin script,
> but Islamists also write it in Ajami (Arabic extensions for African
> languages), particularly in Guinea.
> See:
> 
> http://en.wikipedia.org/wiki/Fula_orthographies
> 
> The *letter* in question is the one used to write the phoneme /ɓ/,
> the bilabial implosive. See:
> 
> http://en.wikipedia.org/wiki/%C6%81
> 
> for the African alphabet convention for the Latin writing of this letter.
> 
> For the Arabic Ajami alphabets for Fula, the form has been missing.
> For whatever reason, in at least one Fulfulde Ajami orthography,
> this implosive was (reasonably) represented by using a Hamza
> diacritic on the beh letter. Following the way such *diacritic* (ijam)
> letter derivations are encoded in the Unicode Standard, a separate,
> non-decomposed entry was required. Note that this use of Hamza
> is *different* from the Arabic (language) use of a combining Hamza
> to indicate a glottal stop, often in combination with a letter that
> is actually pronounced as a vowel.
> 
> As to *why* it was encoded as a single, undecomposed letter,
> that is explained at length in the proposal document, as well
> as in the section on Hamza in the Unicode Standard, which you
> have referred to in the Internet Draft you mention.
> 
> The newly encoded character U+08A1 for Unicode 7.0 has
> *already* been added to the relevant table "Arabic Letters
> With Hamza Above" in the draft core specification for
> Unicode 7.0, where, like the long-encoded U+0681
> and U+076C, it is noted as having no decomposition.
> (The core specification will be posted around October -- it
> is still undergoing its final editing for all the 7.0 additions.)
> 
> U+08A1 does not have a canonical decomposition in Unicode 7.0
> (nor, of course, will it *ever* have a canonical decomposition,
> because of normalization stability). This is exactly the same treatment
> that U+0681 and U+076C got, and for exactly the same reasons.
> (And, as you know, of course, those characters date back to
> Unicode 4.1 for U+076C and even earlier, Unicode 1.1 for U+0681.)
> 
> Note that it is incorrect to assert that U+08A1 ARABIC LETTER BEH
> WITH HAMZA ABOVE "should" be normalized to U+0628 ARABIC LETTER
> BEH + U+0654 ARABIC HAMZA ABOVE. Those are distinct sequences,
> and they are never going to compare equal in their NFC normalizations.
> 
> I am concerned that the Internet Draft here is heading in
> exactly the wrong direction. If it ends up changing RFC 5892
> to override the derivation for U+08A1 and force it to INVALID,
> all I can see that accomplishing is to guarantee forever that
> correctly spelled Ajami Fulfulde cannot be used in domain
> names, and that instead people would end up having to use
> misspellings to represent their implosive b in a domain names.
> 
> With all due respect to the Arabic script experts that have been
> consulted, I rather doubt that they are experts on Ajami orthographies
> in West Africa, or are in touch with the people who would be
> supporting those languages and implementing keyboarding
> and such for West Africa.
> 
> Also, I don't see any way you can justify the abrupt (and permanent)
> discontinuity that this would put in place between the treatment
> of U+08A1 for Fulfulde and U+076C for Ormuri or U+0681 for Pashto.
> 
> 
> If you are looking for a more analogous precedent I suggest, for example:
> 
> U+2C65 LATIN SMALL LETTER A WITH STROKE
> 
> That was added in Unicode 5.0, and nobody has ever had any problem
> with it being PVALID in IDNA. It only has limited use in a minor orthography,
> but what is the harm?
> 
> Now, if you examine U+2C65, you could well claim that it *should*
> be decomposed to "a" plus the combining stroke overlay, U+0338.
> And both of those have been encoded for a long, long time in
> the standard, so in principle, somebody *could* have been representing
> their data for a letter a with stroke before Unicode 5.0 using the sequence with the stroke
> overlay. It might even look o.k. in text, depending on the font support
> for the combina̸tion. But the Unicode Standard has rules now for the
> encoding of certain combinations of base letters and diacritic modifiers
> that overlay or modify the base character form. So U+2C65 was
> separately encoded. And there is no normalization of the sequence
> involved. That stroked letter use is, in text, distinct from somebody, say,
> using a bunch of overlay strokes as a strikethrough convention for
> some reason: a̸a̸a̸a̸a̸a̸
> 
> Consider the Hamza diacritic as falling in this same class of edge cases,
> if you will.
> 
> And in this case, I don't think it will be doing anybody any favors to
> update RFC 5892 to make U+08A1 DISALLOWED in IDNA. It doesn't
> "fix" normalization for it. All it accomplishes is to force any Fulfulde
> user of Ajami orthography to misspell their text in order to use a /ɓ/
> in a domain name. It would just create an unexplained (and unfixable)
> discontinuity between what the domain registrations would accept
> and what the Fulfulde input and spelling tools would support. Or I
> guess it would just force people to give up the Arabic spellings and
> go back to the more widely supported Latin alphabets for Fula to
> get their domain names.
> 
> What would be accomplished by
> forcing another point incompatibility that just ends up getting
> carried around forever?
> 
> ====
> 
> There are four levels at which confusables, including homoglyphs , can be addressed for domain names
> 
> 1. Encoding
> 2. Protocol (IDNA)
> 3. Label Generation Ruleset
> 4. String Review
> 
> A more natural level [for addressing confusables] would be the Label Generation Ruleset level. For an LGR, there are three ways to deal with homoglyphs, one of which is not available on the protocol level. The first two of these are to rule out a code point (by not including it in the LGR's repertoire), or to rule out a code point or sequence conditionally. Unlike using these methods on the Protocol level, doing so on the LGR level means that it is possible to be more restrictive, say, for the root of the DNS than for domains several levels down the tree. The downside of using the LGR is, of course, that it is specific to the given zone on the internet.
> 
> The upside is that an LGR has additional mechanisms, such as defining a "blocked" variant. That creates an "either/or" situation, where both are permitted, but not at the same time in the same position of an otherwise identical label. This is a very nice solution for a number of confusables/homoglyphs that are systemic (not dependent on accidents of rendering or "arms length" similarity).
> 
> Unlike the final level, String Review, an LGR has the advantage of being applied mechanically without any case-by-case review, which is why it's appropriate for cases like the one that gave rise to this discussion.
> 
> In principle, both the Label Generation Ruleset or the String Review are created/carried out by people/entities that have access to the necessary and specific linguistic and script expertise, unlike IDNA which seems to be created largely by protocol experts.
> 
> 
> On Tue, Jul 22, 2014 at 12:22 AM, John C Klensin <john+w3c at jck.com> wrote:
> Hi.   I was asked to forward the announcement of this Internet
> Draft to this group once it was posted.  See attached.
> 
> For information -- comments welcome, but the core issue may be
> rather specific to concerns that surround IDNs and IDNA.  Or not.
> 
> Or course, if I/we are still completely confused, corrections
> and explanations would be welcome.
> 
>     john
> 
> 
> ---------- Forwarded message ----------
> From: internet-drafts at ietf.org
> To: i-d-announce at ietf.org
> Cc: 
> Date: Mon, 21 Jul 2014 04:03:58 -0700
> Subject: I-D Action: draft-klensin-idna-5892upd-unicode70-00.txt
> 
> A New Internet-Draft is available from the on-line Internet-Drafts directories.
> 
> 
>         Title           : IDNA Update for Unicode 7.0.0
>         Authors         : John C Klensin
>                           Patrik Faltstrom
>         Filename        : draft-klensin-idna-5892upd-unicode70-00.txt
>         Pages           : 10
>         Date            : 2014-07-21
> 
> Abstract:
>    The current version of the IDNA specifications anticipated that each
>    new version of Unicode would be reviewed to verify that no changes
>    had been introduced that required adjustments to the set of rules
>    and, in particular, whether new exceptions or backward compatibility
>    adjustments were needed.  That review was conducted for Unicode 7.0.0
>    and identified a problematic new code point.  This specification
>    updates RFC 5982 to disallow that code point and provides information
>    about the reasons why that exclusion is appropriate.  It also applies
>    an editorial clarification that was the subject of an earlier
>    erratum.
> 
> 
> The IETF datatracker status page for this draft is:
> https://datatracker.ietf.org/doc/draft-klensin-idna-5892upd-unicode70/
> 
> There's also a htmlized version available at:
> http://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode70-00
> 
> 
> Please note that it may take a couple of minutes from the time of submission
> until the htmlized version and diff are available at tools.ietf.org.
> 
> Internet-Drafts are also available by anonymous FTP at:
> ftp://ftp.ietf.org/internet-drafts/
> 
> _______________________________________________
> I-D-Announce mailing list
> I-D-Announce at ietf.org
> https://www.ietf.org/mailman/listinfo/i-d-announce
> Internet-Draft directories: http://www.ietf.org/shadow.html
> or ftp://ftp.ietf.org/ietf/1shadow-sites.txt
> 
> 
> 
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140806/ef719007/attachment.pgp>