Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

Mark Davis ☕️ mark at macchiato.com
Wed Aug 6 13:07:28 CEST 2014


I don't much care either, as long as there is a known place *for* it to be
discussed.


Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*


On Tue, Aug 5, 2014 at 10:03 PM, Patrik Fältström <paf at frobbit.se> wrote:

> To be honest, I do not think it matters where it is discussed.
>
>    Patrik
>
> On 6 aug 2014, at 01:07, Mark Davis ☕️ <mark at macchiato.com> wrote:
>
> >
> > ​I hadn't heard back from John, but I'm guessing that the right place to
> discuss this is here​, based on Marc's email.
> >
> >
> > ---------- Forwarded message ----------
> > From: Mark Davis ☕️ <mark at macchiato.com>
> > Date: Wed, Jul 30, 2014 at 7:38 AM
> > Subject: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization
> for comparison
> > To: John C Klensin <john+w3c at jck.com>, Patrik Fältström <paf at frobbit.se>
> > Cc: member-i18n-core at w3.org, Asmus Freytag <asmusf at ix.netcom.com>
> >
> >
> > ​On what email address is this being discussed?  I'd like to convey to
> that list some comments from an internal discussion​ about
> draft-klensin-idna-5892upd-unicode70-00.txt.
> >
> > (These are not my wording, but I agree with them. I edited slightly for
> flow. I will add that from a confusability standpoint, the proposed draft
> accomplishes nothing, since there are thousands of cases of confusable
> characters; restricting just this one character has no useful effect; like
> removing a quart of water from a lake.)
> >
> > For U+08A1, this certainly is a *letter* of Fula (Fulfulde, Pula, ...),
> > a large language spoken across swaths of
> > West Africa. Fula is mostly written with the Latin script,
> > but Islamists also write it in Ajami (Arabic extensions for African
> > languages), particularly in Guinea.
> > See:
> >
> > http://en.wikipedia.org/wiki/Fula_orthographies
> >
> > The *letter* in question is the one used to write the phoneme /ɓ/,
> > the bilabial implosive. See:
> >
> > http://en.wikipedia.org/wiki/%C6%81
> >
> > for the African alphabet convention for the Latin writing of this letter.
> >
> > For the Arabic Ajami alphabets for Fula, the form has been missing.
> > For whatever reason, in at least one Fulfulde Ajami orthography,
> > this implosive was (reasonably) represented by using a Hamza
> > diacritic on the beh letter. Following the way such *diacritic* (ijam)
> > letter derivations are encoded in the Unicode Standard, a separate,
> > non-decomposed entry was required. Note that this use of Hamza
> > is *different* from the Arabic (language) use of a combining Hamza
> > to indicate a glottal stop, often in combination with a letter that
> > is actually pronounced as a vowel.
> >
> > As to *why* it was encoded as a single, undecomposed letter,
> > that is explained at length in the proposal document, as well
> > as in the section on Hamza in the Unicode Standard, which you
> > have referred to in the Internet Draft you mention.
> >
> > The newly encoded character U+08A1 for Unicode 7.0 has
> > *already* been added to the relevant table "Arabic Letters
> > With Hamza Above" in the draft core specification for
> > Unicode 7.0, where, like the long-encoded U+0681
> > and U+076C, it is noted as having no decomposition.
> > (The core specification will be posted around October -- it
> > is still undergoing its final editing for all the 7.0 additions.)
> >
> > U+08A1 does not have a canonical decomposition in Unicode 7.0
> > (nor, of course, will it *ever* have a canonical decomposition,
> > because of normalization stability). This is exactly the same treatment
> > that U+0681 and U+076C got, and for exactly the same reasons.
> > (And, as you know, of course, those characters date back to
> > Unicode 4.1 for U+076C and even earlier, Unicode 1.1 for U+0681.)
> >
> > Note that it is incorrect to assert that U+08A1 ARABIC LETTER BEH
> > WITH HAMZA ABOVE "should" be normalized to U+0628 ARABIC LETTER
> > BEH + U+0654 ARABIC HAMZA ABOVE. Those are distinct sequences,
> > and they are never going to compare equal in their NFC normalizations.
> >
> > I am concerned that the Internet Draft here is heading in
> > exactly the wrong direction. If it ends up changing RFC 5892
> > to override the derivation for U+08A1 and force it to INVALID,
> > all I can see that accomplishing is to guarantee forever that
> > correctly spelled Ajami Fulfulde cannot be used in domain
> > names, and that instead people would end up having to use
> > misspellings to represent their implosive b in a domain names.
> >
> > With all due respect to the Arabic script experts that have been
> > consulted, I rather doubt that they are experts on Ajami orthographies
> > in West Africa, or are in touch with the people who would be
> > supporting those languages and implementing keyboarding
> > and such for West Africa.
> >
> > Also, I don't see any way you can justify the abrupt (and permanent)
> > discontinuity that this would put in place between the treatment
> > of U+08A1 for Fulfulde and U+076C for Ormuri or U+0681 for Pashto.
> >
> >
> > If you are looking for a more analogous precedent I suggest, for example:
> >
> > U+2C65 LATIN SMALL LETTER A WITH STROKE
> >
> > That was added in Unicode 5.0, and nobody has ever had any problem
> > with it being PVALID in IDNA. It only has limited use in a minor
> orthography,
> > but what is the harm?
> >
> > Now, if you examine U+2C65, you could well claim that it *should*
> > be decomposed to "a" plus the combining stroke overlay, U+0338.
> > And both of those have been encoded for a long, long time in
> > the standard, so in principle, somebody *could* have been representing
> > their data for a letter a with stroke before Unicode 5.0 using the
> sequence with the stroke
> > overlay. It might even look o.k. in text, depending on the font support
> > for the combina̸tion. But the Unicode Standard has rules now for the
> > encoding of certain combinations of base letters and diacritic modifiers
> > that overlay or modify the base character form. So U+2C65 was
> > separately encoded. And there is no normalization of the sequence
> > involved. That stroked letter use is, in text, distinct from somebody,
> say,
> > using a bunch of overlay strokes as a strikethrough convention for
> > some reason: a̸a̸a̸a̸a̸a̸
> >
> > Consider the Hamza diacritic as falling in this same class of edge cases,
> > if you will.
> >
> > And in this case, I don't think it will be doing anybody any favors to
> > update RFC 5892 to make U+08A1 DISALLOWED in IDNA. It doesn't
> > "fix" normalization for it. All it accomplishes is to force any Fulfulde
> > user of Ajami orthography to misspell their text in order to use a /ɓ/
> > in a domain name. It would just create an unexplained (and unfixable)
> > discontinuity between what the domain registrations would accept
> > and what the Fulfulde input and spelling tools would support. Or I
> > guess it would just force people to give up the Arabic spellings and
> > go back to the more widely supported Latin alphabets for Fula to
> > get their domain names.
> >
> > What would be accomplished by
> > forcing another point incompatibility that just ends up getting
> > carried around forever?
> >
> > ​====
> > ​
> > There are four levels at which confusables, including homoglyphs ​,​ can
> be addressed for domain names
> >
> > 1. Encoding
> > 2. Protocol (IDNA)
> > 3. Label Generation Ruleset
> > 4. String Review
> >
> > ​A more natural level ​[for ​addressing ​confusables​] ​​would be the
> Label Generation Ruleset level. For an LGR, there are three ways to deal
> with homoglyphs, one of which is not available on the protocol level. The
> first two of these are to rule out a code point (by not including it in the
> LGR's repertoire), or to rule out a code point or sequence conditionally.
> Unlike using these methods on the Protocol level, doing so on the LGR level
> means that it is possible to be more restrictive, say, for the root of the
> DNS than for domains several levels down the tree. The downside of using
> the LGR is, of course, that it is specific to the given zone on the
> internet.
> >
> > The upside is that an LGR has additional mechanisms, such as defining a
> "blocked" variant. That creates an "either/or" situation, where both are
> permitted, but not at the same time in the same position of an otherwise
> identical label. This is a very nice solution for a number of
> confusables/homoglyphs that are systemic (not dependent on accidents of
> rendering or "arms length" similarity).
> >
> > Unlike the final level, String Review, an LGR has the advantage of being
> applied mechanically without any case-by-case review, which is why it's
> appropriate for cases like the one that gave rise to this discussion.
> >
> > In principle, both the Label Generation Ruleset or the String Review are
> created/carried out by people/entities that have access to the necessary
> and specific linguistic and script expertise, unlike IDNA which seems to be
> created largely by protocol experts.
> >
> >
> > On Tue, Jul 22, 2014 at 12:22 AM, John C Klensin <john+w3c at jck.com>
> wrote:
> > Hi.   I was asked to forward the announcement of this Internet
> > Draft to this group once it was posted.  See attached.
> >
> > For information -- comments welcome, but the core issue may be
> > rather specific to concerns that surround IDNs and IDNA.  Or not.
> >
> > Or course, if I/we are still completely confused, corrections
> > and explanations would be welcome.
> >
> >     john
> >
> >
> > ---------- Forwarded message ----------
> > From: internet-drafts at ietf.org
> > To: i-d-announce at ietf.org
> > Cc:
> > Date: Mon, 21 Jul 2014 04:03:58 -0700
> > Subject: I-D Action: draft-klensin-idna-5892upd-unicode70-00.txt
> >
> > A New Internet-Draft is available from the on-line Internet-Drafts
> directories.
> >
> >
> >         Title           : IDNA Update for Unicode 7.0.0
> >         Authors         : John C Klensin
> >                           Patrik Faltstrom
> >         Filename        : draft-klensin-idna-5892upd-unicode70-00.txt
> >         Pages           : 10
> >         Date            : 2014-07-21
> >
> > Abstract:
> >    The current version of the IDNA specifications anticipated that each
> >    new version of Unicode would be reviewed to verify that no changes
> >    had been introduced that required adjustments to the set of rules
> >    and, in particular, whether new exceptions or backward compatibility
> >    adjustments were needed.  That review was conducted for Unicode 7.0.0
> >    and identified a problematic new code point.  This specification
> >    updates RFC 5982 to disallow that code point and provides information
> >    about the reasons why that exclusion is appropriate.  It also applies
> >    an editorial clarification that was the subject of an earlier
> >    erratum.
> >
> >
> > The IETF datatracker status page for this draft is:
> > https://datatracker.ietf.org/doc/draft-klensin-idna-5892upd-unicode70/
> >
> > There's also a htmlized version available at:
> > http://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode70-00
> >
> >
> > Please note that it may take a couple of minutes from the time of
> submission
> > until the htmlized version and diff are available at tools.ietf.org.
> >
> > Internet-Drafts are also available by anonymous FTP at:
> > ftp://ftp.ietf.org/internet-drafts/
> >
> > _______________________________________________
> > I-D-Announce mailing list
> > I-D-Announce at ietf.org
> > https://www.ietf.org/mailman/listinfo/i-d-announce
> > Internet-Draft directories: http://www.ietf.org/shadow.html
> > or ftp://ftp.ietf.org/ietf/1shadow-sites.txt
> >
> >
> >
> > _______________________________________________
> > Idna-update mailing list
> > Idna-update at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140806/18c79a42/attachment-0001.html>


More information about the Idna-update mailing list