Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

Wed Aug 6 15:20:25 CEST 2014

The problem with excluding this one particular character is that it
introduces a discrepancy with no real value. It's a bit like addressing
global warming by requiring that Harley Davidsons with licence plate
numbers divisible by 3 cannot use the US interstate. It might make some
people feel like they are doing something good, but just complicates the
rules with no measurable benefit.

The principle that is being espoused is something like:

P1: Any characters that are visually confusable with others should be
excluded from domain names.

But why apply that principle to this (rather rare) character, that is only
used in a particular language, when it is not applied to thousands of other
characters, and characters that are vastly more commonly used? And if P1 is
not the principle that is being applied, what is?

Mark <https://google.com/+MarkDavis>

 *— Il meglio è l’inimico del bene —*

On Wed, Aug 6, 2014 at 5:02 AM, Vint Cerf <vint at google.com> wrote:

> Mark,
>
> I think it is important to distinguish the use of a character for purposes
> other than domain names and for use in domain names. The long preface in
> your email makes that case that the character is useful and legitimate in
> the language but I think it has long been understood that domain names have
> properties and requirements for unambiguous comparison that might lead to
> the conclusion that, despite the character's undeniable utility in written
> language, it may still not be appropriate for use in domain names.
>
> In consequence of that line of reasoning, I would separate the legitimate
> presence of the character in UNICODE from its use in domain names.
>
> The next question, of course, is whether there is any harm to users that
> can be anticipated by the use of the character in domain names. I think it
> is clear that whatever conclusion is reached, it should apply to all levels
> of domain name and therefore to all labels. Adopting a rule concerning a
> character that is applied only at TLD or SLD level but cannot be enforced
> at lower levels of a DNS hierarchy seems like a mistake, so if there is a
> problem with a character, that problem should be solved for all use in
> labels.
>
> I want to emphasize that, so far in this text, I have not taken a position
> regarding the use of U+08A1 in domain names. I am only discussing the
> process of deciding whether a character should be rendered invalid for use
> in domain name labels. The argument that the character is useful for
> properly formed written language is not necessarily an argument for its
> permitted use in domain names, if harms are identified in the latter usage
> that are considered unacceptable.
>
> It is also clear from past experience that reasonable people can differ in
> their assessment of the degree of risk or harm that a particular character
> poses. So, now we get to the central question whether U+08A1 produces
> sufficient risk to be banned from use in domain names.
>
> In Klensin's draft RFC, the argument is made that the previous way in
> which BEH WITH HAMZA ABOVE was accommodated, including for use in domain
> names, was a sequence U+0628 followed by U+0654. The incorporation of a
> new, combined character U+08A1 creates ambiguity because there would be two
> ways for this character to be used in a domain name, but the two do not
> compare equal. For users, this creates the risk that two labels could be
> registered that would look the same but presumably take one to distinct
> destinations in the event that two different registrations of labels
> containing instances of these characters (pre-composed U+08A1 and the
> U+0628/U+0654 sequence) were permitted. Making U+08A1 PVALID while also
> allowing the earlier composed form creates exactly this ambiguity.
>
> This is not a trivial problem. There are similar problems even in the
> purely Latin character cases where numbers and letters can look the same,
> depending on fonts, but be distinct for purposes of label comparisons. The
> fact that such problematic forms exist should not be an argument for
> introducing more of them.
>
> I think this discussion boils down to a principle of not introducing
> additional ambiguity where there had been none before. It is also fair to
> say that this is not a question of excluding the character itself from
> UNICODE for which ample argument has been made for its inclusion, but a
> question of allowing or disallowing its use in domain name labels.
>
> vint
>
>
>
> On Tue, Aug 5, 2014 at 7:07 PM, Mark Davis ☕️ <mark at macchiato.com> wrote:
>
>>
>>  I hadn't heard back from John, but I'm guessing that the right place
>> to discuss this is here, based on Marc's email.
>>
>>
>> ---------- Forwarded message ----------
>> From: Mark Davis ☕️ <mark at macchiato.com>
>> Date: Wed, Jul 30, 2014 at 7:38 AM
>> Subject: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization
>> for comparison
>> To: John C Klensin <john+w3c at jck.com>, Patrik Fältström <paf at frobbit.se>
>> Cc: member-i18n-core at w3.org, Asmus Freytag <asmusf at ix.netcom.com>
>>
>>
>> On what email address is this being discussed?  I'd like to convey to
>> that list some comments from an internal discussion about draft-klensin
>> -idna-5892upd-unicode70-00.txt.
>>
>> (These are not my wording, but I agree with them. I edited slightly for
>> flow. I will add that from a confusability standpoint, the proposed draft
>> accomplishes nothing, since there are thousands of cases of confusable
>> characters; restricting just this one character has no useful effect; like
>> removing a quart of water from a lake.)
>>
>> For U+08A1, this certainly is a *letter* of Fula (Fulfulde, Pula, ...),
>> a large language spoken across swaths of
>> West Africa. Fula is mostly written with the Latin script,
>> but Islamists also write it in Ajami (Arabic extensions for African
>> languages), particularly in Guinea.
>> See:
>>
>> http://en.wikipedia.org/wiki/Fula_orthographies
>>
>> The *letter* in question is the one used to write the phoneme /ɓ/,
>> the bilabial implosive. See:
>>
>> http://en.wikipedia.org/wiki/%C6%81
>>
>> for the African alphabet convention for the Latin writing of this letter.
>>
>> For the Arabic Ajami alphabets for Fula, the form has been missing.
>> For whatever reason, in at least one Fulfulde Ajami orthography,
>> this implosive was (reasonably) represented by using a Hamza
>> diacritic on the beh letter. Following the way such *diacritic* (ijam)
>> letter derivations are encoded in the Unicode Standard, a separate,
>> non-decomposed entry was required. Note that this use of Hamza
>> is *different* from the Arabic (language) use of a combining Hamza
>> to indicate a glottal stop, often in combination with a letter that
>> is actually pronounced as a vowel.
>>
>> As to *why* it was encoded as a single, undecomposed letter,
>> that is explained at length in the proposal document, as well
>> as in the section on Hamza in the Unicode Standard, which you
>> have referred to in the Internet Draft you mention.
>>
>> The newly encoded character U+08A1 for Unicode 7.0 has
>> *already* been added to the relevant table "Arabic Letters
>> With Hamza Above" in the draft core specification for
>> Unicode 7.0, where, like the long-encoded U+0681
>> and U+076C, it is noted as having no decomposition.
>> (The core specification will be posted around October -- it
>> is still undergoing its final editing for all the 7.0 additions.)
>>
>> U+08A1 does not have a canonical decomposition in Unicode 7.0
>> (nor, of course, will it *ever* have a canonical decomposition,
>> because of normalization stability). This is exactly the same treatment
>> that U+0681 and U+076C got, and for exactly the same reasons.
>> (And, as you know, of course, those characters date back to
>> Unicode 4.1 for U+076C and even earlier, Unicode 1.1 for U+0681.)
>>
>> Note that it is incorrect to assert that U+08A1 ARABIC LETTER BEH
>> WITH HAMZA ABOVE "should" be normalized to U+0628 ARABIC LETTER
>> BEH + U+0654 ARABIC HAMZA ABOVE. Those are distinct sequences,
>> and they are never going to compare equal in their NFC normalizations.
>>
>> I am concerned that the Internet Draft here is heading in
>> exactly the wrong direction. If it ends up changing RFC 5892
>> to override the derivation for U+08A1 and force it to INVALID,
>> all I can see that accomplishing is to guarantee forever that
>> correctly spelled Ajami Fulfulde cannot be used in domain
>> names, and that instead people would end up having to use
>> misspellings to represent their implosive b in a domain names.
>>
>> With all due respect to the Arabic script experts that have been
>> consulted, I rather doubt that they are experts on Ajami orthographies
>> in West Africa, or are in touch with the people who would be
>> supporting those languages and implementing keyboarding
>> and such for West Africa.
>>
>> Also, I don't see any way you can justify the abrupt (and permanent)
>> discontinuity that this would put in place between the treatment
>> of U+08A1 for Fulfulde and U+076C for Ormuri or U+0681 for Pashto.
>>
>>
>> If you are looking for a more analogous precedent I suggest, for example:
>>
>> U+2C65 LATIN SMALL LETTER A WITH STROKE
>>
>> That was added in Unicode 5.0, and nobody has ever had any problem
>> with it being PVALID in IDNA. It only has limited use in a minor
>> orthography,
>> but what is the harm?
>>
>> Now, if you examine U+2C65, you could well claim that it *should*
>> be decomposed to "a" plus the combining stroke overlay, U+0338.
>> And both of those have been encoded for a long, long time in
>> the standard, so in principle, somebody *could* have been representing
>> their data for a letter a with stroke before Unicode 5.0 using the
>> sequence with the stroke
>> overlay. It might even look o.k. in text, depending on the font support
>> for the combina̸tion. But the Unicode Standard has rules now for the
>> encoding of certain combinations of base letters and diacritic modifiers
>> that overlay or modify the base character form. So U+2C65 was
>> separately encoded. And there is no normalization of the sequence
>> involved. That stroked letter use is, in text, distinct from somebody,
>> say,
>> using a bunch of overlay strokes as a strikethrough convention for
>> some reason: a̸a̸a̸a̸a̸a̸
>>
>> Consider the Hamza diacritic as falling in this same class of edge cases,
>> if you will.
>>
>> And in this case, I don't think it will be doing anybody any favors to
>> update RFC 5892 to make U+08A1 DISALLOWED in IDNA. It doesn't
>> "fix" normalization for it. All it accomplishes is to force any Fulfulde
>> user of Ajami orthography to misspell their text in order to use a /ɓ/
>> in a domain name. It would just create an unexplained (and unfixable)
>> discontinuity between what the domain registrations would accept
>> and what the Fulfulde input and spelling tools would support. Or I
>> guess it would just force people to give up the Arabic spellings and
>> go back to the more widely supported Latin alphabets for Fula to
>> get their domain names.
>>
>> What would be accomplished by
>> forcing another point incompatibility that just ends up getting
>> carried around forever?
>>
>> ====
>> 
>> There are four levels at which confusables, including homoglyphs
>> ,
>> can be addressed for domain names
>>
>> 1. Encoding
>> 2. Protocol (IDNA)
>> 3. Label Generation Ruleset
>> 4. String Review
>>
>> A
>>  more natural level [for addressing confusables] would be the
>> Label Generation Ruleset level. For an LGR, there are three ways to deal
>> with homoglyphs, one of which is not available on the protocol level. The
>> first two of these are to rule out a code point (by not including it in the
>> LGR's repertoire), or to rule out a code point or sequence conditionally.
>> Unlike using these methods on the Protocol level, doing so on the LGR level
>> means that it is possible to be more restrictive, say, for the root of the
>> DNS than for domains several levels down the tree. The downside of using
>> the LGR is, of course, that it is specific to the given zone on the
>> internet.
>>
>> The upside is that an LGR has additional mechanisms, such as defining a
>> "blocked" variant. That creates an "either/or" situation, where both are
>> permitted, but not at the same time in the same position of an otherwise
>> identical label. This is a very nice solution for a number of
>> confusables/homoglyphs that are systemic (not dependent on accidents of
>> rendering or "arms length" similarity).
>>
>> Unlike the final level, String Review, an LGR has the advantage of being
>> applied mechanically without any case-by-case review, which is why it's
>> appropriate for cases like the one that gave rise to this discussion.
>>
>> In principle, both the Label Generation Ruleset or the String Review are
>> created/carried out by people/entities that have access to the necessary
>> and specific linguistic and script expertise, unlike IDNA which seems to be
>> created largely by protocol experts.
>>
>>
>> On Tue, Jul 22, 2014 at 12:22 AM, John C Klensin <john+w3c at jck.com>
>> wrote:
>>
>>> Hi.   I was asked to forward the announcement of this Internet
>>> Draft to this group once it was posted.  See attached.
>>>
>>> For information -- comments welcome, but the core issue may be
>>> rather specific to concerns that surround IDNs and IDNA.  Or not.
>>>
>>> Or course, if I/we are still completely confused, corrections
>>> and explanations would be welcome.
>>>
>>>     john
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: internet-drafts at ietf.org
>>> To: i-d-announce at ietf.org
>>> Cc:
>>> Date: Mon, 21 Jul 2014 04:03:58 -0700
>>> Subject: I-D Action: draft-klensin-idna-5892upd-unicode70-00.txt
>>>
>>> A New Internet-Draft is available from the on-line Internet-Drafts
>>> directories.
>>>
>>>
>>>         Title           : IDNA Update for Unicode 7.0.0
>>>         Authors         : John C Klensin
>>>                           Patrik Faltstrom
>>>         Filename        : draft-klensin-idna-5892upd-unicode70-00.txt
>>>         Pages           : 10
>>>         Date            : 2014-07-21
>>>
>>> Abstract:
>>>    The current version of the IDNA specifications anticipated that each
>>>    new version of Unicode would be reviewed to verify that no changes
>>>    had been introduced that required adjustments to the set of rules
>>>    and, in particular, whether new exceptions or backward compatibility
>>>    adjustments were needed.  That review was conducted for Unicode 7.0.0
>>>    and identified a problematic new code point.  This specification
>>>    updates RFC 5982 to disallow that code point and provides information
>>>    about the reasons why that exclusion is appropriate.  It also applies
>>>    an editorial clarification that was the subject of an earlier
>>>    erratum.
>>>
>>>
>>> The IETF datatracker status page for this draft is:
>>> https://datatracker.ietf.org/doc/draft-klensin-idna-5892upd-unicode70/
>>>
>>> There's also a htmlized version available at:
>>> http://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode70-00
>>>
>>>
>>> Please note that it may take a couple of minutes from the time of
>>> submission
>>> until the htmlized version and diff are available at tools.ietf.org.
>>>
>>> Internet-Drafts are also available by anonymous FTP at:
>>> ftp://ftp.ietf.org/internet-drafts/
>>>
>>> _______________________________________________
>>> I-D-Announce mailing list
>>> I-D-Announce at ietf.org
>>> https://www.ietf.org/mailman/listinfo/i-d-announce
>>> Internet-Draft
>>> <https://www.ietf.org/mailman/listinfo/i-d-announceInternet-Draft>
>>> directories: http://www.ietf.org/shadow.html
>>> or ftp://ftp.ietf.org/ietf/1shadow-sites.txt
>>>
>>>
>>
>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140806/cc72c0eb/attachment-0001.html>