Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

Wed Aug 6 15:22:39 CEST 2014

that is not the argument mark - it has to do with different encodings of
what looks like the same thing - that can be exploited by phishing, for
example.

v

On Wed, Aug 6, 2014 at 9:20 AM, Mark Davis ☕️ <mark at macchiato.com> wrote:

> The problem with excluding this one particular character is that it
> introduces a discrepancy with no real value. It's a bit like addressing
> global warming by requiring that Harley Davidsons with licence plate
> numbers divisible by 3 cannot use the US interstate. It might make some
> people feel like they are doing something good, but just complicates the
> rules with no measurable benefit.
>
> The principle that is being espoused is something like:
>
> P1: Any characters that are visually confusable with others should be
> excluded from domain names.
>
> But why apply that principle to this (rather rare) character, that is only
> used in a particular language, when it is not applied to thousands of other
> characters, and characters that are vastly more commonly used? And if P1 is
> not the principle that is being applied, what is?
>
>
>
> Mark <https://google.com/+MarkDavis>
>
>  *— Il meglio è l’inimico del bene —*
>
>
> On Wed, Aug 6, 2014 at 5:02 AM, Vint Cerf <vint at google.com> wrote:
>
>> Mark,
>>
>> I think it is important to distinguish the use of a character for
>> purposes other than domain names and for use in domain names. The long
>> preface in your email makes that case that the character is useful and
>> legitimate in the language but I think it has long been understood that
>> domain names have properties and requirements for unambiguous comparison
>> that might lead to the conclusion that, despite the character's undeniable
>> utility in written language, it may still not be appropriate for use in
>> domain names.
>>
>> In consequence of that line of reasoning, I would separate the legitimate
>> presence of the character in UNICODE from its use in domain names.
>>
>> The next question, of course, is whether there is any harm to users that
>> can be anticipated by the use of the character in domain names. I think it
>> is clear that whatever conclusion is reached, it should apply to all levels
>> of domain name and therefore to all labels. Adopting a rule concerning a
>> character that is applied only at TLD or SLD level but cannot be enforced
>> at lower levels of a DNS hierarchy seems like a mistake, so if there is a
>> problem with a character, that problem should be solved for all use in
>> labels.
>>
>> I want to emphasize that, so far in this text, I have not taken a
>> position regarding the use of U+08A1 in domain names. I am only discussing
>> the process of deciding whether a character should be rendered invalid for
>> use in domain name labels. The argument that the character is useful for
>> properly formed written language is not necessarily an argument for its
>> permitted use in domain names, if harms are identified in the latter usage
>> that are considered unacceptable.
>>
>> It is also clear from past experience that reasonable people can differ
>> in their assessment of the degree of risk or harm that a particular
>> character poses. So, now we get to the central question whether U+08A1
>> produces sufficient risk to be banned from use in domain names.
>>
>> In Klensin's draft RFC, the argument is made that the previous way in
>> which BEH WITH HAMZA ABOVE was accommodated, including for use in domain
>> names, was a sequence U+0628 followed by U+0654. The incorporation of a
>> new, combined character U+08A1 creates ambiguity because there would be two
>> ways for this character to be used in a domain name, but the two do not
>> compare equal. For users, this creates the risk that two labels could be
>> registered that would look the same but presumably take one to distinct
>> destinations in the event that two different registrations of labels
>> containing instances of these characters (pre-composed U+08A1 and the
>> U+0628/U+0654 sequence) were permitted. Making U+08A1 PVALID while also
>> allowing the earlier composed form creates exactly this ambiguity.
>>
>> This is not a trivial problem. There are similar problems even in the
>> purely Latin character cases where numbers and letters can look the same,
>> depending on fonts, but be distinct for purposes of label comparisons. The
>> fact that such problematic forms exist should not be an argument for
>> introducing more of them.
>>
>> I think this discussion boils down to a principle of not introducing
>> additional ambiguity where there had been none before. It is also fair to
>> say that this is not a question of excluding the character itself from
>> UNICODE for which ample argument has been made for its inclusion, but a
>> question of allowing or disallowing its use in domain name labels.
>>
>> vint
>>
>>
>>
>> On Tue, Aug 5, 2014 at 7:07 PM, Mark Davis ☕️ <mark at macchiato.com> wrote:
>>
>>>
>>>  I hadn't heard back from John, but I'm guessing that the right place
>>> to discuss this is here, based on Marc's email.
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Mark Davis ☕️ <mark at macchiato.com>
>>> Date: Wed, Jul 30, 2014 at 7:38 AM
>>> Subject: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization
>>> for comparison
>>> To: John C Klensin <john+w3c at jck.com>, Patrik Fältström <paf at frobbit.se>
>>> Cc: member-i18n-core at w3.org, Asmus Freytag <asmusf at ix.netcom.com>
>>>
>>>
>>> On what email address is this being discussed?  I'd like to convey to
>>> that list some comments from an internal discussion about draft-klensin
>>> -idna-5892upd-unicode70-00.txt.
>>>
>>> (These are not my wording, but I agree with them. I edited slightly for
>>> flow. I will add that from a confusability standpoint, the proposed draft
>>> accomplishes nothing, since there are thousands of cases of confusable
>>> characters; restricting just this one character has no useful effect; like
>>> removing a quart of water from a lake.)
>>>
>>> For U+08A1, this certainly is a *letter* of Fula (Fulfulde, Pula, ...),
>>> a large language spoken across swaths of
>>> West Africa. Fula is mostly written with the Latin script,
>>> but Islamists also write it in Ajami (Arabic extensions for African
>>> languages), particularly in Guinea.
>>> See:
>>>
>>> http://en.wikipedia.org/wiki/Fula_orthographies
>>>
>>> The *letter* in question is the one used to write the phoneme /ɓ/,
>>> the bilabial implosive. See:
>>>
>>> http://en.wikipedia.org/wiki/%C6%81
>>>
>>> for the African alphabet convention for the Latin writing of this letter.
>>>
>>> For the Arabic Ajami alphabets for Fula, the form has been missing.
>>> For whatever reason, in at least one Fulfulde Ajami orthography,
>>> this implosive was (reasonably) represented by using a Hamza
>>> diacritic on the beh letter. Following the way such *diacritic* (ijam)
>>> letter derivations are encoded in the Unicode Standard, a separate,
>>> non-decomposed entry was required. Note that this use of Hamza
>>> is *different* from the Arabic (language) use of a combining Hamza
>>> to indicate a glottal stop, often in combination with a letter that
>>> is actually pronounced as a vowel.
>>>
>>> As to *why* it was encoded as a single, undecomposed letter,
>>> that is explained at length in the proposal document, as well
>>> as in the section on Hamza in the Unicode Standard, which you
>>> have referred to in the Internet Draft you mention.
>>>
>>> The newly encoded character U+08A1 for Unicode 7.0 has
>>> *already* been added to the relevant table "Arabic Letters
>>> With Hamza Above" in the draft core specification for
>>> Unicode 7.0, where, like the long-encoded U+0681
>>> and U+076C, it is noted as having no decomposition.
>>> (The core specification will be posted around October -- it
>>> is still undergoing its final editing for all the 7.0 additions.)
>>>
>>> U+08A1 does not have a canonical decomposition in Unicode 7.0
>>> (nor, of course, will it *ever* have a canonical decomposition,
>>> because of normalization stability). This is exactly the same treatment
>>> that U+0681 and U+076C got, and for exactly the same reasons.
>>> (And, as you know, of course, those characters date back to
>>> Unicode 4.1 for U+076C and even earlier, Unicode 1.1 for U+0681.)
>>>
>>> Note that it is incorrect to assert that U+08A1 ARABIC LETTER BEH
>>> WITH HAMZA ABOVE "should" be normalized to U+0628 ARABIC LETTER
>>> BEH + U+0654 ARABIC HAMZA ABOVE. Those are distinct sequences,
>>> and they are never going to compare equal in their NFC normalizations.
>>>
>>> I am concerned that the Internet Draft here is heading in
>>> exactly the wrong direction. If it ends up changing RFC 5892
>>> to override the derivation for U+08A1 and force it to INVALID,
>>> all I can see that accomplishing is to guarantee forever that
>>> correctly spelled Ajami Fulfulde cannot be used in domain
>>> names, and that instead people would end up having to use
>>> misspellings to represent their implosive b in a domain names.
>>>
>>> With all due respect to the Arabic script experts that have been
>>> consulted, I rather doubt that they are experts on Ajami orthographies
>>> in West Africa, or are in touch with the people who would be
>>> supporting those languages and implementing keyboarding
>>> and such for West Africa.
>>>
>>> Also, I don't see any way you can justify the abrupt (and permanent)
>>> discontinuity that this would put in place between the treatment
>>> of U+08A1 for Fulfulde and U+076C for Ormuri or U+0681 for Pashto.
>>>
>>>
>>> If you are looking for a more analogous precedent I suggest, for
>>> example:
>>>
>>> U+2C65 LATIN SMALL LETTER A WITH STROKE
>>>
>>> That was added in Unicode 5.0, and nobody has ever had any problem
>>> with it being PVALID in IDNA. It only has limited use in a minor
>>> orthography,
>>> but what is the harm?
>>>
>>> Now, if you examine U+2C65, you could well claim that it *should*
>>> be decomposed to "a" plus the combining stroke overlay, U+0338.
>>> And both of those have been encoded for a long, long time in
>>> the standard, so in principle, somebody *could* have been representing
>>> their data for a letter a with stroke before Unicode 5.0 using the
>>> sequence with the stroke
>>> overlay. It might even look o.k. in text, depending on the font support
>>> for the combina̸tion. But the Unicode Standard has rules now for the
>>> encoding of certain combinations of base letters and diacritic modifiers
>>> that overlay or modify the base character form. So U+2C65 was
>>> separately encoded. And there is no normalization of the sequence
>>> involved. That stroked letter use is, in text, distinct from somebody,
>>> say,
>>> using a bunch of overlay strokes as a strikethrough convention for
>>> some reason: a̸a̸a̸a̸a̸a̸
>>>
>>> Consider the Hamza diacritic as falling in this same class of edge cases,
>>> if you will.
>>>
>>> And in this case, I don't think it will be doing anybody any favors to
>>> update RFC 5892 to make U+08A1 DISALLOWED in IDNA. It doesn't
>>> "fix" normalization for it. All it accomplishes is to force any Fulfulde
>>> user of Ajami orthography to misspell their text in order to use a /ɓ/
>>> in a domain name. It would just create an unexplained (and unfixable)
>>> discontinuity between what the domain registrations would accept
>>> and what the Fulfulde input and spelling tools would support. Or I
>>> guess it would just force people to give up the Arabic spellings and
>>> go back to the more widely supported Latin alphabets for Fula to
>>> get their domain names.
>>>
>>> What would be accomplished by
>>> forcing another point incompatibility that just ends up getting
>>> carried around forever?
>>>
>>> ====
>>> 
>>> There are four levels at which confusables, including homoglyphs
>>> ,
>>> can be addressed for domain names
>>>
>>> 1. Encoding
>>> 2. Protocol (IDNA)
>>> 3. Label Generation Ruleset
>>> 4. String Review
>>>
>>> A
>>>  more natural level [for addressing confusables] would be the
>>> Label Generation Ruleset level. For an LGR, there are three ways to deal
>>> with homoglyphs, one of which is not available on the protocol level. The
>>> first two of these are to rule out a code point (by not including it in the
>>> LGR's repertoire), or to rule out a code point or sequence conditionally.
>>> Unlike using these methods on the Protocol level, doing so on the LGR level
>>> means that it is possible to be more restrictive, say, for the root of the
>>> DNS than for domains several levels down the tree. The downside of using
>>> the LGR is, of course, that it is specific to the given zone on the
>>> internet.
>>>
>>> The upside is that an LGR has additional mechanisms, such as defining a
>>> "blocked" variant. That creates an "either/or" situation, where both are
>>> permitted, but not at the same time in the same position of an otherwise
>>> identical label. This is a very nice solution for a number of
>>> confusables/homoglyphs that are systemic (not dependent on accidents of
>>> rendering or "arms length" similarity).
>>>
>>> Unlike the final level, String Review, an LGR has the advantage of being
>>> applied mechanically without any case-by-case review, which is why it's
>>> appropriate for cases like the one that gave rise to this discussion.
>>>
>>> In principle, both the Label Generation Ruleset or the String Review are
>>> created/carried out by people/entities that have access to the necessary
>>> and specific linguistic and script expertise, unlike IDNA which seems to be
>>> created largely by protocol experts.
>>>
>>>
>>> On Tue, Jul 22, 2014 at 12:22 AM, John C Klensin <john+w3c at jck.com>
>>> wrote:
>>>
>>>> Hi.   I was asked to forward the announcement of this Internet
>>>> Draft to this group once it was posted.  See attached.
>>>>
>>>> For information -- comments welcome, but the core issue may be
>>>> rather specific to concerns that surround IDNs and IDNA.  Or not.
>>>>
>>>> Or course, if I/we are still completely confused, corrections
>>>> and explanations would be welcome.
>>>>
>>>>     john
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: internet-drafts at ietf.org
>>>> To: i-d-announce at ietf.org
>>>> Cc:
>>>> Date: Mon, 21 Jul 2014 04:03:58 -0700
>>>> Subject: I-D Action: draft-klensin-idna-5892upd-unicode70-00.txt
>>>>
>>>> A New Internet-Draft is available from the on-line Internet-Drafts
>>>> directories.
>>>>
>>>>
>>>>         Title           : IDNA Update for Unicode 7.0.0
>>>>         Authors         : John C Klensin
>>>>                           Patrik Faltstrom
>>>>         Filename        : draft-klensin-idna-5892upd-unicode70-00.txt
>>>>         Pages           : 10
>>>>         Date            : 2014-07-21
>>>>
>>>> Abstract:
>>>>    The current version of the IDNA specifications anticipated that each
>>>>    new version of Unicode would be reviewed to verify that no changes
>>>>    had been introduced that required adjustments to the set of rules
>>>>    and, in particular, whether new exceptions or backward compatibility
>>>>    adjustments were needed.  That review was conducted for Unicode 7.0.0
>>>>    and identified a problematic new code point.  This specification
>>>>    updates RFC 5982 to disallow that code point and provides information
>>>>    about the reasons why that exclusion is appropriate.  It also applies
>>>>    an editorial clarification that was the subject of an earlier
>>>>    erratum.
>>>>
>>>>
>>>> The IETF datatracker status page for this draft is:
>>>> https://datatracker.ietf.org/doc/draft-klensin-idna-5892upd-unicode70/
>>>>
>>>> There's also a htmlized version available at:
>>>> http://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode70-00
>>>>
>>>>
>>>> Please note that it may take a couple of minutes from the time of
>>>> submission
>>>> until the htmlized version and diff are available at tools.ietf.org.
>>>>
>>>> Internet-Drafts are also available by anonymous FTP at:
>>>> ftp://ftp.ietf.org/internet-drafts/
>>>>
>>>> _______________________________________________
>>>> I-D-Announce mailing list
>>>> I-D-Announce at ietf.org
>>>> https://www.ietf.org/mailman/listinfo/i-d-announce
>>>> Internet-Draft
>>>> <https://www.ietf.org/mailman/listinfo/i-d-announceInternet-Draft>
>>>> directories: http://www.ietf.org/shadow.html
>>>> or ftp://ftp.ietf.org/ietf/1shadow-sites.txt
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140806/ea53953d/attachment-0001.html>