Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

Mark Davis ☕️ mark at macchiato.com
Wed Aug 6 16:26:07 CEST 2014


> that is not the argument mark

I'm not sure why it isn't the argument. When I say *X is confusable with Y*,
I mean "X looks like Y but is encoded differently". That appears to be
exactly the same as your "different encodings of what looks like the same
thing".

So I'll try translating P1 into your terms:

P1: Any characters that are visually confusable with others should be
excluded from domain names.
=>
P1: Any characters that are have different encoding but look the same as
others should be excluded from domain names.


Mark <https://google.com/+MarkDavis>

 *— Il meglio è l’inimico del bene —*


On Wed, Aug 6, 2014 at 6:22 AM, Vint Cerf <vint at google.com> wrote:

> that is not the argument mark - it has to do with different encodings of
> what looks like the same thing - that can be exploited by phishing, for
> example.
>
> v
>
>
>
> On Wed, Aug 6, 2014 at 9:20 AM, Mark Davis ☕️ <mark at macchiato.com> wrote:
>
>> The problem with excluding this one particular character is that it
>> introduces a discrepancy with no real value. It's a bit like addressing
>> global warming by requiring that Harley Davidsons with licence plate
>> numbers divisible by 3 cannot use the US interstate. It might make some
>> people feel like they are doing something good, but just complicates the
>> rules with no measurable benefit.
>>
>> The principle that is being espoused is something like:
>>
>> P1: Any characters that are visually confusable with others should be
>> excluded from domain names.
>>
>> But why apply that principle to this (rather rare) character, that is
>> only used in a particular language, when it is not applied to thousands of
>> other characters, and characters that are vastly more commonly used? And if
>> P1 is not the principle that is being applied, what is?
>>
>>
>>
>> Mark <https://google.com/+MarkDavis>
>>
>>  *— Il meglio è l’inimico del bene —*
>>
>>
>> On Wed, Aug 6, 2014 at 5:02 AM, Vint Cerf <vint at google.com> wrote:
>>
>>> Mark,
>>>
>>> I think it is important to distinguish the use of a character for
>>> purposes other than domain names and for use in domain names. The long
>>> preface in your email makes that case that the character is useful and
>>> legitimate in the language but I think it has long been understood that
>>> domain names have properties and requirements for unambiguous comparison
>>> that might lead to the conclusion that, despite the character's undeniable
>>> utility in written language, it may still not be appropriate for use in
>>> domain names.
>>>
>>> In consequence of that line of reasoning, I would separate the
>>> legitimate presence of the character in UNICODE from its use in domain
>>> names.
>>>
>>> The next question, of course, is whether there is any harm to users that
>>> can be anticipated by the use of the character in domain names. I think it
>>> is clear that whatever conclusion is reached, it should apply to all levels
>>> of domain name and therefore to all labels. Adopting a rule concerning a
>>> character that is applied only at TLD or SLD level but cannot be enforced
>>> at lower levels of a DNS hierarchy seems like a mistake, so if there is a
>>> problem with a character, that problem should be solved for all use in
>>> labels.
>>>
>>> I want to emphasize that, so far in this text, I have not taken a
>>> position regarding the use of U+08A1 in domain names. I am only discussing
>>> the process of deciding whether a character should be rendered invalid for
>>> use in domain name labels. The argument that the character is useful for
>>> properly formed written language is not necessarily an argument for its
>>> permitted use in domain names, if harms are identified in the latter usage
>>> that are considered unacceptable.
>>>
>>> It is also clear from past experience that reasonable people can differ
>>> in their assessment of the degree of risk or harm that a particular
>>> character poses. So, now we get to the central question whether U+08A1
>>> produces sufficient risk to be banned from use in domain names.
>>>
>>> In Klensin's draft RFC, the argument is made that the previous way in
>>> which BEH WITH HAMZA ABOVE was accommodated, including for use in domain
>>> names, was a sequence U+0628 followed by U+0654. The incorporation of a
>>> new, combined character U+08A1 creates ambiguity because there would be two
>>> ways for this character to be used in a domain name, but the two do not
>>> compare equal. For users, this creates the risk that two labels could be
>>> registered that would look the same but presumably take one to distinct
>>> destinations in the event that two different registrations of labels
>>> containing instances of these characters (pre-composed U+08A1 and the
>>> U+0628/U+0654 sequence) were permitted. Making U+08A1 PVALID while also
>>> allowing the earlier composed form creates exactly this ambiguity.
>>>
>>> This is not a trivial problem. There are similar problems even in the
>>> purely Latin character cases where numbers and letters can look the same,
>>> depending on fonts, but be distinct for purposes of label comparisons. The
>>> fact that such problematic forms exist should not be an argument for
>>> introducing more of them.
>>>
>>> I think this discussion boils down to a principle of not introducing
>>> additional ambiguity where there had been none before. It is also fair to
>>> say that this is not a question of excluding the character itself from
>>> UNICODE for which ample argument has been made for its inclusion, but a
>>> question of allowing or disallowing its use in domain name labels.
>>>
>>> vint
>>>
>>>
>>>
>>> On Tue, Aug 5, 2014 at 7:07 PM, Mark Davis ☕️ <mark at macchiato.com>
>>> wrote:
>>>
>>>>
>>>>  ​I hadn't heard back from John, but I'm guessing that the right place
>>>> to discuss this is here​, based on Marc's email.
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Mark Davis ☕️ <mark at macchiato.com>
>>>> Date: Wed, Jul 30, 2014 at 7:38 AM
>>>> Subject: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization
>>>> for comparison
>>>> To: John C Klensin <john+w3c at jck.com>, Patrik Fältström <paf at frobbit.se
>>>> >
>>>> Cc: member-i18n-core at w3.org, Asmus Freytag <asmusf at ix.netcom.com>
>>>>
>>>>
>>>> ​On what email address is this being discussed?  I'd like to convey to
>>>> that list some comments from an internal discussion​ about draft-
>>>> klensin-idna-5892upd-unicode70-00.txt.
>>>>
>>>> (These are not my wording, but I agree with them. I edited slightly for
>>>> flow. I will add that from a confusability standpoint, the proposed draft
>>>> accomplishes nothing, since there are thousands of cases of confusable
>>>> characters; restricting just this one character has no useful effect; like
>>>> removing a quart of water from a lake.)
>>>>
>>>> For U+08A1, this certainly is a *letter* of Fula (Fulfulde, Pula, ...),
>>>> a large language spoken across swaths of
>>>> West Africa. Fula is mostly written with the Latin script,
>>>> but Islamists also write it in Ajami (Arabic extensions for African
>>>> languages), particularly in Guinea.
>>>> See:
>>>>
>>>> http://en.wikipedia.org/wiki/Fula_orthographies
>>>>
>>>> The *letter* in question is the one used to write the phoneme /ɓ/,
>>>> the bilabial implosive. See:
>>>>
>>>> http://en.wikipedia.org/wiki/%C6%81
>>>>
>>>> for the African alphabet convention for the Latin writing of this
>>>> letter.
>>>>
>>>> For the Arabic Ajami alphabets for Fula, the form has been missing.
>>>> For whatever reason, in at least one Fulfulde Ajami orthography,
>>>> this implosive was (reasonably) represented by using a Hamza
>>>> diacritic on the beh letter. Following the way such *diacritic* (ijam)
>>>> letter derivations are encoded in the Unicode Standard, a separate,
>>>> non-decomposed entry was required. Note that this use of Hamza
>>>> is *different* from the Arabic (language) use of a combining Hamza
>>>> to indicate a glottal stop, often in combination with a letter that
>>>> is actually pronounced as a vowel.
>>>>
>>>> As to *why* it was encoded as a single, undecomposed letter,
>>>> that is explained at length in the proposal document, as well
>>>> as in the section on Hamza in the Unicode Standard, which you
>>>> have referred to in the Internet Draft you mention.
>>>>
>>>> The newly encoded character U+08A1 for Unicode 7.0 has
>>>> *already* been added to the relevant table "Arabic Letters
>>>> With Hamza Above" in the draft core specification for
>>>> Unicode 7.0, where, like the long-encoded U+0681
>>>> and U+076C, it is noted as having no decomposition.
>>>> (The core specification will be posted around October -- it
>>>> is still undergoing its final editing for all the 7.0 additions.)
>>>>
>>>> U+08A1 does not have a canonical decomposition in Unicode 7.0
>>>> (nor, of course, will it *ever* have a canonical decomposition,
>>>> because of normalization stability). This is exactly the same treatment
>>>> that U+0681 and U+076C got, and for exactly the same reasons.
>>>> (And, as you know, of course, those characters date back to
>>>> Unicode 4.1 for U+076C and even earlier, Unicode 1.1 for U+0681.)
>>>>
>>>> Note that it is incorrect to assert that U+08A1 ARABIC LETTER BEH
>>>> WITH HAMZA ABOVE "should" be normalized to U+0628 ARABIC LETTER
>>>> BEH + U+0654 ARABIC HAMZA ABOVE. Those are distinct sequences,
>>>> and they are never going to compare equal in their NFC normalizations.
>>>>
>>>> I am concerned that the Internet Draft here is heading in
>>>> exactly the wrong direction. If it ends up changing RFC 5892
>>>> to override the derivation for U+08A1 and force it to INVALID,
>>>> all I can see that accomplishing is to guarantee forever that
>>>> correctly spelled Ajami Fulfulde cannot be used in domain
>>>> names, and that instead people would end up having to use
>>>> misspellings to represent their implosive b in a domain names.
>>>>
>>>> With all due respect to the Arabic script experts that have been
>>>> consulted, I rather doubt that they are experts on Ajami orthographies
>>>> in West Africa, or are in touch with the people who would be
>>>> supporting those languages and implementing keyboarding
>>>> and such for West Africa.
>>>>
>>>> Also, I don't see any way you can justify the abrupt (and permanent)
>>>> discontinuity that this would put in place between the treatment
>>>> of U+08A1 for Fulfulde and U+076C for Ormuri or U+0681 for Pashto.
>>>>
>>>>
>>>> If you are looking for a more analogous precedent I suggest, for
>>>> example:
>>>>
>>>> U+2C65 LATIN SMALL LETTER A WITH STROKE
>>>>
>>>> That was added in Unicode 5.0, and nobody has ever had any problem
>>>> with it being PVALID in IDNA. It only has limited use in a minor
>>>> orthography,
>>>> but what is the harm?
>>>>
>>>> Now, if you examine U+2C65, you could well claim that it *should*
>>>> be decomposed to "a" plus the combining stroke overlay, U+0338.
>>>> And both of those have been encoded for a long, long time in
>>>> the standard, so in principle, somebody *could* have been representing
>>>> their data for a letter a with stroke before Unicode 5.0 using the
>>>> sequence with the stroke
>>>> overlay. It might even look o.k. in text, depending on the font support
>>>> for the combina̸tion. But the Unicode Standard has rules now for the
>>>> encoding of certain combinations of base letters and diacritic modifiers
>>>> that overlay or modify the base character form. So U+2C65 was
>>>> separately encoded. And there is no normalization of the sequence
>>>> involved. That stroked letter use is, in text, distinct from somebody,
>>>> say,
>>>> using a bunch of overlay strokes as a strikethrough convention for
>>>> some reason: a̸a̸a̸a̸a̸a̸
>>>>
>>>> Consider the Hamza diacritic as falling in this same class of edge
>>>> cases,
>>>> if you will.
>>>>
>>>> And in this case, I don't think it will be doing anybody any favors to
>>>> update RFC 5892 to make U+08A1 DISALLOWED in IDNA. It doesn't
>>>> "fix" normalization for it. All it accomplishes is to force any Fulfulde
>>>> user of Ajami orthography to misspell their text in order to use a /ɓ/
>>>> in a domain name. It would just create an unexplained (and unfixable)
>>>> discontinuity between what the domain registrations would accept
>>>> and what the Fulfulde input and spelling tools would support. Or I
>>>> guess it would just force people to give up the Arabic spellings and
>>>> go back to the more widely supported Latin alphabets for Fula to
>>>> get their domain names.
>>>>
>>>> What would be accomplished by
>>>> forcing another point incompatibility that just ends up getting
>>>> carried around forever?
>>>>
>>>> ​====
>>>>>>>> There are four levels at which confusables, including homoglyphs
>>>> ​,​
>>>> can be addressed for domain names
>>>>
>>>> 1. Encoding
>>>> 2. Protocol (IDNA)
>>>> 3. Label Generation Ruleset
>>>> 4. String Review
>>>>
>>>> ​A
>>>>  more natural level ​[for ​addressing ​confusables​] ​​would be the
>>>> Label Generation Ruleset level. For an LGR, there are three ways to deal
>>>> with homoglyphs, one of which is not available on the protocol level. The
>>>> first two of these are to rule out a code point (by not including it in the
>>>> LGR's repertoire), or to rule out a code point or sequence conditionally.
>>>> Unlike using these methods on the Protocol level, doing so on the LGR level
>>>> means that it is possible to be more restrictive, say, for the root of the
>>>> DNS than for domains several levels down the tree. The downside of using
>>>> the LGR is, of course, that it is specific to the given zone on the
>>>> internet.
>>>>
>>>> The upside is that an LGR has additional mechanisms, such as defining a
>>>> "blocked" variant. That creates an "either/or" situation, where both are
>>>> permitted, but not at the same time in the same position of an otherwise
>>>> identical label. This is a very nice solution for a number of
>>>> confusables/homoglyphs that are systemic (not dependent on accidents of
>>>> rendering or "arms length" similarity).
>>>>
>>>> Unlike the final level, String Review, an LGR has the advantage of
>>>> being applied mechanically without any case-by-case review, which is why
>>>> it's appropriate for cases like the one that gave rise to this discussion.
>>>>
>>>> In principle, both the Label Generation Ruleset or the String Review
>>>> are created/carried out by people/entities that have access to the
>>>> necessary and specific linguistic and script expertise, unlike IDNA which
>>>> seems to be created largely by protocol experts.
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 12:22 AM, John C Klensin <john+w3c at jck.com>
>>>> wrote:
>>>>
>>>>> Hi.   I was asked to forward the announcement of this Internet
>>>>> Draft to this group once it was posted.  See attached.
>>>>>
>>>>> For information -- comments welcome, but the core issue may be
>>>>> rather specific to concerns that surround IDNs and IDNA.  Or not.
>>>>>
>>>>> Or course, if I/we are still completely confused, corrections
>>>>> and explanations would be welcome.
>>>>>
>>>>>     john
>>>>>
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: internet-drafts at ietf.org
>>>>> To: i-d-announce at ietf.org
>>>>> Cc:
>>>>> Date: Mon, 21 Jul 2014 04:03:58 -0700
>>>>> Subject: I-D Action: draft-klensin-idna-5892upd-unicode70-00.txt
>>>>>
>>>>> A New Internet-Draft is available from the on-line Internet-Drafts
>>>>> directories.
>>>>>
>>>>>
>>>>>         Title           : IDNA Update for Unicode 7.0.0
>>>>>         Authors         : John C Klensin
>>>>>                           Patrik Faltstrom
>>>>>         Filename        : draft-klensin-idna-5892upd-unicode70-00.txt
>>>>>         Pages           : 10
>>>>>         Date            : 2014-07-21
>>>>>
>>>>> Abstract:
>>>>>    The current version of the IDNA specifications anticipated that each
>>>>>    new version of Unicode would be reviewed to verify that no changes
>>>>>    had been introduced that required adjustments to the set of rules
>>>>>    and, in particular, whether new exceptions or backward compatibility
>>>>>    adjustments were needed.  That review was conducted for Unicode
>>>>> 7.0.0
>>>>>    and identified a problematic new code point.  This specification
>>>>>    updates RFC 5982 to disallow that code point and provides
>>>>> information
>>>>>    about the reasons why that exclusion is appropriate.  It also
>>>>> applies
>>>>>    an editorial clarification that was the subject of an earlier
>>>>>    erratum.
>>>>>
>>>>>
>>>>> The IETF datatracker status page for this draft is:
>>>>> https://datatracker.ietf.org/doc/draft-klensin-idna-5892upd-unicode70/
>>>>>
>>>>> There's also a htmlized version available at:
>>>>> http://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode70-00
>>>>>
>>>>>
>>>>> Please note that it may take a couple of minutes from the time of
>>>>> submission
>>>>> until the htmlized version and diff are available at tools.ietf.org.
>>>>>
>>>>> Internet-Drafts are also available by anonymous FTP at:
>>>>> ftp://ftp.ietf.org/internet-drafts/
>>>>>
>>>>> _______________________________________________
>>>>> I-D-Announce mailing list
>>>>> I-D-Announce at ietf.org
>>>>> https://www.ietf.org/mailman/listinfo/i-d-announce
>>>>> Internet-Draft
>>>>> <https://www.ietf.org/mailman/listinfo/i-d-announceInternet-Draft>
>>>>> directories: http://www.ietf.org/shadow.html
>>>>> or ftp://ftp.ietf.org/ietf/1shadow-sites.txt
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Idna-update mailing list
>>>> Idna-update at alvestrand.no
>>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140806/6e259e00/attachment-0001.html>


More information about the Idna-update mailing list