Visually confusable characters (7)

Sun Aug 10 21:16:48 CEST 2014

On 8/9/2014 10:48 AM, John C Klensin wrote:

John,

I thought it best to reply to your points individually, as some
branches of the discussion are probably not going to be as deep.

As you wrote them "in no particular order", I'm going to respond
to them in the same way.

This message responds to point (7)

A./

>
>
>
> (7) This matching mess is so horrible that the only options are
> to accept it and live with it or to replace IDNA2008 with yet
> another version.
>
> Sorry, but no.  Selectively DISALLOWing selected code points,
> especially newly-added ones, was anticipated in the IDNA design.
> However horrible the idea may seem, the WG and broader community
> understood the issue and accepted the risks and possible costs
> (or thought it did).  Had we not done so, there would be no
> provision in IDNA for a review process and a mechanism for
> excluding newly-added characters.

John,

I have no objection in principle to the need for IDNA2008 to have the 
ability
to restrict the repertoire. Communities sometimes do adopt orthographies
that are inherently not compatible with forming the basis of an identifier
scheme.

The most egregious example is an orthography that is reported to
use @ as a letter. I don't see this going into Unicode, but if if ever did,
I'd fully support IDNA2008 in DISALLOWING it.

However, reserving the ability to review is one thing. Applying it in 
isolated
circumstances in an inconsistent manner is another.

Others have given examples of many cases that are structurally equivalent
to the character that you singled out, so I will not repeat them here.

The issue is not whether the matching is a mess, but whether the response
to that mess is well structured and coherent. And, also whether it makes
a reasonable dent in the problem.

Leaving 95% + of the existing cases of sequences identical in appearance to
non-decomposable singletons to the tender mercies of LGRs (in the generic
sense, see my response to (3)) or to other means of addressing confusables
(e.g. String Review or whatever it's equivalent in a given zone) does not
strike me a strengthening the case for taking this particular action out of
context now.

>   What is missing is a way to
> exclude particular combining sequences that turn out to be
> problematic, especially ones that become problematic as a result
> of additions to Unicode.   But, Andrew's comments about IDNA201x
> or IDNA202x notwithstanding, it appears to me that provisions
> for such exclusions would rather easily be added to the existing
> model and that it could be done with little or no disruption as
> long as one was very careful about cases that would appear to be
> retroactive.

Disallowed sequences appear to be just CONTEXT0 expressions of the form
"x must not be followed immediately by y" -- so it would depend on whether
you can add additional expressions. You might want to have a shorthand way
of listing them, so you don't have to write a full rule for each.

This would help in the majority of cases where the sequence not only has
been declared by Unicode to designate something DIFFERENT from the
singleton character, but where the sequence is also not ordinarily used in
any orthography.

For example, Unicode has a policy of not decomposing overlays, but provides
combining overlays (for example for use in the negation of math operators).
If the combining overlays were DISALLOWED in general, or in particular
sequences, that would be retroactive, but, importantly, should not affect
labels that are intended to be based on actual spellings.

I just had a conversation with an Arabic expert, and he assures me that
the sequence "beh + hamza" is in fact limited to Koranic contexts and  can't
be typed on ordinary keyboard layouts.

He further mentioned in passing that the sequence should also be
indistinguishable from "yeh + hamza", presumably in some positional
contexts,  which, if true would further illustrate that while the intent is
laudable,  the proposed update, by considering only a single code point
apparently does little to improve on the "mess", as you call it.

The proposed update also contains the suggestion that applications might
use the sequence to achieve the effect of the disallowed singleton.
I am really troubled by that one. Recommending to use a sequence that is
not part of the orthography and can't be typed or searched by users for
whom it is intended compounds the "linguistic damage".  It is not a
benign or useful fallback and the recommendation should not have been
made.

A./