Visually confusable characters (3)

Sun Aug 10 21:15:01 CEST 2014

On 8/9/2014 10:48 AM, John C Klensin wrote:
>

John,

I thought it best to reply to your points individually, as some
branches of the discussion are probably not going to be as deep.

As you wrote them "in no particular order", I'm going to respond
to them in the same way.

This message responds to point (3)

A./
>
> (3) The LGR rules and process have something to do with this or
> can be applied to help with it.
>
> The process that led to the LGR developments, and the LGR
> process itself, was agreed-to by the ICANN Board and community
> based on some very restrictive conditions.  First, it applies
> only to TLDs to be proposed to ICANN in some future (and as yet
> undefined) application round, not even to IDN TLDs that are now
> approved or in process.  Second, some a priori conditions were
> applied to it that would not apply to lower-level registrations,
> particularly the prohibition on archaic scripts (and,
> presumably, characters and avoidance of IDNA CONTEXTJ and
> CONTEXTO characters.   So any applicability to registrations at
> the second level or below or to non-contracted parties would be
> only as guidelines.  Based on experience, those guidelines would
> be strongly resisted if ICANN tried to turn them into policies
> (and simply ignored (at best) by any ccTLD that concluded its
> interests lay elsewhere).

John,

there is a basic misunderstanding here that we need to get past in order
to communicate effectively, so I put this at the outset.

When I speak of "label generation rulesets", I do so in a very generic 
sense.
The sense I employ is best understood as the combination of the concept
of "idn table" plus certain validation / disposition evaluations that can be
automated based on a suitable machine-readable expression of the relevant
policy aspects.

The sense in which I use the term can be seen as, more or less, equivalent
to something that can be expressed in the proposed XML format.
That format is purposefully *not* restricted in the same way as the
*Root Zone* LGR project, but instead designed to express all registered
idn table formats (including those policy aspects that translate directly
into a machine-readable format).

So, unless I say "root zone", I do not refer to an actual process or the
restrictions you are mentioning but to a broader technology.
>
> Incidentally, if a gTLD label was applied for containing a
> character that was exclusive to Fula-in-Arabic-script and that
> fell into the scope of the LGR rules, I'd expect it to be
> rejected unless the Arabic script panel had first recommended
> including that character and proven that it had the expertise to
> make the relevant judgments, including judgments about possible
> conflicts and requirements for variants.

The root zone process, so far, is on record that homographs (homoglyphs)
need to be addressed. Because for the root, being restrictive on the 
repertoire
end is facilitated by exclusion of historic/technical usage, many homographs
(homoglyphs) are excluded from the Maximal Starting Repertoire.

As for the expertise of the Arabic panel, this is no longer a 
theoretical question
as the panel exists (and its composition is public). You may make your own
evaluation of this, but it strikes me that the level of expertise 
collected there
exceeds, what concerns the Arabic script, the collective expertise in that
subject, that was brought to bear on IDNA itself.

>   If such a label was
> accepted anyway, I'd expect the sort of conflicts and appeals
> that would invoke the "start over" provisions of the VIP/IDN/LGR
> Process.  If that reasoning is correct, U+08A1 would be excluded
> from gTLDs as surely as if it were DISALLOWED by IDNA.  The
> _only_ issue with it in that regard is whether it should be
> allowed in places where it might otherwise co-occur with the BEH
> with HAMZA ABOVE combining sequence (of course, as Vint, Andrew,
> and others have pointed out, the real problem where isn't U+08A1
> except insofar as it a symptom of a problem that we thought
> normalization solved and didn't/ doesn't).

Label Generation Rulesets (in the sense that I understand this term) go 
beyond
the mechanisms available in IDNA2008.

IDNA2008 contains mechanisms to restrict the repertoire by defining the
code points, and mechanisms to further restrict labels by requiring or
prohibiting code points in certain contexts.

LGRs have additional mechanisms available.

The most important is the ability to create equivalence classes among
code point (and sequences), known as variant sets.

These variant sets can then be subject to machine interpretable policies
resulting in disposition for labels.

Of the number of possible dispositions, I am here interested only in the
one that leads to mutual exclusion of labels that differ only by a code
point or sequence from the same variant set (at the same location).

Like normalization this mechanism thus applied leads to only a single
successful label. Lookup is unambiguous. A spoofed label can't be
created, and if used for lookup will fail. Unlike normalization, it is not
possible, to a priori determine which of a set of labels might
exist, which can be an issue if trying to type in a label from a print out.
(But that issue exists in general, because of other types of confusables,
not just homographs/homoglyphs in the strict sense).

LGRs thus have the ability to address homoglyphs/homographs in
situations where it is not tenable to pick "favorites".

For the Root Zone process, the expectation is certainly that LGRs
submitted for the root will be reviewed and rejected if they do not
make use of that provision.

For other zones there is, of course no way to enforce that - which
does not diminish the fact that the technical mechanism of applying
a mutual exclusion over a variant set leads to linguistically less
harmful mitigation of the homograph/homoglyph problem.
>
> Suggestions about "blocked variant mechanisms" and the like are
> instances of either the LGR problem and binding to TLDs or
> general issues about ICANN scope and authority.  There has never
> been a _DNS_ "variant" mechanism and, for reasons that have been
> discussed extensively elsewhere really cannot be without a
> complete and fairly fundamental DNS redesign.  DNSng is even
> harder to contemplate than IDNA202x; ideas that depend on it
> are, at best, irrelevant to any contemporary discussion.
>
>

The blocked variant mechanism is something that is applied
during registration; I don't see any way where you would be able to
inject that into the operation (lookup) aspect of the DNS.

Like repertoire restriction, and normalization the blocked variant
approach is intended to limit the namespace. (Unlike it's cousin,
allocated variants, it does not affect the operation.)

So, yes, I'm aware of the limitations, but I am also aware of how
a well-designed LGR (in the generic sense) can be used in any zone
to make lookup safer and, in particular, to address the homograph/
homoglyph problem robustly but flexibly.

A./