Browser IDN display policy: opinions sought

Eric Brunner-Williams ebw at abenaki.wabanaki.net
Thu Dec 22 17:05:25 CET 2011


I've been thinking about John's recent remark that the "the potential
for confusion with a 37 character repertoire is far less than the
potential with a repertoire of circa 50K characters or more". I write
to point out a context, outline a mechanism, and offer a rational.

For the sets of points in the string space defined by a distance 1 in
Damerau–Levenshtein metric from the strings "microsoft", "twitter",
"facebook", "google", and "apple", in the .COM namespace (Verisign
registry operator, ICANN policy authority, registration policy
implemented by ICANN accredited registrars), the non-intermediated
non-NXDOMIN resolution ratios are 61%, 74%, 81%, 83%, and 86%,
respectively*,**.

1. Given the order of introduction of scripts, in both the MdR and
Bejing roots, that is, LDH first, the use of non-Latin homoglyphs to
extend the repertoire over which a distance 1 in Damerau–Levenshtein
metric resolvable string from an existing string in a namespace may be
found is likely to be preceded by non-zero resolvable strings in the
LDH distance 1 in Damerau–Levenshtein metric.

Restated, typosquatting in LDH precedes typosquatting in LDH+.

2. Densities of non-intermediated non-NXDOMAIN resolution in distance
1 in Damerau–Levenshtein metric from this, or any equivalent sample
set of strings in namespaces which differ in registry policy, e.g.,
.cat, .tel, .museum, or which differ in policy authority, e.g., .sa,
may be determined by resolving applications, at run time, as well as
earlier.

Restated, a browser vendor may determine that the root problem is in
namespace policy authority, not a property of glyphs, and test any
resolution for Damerau–Levenshtein metric 1 neighbors, and act upon
that resolution-time or earlier data.

3. With likelihood .3 or greater, unmediated non-NXDOMAIN resolutions
within a Damerau–Levenshtein metric 1 of these five strings contain
discoverable links to a monitizing agent -- Google's DoubleClick
business unit. High correlations exist for the associated NS records,
and the addresses mapped by A records, for strings within this metric
1 set.

I suggest that the context missing is, unfortunately, the policy
authority and its implementers.

A mechanism any resolution requesting application author may exercise,
at any point in time, including "run time", and hence not necessarily
limited to the static representations of a dynamic set of resolution
contexts, roots included, is to check for the Damerau–Levenshtein
metric 1 density of apparently unmediated non-NXDOMAIN returns for
resolutions, about some suite of strings, and/or around the string
resolution is attempted. High density is distinguishable from low
density, and the application logic may branch upon some empirically
observed property of a stringspace within a namespace.

A rational is the application author, unless specifically motivated,
say an "evil browser" (and we've seen plenty of browser hacks to fake
IDN TLDs over the past decade, which while not "evil", is not "good"
either), is unlikely to share in the recurring revenue obtained by any
registry operator and its database access providers (registrars), the
homomorph exploiting registrant, and the policy authority which
materially benefits from all of these registrations.

In my modest opinion, the observations made by operators of the .sa
registry are not well answered by appeals to script repertoires that
do not address pre-existing conditions, and their root causes in
policy and economics allowed by policy, in the LDH set of code points.

Its solstice, so I'm going to go hang apples on the trees for the
deer, and corn for the squirrels. Enjoy the day, tomorrow will be
longer, and the night, for no night will be longer.

Eric

(*) The authors of the typosquatting study do not appear to have
limited their Damerau–Levenshtein metric 1 sets to the AZERTY or
QWERTY keyboard adjacent set, the "i" "l" "1" or "o" "0" sets or
similar subsets of the 101 key, 37 character constraints. If their
data did not contain these limits, then for those subsets, the ratios
may be significantly higher, though not in excess of 100%.

(**) The authors of the typosquatting study found 1,502 of a total
possible 2,249 points in the .COM string space defined by six strings
(the sixth being the author's name "saphos") resolved. If the
resolutions were unmediated, these registrations created an additional
$294 revenue for ICANN, and approximately $9,000 additional revenue
for Verisign, and possibly a similar additional revenue amount for a
number of registrars.



More information about the Idna-update mailing list