Confusability (Re: New version,
John C Klensin
klensin at jck.com
Fri Jun 22 23:38:03 CEST 2007
--On Monday, 18 June, 2007 16:30 -0700 Kenneth Whistler
<kenw at sybase.com> wrote:
> And please note that as the UTC has examined this issue
> in more detail, the number of required contexts has been
> pared down to a mere handful currently -- all of which
> can be described in terms of constrained regular expressions.
> In particular, ZWNJ seems only required in Persian (not
> other languages using the Arabic script) in some specific
> contexts, and then for the Malayalam and Khmer scripts,
> also in very specific contexts.
But we don't have a good way to determine at lookup time that
Persian, rather than some other Arabic-script language, is in
use at lookup time, do we? It seems to me that the appearance
of ZWNJ somewhere in the DNS tree -- for example in a label that
otherwise consists of Cyrillic or Han characters -- could cause
such serious problems that either a lookup-time rule is needed
or we need to go back and review _how much_ Persian, Malayalam,
and Khmer need that character.
At lookup time, one can certainly detect Arabic (or Malayalam or
Khmer) script and then reject a label is ZWNJ is mixed with
another script. If, when ZWNJ appears in the contexts of those
scripts it always causes presentation changes, then I would
imagine that the residual difficulties can be dealt with by
registry restrictions -- by registries who presumably understand
the issues because they are doing registrations in Arabic.
> ZWJ seems only required in the Sinhala script.
> That is not to say that ZWNJ and ZWJ aren't much more widely
> used in the Arabic script and in many Indian scripts for
> presentational purposes -- but the few instances above
> are the only ones we currently know about where important
> semantic distinctions require the presence of a ZWNJ or a ZWJ
> to be "spelled" correctly, from the point of view of an end
> user. The UTC is engaged in a dialogue with the Government of
> India now to determine if there are other specialized contexts
> of this sort involving ZWNJ or ZWJ for any of the scripts of
This is great and I (at least) look forward to hearing the
results of your dialogue. I do want to caution about
"semantic distinctions": unless the relevant parties are
willing to restrict domain labels to strings that could be
orthographically correct in the language(s) --whether or not
they actually form words-- moving too far in the direction of
semantic distinctions may eliminate some strings that would be
perfectly reasonable mnemonics.
Even then, we need to be quite careful about assumptions that
depend on language: across the DNS, registration restrictions
are a rather weak tool, even though they are likely to be very
important at the first few levels, and, as discussed in earlier
notes, language information is not available at lookup time.
> for details.
More information about the Idna-update