New version of strawman for IDNAv2

Sat Feb 28 00:27:13 CET 2009

--On Friday, February 27, 2009 14:59 -0800 Mark Davis
<mark at macchiato.com> wrote:

> What I was trying to do is set expectations correctly. That is:
> 
>    - It is false to say that the contextual rules will
> disallow joiners in    all cases where they make no visual
> difference.
>    - It is true to say that the contextual rules can remove
> the vast    majority of cases where joiners make no visual
> difference.

Understood and appreciated.  If that didn't come through in my
note, I was writing too quickly.

For whatever it is worth, my (personal, not speaking for anyond
else) expectation about the contextual rules is that 

	(i) they would block most of the risky cases, especially
	those risky cases involving scripts with which they have
	no meaningful interpretation, and

	(ii) they would act as an extra level of alert to
	registries using scripts in which they sometimes had a
	useful effect about the cases where they did not.

Put differently, the rules would block the use of those
characters for the registries and scripts where it might be
reasonable for the registries to have no clue about how to
handle them, restricting the need for specific registry action
and restrictions to the scripts in which the joiners would
sometimes make sense.

I think that objective is consistent with your comments and
analysis above.  As I've said before, I think there are few
aspects of IDNs that do not require several areas of work, by
several groups, to get completely right and, as a corollary,
that only rarely will a single technique or decision point be
sufficient to repel an attack by a smart and determined attacker.

> As to the issue you cite, let me give a bit of background. The
> joiners were originally designed to be purely formatting
> controls, to indicate display preferences. More precisely,
> they are requests for rendering a particular way, ideally to
> be honored by the font/rendering system if possible. But a
> font/rendering system may not be able to honor the request,
> and would then just give the normal display.
> 
> The goal was indeed to encode characters where a *semantic*
> difference (rather than a display preference) was necessary.
> Thus these characters were meant to be optional. However,
> semantic vs display differences are to a certain degree a
> judgment call, and communities have developed with the
> expectation of behavior whereby the presence or absence of the
> joiner would be seen as the "wrong" spelling of the word.

This makes tremendous sense, better explains the advice to map
them to nothing in some circumstances, and, at least for me, is
very helpful.

> Now, in the case of Malayalam, the chilu characters were added
> because it was not possible to represent certain text with the
> current characters. And that is a good reason for adding new
> characters. In so doing, for the purpose of identifiers, it is
> the case that the joiners are not really necessary. And while
> the UTC has no stated opinion on this issue, it would
> certainly consider proposals for the addition of characters to
> represent semantic differences for Indic scripts that may
> currently be using joiner/non-joiner.

Again, very helpful (and something I will pass along) although,
as with some of the other recent discussions, I hope we don't
have to go blocked on Unicode changes just as I hope we don't do
anything that would block us on DNS changes and deployment (I
believe that Unicode changes can be made much more quickly than
we can implement and deploy DNS ones, but still...).

> Note -- and this is important -- that the addition of
> characters that have the same effect as "virama + non-joiner"
> or "virama + joiner" often do absolutely nothing to reduce
> cases of visual similarity: the new characters will look just
> like other characters in circumstances where the joiners had
> no effect! The advantage from the IDNA perspective, however,
> is that it makes testing easier.

Understood.  And see comment above about registry responsibility.

> I am less familiar with the case of Sinhala, but I believe
> that the main issue with all of the other Indic-based scripts
> is the same as the sigma case. That is, it is not a
> significant problem to identify input strings A and B; the
> problem is that the one that is returned from the DNS is not
> the "correct" form. That is, the canonicalization and
> equivalence class are acceptable, but the particular
> representative of that equivalence class that is in the DNS is
> not the preferred form. 

And that is a problem that, unless someone has more insight into
the situation than I do, can be solved only with server-side
matching.  Perhaps there are patches short of that which would
address some of the cases, but server-side matching is the only
real solution (just as it was with ASCII case forms).   I'd love
to see that, but see no way to get there from here that doesn't
require putting IDNs on hold for a decade or more.

> For some cases, as with Arabic, two
> words with different meanings may map together, but frankly,
> that is on the order of the difference between:
> 
> therapist.com // read as "Therapist"
> therapist.com // read as "the rapist"

An interesting example.  I'll leave the issue of whether it is a
proper analogy in the hands of our Arabic-script-using friends.

Historically (going back to before the dawn of the DNS), our
assumption was that words would be joined together in labels (or
host names) only with intervening hyphens and not by simple
concatenation.  That model would have eliminated the problem
above.  We lost that battle when the DNS went commercial and
marketing types decided that they would prefer concatenation,
often taking advantage of case preservation and server-side
matching to make things work (hence we typically see
"SillyLabel" rather than "sillylabel".   For the IDN case
whether that option is not available because we have to map
rather than matching on the server, I believe we should give
considerable deference to what the registries and other relevant
communities think is appropriate.  But your point is clear.

> Personally, I believe that we could get along quite without
> having the four special cases (eszett, sigma, joiners) in
> IDNA2008. That is, the advantages of compatibility outweigh
> the utility of a breaking change. And as we all know, IDNA
> does not guarantee that all the text in any given language can
> be in IDNs: the simplest example is the English word: *can't*.

At least some of my grammarian friends would claim that isn't a
word at all but a typographical and pronunciation convenience.
However, as I believe everyone knows by now, there are lots of
other examples in English that are equally obvious.

> That is my personal opinion -- not necessarily representing
> particular Unicode Consortium members, who may have different,
> reasonable views of the importance of these cases.

I appreciate your making the distinction.

     john