New version of strawman for IDNAv2

Fri Feb 27 23:59:35 CET 2009

What I was trying to do is set expectations correctly. That is:

   - It is false to say that the contextual rules will disallow joiners in
   all cases where they make no visual difference.
   - It is true to say that the contextual rules can remove the vast
   majority of cases where joiners make no visual difference.

As to the issue you cite, let me give a bit of background. The joiners were
originally designed to be purely formatting controls, to indicate display
preferences. More precisely, they are requests for rendering a particular
way, ideally to be honored by the font/rendering system if possible. But a
font/rendering system may not be able to honor the request, and would then
just give the normal display.

The goal was indeed to encode characters where a *semantic* difference
(rather than a display preference) was necessary. Thus these characters were
meant to be optional. However, semantic vs display differences are to a
certain degree a judgment call, and communities have developed with the
expectation of behavior whereby the presence or absence of the joiner would
be seen as the "wrong" spelling of the word.

Now, in the case of Malayalam, the chilu characters were added because it
was not possible to represent certain text with the current characters. And
that is a good reason for adding new characters. In so doing, for the
purpose of identifiers, it is the case that the joiners are not really
necessary. And while the UTC has no stated opinion on this issue, it would
certainly consider proposals for the addition of characters to represent
semantic differences for Indic scripts that may currently be using
joiner/non-joiner.

Note -- and this is important -- that the addition of characters that have
the same effect as "virama + non-joiner" or "virama + joiner" often do
absolutely nothing to reduce cases of visual similarity: the new characters
will look just like other characters in circumstances where the joiners had
no effect! The advantage from the IDNA perspective, however, is that it
makes testing easier.

I am less familiar with the case of Sinhala, but I believe that the main
issue with all of the other Indic-based scripts is the same as the sigma
case. That is, it is not a significant problem to identify input strings A
and B; the problem is that the one that is returned from the DNS is not the
"correct" form. That is, the canonicalization and equivalence class are
acceptable, but the particular representative of that equivalence class that
is in the DNS is not the preferred form. For some cases, as with Arabic, two
words with different meanings may map together, but frankly, that is on the
order of the difference between:

therapist.com // read as "Therapist"
therapist.com // read as "the rapist"

Personally, I believe that we could get along quite without having the four
special cases (eszett, sigma, joiners) in IDNA2008. That is, the advantages
of compatibility outweigh the utility of a breaking change. And as we all
know, IDNA does not guarantee that all the text in any given language can be
in IDNs: the simplest example is the English word: *can't*.

That is my personal opinion -- not necessarily representing particular
Unicode Consortium members, who may have different, reasonable views of the
importance of these cases.

Mark

On Fri, Feb 27, 2009 at 07:43, John C Klensin <klensin at jck.com> wrote:

>
>
> --On Friday, February 27, 2009 07:27 -0800 Mark Davis
> <mark at macchiato.com> wrote:
>
> >...
> > With Indic scripts, the situation is slightly different. The
> > rules limit the cases severely, disallowing joiners where they
> > don't make a visual difference after almost all characters.
> > However, taking the example of Malayalam, something like half
> > of the cases where it allows joiners will not typically have a
> > difference in visual display. With Tamil even fewer, with
> > Sinhala, more.
>
> FWIW, I was told early in this week that, with additional
> precomposed characters added in Unicode 5.1, Malayalam doesn't
> need joiners in IDNs at all.  The person who made the comment
> more or less suggested that the proper solution to the issue
> with joiners and the other Indic scripts was to add sufficient
> additional characters to Unicode to capture the important
> shaping forms (e.g., the half-characters), thereby also
> eliminating the requirement.  He felt that was a better solution
> generally because people who write the language understand about
> half-characters but have never heard of ZWJ.
>
> I did not encourage him to either make the relevant proposals or
> to sit around waiting for this to happen (proposals or not).
>
> However, to your specific point, Mark: do you suggest that the
> fact that we cannot completely cover all cases implies that we
> should give up and either abandon the joiners or leave this all
> to the registries?
>
>    john
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090227/a3a40e6f/attachment.htm