New version of strawman for IDNAv2

Mark Davis mark at macchiato.com
Fri Feb 27 16:27:30 CET 2009


I want to make sure that people understand that the contextual rules in
http://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters(section
2.3) are not perfect. They do not characterize
*precisely *all and only those cases where joiners make a visual difference.
Part of the issue is that the results may vary somewhat by font.

What those rules do is filter it down the problematic cases to an extremely
small set, as a percentage of total text. It prevents problems with the
majority of scripts: Latin, Cyrillic, CJK, and so on. It also does a good
job with Arabic, with any normal fonts.

With Indic scripts, the situation is slightly different. The rules limit the
cases severely, disallowing joiners where they don't make a visual
difference after almost all characters. However, taking the example of
Malayalam, something like half of the cases where it allows joiners will not
typically have a difference in visual display. With Tamil even fewer, with
Sinhala, more.

Now, that "filtering down to an extremely small set" is worth doing, either
in the protocol or via client side notification, but I just wanted people to
understand the limitations, that it is not a panacea.

Mark


On Thu, Feb 26, 2009 at 20:17, John C Klensin <klensin at jck.com> wrote:

>
>
> --On Thursday, February 26, 2009 19:46 -0800 Paul Hoffman
> <phoffman at imc.org> wrote:
>
> > This tees into John's recent thread on parsing the issues and
> > finding a middle ground. I have included many of the
> > suggestions from the mailing list and off-line responses. Most
> > significantly, I have changed ZWNJ and ZWJ from "mapped to
> > nothing" to being allowed so that Arabic labels will be more
> > realistic.
>
> Paul,
>
> With the understanding that I still don't believe this is the
> right way to go, one technical correction and one issue:
>
> (1) ZWJ and ZWNJ are not needed for Arabic language orthography.
> ZWNJ is needed for Persian languages and what are sometimes
> called Indo-Arabic ones (e.g., Urdu, but there are _many_
> others).  Both ZWJ and ZWNJ are needed for several of the Indic
> scripts and associated languages (although slightly fewer with
> Unicode 5.1 than with Unicode 3.2).
>
> (2) When one considers the number of registries/zones on the
> Internet or even those that exist only at the second level
> (i.e., maintaining registrations for third-level names), it is
> certain that some of them will be operated by people with bad
> intentions.  Given that, are you confident that ZWJ/ZWNJ can
> simply be treated as ordinary characters, relying on the
> registries to prevent those characters where they would be fully
> invisible?
>
> When faced with that question very early in the IDNA2008 design
> process, we concluded that there were four possible answers:
>
>        (i) Yes, we trust the registries and are willing to live
>        with labels like "ábc" failing to compare equal to
>        "áb<ZWJ>c" despite looking exactly the same when
>        displayed by normal rendering software.
>
>        (ii) We don't quite trust the registries but are
>        confident that all rendering software, on all operating
>        systems, that encounter strings like "áb<ZWJ>c" will
>        get upset in sufficient vivid ways to warn the user off.
>        We didn't think rendering ZWJ as a little box or
>        question mark would be adequate for that case because it
>        might be a legitimate character for which no font was
>        available even though it would at least not be confused
>        with "ábc".
>
>        (iii) We either leave things as they are in IDNA2003
>        (map to nothing) or simply ban the character.  Either
>        one puts the scripts that need one or both of these
>        characters at an intolerable disadvantage.
>
>        (iv) We adopt some sort of "contextual rule" model,
>        despite the complexity it adds.
>
> Obviously, we chose the fourth.  We did so because we didn't
> believe the assumptions that (i) or (ii) implied and did not
> consider (iii) to be acceptable given the number of people who
> use the relevant scripts.   As I read your document, you are
> proposing (i).   Is that correct and, if so, could you explain a
> bit better how you see the tradeoffs?
>
> Please also note that, if you permit ZWJ and/or ZWNJ as
> characters, we end up in exactly the same situation that you and
> others have objected to with Eszett and Final Sigma, i.e., an
> input string that converts to a different A-label in IDNA2003
> and IDNA2008.  I'm prepared to live with that but, to the degree
> to which you consider it a problem so serious as to require
> rechartering and a completely different document strategy, I'd
> like to better understand the exception and its implications.
> In particular, I don't see the section of your outline document
> that discussed the transition strategy that many people (I think
> including you, but could be wrong about that) have argued is
> absolutely essential if there are going to be any
> incompatibilities of that sort.
>
> best,
>    john
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090227/c0e3e57b/attachment.htm 


More information about the Idna-update mailing list