I want to make sure that people understand that the contextual rules in <a href="http://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters">http://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters</a> (section 2.3) are not perfect. They do not characterize <i>precisely </i>all and only those cases where joiners make a visual difference. Part of the issue is that the results may vary somewhat by font.<br>
<br>What those rules do is filter it down the problematic cases to an extremely small set, as a percentage of total text. It prevents problems with the majority of scripts: Latin, Cyrillic, CJK, and so on. It also does a good job with Arabic, with any normal fonts.<br>
<br>With Indic scripts, the situation is slightly different. The rules limit the cases severely, disallowing joiners where they don't make a visual difference after almost all characters. However, taking the example of Malayalam, something like half of the cases where it allows joiners will not typically have a difference in visual display. With Tamil even fewer, with Sinhala, more.<br>
<br>Now, that "filtering down to an extremely small set" is worth doing, either in the protocol or via client side notification, but I just wanted people to understand the limitations, that it is not a panacea.<br>
<br clear="all">Mark<br>
<br><br><div class="gmail_quote">On Thu, Feb 26, 2009 at 20:17, John C Klensin <span dir="ltr"><<a href="mailto:klensin@jck.com">klensin@jck.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
<br>
--On Thursday, February 26, 2009 19:46 -0800 Paul Hoffman<br>
<div class="Ih2E3d"><<a href="mailto:phoffman@imc.org">phoffman@imc.org</a>> wrote:<br>
<br>
> This tees into John's recent thread on parsing the issues and<br>
> finding a middle ground. I have included many of the<br>
> suggestions from the mailing list and off-line responses. Most<br>
> significantly, I have changed ZWNJ and ZWJ from "mapped to<br>
> nothing" to being allowed so that Arabic labels will be more<br>
> realistic.<br>
<br>
</div>Paul,<br>
<br>
With the understanding that I still don't believe this is the<br>
right way to go, one technical correction and one issue:<br>
<br>
(1) ZWJ and ZWNJ are not needed for Arabic language orthography.<br>
ZWNJ is needed for Persian languages and what are sometimes<br>
called Indo-Arabic ones (e.g., Urdu, but there are _many_<br>
others). Both ZWJ and ZWNJ are needed for several of the Indic<br>
scripts and associated languages (although slightly fewer with<br>
Unicode 5.1 than with Unicode 3.2).<br>
<br>
(2) When one considers the number of registries/zones on the<br>
Internet or even those that exist only at the second level<br>
(i.e., maintaining registrations for third-level names), it is<br>
certain that some of them will be operated by people with bad<br>
intentions. Given that, are you confident that ZWJ/ZWNJ can<br>
simply be treated as ordinary characters, relying on the<br>
registries to prevent those characters where they would be fully<br>
invisible?<br>
<br>
When faced with that question very early in the IDNA2008 design<br>
process, we concluded that there were four possible answers:<br>
<br>
(i) Yes, we trust the registries and are willing to live<br>
with labels like "ábc" failing to compare equal to<br>
"áb<ZWJ>c" despite looking exactly the same when<br>
displayed by normal rendering software.<br>
<br>
(ii) We don't quite trust the registries but are<br>
confident that all rendering software, on all operating<br>
systems, that encounter strings like "áb<ZWJ>c" will<br>
get upset in sufficient vivid ways to warn the user off.<br>
We didn't think rendering ZWJ as a little box or<br>
question mark would be adequate for that case because it<br>
might be a legitimate character for which no font was<br>
available even though it would at least not be confused<br>
with "ábc".<br>
<br>
(iii) We either leave things as they are in IDNA2003<br>
(map to nothing) or simply ban the character. Either<br>
one puts the scripts that need one or both of these<br>
characters at an intolerable disadvantage.<br>
<br>
(iv) We adopt some sort of "contextual rule" model,<br>
despite the complexity it adds.<br>
<br>
Obviously, we chose the fourth. We did so because we didn't<br>
believe the assumptions that (i) or (ii) implied and did not<br>
consider (iii) to be acceptable given the number of people who<br>
use the relevant scripts. As I read your document, you are<br>
proposing (i). Is that correct and, if so, could you explain a<br>
bit better how you see the tradeoffs?<br>
<br>
Please also note that, if you permit ZWJ and/or ZWNJ as<br>
characters, we end up in exactly the same situation that you and<br>
others have objected to with Eszett and Final Sigma, i.e., an<br>
input string that converts to a different A-label in IDNA2003<br>
and IDNA2008. I'm prepared to live with that but, to the degree<br>
to which you consider it a problem so serious as to require<br>
rechartering and a completely different document strategy, I'd<br>
like to better understand the exception and its implications.<br>
In particular, I don't see the section of your outline document<br>
that discussed the transition strategy that many people (I think<br>
including you, but could be wrong about that) have argued is<br>
absolutely essential if there are going to be any<br>
incompatibilities of that sort.<br>
<br>
best,<br>
<font color="#888888"> john<br>
</font><div><div></div><div class="Wj3C7c"><br>
_______________________________________________<br>
Idna-update mailing list<br>
<a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>
<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>
</div></div></blockquote></div><br>