<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">

just a small suggestion: minimization of rules may conflict with clarity and understanding. Perhaps one could distinguish the "rules" from "an implementation of the rules" that does try to fold some tests together?<div><br class="webkit-block-placeholder"></div><div>v</div><div><br><div><div>On Feb 7, 2008, at 2:01 AM, Mark Davis wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">I don't think these are the minimal rules yet, since some *parts* of these rules can be removed. Here are the ones you say to keep:<br><br>Proposed rules (numbering for reference)"<ol><li>Only characters with the BIDI properties L, R, AL, EN, ES, BN, ON and NSM are allowed.</li> <li>ES and ON are not allowed in the first position</li><li>ES and ON, followed by zero or more NSM, is not allowed in the last position</li><li>If an L is present, no R, AL or AN may be present.</li><li>If an EN is present, no AN may be present</li> <li>The first character may not be an NSM</li><li>The first character may not be an EN (European Number) or an AN (Arabic Number).</li></ol>Comments<br><ol><li>All of this applies only to Bidi labels: that is, those with BIDI properties R, AL, or AN. Because of that, #4 can be changed to be simply: No L.<br> </li><li>According to #1, AN is not allowed at all. That would remove #5, and remove AN from #4 and #7. However, I think that's a mistake -- as discussed before--  that we need to include AN in #1.</li><li>Because the protocol limits to only [[:L:][:Mn:][:Mc:][:Nd:]] plus a handful of exceptions, #1 is redundant because of the other restrictions in the protocol document. If it is retained, we should at least have a note about that.<br> </li><li>The only characters allowed in NSM are [:Mn:], and [:Me:]. The protocol forbits [:Me:] entirely, and forbids [:Mc:] in first position. So #6 is redundant. If it is retained, we should at least have a note about that.</li> <li>#7 can be combined with #2; all about the first character.<br></li></ol>Assuming the rest are required (although it might be worth trying the removal of individual properties to make sure all are), here is what a minimal list would look like, I think:<br> <br><div style="margin-left: 40px;">For any label containing a character with BIDI property values R, AL, or AN:<br></div><ol style="margin-left: 40px;"><li>No character can have the BIDI property value of L.</li><li>The first character cannot have BIDI property value ES, ON, EN, AN.<br> </li><li>The label cannot end with a sequence where the first character has BIDI property values ES or ON, and the remainder is a sequence of zero or more characters with BIDI property value NSM</li><li>The label cannot contain two characters where one has BIDI property value EN and the other has BIDI property value AN<br> </li></ol>This would be equivalent to saying in regex notation:<br><br><div style="margin-left: 40px;">If the label matches:<br></div><ul style="margin-left: 40px;"><li>.*([:bc=R:][:bc=AL:][:bc=AN:]).*</li></ul><div style="margin-left: 40px;"> Then it MUST NOT also match:<br></div><ul style="margin-left: 40px;"><li>.*[:bc=L:].*<br>| [[:bc=ES:][:bc=ON:][:bc=EN:][:bc=AN:]].*<br>|.*([:bc=ES:][:bc=ON:])[:bc=NSM:]*<br>|.*[:bc=EN:].*[:bc=AN:].*<br>|.*[:bc=AN:].*[:bc=EN:].*<br> </li></ul>Would need to verify that this matches all the exclusions above, of course. You can do that with ICU's regex, if you are feeling adventurous! You might have to use Perl syntax for the properties, eg \p{bc=L} and so on.<br> <br>Mark<br><br><div class="gmail_quote">On Feb 6, 2008 9:53 PM, Erik van der Poel &lt;<a href="mailto:erikv@google.com">erikv@google.com</a>&gt; wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> It turns out that ICU4C has an option called<br>UBIDI_KEEP_BASE_COMBINING, and this did the trick (putting the base<br>character and combining mark in the "correct" order). When I did that,<br>I found that I no longer needed the new rule that I mentioned in my<br> previous email. You're also right that "If an R, AL or AN is present,<br>no L may be present" is simply redundant (same as the previous rule).<br>So here are all the rules, with "Keep" and "Remove" indicated:<br> <br>Keep:<br><div class="Ih2E3d">Only characters with the BIDI properties L, R, AL, EN, ES, BN, ON<br>and NSM are allowed.<br><br></div>Keep:<br>ES and ON are not allowed in the first position<br><br>Keep:<br>ES and ON, followed by zero or more NSM, is not allowed in the<br> last position<br><br>Keep:<br><div class="Ih2E3d">If an L is present, no R, AL or AN may be present.<br><br></div>Remove (redundant):<br><div class="Ih2E3d">If an R, AL or AN is present, no L may be present.<br><br></div> Keep:<br><div class="Ih2E3d">If an EN is present, no AN may be present<br><br></div>Remove (redundant):<br><div class="Ih2E3d">If an AN is present, no EN may be present<br><br></div>Remove:<br><div class="Ih2E3d">If an AN is present, at least one R or AL must be present<br> <br></div>Keep:<br>The first character may not be an NSM<br><br>Keep:<br>The first character may not be an EN (European Number) or an AN<br>(Arabic Number).<br><br>Harald, are you also able to remove the rules that I marked "Remove"<br> above, and still have the tests pass?<br><font color="#888888"><br>Erik<br></font><div><div></div><div class="Wj3C7c"><br>On Feb 6, 2008 7:52 PM, Mark Davis &lt;<a href="mailto:mark.davis@icu-project.org">mark.davis@icu-project.org</a>&gt; wrote:<br> &gt; Here is what is happening with that. The BIDI algorithm is designed for<br>&gt; display, and in display, the NSMs are designed to follow their base -- in<br>&gt; display order. That means an NSM following an R character will come after it<br> &gt; within its level (odd). So that is covered by the following rule:<br>&gt;<br>&gt;<br>&gt;<br>&gt;<br>&gt; L3. Combining marks applied to a right-to-left base character will at this<br>&gt; point precede their base character. If the rendering engine expects them to<br> &gt; follow the base characters in the final display process, then the ordering<br>&gt; of the marks and the base character must be reversed.<br>&gt;<br>&gt;<br>&gt;<br>&gt; What you are not seeing when you just look at the text is that what the bidi<br> &gt; algorithm actually produces is a series of levels associated with the text.<br>&gt; That level information is available at the time of L3 for the display engine<br>&gt; to use.<br>&gt;<br>&gt; So what you need to do is make your test do L3 (not done by ICU, since it is<br> &gt; targeted at display layout). So you need to make one more pass through each<br>&gt; segment of text that is at an odd level, and reverse any sequence matching<br>&gt; the following regex:<br>&gt;  /[:bc=NSM:]+ [:^bc=NSM:]/<br> &gt;<br>&gt; =================<br>&gt;<br>&gt; I haven't looked in detail at "If an R, AL or AN is present, no L may be<br>&gt; present.", and don't have the other rules handy here -- it may be redundant<br> &gt; with them. But just to restate the test case:<br>&gt;<br>&gt; For the collision test, your test should be checking two environments:<br>&gt;<br>&gt; a) RLM + test_string<br>&gt; b) LRM + test_string<br>&gt;<br>&gt; You'll test a series of strings that cover all the combinations. There is a<br> &gt; collision failure if string1 has the same bidi results in either (a) or (b)<br>&gt; as string2 in either (a) or (b).<br>&gt;<br>&gt; With that, if I have<br>&gt;<br>&gt; test_string1 = Ab<br>&gt; test_string2 = bA<br> &gt;<br>&gt; I will get:<br>&gt;<br>&gt; test_string1 in (a) =&gt; bA<br>&gt; test_string2 in (b) =&gt; bA<br>&gt;<br>&gt; thus a collision. I also get a collision with<br>&gt;<br>&gt; test_string1 in (b) =&gt; Ab<br>&gt;  test_string2 in (a) =&gt; Ab<br> &gt;<br>&gt; As I said, though, I don't have the other rules handy here -- "If an R, AL<br>&gt; or AN is present, no L may be present." may be just redundant.<br>&gt;<br>&gt; Mark<br>&gt;<br></div></div></blockquote> </div><br><br clear="all"><br>-- <br>Mark <div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">_______________________________________________</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Idna-update mailing list</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><a href="http://www.alvestrand.no/mailman/listinfo/idna-update">http://www.alvestrand.no/mailman/listinfo/idna-update</a></div> </blockquote></div><br></div></body></html>