tables-06b.txt Pseudo-code clarification

Kenneth Whistler kenw at sybase.com
Wed Jul 22 03:46:41 CEST 2009


Patrik,

Before delving into the syntax issues remaining for
the rule sets for A.5, A.6, A.8, and A.9, I want to
back up and consider the pseudo-code conventions
described at the top of Appendix A. Part of the
concerns raised about the syntax for the Hebrew
gershayim results, I think, from the fact that the
pseudo-code conventions aren't as clear as they
could be.

In particular, the meaning of the constructs
Before(cp) and After(cp) are unclear enough that they
will likely lead to misunderstandings and inconsistent
attempts at implementation of the rules.

I suggest that this be addressed by rewriting the
paragraph which explains the pseudo-code conventions.
And among other things, for a set of conventions like
these, breaking them out from a paragraph form into
bullet-like sections will also make it easier for people
to read and understand them.

In a set of rules like this, I think conciseness is less
important than clarity, so my suggested rewrite will be
a little more long-winded than the current draft, but I 
hope much clearer in the long run.

Also, because of the way this pseudo-code is trying to
mix property functions and string position functions,
I think it is important to introduce an explicit
"Undefined" term which can be then used consistently
to deal with invalid string positions or property functions
involving invalid codepoints.

So here is my attempt at a rewrite for clarity. I'm not
trying to change the intent of any of this pseudo-code,
as developed to express the rule sets for CONTEXTO --
just to make it clearer and more rigorous.

--Ken

=========================================================

The grammatical rules are expressed in pseudo code. The
conventions used for that pseudo code are explained here.

Each rule is constructed as a Boolean expression that
evaluates to either True or False. A simple "True;" or
"False;" rule sets the default result value for the rule set.
Subsequent conditional rules that evaluate to True or
False may re-set the result value.

A special value "Undefined" is used to deal with any
error conditions, such as an attempt to test a character
before the start of a label or after the end of a label.
If any term of a rule evaluates to Undefined, further
evaluation of the rule immediately terminates, as the
result value of the rule will itself be Undefined.

cp represents the codepoint to be tested.

FirstChar is a special term which denotes the first codepoint
in a label.

LastChar is a special term which denotes the last codepoint
in a label.

.eq. represents the equality relation.

     A .eq. B evaluates to True if A equals B.
     
.ne. represents the non-equality relation.

     A .ne. B evaluates to True if A is not equal to B.
     
.in. represents the set inclusion relation.

     A .in. B evaluates to True if A is a member of the set B.
     
A functional notation, Function_Name(cp), is used to express
either string positions within a label, Boolean character
property tests of a codepoint, or a regular expression
match. When such function names
refer to Boolean character property tests, the function names
use the exact Unicode character property name for the property
in question, and "cp" is evaluated as the Unicode value
of the codepoint to be tested, rather than as its position
in the label. When such function names refer to string positions
within a label, "cp" is evaluated as its position in the label.

RegExpMatch(X) takes as its parameter X a schematic regular
expression consisting of a mix of Unicode character property
values and literal Unicode codepoints.

Script(cp) returns the value of the Unicode Script property,
as defined in Scripts.txt in the Unicode Character Database.

Canonical_Combining_Class(cp) returns the value of the
Unicode Canonical_Combining_Class property, as defined in
UnicodeData.txt in the Unicode Character Database.

Before(cp) returns the codepoint of the character
immediately preceding cp in logical order in the string
representing the label. Before(FirstChar) evaluates to
Undefined.

After(cp) returns the codepoint of the character
immediately following cp in logical order in the string
representing the label. After(LastChar) evaluates to
Undefined.

Note that "Before" and "After" do not refer
to the visual display order of the character in a label,
which may be reversed or otherwise modified by the
bidirectional algorithm for labels including characters
from scripts written right-to-left.

Repeated evaluation for all characters in a label makes
use of the special construct:

   For All Characters:
      Expression;
   End For;
   
This construct requires repeated evaluation of "Expression"
for each codepoint in the label, starting from FirstChar
and proceeding to LastChar.

===============================================================



More information about the Idna-update mailing list