Tables: Context Rules

Patrik Fältström patrik at frobbit.se
Wed Jul 15 17:06:26 CEST 2009


On 19 nov 2008, at 17.22, Mark Davis wrote:

> *Location*
>
> *Since this will not be part of the final document: the text will be  
> moved
> to the IANA registry and be maintained there -- there needs to be a  
> note to
> the readers and editor to that effect at the top of the section. There
> should also be an ed note there (in John's style) indicating that the
> following rules still require much work.*

An IANA rule set is always filled with initial data from an RFC, so I  
do not see this has to be moved more than what has already happend to  
version -05.

> *Pseudocode*
>
> *There should be some explanation of the syntax and functions, even  
> if not
> precise. The syntax needs to be a bit more extended to be useful.*

Ok

> *I'd suggest defining P to be the current position of the character  
> being
> tested, F to be the position of the first character, and L to be the
> position of the last character. Then we don't need constructs such as
> LastChar, and can be more expressive, because we have to be able to  
> look at
> more than one character before/after; eg we can then use
> Script(Character[P-2]) to get the script of the previous to last  
> character.
> (Note: I include F just so we don't have to decide between zero- 
> based or
> one-based, but it would be even simpler to do zero-based.)*

Version -05 of the document do include some text that should make  
things easier to understand.

> *I'd also prefer just using = instead of .eq., but that's just a  
> preference.
> *

I did choose .eq. as some think = is not a test, but assignment...

> *The rules need to be carefully reviewed for clarity and consistency  
> with
> the text (and vice versa). For example, even for a simple case like  
> Garesh
> there are many problems.*
>
>
> Overview:
> The script of the preceding character and the subsequent character, if
> any, MUST be Hebrew.//
> The scope of "if any" must be clear. Is it to apply to both the  
> preceding
> and subsequent, or just the subsequent?

The way I read the text (of course, as I wrote it) say that: "The  
character before MUST be Hebrew" plus "If there is a character after,  
then that MUST be Hebrew".

I am now changing to:

The script of the preceding character MUST be Hebrew.

> // And it must not require the second, because it can be final in a  
> word,
> which means it is fine to follow with "-" or other non-Hebrew.

I do not like the word "word" here. We talk about checking in whole  
labels. Maybe that should be clarified. That there is no character  
before imply the character we look at is the first in a label.

I think your point here is that if there is a label that have  
characters from more than one script, then it might be the case that  
the character after the Geresh might be in a different script? I.e. in  
reality the rule you want is "only" that the character before the  
Geresh is Hebrew script?

> Rule Set:
> If FirstChar .eq. True then False;
> Else If BeforeScript .eq. Hebrew Then
>    If AfterScript .eq. Hebrew Then True;
>    Else False;
>
> // This is missing a trailing Else (made clear by my block  
> indentation)
> // While it shouldn't require an AfterScript, even the syntax is
> ill-defined:
> //    What is the value of AfterScript if there is no character  
> after? There
> is no check to make sure that it isn't LastChar.

I have now changed to:

False;
If Script(Before(cp)) .eq. Hebrew Then True;

> *9. HYPHEN-MINUS*
> Overview: Must appear at the beginning or end of a label.
> ...
> Rule Set:
> If FirstChar .eq. True Then False;
> If LastChar .eq. Then False;
> Else True;
> =>
> Overview: Must appear neither at the beginning nor at the end of a  
> label,
> and must not be in both the third and fourth positions in the string.
> Rule Set:
> If P = F OR P = L Then False;
> Else if P = F+2 And Character[P+1] = "-" Then False;
> Else if P = F+3 And Character[P-1] = "-" Then False;
> Else True;
>
> Hyphen-Minus is quite unlike the rest of the rules in that we can  
> NEVER have
> the above 3 conditions changed. We should just remove it from the  
> CONTEXTO
> rules, since the conditions for its use are in Protocol as a separate
> condition (Hyphen - P4.3.2.1, although this needs fleshing out, see  
> previous
> note) from the CONTEXT conditions (P4.3.2.3).

I think the requirements for '-' in position 3 and 4 is already taken  
care of by other policies. Else we should also have "x" and "n" as  
contextual rules.

> *10. ZERO WIDTH NON-JOINER*
> For the rule sets I suggest the following. Rationale: As long as it is
> pseudocode -- it is made up for this purpose and matches no real  
> programming
> language -- we should use a pseudocode that actually works to give  
> the same
> meaning as the prose. And the conditions needed to be tighter, as per
> http://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters
> ===
>
> The
> script must be one in which the use of this character causes
> significant visual transformation of one or both of the adjacent
> characters.
> =>
> The script must be one in which the use of this character causes
> visual transformation of one or both of the adjacent characters that  
> are
> required for significant semantic distinctions in at least some  
> cases. This
> includes ZWNJ after certain Virama characters, and between particular
> joining characters in cursive scripts like Arabic.
> [[anchor9a: The script list for this character is _not_ complete and,
> in particular, more Indic scripts certainly need to be listed.]]

Is it possible to get a complete list, or is your suggestion that the  
list should be open?

> RuleSet
>
> If BeforeScript .eq. ( Deva | Tamil |... ) Then
>  If P = F OR P = L Then False;
>  Else if Canonical_Combining_Class(Character[P-1]) != Virama Then  
> False;
>  Else if Not IsLetter(Character[P-2]) Then False;
>  Else if Not ScriptCount(Character[P-2] + Character[P-1]) > 1 Then  
> False;
>  Else False;
> Else if BeforeScript != Arabic Then False;
> Else if Not MatchesBefore([[:jt=D:][:jt=L:]][:jt=T:]*) Then False;
> Else if Not MatchesAfter([:jt=T:]*[[:jt=D:][:jt=R:]]) Then False;
> Else True;
>
> For
> more information see Section 2.3 Layout and Format Control Characters
> in [UAX31].

Ok...

The problem for me here is that you change the syntax/pseudocode AND  
you come up with new rules. ;-)

Not easy for me to digest, as I have to translate back to the  
pseudocode I use in the document as well as understanding your rules...

I hope I have done the right thing, although the rules are not as  
explicit as yours.

> *11. ZERO WIDTH JOINER
> *
> The
> script must be one in which the use of this character causes
> significant visual transformation of one or both of the adjacent
> characters.
> =>
> The script must be one in which the use of this character causes
> visual transformation of one or both of the adjacent characters that  
> are
> required for significant semantic distinctions in at least some  
> cases. This
> includes ZWNJ after certain Virama characters, and between particular
> joining characters in cursive scripts like Arabic.
> [[anchor9a: The script list for this character is _not_ complete and,
> in particular, more Indic scripts certainly need to be listed.]]
>
> RuleSet
> If BeforeScript .eq. ( Deva | Tamil |... ) Then
>  If P = F OR P = L Then False;
>  Else if Canonical_Combining_Class(Character[P-1]) != Virama Then  
> False;
>  Else if Not IsLetter(Character[P-2]) Then False;
>  Else if Not ScriptCount(Character[P-2] + Character[P-1]) > 1 Then  
> False;
>  Else False;
> Else False;

Ok, I have changed this in my document to something similar to this,  
similar to ZWNJ.

> *14. MODIFIER LETTER PRIME *
>
> Add a description: also used in Cyrillic transcription, where it  
> must be
> after a consonant.
>
> BeforeScript If .eq. Greek Then
> ...
> =>
> If IsLetter(Character[-1]) And BeforeScript = Cyrillic Then True;
> ...

Ok. Also changed.

    paf

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://www.alvestrand.no/pipermail/idna-update/attachments/20090715/ba505efb/attachment.pgp 


More information about the Idna-update mailing list