Context Rules - with apologies for the long email

Thu Jul 16 18:00:53 CEST 2009

Mark

On Wed, Jul 15, 2009 at 23:43, Chris Wright <chris at ausregistry.com.au>wrote:

> As this is my first post to this list, a very quick background, for setting
> the context purpose only:
>  - AusRegistry is the .au domain name registry. Our registry software is in
> use in several other TLD registries many of which are eagerly awaiting their
> ccTLD IDNs from ICANN
>  - We have been following this list for a the last 12 months or so
>  - We have spent the last almost 12 months implementing IDNs (IDN 2008) in
> our registry software, implemented all of the current drafts with the
> exception of mappings
>  - We have, especially focused on Arabic IDNs and IDN 2008 (see Arabic
> comments further down)
>  - Now have a fully working initial implementation that includes
> configurable elements to address registry policy controls such as
> generation, bundling and blocking of variants etc
>
> On Context Rules:
>
> As implementers we found that the current definitions are a little bit
> ambiguous and redundant in some places, the pseudo code can be confusing to
> follow and in most cases we returned to the descriptions to guess what was
> intended. We would be happy to take a shot at improving these, however we
> have some broader concerns with context rules that we would like to address
> first.

I agree about the former; I also think that we have too many CONTEXT* items
in the first place, and that we should reduce them to what is required
before trying to fix the pseudocode.

>
>
> Appendix A.1. HYPHEN-MINUS
>
> In the protocol document it states in 4.2.3.1:
>
> > The Unicode string MUST NOT contain "--" (two consecutive hyphens) in
> > the third and fourth character positions.
>
> This is an explicit disallow of that character in those positions for all
> labels (ALL CONTEXTS), yet this context rule restricts the use of hyphens in
> the first and last position for all labels (ALL CONTEXTS) i.e. this context
> rule appears to have no specific context. Given this, why isn't this just
> another protocol rule similar to the one above?
>
> Appendix A.2 ZERO WIDTH NON-JOINER
>
> In the regular expression you reference 'Joining_Type:L' which we initially
> assumed to mean the Unicode Joining type property of the character must be L
> as described in http://www.unicode.org/Public/UNIDATA/ArabicShaping.txt,
> however upon further investigation there are no code points with that
> property value. It wasn't until we found
> http://unicode.org/review/pr-96.html which appears to clarify what was
> trying to be achieved (we think), that it became clear to us that this rule
> should actually include type D as well. Is this what was intended?

I think the intent was to match
http://www.unicode.org/reports/tr31/tr31-10.html#Layout_and_Format_Control_Characters,
table A1.

(Note that there is a correction there to fix the dual joining language.
That is, the regex was fine, but the textual explanation was in error. This
is the draft update for Unicode 5.2, due in October.)

>
>
> Additionally if you scan the Unicode code table for all code points with
> the joining type property of T,L,R and D there are many more scripts than
> the 3 listed in the context rule. There are PVALID code points in the
> following scripts:
>
> Gujarati, Lepcha, Rejang, Kharoshthi, Devanagari, Tamil, Malayalam,
> Tagalog, Syloti_Nagri, Gurmukhi, Sinhala, Lao, Hanunoo, Kayah_Li, Bengali,
> Thai, Myanmar, Ethiopic, Buhid, Cyrillic, Tibetan, Inherited, Nko, Thaana,
> Oriya, Telugu, Kannada, Limbu, Buginese, Sundanese, Arabic, Khmer,
> Mongolian, Saurashtra, Syriac, Hebrew, Tagbanwa, Balinese, Cham
>
> If the Unicode consortium sees fit to define this rule without referring to
> the scripts at all (as in the second link above) why does the protocol
> restrict this further? Most registries will not allow the combining of
> characters of different scripts together in the same label anyway, as
> recommended by ICANN and I thought in the draft documents as well (at some
> point)

In practice, while the transparent characters can be almost anything, the
preceding and following characters are limited, eg.

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[
\p{Joining_Type%3DDual_Joining}\p{Joining_Type%3DRight_Joining}]

However, you have a good point; I don't think there is need to restrict the
scripts.

>
>
> Given that this rule is to be applied on lookup we are concerned that the
> IDNA protocol may end up restricting scripts/languages that have not yet
> been considered.
>
> Appendix A.3 ZERO WIDTH JOINER
>
> Given its on lookup, same argument as the last 2 paragraphs of A.2
>
> Appendix A.4. MIDDLE DOT
>
> No specific comments on this rule other than it is really policy as
> discussed below
>
> Appendix A.5 GREEK LOWER NUMERAL SIGN
>
> The script function is defined in tables-05, page 13 as:
>
> > Script(cp) return the name of the Script cp is of
>
> Given this, my reading of this rule prohibits the use of GREEK LOWER
> NUMERAL SIGN in any domain that does not contain code points with the
> 'Greek' script property. This means code points with the script property of
> 'Common' (for example) cannot be used in these labels, that would mean, for
> example, this context rule would prohibit a registrant from using a hyphen
> to separate the words of their domain, is this really what was intended?
>
> Appendix A.6.  MODIFIER LETTER PRIME
>
> Similar problem to A.5, all characters must be script common (lower case c
> common). Additionally, this context rule is intended to be applied to labels
> containing MODIFIER LETTER PRIME, it states that all characters must be in
> the script of 'Greek' however MODIFIER LETTER PRIME is actually in script
> 'Common' thus this rule should always fail!
>
> Appendix A.7. COMBINING CYRILLIC TITLO
>
> As per A.5.
>
> Appendix A.8 HEBREW PUNCTUATION GERESH
>
> This one is hard to decipher, I could argue that (given the indentation):
>
> Assuming H is Hebrew, C is Chinese and P is the punctuation then the label
> HHHPCCCCHP would satisfy the rule
>
> The definition of LastChar and/or .eq. is ambiguous at best I interpret
> these lines
>
> If Script(Before(cp)) .eq.  Hebrew And
>         LastChar .eq. cp Then True;
>
> To say, if there is HPG where the script of the code point before it is
> Hebrew, and the LastChar is also a HPG then True but Im pretty sure that is
> not what was intended. I think the .eq. in this case is meant to say that is
> the LastChar the code point being tested or is it 'equal to' the code point
> being tested not is the 'last code point' equal to (ie the same code point
> value) as the code point being tested. It's even difficult to explain
> correctly. Also the indentation is hard to follow, is it actually meant to
> mean anything or is it just for visual purposes, I believe what was intended
> is something more like:
>
> IsLast(cp) True if code point is the last code point in the label
>
> False;
> If Script(Before(cp)) .eq. Hebrew And IsLast(cp) Then True;
> If Script(Before(cp)) .eq.  Hebrew And
>   Script(After(cp)) .eq.  Hebrew Then True;
>
> Additionally the rule does not allow this character to appear at the start
> of the label, because of the definition of the Before function, was this
> intended? Regardless this is fine, however inconsistent with A.10. and
> A.11., which explicitly states this requirement and then implements a rule
> to prevent it.
>
> Appendix A.9.  HEBREW PUNCTUATION GERSHAYIM
>
> As per A.8.
>
> Appendix A.10.  IDEOGRAPHIC ITERATION MARK;
>
> See final paragraph of A.8.
>
> Appendix A.11.  VERTICAL IDEOGRAPHIC ITERATION MARK
>
> See final paragraph of A.8.
>
> Appendix A.12.  KATAKANA MIDDLE DOT
>
> Similar to the case in A.8. this rule implicitly uses Before and After to
> state that the code point cannot appear at the beginning and end of the
> label is this intended? Why not explicitly state it like in A.10 and A.11
>
> Appendix A.13.  ARABIC-INDIC DIGITS & Appendix A.14.  EXTENDED ARABIC-INDIC
> DIGITS
>
> Quoting BIDI draft document
>
> BIDI description in section 2
>
> > A label containing a character of type R, AL or AN MUST satisfy all 7
> > of the rules below.
>
> AND BIDI rule number 5 of section 2
>
> > 5.  If an EN is present, no AN may be present, and vice versa.
>
> Given the above, these context rules are not needed, as the BIDI rules
> explicitly prohibit this from happening anyway!

For essentially all of the above, I don't think we want them to be context
at all. I'll put out a separate note.

>
>
>
> ISSUES WITH CONTEXT RULES THAT ARE FOR REGISTRATION ONLY
>
> Firstly I'd like to support Shawn Steele in his view that these should be
> described more like:
>
> > Maybe instead of "Lookup: true/false" it could be something like
> "Context: Registration Only" or "Context: Registration and Lookup".
>
> We assume that the audience of these rules is for humans and this clarifies
> things very clearly.
>
> However, I'd really like to ask what is the purpose of Registration only
> context rules. Given that these rules won't stop domains being looked up as
> soon as a registry disagrees with one of the registration only rules, they
> will simply stop applying it!. For example the context rule concerning
> 'modifier letter prime' discussed above which excludes a hyphen or digits,
> why would I as a registry operator want to apply it as written? If I allow
> hyphens to be used in those names, the domain names will still work when
> looked up. This leads me to a greater discussion about how registration only
> context rules are really just policy decisions!
>
> I acknowledge and welcome the effort put in by all, and I think they are
> important aspects of IDNs that should be documented and circulated, however,
> I feel they are only implemented by registries and thus should only be
> recommendations for best common practices for registries, not enforced parts
> of the standard/protocol/<insert the right word here>. The document should
> be educating the registries about the potential issues with certain
> characters that they may allow but that should be it. Not all languages have
> been represented during the development of these rules, what if some
> language that has not been considered yet requires to use this 'modifier
> letter prime' (or some other focus of another context rule) and we have now
> FORCED with the protocol that they couldn't, when really it should just be a
> recommendation for the zone administrator (registry) about a business rule
> they should apply to registrations. I believe there will potentially be many
> more 'context' style rule
>  s that will need to be developed as we move forward, and these will not
> become part of the standards documents, just be rules that are implemented
> by registries.
>
> As an example, if we look at the 'Yiddish' rules used by the .SE registry
> (documented here http://iana.org/domains/idn-tables/tables/se_yi_1.0.html)
> they state that certain combining marks are only allowed to be used with
> certain base characters (i.e. in a given context), these are also 'context'
> rules but not things that should be part of the standard. As long as the
> code points are PVALID, and if this means some code points become PVALID by
> exception then so be it, but then leave it the registries to define the
> contexts in which they actually can and can't be used.
>
> A BCP document that points out the issues with each item context rules are
> trying to address, shows some examples, and then recommends that zone
> administrators produce policy requirements to prohibit certain registrations
> using context style rules should be sufficient, and is the most flexible
> approach moving forward. No zone administrator in their right mind would do
> anything to risk their namespace being branded as unsafe etc. I also believe
> that lower levels of DNS hierarchy will never apply 'registration only'
> context rules!

I agree; I think unless the BIDI and CONTEXT rules are required in the
lookup procedure, they might as well just be guidelines for registrars.

>
>
> Thanks
>
> Chris Wright
> Chief Technology Officer
> AusRegistry Pty Ltd
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090716/7ca2d73e/attachment.htm