Context Rules - with apologies for the long email

Chris Wright chris at ausregistry.com.au
Fri Jul 17 02:08:20 CEST 2009


Shawn,

I definitely agree with you, context rules should not be applied on lookup, I hope no one interpreted my post to think I was saying that. What I am saying is that I don’t think the rules should exist (as rules) at all, then if they must stay, a whole bunch of clarifying questions which hopefully someone can answer.

Of great importance is the script restriction in the joiner rule that IS applied on lookup.

c.

From: Shawn Steele [mailto:Shawn.Steele at microsoft.com]
Sent: Friday, 17 July 2009 7:35 AM
To: Mark Davis ⌛; Chris Wright
Cc: idna-update at alvestrand.no
Subject: RE: Context Rules - with apologies for the long email

Thanks Chris for providing input, more perspectives is always good ☺

I prefer the lookup rules to be more relaxed than the registration rules.  If the name isn’t registered, then it won’t work, I don’t think the rules should be required on lookup just to force registry best practices.

These rules are also more difficult to update than data (if they might require code changes), particularly when multiple clients/versions that need lookup are taken into account.  Servicing & testing those kinds of changes in say XP, Vista, Server 2003, Win7, etc., can be very expensive.  Especially if we then decide a new rule needs added (maybe for new code points), or a rule needs tweaked for some edge case.

-Shawn

From: idna-update-bounces at alvestrand.no [mailto:idna-update-bounces at alvestrand.no] On Behalf Of Mark Davis ?
Sent: Thursday, July 16,  2009 9:01
To: Chris Wright
Cc: idna-update at alvestrand.no
Subject: Re: Context Rules - with apologies for the long email


Mark
On Wed, Jul 15, 2009 at 23:43, Chris Wright <chris at ausregistry.com.au<mailto:chris at ausregistry.com.au>> wrote:
As this is my first post to this list, a very quick background, for setting the context purpose only:
 - AusRegistry is the .au domain name registry. Our registry software is in use in several other TLD registries many of which are eagerly awaiting their ccTLD IDNs from ICANN
 - We have been following this list for a the last 12 months or so
 - We have spent the last almost 12 months implementing IDNs (IDN 2008) in our registry software, implemented all of the current drafts with the exception of mappings
 - We have, especially focused on Arabic IDNs and IDN 2008 (see Arabic comments further down)
 - Now have a fully working initial implementation that includes configurable elements to address registry policy controls such as generation, bundling and blocking of variants etc

On Context Rules:

As implementers we found that the current definitions are a little bit ambiguous and redundant in some places, the pseudo code can be confusing to follow and in most cases we returned to the descriptions to guess what was intended. We would be happy to take a shot at improving these, however we have some broader concerns with context rules that we would like to address first.

I agree about the former; I also think that we have too many CONTEXT* items in the first place, and that we should reduce them to what is required before trying to fix the pseudocode.




Appendix A.1. HYPHEN-MINUS

In the protocol document it states in 4.2.3.1<http://4.2.3.1>:

> The Unicode string MUST NOT contain "--" (two consecutive hyphens) in
> the third and fourth character positions.

This is an explicit disallow of that character in those positions for all labels (ALL CONTEXTS), yet this context rule restricts the use of hyphens in the first and last position for all labels (ALL CONTEXTS) i.e. this context rule appears to have no specific context. Given this, why isn't this just another protocol rule similar to the one above?

Appendix A.2 ZERO WIDTH NON-JOINER

In the regular expression you reference 'Joining_Type:L' which we initially assumed to mean the Unicode Joining type property of the character must be L as described in http://www.unicode.org/Public/UNIDATA/ArabicShaping.txt, however upon further investigation there are no code points with that property value. It wasn't until we found http://unicode.org/review/pr-96.html which appears to clarify what was trying to be achieved (we think), that it became clear to us that this rule should actually include type D as well. Is this what was intended?

I think the intent was to match http://www.unicode.org/reports/tr31/tr31-10.html#Layout_and_Format_Control_Characters, table A1.

(Note that there is a correction there to fix the dual joining language. That is, the regex was fine, but the textual explanation was in error. This is the draft update for Unicode 5.2, due in October.)



Additionally if you scan the Unicode code table for all code points with the joining type property of T,L,R and D there are many more scripts than the 3 listed in the context rule. There are PVALID code points in the following scripts:

Gujarati, Lepcha, Rejang, Kharoshthi, Devanagari, Tamil, Malayalam, Tagalog, Syloti_Nagri, Gurmukhi, Sinhala, Lao, Hanunoo, Kayah_Li, Bengali, Thai, Myanmar, Ethiopic, Buhid, Cyrillic, Tibetan, Inherited, Nko, Thaana, Oriya, Telugu, Kannada, Limbu, Buginese, Sundanese, Arabic, Khmer, Mongolian, Saurashtra, Syriac, Hebrew, Tagbanwa, Balinese, Cham

If the Unicode consortium sees fit to define this rule without referring to the scripts at all (as in the second link above) why does the protocol restrict this further? Most registries will not allow the combining of characters of different scripts together in the same label anyway, as recommended by ICANN and I thought in the draft documents as well (at some point)
In practice, while the transparent characters can be almost anything, the preceding and following characters are limited, eg.

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5b>\p{Joining_Type%3DDual_Joining}\p{Joining_Type%3DRight_Joining}]

However, you have a good point; I don't think there is need to restrict the scripts.



Given that this rule is to be applied on lookup we are concerned that the IDNA protocol may end up restricting scripts/languages that have not yet been considered.

Appendix A.3 ZERO WIDTH JOINER

Given its on lookup, same argument as the last 2 paragraphs of A.2

Appendix A.4. MIDDLE DOT

No specific comments on this rule other than it is really policy as discussed below

Appendix A.5 GREEK LOWER NUMERAL SIGN

The script function is defined in tables-05, page 13 as:

> Script(cp) return the name of the Script cp is of

Given this, my reading of this rule prohibits the use of GREEK LOWER NUMERAL SIGN in any domain that does not contain code points with the 'Greek' script property. This means code points with the script property of 'Common' (for example) cannot be used in these labels, that would mean, for example, this context rule would prohibit a registrant from using a hyphen to separate the words of their domain, is this really what was intended?

Appendix A.6.  MODIFIER LETTER PRIME

Similar problem to A.5, all characters must be script common (lower case c common). Additionally, this context rule is intended to be applied to labels containing MODIFIER LETTER PRIME, it states that all characters must be in the script of 'Greek' however MODIFIER LETTER PRIME is actually in script 'Common' thus this rule should always fail!

Appendix A.7. COMBINING CYRILLIC TITLO

As per A.5.

Appendix A.8 HEBREW PUNCTUATION GERESH

This one is hard to decipher, I could argue that (given the indentation):

Assuming H is Hebrew, C is Chinese and P is the punctuation then the label HHHPCCCCHP would satisfy the rule

The definition of LastChar and/or .eq. is ambiguous at best I interpret these lines

If Script(Before(cp)) .eq.  Hebrew And
        LastChar .eq. cp Then True;

To say, if there is HPG where the script of the code point before it is Hebrew, and the LastChar is also a HPG then True but Im pretty sure that is not what was intended. I think the .eq. in this case is meant to say that is the LastChar the code point being tested or is it 'equal to' the code point being tested not is the 'last code point' equal to (ie the same code point value) as the code point being tested. It's even difficult to explain correctly. Also the indentation is hard to follow, is it actually meant to mean anything or is it just for visual purposes, I believe what was intended is something more like:

IsLast(cp) True if code point is the last code point in the label

False;
If Script(Before(cp)) .eq. Hebrew And IsLast(cp) Then True;
If Script(Before(cp)) .eq.  Hebrew And
  Script(After(cp)) .eq.  Hebrew Then True;

Additionally the rule does not allow this character to appear at the start of the label, because of the definition of the Before function, was this intended? Regardless this is fine, however inconsistent with A.10. and A.11., which explicitly states this requirement and then implements a rule to prevent it.

Appendix A.9.  HEBREW PUNCTUATION GERSHAYIM

As per A.8.

Appendix A.10.  IDEOGRAPHIC ITERATION MARK;

See final paragraph of A.8.

Appendix A.11.  VERTICAL IDEOGRAPHIC ITERATION MARK

See final paragraph of A.8.

Appendix A.12.  KATAKANA MIDDLE DOT

Similar to the case in A.8. this rule implicitly uses Before and After to state that the code point cannot appear at the beginning and end of the label is this intended? Why not explicitly state it like in A.10 and A.11

Appendix A.13.  ARABIC-INDIC DIGITS & Appendix A.14.  EXTENDED ARABIC-INDIC DIGITS

Quoting BIDI draft document

BIDI description in section 2

> A label containing a character of type R, AL or AN MUST satisfy all 7
> of the rules below.

AND BIDI rule number 5 of section 2

> 5.  If an EN is present, no AN may be present, and vice versa.

Given the above, these context rules are not needed, as the BIDI rules explicitly prohibit this from happening anyway!

For essentially all of the above, I don't think we want them to be context at all. I'll put out a separate note.




ISSUES WITH CONTEXT RULES THAT ARE FOR REGISTRATION ONLY

Firstly I'd like to support Shawn Steele in his view that these should be described more like:

> Maybe instead of "Lookup: true/false" it could be something like "Context: Registration Only" or "Context: Registration and Lookup".

We assume that the audience of these rules is for humans and this clarifies things very clearly.

However, I'd really like to ask what is the purpose of Registration only context rules. Given that these rules won't stop domains being looked up as soon as a registry disagrees with one of the registration only rules, they will simply stop applying it!. For example the context rule concerning 'modifier letter prime' discussed above which excludes a hyphen or digits, why would I as a registry operator want to apply it as written? If I allow hyphens to be used in those names, the domain names will still work when looked up. This leads me to a greater discussion about how registration only context rules are really just policy decisions!

I acknowledge and welcome the effort put in by all, and I think they are important aspects of IDNs that should be documented and circulated, however, I feel they are only implemented by registries and thus should only be recommendations for best common practices for registries, not enforced parts of the standard/protocol/<insert the right word here>. The document should be educating the registries about the potential issues with certain characters that they may allow but that should be it. Not all languages have been represented during the development of these rules, what if some language that has not been considered yet requires to use this 'modifier letter prime' (or some other focus of another context rule) and we have now FORCED with the protocol that they couldn't, when really it should just be a recommendation for the zone administrator (registry) about a business rule they should apply to registrations. I believe there will potentially be many more 'context' style rule
 s that will need to be developed as we move forward, and these will not become part of the standards documents, just be rules that are implemented by registries.

As an example, if we look at the 'Yiddish' rules used by the .SE registry (documented here http://iana.org/domains/idn-tables/tables/se_yi_1.0.html) they state that certain combining marks are only allowed to be used with certain base characters (i.e. in a given context), these are also 'context' rules but not things that should be part of the standard. As long as the code points are PVALID, and if this means some code points become PVALID by exception then so be it, but then leave it the registries to define the contexts in which they actually can and can't be used.

A BCP document that points out the issues with each item context rules are trying to address, shows some examples, and then recommends that zone administrators produce policy requirements to prohibit certain registrations using context style rules should be sufficient, and is the most flexible approach moving forward. No zone administrator in their right mind would do anything to risk their namespace being branded as unsafe etc. I also believe that lower levels of DNS hierarchy will never apply 'registration only' context rules!

I agree; I think unless the BIDI and CONTEXT rules are required in the lookup procedure, they might as well just be guidelines for registrars.



Thanks

Chris Wright
Chief Technology Officer
AusRegistry Pty Ltd

_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no<mailto:Idna-update at alvestrand.no>
http://www.alvestrand.no/mailman/listinfo/idna-update

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090717/2d7d01ec/attachment-0001.htm 


More information about the Idna-update mailing list