Draft on IDN Tables in XML

Andrew Sullivan ajs at anvilwalrusden.com
Wed Mar 7 17:35:09 CET 2012


On Wed, Mar 07, 2012 at 12:47:36PM +1100, James Mitchell wrote:
> 2) The set of code points (or sequences of code points) that are considered equivalent by the registry

Could we please not use "equivalent" for this?  That word has caused
enough trouble already.

> The table should not attempt to place rules on the use of code
> points within a label as these rules are often non-trivial. One can
> easily tell whether a name is registered by performing a DNS lookup
> or a WHOIS query for the name. Alternatively a registrar will be
> able to notify a potential registrant should a name be considered
> "invalid".

Withouth rules on the use of code points within a label, the table
needs to be supplemented by something else in order to create a
complete policy.  Where would you want to put those rules? 

> Further to the above the table should not attempt to define those
> variants that are activated/allowed/blocked. An active variant can
> be determined from a query to the DNS or WHOIS and these protocols
> will have to used considering a variant may have been activated
> post-registration. Additionally the rules for determining whether a
> variant can be activated are non-trivial. Consider the example
> below.

See above.  Of course they're not trivial.  But they need to be
expressed somewhere so that one can unambiguously determine whether a
string is a candidate to be a U-label in a zone.  If you can't
determine that, then the policy doesn't actually cover everything it's
supposed to.

> To avoid the somewhat common mistake of incorrectly defining equivalence I suggest that equivalent sequences of code points are defined in one place. For example
> <char cp="0627">
>      <var cp="0625"/>
> </char>
> <char cp="0625">
> 	<!-- whoops, forgot to identify 0627 as an equivalent character -->
> </char>
> should be expressed as
> <equivalent>
> 	<char cp="0625">
> 	<char cp="0627">
> </equivalent>

This won't work for cases where the alternatation goes only one way.
In Russian, for instance, IE (U+0435) can be used (casually, though
not formally) where IO (U+0451) is used; but it is never the case that
IO can be used where IE is used.  Trivially, in French it is sometimes
the case that one substitutes undecorated characters for decorated
characters; but you don't do things the opposite way.  Now, of course,
you might just say, "Do them all in order to avoid confusion."  But
this is an example where at least some registries have talked about
wanting to ensure blocking for some cases and withholding for others;
and in that case, "equivalence" is certainly the wrong concept and
symmetry is not what you get.



Andrew Sullivan
ajs at anvilwalrusden.com

More information about the Idna-update mailing list