Draft on IDN Tables in XML

Dillon, Chris c.dillon at ucl.ac.uk
Mon Mar 5 13:22:42 CET 2012

Dear colleagues,

I have been reading this correspondence with Chinese in mind and would like to raise some questions.

This is a case where there are two major forms of a script (Simplified Chinese used, for example in mainland China and Singapore) and Traditional Chinese, used elsewhere. Many characters are the same everywhere but a subset of characters have been abbreviated in the case of Simplified Chinese.

There is an additional complication in the form of a small number of characters that are only used e.g. in Hong Kong (for Cantonese) or Singapore. What would be the best way to include those?

In the RFC3743-style tables at http://www.iana.org/domains/idn-tables/ typically Simplified Chinese Preferred Variants and Traditional Chinese Preferred Variants have their own columns.

http://tools.ietf.org/html/rfc5646 gives the following example tags for Chinese; which should be standard for Chinese in this XML-based system?

Language subtag plus Script subtag:

      zh-Hant (Chinese written using the Traditional Chinese script)

      zh-Hans (Chinese written using the Simplified Chinese script)

     Extended language subtags and their primary language subtag counterparts:

      zh-cmn-Hans-CN (Chinese, Mandarin, Simplified script, as used in China)

      cmn-Hans-CN (Mandarin Chinese, Simplified script, as used in China)

      zh-yue-HK (Chinese, Cantonese, as used in Hong Kong SAR)

      yue-HK (Cantonese Chinese, as used in Hong Kong SAR)


      zh-Hans-CN (Chinese written using the Simplified script as used in mainland China)

A problem that many tables share is that one sees only Unicode numbers, no characters, and so when humans work with the tables, they often need to turn Unicode codes into characters or characters into Unicode codes. Is there any way that the XML could contain both (I think there are Unicode fonts containing nearly all the characters)?

I would be grateful for the answers to any or all of these questions.


Chris Dillon.
Research Associate in Linguistic Computing, Dept of Information Studies, UCL, Gower St, London WC1E 6BT Tel +44 20 7679 1599 (int 31599) ucl.ac.uk/dis/people/chrisdillon

-----Original Message-----
From: vip-bounces at icann.org [mailto:vip-bounces at icann.org] On Behalf Of Kim Davies
Sent: 01 March 2012 19:15
To: vip at icann.org; idna-update at alvestrand.no
Subject: [vip] Draft on IDN Tables in XML


I have posted a first draft regarding a format that could be used for representing IDN Tables in XML to the I-D Repository:


After discussion with a number of folks that felt this would be good work to undertake, I've put together a first cut which is not comprehensive, but I think goes some way toward a potential format.

Unless there is interest in this being a more formal activity, my assumption is to aim to publish the final result independently as an Informational RFC. However, the mechanism of publication is secondary to coming up with something useful that would benefit TLD registries and other implementors. A list of design goals, from the document, is as follows:

	* MUST be in a format that can be implemented in a reasonably straightforward manner in software;
	* The format SHOULD be able to be checked for formatting errors, such that common mistakes can be caught;
	* An IDN Table MUST be able to express the set of valid code points that are allowed for registration under a specific zone administrator's policies;
	* MUST be able to express computed alternatives to a given domain name based on a one-to-one, or one-to-many relationship. These computed alternatives are commonly known as "IDN variants";
	* IDN Variants SHOULD be able to be tagged with specific categories, such that the categories can be used to support registry policy (such as whether to list the computed variant in the zone, or to merely block it from registration);
	* IDN Variants MUST be able to stipulated based on contextual information. For example, specific variants may only be applicable when they follow another specific code point, or when the code point is displayed in a specific presentation form;
	* The data contained within the table MUST be unambiguous, such that independent implementations that utilise the contents will arrive at the same results;
	* IDN Tables SHOULD be suitable for comparison and re-use, such that one could easily compare the contents of two or more to see the differences, to merge them, and so on.
	* As many existing IDN Tables are practicable SHOULD be able to be migrated to the new format with all applicable logic retained.

It is explicitly NOT the goal of this format to:

	* Stipulate what code points should be listed in an IDN Table by a zone administrator. What registration policies are used for a particular zone is outside the scope of this memo.
	* Stipulate what a consumer of an IDN Table must do when they determine a particular domain is valid or invalid; or arrive at a set of computed IDN variants. IDN Tables are only used to describe rules for computing code points, but does not prescribe how registries and other parties utilise them.

I'd appreciate any feedback.



More information about the Idna-update mailing list