Draft on IDN Tables in XML

Kim Davies kim.davies at icann.org
Wed Mar 7 01:56:23 CET 2012


Hi Chris,

On Mar 5, 2012, at 4:22 AM, Dillon, Chris wrote:

In the RFC3743-style tables at http://www.iana.org/domains/idn-tables/ typically Simplified Chinese Preferred Variants and Traditional Chinese Preferred Variants have their own columns.

http://tools.ietf.org/html/rfc5646 gives the following example tags for Chinese; which should be standard for Chinese in this XML-based system?

I would assume simply "zh" would be sufficient. It is not a requirement to stipulate the script in a language tag. Also, the entire tag is discretionary — if, for example, you created a fictitious table that had no bearing on any specific language or script, you would not be required to specify one.

A problem that many tables share is that one sees only Unicode numbers, no characters, and so when humans work with the tables, they often need to turn Unicode codes into characters or characters into Unicode codes. Is there any way that the XML could contain both (I think there are Unicode fonts containing nearly all the characters)?

Creating a tool that takes the code points and turns them into something readable should be a trivial exercise, precisely because of the standardised format. I think it would be best to avoid superfluous descriptions of the individual codepoints in the spec itself, and would rather encourage tools that present the XML file in such a way as to be readable (as a web page, etc.)

For example, I can print human-readable representations from the XML table as follows very simply:

kim at gumleaf:idntables[master*]$ python
Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import idntables, unicodedata
>>> table = idntables.load("samples/nz_Latn_1.0.xml")
>>> for char in sorted(table._codepoints):
...     print "%s [U+%04X] %s" % (unichr(char), char, unicodedata.name(unichr(char)))
...
0 [U+0030] DIGIT ZERO
1 [U+0031] DIGIT ONE
2 [U+0032] DIGIT TWO
3 [U+0033] DIGIT THREE
4 [U+0034] DIGIT FOUR
5 [U+0035] DIGIT FIVE
6 [U+0036] DIGIT SIX
7 [U+0037] DIGIT SEVEN
8 [U+0038] DIGIT EIGHT
9 [U+0039] DIGIT NINE
a [U+0061] LATIN SMALL LETTER A
b [U+0062] LATIN SMALL LETTER B
c [U+0063] LATIN SMALL LETTER C
d [U+0064] LATIN SMALL LETTER D
e [U+0065] LATIN SMALL LETTER E
f [U+0066] LATIN SMALL LETTER F
g [U+0067] LATIN SMALL LETTER G
h [U+0068] LATIN SMALL LETTER H
i [U+0069] LATIN SMALL LETTER I
j [U+006A] LATIN SMALL LETTER J
k [U+006B] LATIN SMALL LETTER K
l [U+006C] LATIN SMALL LETTER L
m [U+006D] LATIN SMALL LETTER M
n [U+006E] LATIN SMALL LETTER N
o [U+006F] LATIN SMALL LETTER O
p [U+0070] LATIN SMALL LETTER P
q [U+0071] LATIN SMALL LETTER Q
r [U+0072] LATIN SMALL LETTER R
s [U+0073] LATIN SMALL LETTER S
t [U+0074] LATIN SMALL LETTER T
u [U+0075] LATIN SMALL LETTER U
v [U+0076] LATIN SMALL LETTER V
w [U+0077] LATIN SMALL LETTER W
x [U+0078] LATIN SMALL LETTER X
y [U+0079] LATIN SMALL LETTER Y
z [U+007A] LATIN SMALL LETTER Z
ā [U+0101] LATIN SMALL LETTER A WITH MACRON
ē [U+0113] LATIN SMALL LETTER E WITH MACRON
ī [U+012B] LATIN SMALL LETTER I WITH MACRON
ō [U+014D] LATIN SMALL LETTER O WITH MACRON
ū [U+016B] LATIN SMALL LETTER U WITH MACRON

kim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20120306/2b6c0f4c/attachment.html>


More information about the Idna-update mailing list