Draft on IDN Tables in XML

Dillon, Chris c.dillon at ucl.ac.uk
Thu Mar 8 14:04:35 CET 2012


Dear Kim,

Good to hear that there is a tool which will mass-convert from Unicode U+ notation to characters!

(Incidentally, in case anyone is interested in converting ad hoc codes, this is the easiest method I have discovered:
To go from U+ notation to character, type the code without U+ in a recent version of Microsoft Word and press Alt-x (at least in the PC version).
To go from character to U+ notation, type the character in the Characters field (below Mixed input) in the following website, click Convert and it gives you at least every code known to mankind:
http://www.rishida.net/tools/conversion )

In this/these XML table(s), it would be good to require that zh was stipulated, but also something is needed to indicate where Simplified Chinese and Traditional Chinese Preferred Variants are stored and, possibly, the relatively small number characters that may only be used e.g. in Singapore, in Hong Kong or in Taiwan.

Regards,

Chris.
==
Research Associate in Linguistic Computing, Dept of Information Studies, UCL, Gower St, London WC1E 6BT Tel +44 20 7679 1599 (int 31599) ucl.ac.uk/dis/people/chrisdillon

From: Kim Davies [mailto:kim.davies at icann.org]
Sent: 07 March 2012 00:56
To: Dillon, Chris
Cc: vip at icann.org; idna-update at alvestrand.no
Subject: Re: Draft on IDN Tables in XML

Hi Chris,

On Mar 5, 2012, at 4:22 AM, Dillon, Chris wrote:

In the RFC3743-style tables at http://www.iana.org/domains/idn-tables/ typically Simplified Chinese Preferred Variants and Traditional Chinese Preferred Variants have their own columns.

http://tools.ietf.org/html/rfc5646 gives the following example tags for Chinese; which should be standard for Chinese in this XML-based system?

I would assume simply "zh" would be sufficient. It is not a requirement to stipulate the script in a language tag. Also, the entire tag is discretionary — if, for example, you created a fictitious table that had no bearing on any specific language or script, you would not be required to specify one.

A problem that many tables share is that one sees only Unicode numbers, no characters, and so when humans work with the tables, they often need to turn Unicode codes into characters or characters into Unicode codes. Is there any way that the XML could contain both (I think there are Unicode fonts containing nearly all the characters)?

Creating a tool that takes the code points and turns them into something readable should be a trivial exercise, precisely because of the standardised format. I think it would be best to avoid superfluous descriptions of the individual codepoints in the spec itself, and would rather encourage tools that present the XML file in such a way as to be readable (as a web page, etc.)

For example, I can print human-readable representations from the XML table as follows very simply:

kim at gumleaf:idntables[master*]$ python
Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import idntables, unicodedata
>>> table = idntables.load("samples/nz_Latn_1.0.xml")
>>> for char in sorted(table._codepoints):
...     print "%s [U+%04X] %s" % (unichr(char), char, unicodedata.name(unichr(char)))
...
0 [U+0030] DIGIT ZERO
1 [U+0031] DIGIT ONE
2 [U+0032] DIGIT TWO
3 [U+0033] DIGIT THREE
4 [U+0034] DIGIT FOUR
5 [U+0035] DIGIT FIVE
6 [U+0036] DIGIT SIX
7 [U+0037] DIGIT SEVEN
8 [U+0038] DIGIT EIGHT
9 [U+0039] DIGIT NINE
a [U+0061] LATIN SMALL LETTER A
b [U+0062] LATIN SMALL LETTER B
c [U+0063] LATIN SMALL LETTER C
d [U+0064] LATIN SMALL LETTER D
e [U+0065] LATIN SMALL LETTER E
f [U+0066] LATIN SMALL LETTER F
g [U+0067] LATIN SMALL LETTER G
h [U+0068] LATIN SMALL LETTER H
i [U+0069] LATIN SMALL LETTER I
j [U+006A] LATIN SMALL LETTER J
k [U+006B] LATIN SMALL LETTER K
l [U+006C] LATIN SMALL LETTER L
m [U+006D] LATIN SMALL LETTER M
n [U+006E] LATIN SMALL LETTER N
o [U+006F] LATIN SMALL LETTER O
p [U+0070] LATIN SMALL LETTER P
q [U+0071] LATIN SMALL LETTER Q
r [U+0072] LATIN SMALL LETTER R
s [U+0073] LATIN SMALL LETTER S
t [U+0074] LATIN SMALL LETTER T
u [U+0075] LATIN SMALL LETTER U
v [U+0076] LATIN SMALL LETTER V
w [U+0077] LATIN SMALL LETTER W
x [U+0078] LATIN SMALL LETTER X
y [U+0079] LATIN SMALL LETTER Y
z [U+007A] LATIN SMALL LETTER Z
ā [U+0101] LATIN SMALL LETTER A WITH MACRON
ē [U+0113] LATIN SMALL LETTER E WITH MACRON
ī [U+012B] LATIN SMALL LETTER I WITH MACRON
ō [U+014D] LATIN SMALL LETTER O WITH MACRON
ū [U+016B] LATIN SMALL LETTER U WITH MACRON

kim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20120308/9a8f1619/attachment.html>


More information about the Idna-update mailing list