UTC Agenda Item: IDNA proposal
markdavis at google.com
Thu Nov 23 17:56:38 CET 2006
Thanks, Patrik, nice work.
Could you put the scripts in a determinate order (eg sorted either by code
-- such as Latn -- or by long name)? That would make comparison easier.
This direction is very promising, and close to what Unicode is recommending,
- Character Restrictions
- characters must have general categories of: letter, mark,
number (Nd), but
- no uppercase/titlecase characters
- no characters that aren't NFKC
- plus ZWJ/ZWNJ, but in very limited contexts*
- Whole Field Restrictions
- field must be NFC
- no field starts with a mark
- nameprep bidi restrictions (loosened to allow marks at end of
The Unicode identifiers (http://unicode.org/reports/tr31/) differ somewhat
in that they add connector punctuation, like "_", and disallow numbers at
the start of identifiers (since the target is programming languages). They
also provide for stability extensions, so that once a character counts as an
identifier, it will in all future versions of Unicode. See
* The ZWJ/NJ is still in draft stage. The UTC is issuing a PRI regarding
use of ZWJ/ZWNJ in identifiers based on L2/06-353 (
* See also
including Table 3. Characters for Natural Language Identifiers
On 11/22/06, Patrik Fältström <patrik at frobbit.se> wrote:
> Version that accept classes Ll, Lo and Mn can be found as
> What about class Nd?
> On 22 nov 2006, at 14.07, Harald Alvestrand wrote:
> > Class Mn contains the HEBREW POINT QAMATS that the -bidi draft is
> > busy defending. Can't eliminate that.
> > Harald
> > --On 22. november 2006 13:52 +0100 Patrik Fältström
> > <patrik at frobbit.se> wrote:
> >> I have recreated the tables using a new algorithm (based on inputfrom
> >> Kenneth mostly).
> >> (1) Use the scripts.txt file for the script definitions, do not
> >> usethe
> >> blocks definitions
> >> (2) Remove codepoints where cp != NFKC(cp)
> >> (3) Remove codepoints where cp != lowercase(cp)
> >> (4) Remove codepoints where class(cp) != "Ll"
> >> (5) Include codepoints that are part of US-ASCII (0-9, A-Z and a-z)
> >> The result of doing this for U+0000 - U+FFFF can be found as
> >> http://stupid.domain.name/idnabis/table-ll.html
> >> If I instead instep 4 accept things of class both Ll and Lo, then
> >> theresult can be found as
> >> http://stupid.domain.name/idnabis/table-lllo.html
> >> Please let me know what you think.
> >> I have this comment regarding one entry from class Lm:
> >>>> | Exclude | U+02BB | U+02BB | Lm | MODIFIER LETTER TURNED
> >>>> COMMA |
> >>>> | Exclude | U+02BC | U+02BC | Lm | MODIFIER LETTER
> >>>> APOSTROPHE |
> >>> As ASCII isn't directly encodable using Punycode, one of these is
> >>> going
> >>> to be needed to be allowed for Pacific languages, which use the
> >>> apostrophe. eg, Hawaiʻi. It is often ignored, but in languages like
> >>> Tongan it can make a difference.
> >> I have not taken this into account when creating these tables.
> >> Regards, Patrik
> >> _______________________________________________
> >> Idna-update mailing list
> >> Idna-update at alvestrand.no
> >> http://www.alvestrand.no/mailman/listinfo/idna-update
> Idna-update mailing list
> Idna-update at alvestrand.no
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update