UTC Agenda Item: IDNA proposal

Mark Davis markdavis at google.com
Thu Nov 23 17:56:38 CET 2006


Thanks, Patrik, nice work.

Could you put the scripts in a determinate order (eg sorted either by code
-- such as Latn -- or by long name)? That would make comparison easier.

This direction is very promising, and close to what Unicode is recommending,
which is:

   - Character Restrictions
      - characters must have general categories of: letter, mark,
      number (Nd), but
      - no uppercase/titlecase characters
      - no characters that aren't NFKC
      - plus ZWJ/ZWNJ, but in very limited contexts*
      - Whole Field Restrictions
      - field must be NFC
      - no field starts with a mark
      - nameprep bidi restrictions (loosened to allow marks at end of
      field)

The Unicode identifiers (http://unicode.org/reports/tr31/) differ somewhat
in that they add connector punctuation, like "_", and disallow numbers at
the start of identifiers (since the target is programming languages). They
also provide for stability extensions, so that once a character counts as an
identifier, it will in all future versions of Unicode. See
http://www.unicode.org/reports/tr31/#Backward_Compatibility

* The ZWJ/NJ is still in draft stage. The UTC is issuing  a PRI regarding
use of ZWJ/ZWNJ in identifiers based on L2/06-353 (
http://www.unicode.org/L2/L2006/06353-zwj-issues.html)

* See also
http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments,
including Table 3. Characters for Natural Language Identifiers

Mark

On 11/22/06, Patrik Fältström <patrik at frobbit.se> wrote:
>
> Version that accept classes Ll, Lo and Mn can be found as
>
> http://stupid.domain.name/idnabis/table-lllomn.html
>
> What about class Nd?
>
>     Patrik
>
> On 22 nov 2006, at 14.07, Harald Alvestrand wrote:
>
> > Class Mn contains the HEBREW POINT QAMATS that the -bidi draft is
> > busy defending. Can't eliminate that.
> >
> >            Harald
> >
> > --On 22. november 2006 13:52 +0100 Patrik Fältström
> > <patrik at frobbit.se> wrote:
> >
> >> I have recreated the tables using a new algorithm (based on inputfrom
> >> Kenneth mostly).
> >>
> >> (1) Use the scripts.txt file for the script definitions, do not
> >> usethe
> >> blocks definitions
> >>
> >> (2) Remove codepoints where cp != NFKC(cp)
> >>
> >> (3) Remove codepoints where cp != lowercase(cp)
> >>
> >> (4) Remove codepoints where class(cp) != "Ll"
> >>
> >> (5) Include codepoints that are part of US-ASCII (0-9, A-Z and a-z)
> >>
> >> The result of doing this for U+0000 - U+FFFF can be found as
> >>
> >> http://stupid.domain.name/idnabis/table-ll.html
> >>
> >> If I instead instep 4 accept things of class both Ll and Lo, then
> >> theresult can be found as
> >>
> >> http://stupid.domain.name/idnabis/table-lllo.html
> >>
> >> Please let me know what you think.
> >>
> >> I have this comment regarding one entry from class Lm:
> >>
> >>>>  | Exclude  | U+02BB | U+02BB | Lm    | MODIFIER LETTER TURNED
> >>>> COMMA |
> >>>>  | Exclude  | U+02BC | U+02BC | Lm    | MODIFIER LETTER
> >>>> APOSTROPHE   |
> >>>>
> >>>
> >>> As ASCII isn't directly encodable using Punycode, one of these is
> >>> going
> >>> to be needed to be allowed for Pacific languages, which use the
> >>> apostrophe. eg, Hawaiʻi. It is often ignored, but in languages like
> >>> Tongan it can make a difference.
> >>
> >> I have not taken this into account when creating these tables.
> >>
> >>      Regards, Patrik
> >>
> >> _______________________________________________
> >> Idna-update mailing list
> >> Idna-update at alvestrand.no
> >> http://www.alvestrand.no/mailman/listinfo/idna-update
> >>
> >
> >
> >
> >
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061123/766dd345/attachment.html


More information about the Idna-update mailing list