Proposed new Firefox IDN display algorithm

Mark Davis ☕ mark at macchiato.com
Mon Jan 23 22:12:56 CET 2012


Some comments on part of the document.

   - Count Common or Inherited characters that are only used with a limited
   number of scripts as being in either or each script, instead of ignoring
   them completely. For example, U+0640 ARABIC TATWEEL is used with the
   scripts Arabic and Syriac, but not Latin or Hangul. This work would be
   potentially time-consuming and complicated; we may have to call in domain
   experts.

The Unicode Consortium in U6.1 (due out soon) is adding the property
Script_Extensions, to provide that data. The sample code in #39 should be
updated to include that, so handling those cases.

   - Check for mixing numbers from different systems, such as U+0660 ( ٠ )
   ARABIC-INDIC DIGIT ZERO with U+06F0 ( ۰ ) EXTENDED ARABIC-INDIC DIGIT ZERO,
   or U+09EA ( ৪ ) BENGALI DIGIT FOUR with U+0038 ( 8 ) DIGIT EIGHT. Perhaps
   we could restrict non-Arabic numerals to particular languages, e.g. Bengali
   numerals to Bengali?

Most of the check for different numbering systems is handled by the script
detection. The only real additional work is to verify there there is no
more than one numbering system. That is, the Bengali 4 has a script of
Bengali, so if you have "a৪" it counts as two different scripts, Bengali
and Latin.

   - Check for strings which contain both simplified-only and
   traditional-only Chinese characters, using the Unihan data in the Unicode
   Character Database. Does our platform have access to this data? If not, how
   large is it?

The Unihan database has mappings from simplified to traditional and vv.
Those mappings are about 16K each (binary on disk). However, just using
that info to produce a simple test would be markedly smaller.

   - Detect sequences of the same nonspacing mark.
   - Check to see that all the characters are in the sets of exemplar
   characters for at least one language in the Unicode Common Locale Data
   Repository. [XXX What does this mean? -- Gerv]

The Unicode CLDR project gathers information on the characters used in
given languages, both the main characters, and those commonly used
'foreign' characters.

Mark
*— Il meglio è l’inimico del bene —*
*
*
*
[https://plus.google.com/114199149796022210033]
*



On Fri, Jan 20, 2012 at 10:38, Gervase Markham <gerv at mozilla.org> wrote:

> Thanks to all on this list who provided input; I have taken several of
> your suggestions into this proposal for a change to the way Firefox chooses
> how to display IDNs:
>
> https://wiki.mozilla.org/IDN_**Display_Algorithm<https://wiki.mozilla.org/IDN_Display_Algorithm>
>
> Comments, particularly on the "Possible Issues and Open Questions", would
> be very welcome.
>
> Gerv
> ______________________________**_________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/**mailman/listinfo/idna-update<http://www.alvestrand.no/mailman/listinfo/idna-update>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20120123/5e170095/attachment.html>


More information about the Idna-update mailing list