Proposed new Firefox IDN display algorithm

Mon Jan 30 17:42:05 CET 2012

Mark
*— Il meglio è l’inimico del bene —*
*
*
*
[https://plus.google.com/114199149796022210033]
*

On Mon, Jan 30, 2012 at 05:07, Gervase Markham <gerv at mozilla.org> wrote:

> Hi Mark,
>
> Again, thanks for your very helpful input.
>
>
> On 23/01/12 21:12, Mark Davis ☕ wrote:
>
>> The Unicode Consortium in U6.1 (due out soon) is adding the property
>> Script_Extensions, to provide that data. The sample code in #39 should
>> be updated to include that, so handling those cases.
>>
>
> Can you be a bit more specific about "soon"? :-)
>

"soon" for 6.1 and UTS #46 is February

There's a UTC meeting in a week that will be reviewing TR36/39 (so any
feedback people have on them is welcome.)

http://www.unicode.org/reports/tr36/proposed.html
http://www.unicode.org/reports/tr39/proposed.html

>
> So this data will associate a number (N, > 1) of language names with each
> Common or Inherited character?

No, it associates multiple scripts with certain common/inherited
characters; the data is at

http://unicode.org/Public/6.1.0/ucd/ScriptExtensions.txt

For example, the katakana/hiragana mark is Common, but the newer data
allows people to detect it in (say) a Cyrillic string.

30FC          ; Hira Kana # Lm       KATAKANA-HIRAGANA PROLONGED SOUND MARK

>
>
>  Most of the check for different numbering systems is handled by the
>> script detection. The only real additional work is to verify there there
>> is no more than one numbering system.
>>
>
>   * Check to see that all the characters are in the sets of exemplar
>>
>>    characters for at least one language in the Unicode Common Locale
>>    Data Repository. [XXX What does this mean? -- Gerv]
>>
>> The Unicode CLDR project gathers information on the characters used in
>> given languages, both the main characters, and those commonly used
>> 'foreign' characters.
>>
>
> Let me put my query another way: "what does this check add that is not
> covered by the previous checks"? Is it a way of expanding the definition of
> what's in a particular script, to include characters which are technically
> classed as being in other scripts? Or something else?
>

No, it is to make the tests more specific to given languages, as a way of
excluding unfamiliar characters from the same script.  There are, for
example, many Latin characters, more than most people realize; even
excluding compatibility variants, it is over 1000. Someone might not
realize that 'ꜱ' is not a regular 's', but a special small cap version, for
example. There are a couple of ways to approach this problem.

#39 provides categorizations of identifiers, see:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Alatn%3A%5D-%5B%3Anfkdqc%3Dn%3A%5D&g=identifier-restriction

CLDR provides information on which characters are used in which languages,
allowing someone to limit characters to those supported by, for example,
official languages or those supported in the UI of a product. (This may not
be a good strategy IMO, but is a technique suitable for some environments.)

http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html

(Both of these are generated from machine-readable data.)

Hope that helps,

> Gerv
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20120130/db004670/attachment.html>