What rules have been used for the current list of codepoints?

Patrik Fältström patrik at frobbit.se
Thu Dec 14 09:54:19 CET 2006


On 13 dec 2006, at 17.45, Mark Davis wrote:

> My comments on the list proposed:
>
>> Do I see a consensus on this list that I should remove rule 2?
> Yes, #2 needs to be removed -- many of these are required for modern
> languages.*
>
>> Do I see a consensus on this list that I should also include Lm  
>> and Nd?
> (Then rule 4 can be removed.)
> Yes, #3 needs to be expanded by adding Lm -- again, many of these are
> required for modern languages.*, **

What about Nd? You say Lm should be added, but Nd?

> In addition,
> #1 needs to be removed -- there are many modern languages that use IPA
> characters.*

Ok

> #6 should be 'casefolded' (this almost completely the same as  
> lowercase, but
> there are a few important exceptions)

Ok

Can you point out a codepoint where there is a difference so I can  
make sure the software I use do the right thing? A test case.

> * It would be possible to sift through to see which are only  
> technical, and
> which are used in modern languages, but as a class they can't be  
> excluded.

It is when we look at individual codepoints and say we need a subset  
of the class and not the whole class (or even, "we can NOT include  
this codepoint"), then we are in a situation we need new definitions.  
Possibly a new class definition. And that was the state I felt we  
where in about 2 weeks ago.

> ** There are pluses and minuses to adding Nd as well;

Ok.

> I'd then recommend a slightly different formulation, because it is  
> unclear
> when you have rule X saying 'ok' and rule Y saying 'not ok' which  
> one wins.

I said "the first rule that matches".

> So I'd recast as a series of additions and removals; thus the later  
> one
> 'wins'. Then the rules would be written as:
>
> 0. Start with the empty set.
> 1. If generalCategory(cp) is [Ll, Lo, Lm, Mn, Mc], add cp
> 2. If NFKC(cp) != cp, remove cp
> 3. If casefold(cp) != cp, remove cp
> 4. If cp is in [-A-Z0-9], add cp

That is a different way of stating the same thing, yes.

    Patrik

> Mark
>
> On 12/13/06, Patrik Fältström <patrik at frobbit.se> wrote:
>>
>> I understand there is confusing what rules have been used TODAY for
>> the list of codepoints.
>>
>> These are the rules, the first that matches tell whether the
>> codepoint is ok to include or not.
>>
>> 1. If block is "IPA Extensions", the codepoint is not ok
>
> 2. If the script is "Inherited", the codepoint is not ok
>
> 3. If the codepoint is [A-Z], the codepoint is ok
>> 4. If the codepoint is [0-9], the codepoint is ok
>> 5. If NFKC(cp) != cp, the codepoint is not ok
>> 6. If lowercase(cp) != cp, the codepoint is not ok
>> 7. If class is [Ll, Lo, Mn, Mc], the codepoint is ok
>>
>> I have a suggestion that rule 7 should also include classes Lm and
>> Nd, but I have not included that.
>>
>> Do I see a consensus on this list that I should also include Lm and
>> Nd? (Then rule 4 can be removed.)
>>
>> I also have a suggestion that rule 2 above should be removed, that I
>> went one step too far in conclusions from earlier discussions.
>>
>> Do I see a consensus on this list that I should remove rule 2?
>>
>> BTW, the URL to the latest document is http://stupid.domain.name/
>> idnabis/table-latest.html.
>>
>> Other changes you will see is:
>>
>> (a) The list of rules (that you see above) will be included in the
>> document
>> (b) The scripts will be in english alphabetical order
>>
>>      Patrik
>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>
>
>
> -- 
> Mark



More information about the Idna-update mailing list