What rules have been used for the current list of codepoints?

Mark Davis mark.davis at icu-project.org
Fri Dec 15 00:33:44 CET 2006


On 12/14/06, John C Klensin <klensin at jck.com> wrote:
>
> Moving up a few thousand feet, I believe we are missing some
> things here.  In no particular order and with the understanding
> that I am deliberately not attempting to be precise in order to
> emphasize the principles...
>
> (I got interrupted for several hours before finishing this; one
> part of it is partially covered by some subsequent discussion,
> but only part, so please bear with me.)
>
> (1) As soon as someone says "X is ok because it can be used only
> in context Y, even though it would be problematic in other
> contexts", we have taken ourselves beyond rules about individual
> code points and into a world in which one must be able to
> rigorously define Y and the relationship.   This would seem to
> apply to various separators that would look like prohibited
> characters in other contexts, such as non-breaking spaces,
> shape-approximations to apostrophes.   It would also apply to
> any characters that would disappear (become invisible) outside
> some specialized contexts in which they have a well-defined and
> obvious impact on whatever surrounds them, such as ZWJ and ZWNJ.
> And, finally, this impacts any model that suggests that certain
> combining characters be permitted only in conjunction with
> particular scripts.  Look-ahead is not part of IDNA (or
> nameprep/stringprep) now.  While adding it is not infeasible, it
> isn't a small step.


I agree that it isn't a small step; and we need to be very limited in what
we accept. That's why the default-ignorable-code-points are have been added
to the list.

However, the ZWJ and ZWNJ are required for certain languages in order to
represent fairly common words, like the name of a country in the own
country's language. I believe that this is worth making an exception for.
The complexity can be contained, I believe, by expressing the condition in
terms of standard regular expressions, which essentially any implementation
has access to. The performance implications are contained, because the
number of instances where those regualar expressions need to be invoked will
be, as a percentage of all IDNs, extremely low.

See http://www.unicode.org/review/pr-96.html

(2) One of our principles (or, if you prefer, meta-rules) is "no
> non-language characters".  The input from the user/ registrant/
> good-DNS-behavior side of things has been pretty clear about
> that.    If we have a script that contains a number of
> non-language characters, and those characters are not identified
> by some class or property that permits them to be discriminated
> from the rest of the script, then we have a problem -- either
> with the principle or with the way we have so far gone about
> this.  That is the issue with the IPA Block: clearly it contains
> some characters we need.  Clearly it contains some characters
> that violate this principle.  If we can neither keep it nor
> eliminate it, either the principle needs to give way or we need
> a different criterion or rule set.  It is not clear what those
> might be.


As has been said otherwise, the criteria should generally be based on
script, not block. For more on the dangers of script vs block, see
http://www.unicode.org/reports/tr18/#Character_Blocks

Now, as others have said, we could take on the project of going through each
of the IPA characters to see which are used in modern languages. There are
difficulties with this, as noted. If, however, we really wanted to do this,
the much more tractable task would be to (a) *first* identify which we think
could cause some significant problem, and only (b) *then* see which of those
are used in modern languages. If you want to take a pass at (a), that would
be useful.

Note that in http://www.unicode.org/reports/tr39/#References, under
xidmodifications.txt<http://www.unicode.org/reports/tr39/data/xidmodifications.txt>,
we do have a pass at separating out characters on a character by character
basis. However, that file is more directed towards notification of users
that there may be a problem (in which case one can be a bit more agressive),
than having a hard-and-fast prohibition in the protocol.

(3) We cannot establish a principle that strings coming into
> IDNA (or Nameprep) must already be normalized (to NFC at least).
> The rule that NFKC(cp) must equal cp is well and good, but,
> taken by itself,  I think it eliminates all sequences involving
> combining characters for which there are precombined sequences
> and may have some other ill effects.  Am I missing something in
> this, or does the rule need further refinement (note that this
> interacts with (1) above).


There is a common misunderstanding about NFKC and NFC.

The  vast majority of combining marks satisfy the requirement than NFKC(cp)
= cp. This requirement does *not* eliminate the requirement to that
NFKC(whole_field) = whole_field, which is a *current* requirement of the
output of IDNA. If my statement here is too obscure I can elaborate... Have
to run to a meeting now.

     john
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061214/45573dfd/attachment-0001.html


More information about the Idna-update mailing list