prohibiting previously mapped and unmapped characters

Thu Nov 30 22:54:48 CET 2006

Finally jumping on this discussion (I joined the list a few days ago but
read the list archive before that). My own opinion is that it is rather
hard to create repertoire restrictions beyond the obvious
mistakes/issues:
- forbid combining mark as label first character
- make normalization fails on unassigned characters ("stable"
normalization concept)
- make easier to move to Unicode 5.0 and beyond (required if we want to
support Myanmar, Khmer and many newly encoded scripts) by using existing
Unicode properties instead of code point enumeration.
- allow combining as last characters of a RTL label (this in fact makes
some current invalid input valid after update, so it is in fact an
extension)

Carving subset inside scripts will be much more difficult as they will
be as many opinions as linguists that will get involved. Even the symbol
set sub-setting is not an obvious task. I see many benefits in creating
such a subset (after all, I worked with Mark Davis in such a list in
Unicode UTS #39), but I am not convinced it belongs to the protocol
level. Maybe having such a list as an informative annex in stringprep
and also informatively referenced by nameprep could be sufficient to
create a stronger guideline than the current situation. I think it is an
useful exercise and I am glad many people are looking into it, but
fixing the items above seems a higher priority to me.

BTW, when we created the IDN subset exposed in UTS #39, we looked at all
IANA IDN repertoires and other repertoires that were publicly available
(such as the one from DENIC, Vietnam, Brazil, etc...) to make sure that
we covered all of them. We even negotiated with JPRS to remove one
problematic character from their list. However the longer IDN is
deployed, the more difficult it will be to deprecate characters that
have been used in registration, known to this group or not.

Michel