"This case isn't the important one" (was Re: Visually confusable characters (8))

Andrew Sullivan ajs at anvilwalrusden.com
Mon Aug 11 23:43:07 CEST 2014


On Mon, Aug 11, 2014 at 07:44:34PM +0000, Whistler, Ken wrote:
> U+08A1 was encoded separately because the Hamza above the beh
> was functioning as an ijam *AND* because the principle had been
> established in the Unicode Standard (very explicitly, I might add),
> that new Arabic characters consisting of a skeleton plus ijam
> are separately encoded.

Once again, this is very helpful to me.  Thanks.

I suspect that what is going to be necessary for protocols is some
sort of mapping mechanism to deal with this principle.  It seems
obvious that someone whose writing system uses a skeleton plus ijam
that is not in Unicode version n is going to use some other combining
sequence to write it.  It seems similarly obvious that after Unicode
version n+m comes available (i.e. due to a system update that changes
the libraries), the same person is going to be (intentionally or not)
using the new separately encoded character.  For documents, this might
not be disastrous.  For protocols, it's going to cause at least
confusion and at worst failed interoperability or security problems.  

I'm wondering (and this really is thinking out loud) whether we need
some pre-wire stage mapping that offers an opportunity to
"proto-normalize" these characters together or something like that.  I
don't know.  But it seems one might want to render one of them
inoperative in truly internationalized (as opposed to localized)

Anyway, I need to think about this quite a bit more.  Thank you very
much for your help.

Best regards,


Andrew Sullivan
ajs at anvilwalrusden.com

More information about the Idna-update mailing list