Unicode 7.0.0, (combining) Hamza Above, and normalization

Fri Aug 8 20:16:20 CEST 2014

--On Friday, August 08, 2014 16:01 +0000 John Levine
<johnl at taugh.com> wrote:

>> I think this is an important insight and it may indeed be the
>> case that normalization for Domain Name purposes and
>> normalization for other purposes are not as aligned as we
>> supposed. ...
> 
> If I may stick my semi-informed oar in, it seems to me that for
> linguistic purposes, homographs are generally not an issue.
> Remember all those manual typewriters that didn't have digit 1
> or 0 keys, so you used letters l and O instead.
> 
> In our case, homographs are a big deal.  So can we just say
> that, and decide to do whatever minimizes homograph issues
> even though it's not the same as what would reflect linguistic
> usage?

John, while I could quibble about your choice of examples (and
assume that someone else will), I think what you are suggesting
is the direction in which Vint, myself, and others are headed.  

Part of what makes that example important is not "linguistics"
but context: use 0 (or O), l (or 1) in a sentence in a language
that uses Latin script and few people will have any trouble
figuring out what was intended.  Our world is one of short
identifiers that may be linked to words for mnemonic purposes
but are not restricted to words, sentences, or even single or
user-predicted scripts.  Not to pick on Shawn or his company,
but, if I see MICR0S0FT (aka micr0s0ft) in a DNS label, I'm
going to be pretty sure that something odd is going on and start
being cautious.  On the other hand, if I were to see G00GLES
(g00gles), I'd be less certain because those forms might be
someone being cute or even associated with a clever way to
trademark something that, in its more normal spelling, is a
too-common term.

A point raised earlier, I think by Andrew, still stands: we
could reasonably agree that there is a problem and still decide
that the advantages of mitigating it (or the part(s) of it we
could get to) are outweighed by the costs and potential
confusion of doing so, at least beyond figuring out how to warn
people in virtual large type.  

I think we first need to agree that there is a problem (or
conclude that there really isn't).  Personally, I find your
perspective and Vint's much more helpful to that goal than I do
suggestions that we have to suggest whatever the Unicode
Consortium has done for linguistic, phonetic, or geopolitical
reasons.

 best,
     john