Non-Unicode interfaces to IDNs (was: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison)

Thu Aug 7 22:29:11 CEST 2014

--On Thursday, August 07, 2014 20:07 +0200 Jefsey
<jefsey at jefsey.com> wrote:

> At 18:09 07/08/2014, John C Klensin wrote:
>> Jefsey,
>> 
>> I am not sure I understand what you are talking about but, if
>> I do, it is about an almost completely different topic.
> 
> John,
> 
> as you may remember I am not interested in Unicode per se. I
> am interested in the open pragmatic use of the digisphere
> through whatever is available and people want to use. This
> calls for external (fringe to fringe) innovations not to be
> constrained by any internal (end to end) MUST. As far as I
> undedrstand, the issue being raised concerns orthotypography
> in a  specific language?

Actually, perhaps the opposite because one view of the problem
is a desire to keep the IDN uses of Unicode language-independent.

> This has been ruled out of the IDNA2008 scope. Because it was
> ruled out the Unicode scope.

Yes.  And that makes it out of scope for this list.  not just
the Unicode 7.0.0 topic thread.  

For whatever it is worth, the Internet is rapidly moving in a
direction in which there will be two kinds of Web browsers:
those that work (more or less exclusively) with Unicode encoded
in UTF-8 and those that don't work with contemporary tools and
facilities.  That set of issues is obviously much broader than
IDNs.   If you feel a need to support or rely on non-Unicode
interfaces you should probably be standing in front that that
juggernaut, not trying to fine-tune IDNA edge cases.

> Now, what may have to be clarified is what a user calls
> confusable. For users, "confusables" are different codepoint
> sequences that look the same (whatever the reason). If the
> Hamza added sequences are not creating internet use confusion,
> we are not concerned.   If they are, we are concerned. However,
> we MUST not decide for others: we only are to give them the
> possibility to decide by themselves.

Then you are discussing yet another problem.   People who survey
the written forms of languages indicate that the language in
which the particular character in question is written (variously
known as Fula,  Fulan, Peul, and other names) is almost always
written in this century in Latin script.  I haven't even tried
to do the research but a guess from my recollections about the
history of the area is that Latin script started to predominate
about the time of the beginning of French presence in Northern
Africa.  The Unicode discussion (and various online articles
that may or may not be independent) say that the language is
still written in Arabic script by "Islamists" (a category whose
definition one can guess at but cannot be certain of its meaning
in this context).   The first question is whether that community
is likely to be registering and using domain names that use this
particular character.  I have no way to guess the answer to that
question but want to stress that this character is a non-issue
for anyone who reads and writes the Fula language exclusively in
Latin characters (and may be less of an issue from those who
don't read or write, and maybe haven't heard of, that language).

The second part is more complicated.  Even in the Unicode
context as of Version 6.3 and earlier, one can create something
that looks just like this character by using BEH and Combining
Hamza Above.  That is clearly inconvenient and annoying, in the
same sense that it would be inconvenient and annoying to have to
figure out a way to enter U+0063 and then U+0327 any time you
wanted to write "ç" (U+00E7) (I'm assuming a French-relevant
example will work better for you than the usual Swedish one).  

That brings us to the crux of this rather subtle problem.   If
you (or the users you are concerned about) think of "ç" as a
special and unique latter, rather than as a decorated "c", then,
following the Fula example that leads to U+08A1, then that
letter should not compare equal, even after normalization, to
the U+0063 U+0327 combination and, unless the latter is
meaningful in some language that does not consider the
combination a letter, the combination should not be allowed at
all... any more than we would expect "w" to compare equal to the
sequence "vv" or "uu".   But, if both U+00E7 and the combining
sequence are allowed (as is the case today) and both are
normalized using the same method, the results will compare
equal... not because they look alike, but because they are the
same character.  Moreover, the IDNA requirement that all
A-labels be in NFC form (which doesn't affect your ordinary
text) effectively makes it impossible to incorporate the
combining sequence into the DNS, so no potential for two
representations of the same character that do not compare equal. 

However, because of the distinction that was explained earlier,
and perhaps because Arabic script is subject to different rules
in practice than Latin script, the intent (or at least the
effect of letting the existing rules work without introducing an
exception) is that the DNS be able to accommodate both ARABIC
BEH WITH HAMZA ABOVE as a single, atomic, character coded as
U+08A1 and ARABIC BEH with HAMZA ABOVE as a two-character
combining sequence with identical appearance and without the two
comparing equal after normalization.  

_That_ is the present issue.   

Now, in a world in which one were not using Unicode but instead
used, e.g., national language code pages, both the French and
Fula cases above are entirely non-issues.   In ISO 8859-1 and
its descendants, "ç" appears as a single character, there is no
combining cedella, and it makes no difference whether that ISO
8859-1 character is mapped to Unicode by using the single code
point or the combining sequence as long as one is consistent
about it.  Similarly, if one created a seven or eight bit code
page for Fula written in extended Arabic script, the combining
Hamza Above would probably not be included, the single code
point probably would, and it would make no difference how the
character was mapped to Unicode as long as one was consistent
about it.

The only problem with the above paragraph is that it is
questionable whether continued use of ISO 8859-1 is viable today
much less whether there would be international acceptance of new
code pages.  Whether it is viable today or not, it is clear that
it is getting rapidly less plausible no matter how much you (or
others) might wish for it.

> A digital name supports a semantic address through group of
> visual signs. Whatever the underlying code, version, etc. For
> the time being the IETF has chosen a single underlying
> typographic code to support the digital names' signs.

This may be a vocabulary issue, but I would associate the term
"typographic code" with the combination of some sort of
reference to a character repertoire (a coded character set,
encoding, and code point would be one such reference, but
definitely not the only possibility) with a reference to a type
style or type family or member thereof (or, less precisely, a
"font").   The IETF has _never_ made such a choice.

>  That
> code does not consider orthotypography (i.e. semantic
> constraints that are language dependant). So the IETF has
> chosed that its end to end protocol are not concerned by
> orthotypographic issues. Tomorrow we can chose an additional
> code to Unicode: we need the use of these two codes to be
> transparent to our own use, independently from any
> orthotypographic issue. This is the metric of our choice.

I think that, in practice, the window on your making and using
that choice is closing rapidly if it has not closed already.
See above.  The only constraint existing IETF work is going to
impose on you is that there be an orderly and consistent mapping
between whatever you decide to use and the Unicode repertoire.
If there is not, you will find that you are introducing your own
form of confusion, one that will be very hard for you users to
understand.   However, if you want to create or adopt a
language-specific or even typography-specific coding, don't let
me discourage you from trying as long as you don't expect most
people on this list (or even me) to discuss it with you.

> 1. TLD Managers must be able to use their own add-ons to
> support or not orthotypographic aspects in their zone. How do
> you know if you will not create conflicts today with Hamza, or
> in "equivalent" other cases?

Such TLD managers would face at least two problems: One would be
getting equivalent plug-ins into all of the browsers that
potential users of the TLD (or URIs that include it) might use
or reference.  Otherwise, they would expose users to wildly
inconsistent behavior, which is generally not a good idea.
Reducing the chance of that sort of confusion is one of many
things driving the "UTF-8 only" movement.  The second is that
the TLD manager has very little control over the TLD
environment.  If the plug-ins merely change the appearance of
characters to make them a little more attractive, that might be
a non-problem.  But, if they create distinctions that don't
exist elsewhere or map some characters into others, there is a
lot of potential for very bad things to happen.   In particular,
the propagation of such plugins would create a wonderful
opportunity for people with malicious intent because, at least
conceptually, any plug in that can alter the form of a domain
name or URI can alter it all the way to something completely
different.

> 2. If you consider Hamza you must consider French majucules.
> Or am I wrong?

You are mostly wrong because the issue with French majuscules is
that, if you had a CCS that distinguished them from conventional
upper and lower case letters, information would be lost in
mapping that CCS to Unicode, causing the "Consistency" condition
mentioned above to fail.  There is no loss of information in
this Hamza case because the mappings to and from Unicode are
trivial (because the discussion is entirely about Unicode) and,
as discussed above, any specialized code page would behave in a
consistent and predicable way.   Interestingly --and part of the
challenge the Unicoce Consortium faces-- if normalization
brought U+08A1 together with the combining sequence but the
"separate character" argument continued to hold, the
normalization process would lose the information about whether
the original was that "separate character" or the combining
sequence.  But that situation is much like the counterfactual
c-with-cedilla case mentioned above, not anything to do with
majuscules.

And I suspect this list has now had enough of this discussion.

best,
   john