Unicode position on local mapping

Wed Feb 18 23:29:05 CET 2009

--On Wednesday, February 18, 2009 15:56 +0100 JFC Morfin
<jefsey at jefsey.com> wrote:

> Dear John,Sorry, but I am travelling to and have a poor vision
> of my screen. I hope that this mail will make sense as I can
> hardly edit my text.

I am having what may be similar visibility/editing problems, but
perhaps we can work through them.

> The point is that you assume that characters bijectively fold.
> This is not the case. "é","è", "ê", "ë" and "e" fold the
> same: as the same unicode "E".

Well, yes and no.  Yes, it illustrates the problem.  But, while
we understand this particular mapping  for French (as used in
France), and the somewhat more complex case that Cary describes
for Swedish, it also illustrates the difficulties of trying to
do these mappings in the protocol, rather than via some set of
registry conventions or prohibitions.

In particular, please observe:

(1) While "é","è", "ê", "ë" and "e" all convert* to the same
upper-case character, Unicode "E" in French as used in France,
they convert to separate upper-case forms with the respective
diacritical marks in various other areas that consider
themselves Francophone.  Some or all of them also have different
upper-case associations with various other Latin-script-based
writing systems for other languages.  That doesn't make anyone
"right" or "wrong" except within their own systems, it just
means that no global, protocol-level rule that recognizes the
local mapping preference is possible (at least without carrying
very language/ dialect/orthography-specific information in the
labels.  I hope that, by this time, we all understand why that
is not practicable.  If that is not true, I need to reopen the
"alternatives" draft sometime soon.

	* I am deliberately avoiding terms like "case folding"
	in this response because they have very specific
	meanings in Unicode and I want to avoid causing further
	confusion if possible.

(2) Because IDNA ends up with lower-case characters (whether by
the mappings of IDNA2003 or the prohibition of IDNA2008), the
observation that, in some contexts, multiple lower-case
characters that convert to the same upper-case one is not a
problem as long as one does not conclude that the lower-case
characters thereby become equivalent.  However, this is
precisely the problem with doing even case conversion in the
protocol: if your environment requires that "E" convert to any
of four characters depending on context and/or external
information, then (i) there is no practical way with IDNA to
specify that contextual information, (ii) someone else might
believe that the only way to get a lower-case e-with-acute from
an upper case character is for the upper-case one to be
E-with-acute, and (iii) the very fact that the conversion model
is not strictly one-to-one means that the Unicode LowerCase
operation is unsuitable for your purposes, since it can get to a
lower case accented character only from an upper case character
with the same accent.

Note that this problem is almost identical with the idea of
converting from Traditional to Simplified Chinese in the
protocol that got a lot of discussion with IDNA2003.   One
cannot do it in the protocol because doing  so would require
both language/locale information (to avoid, e.g., converting
Japanese Kanji to Simplified Chinese) and because the mappings
are not strictly one-to-one.  And it seems to me that the right
protocol solution is the same also: avoid doing anything that
either loses or makes up information and then leave the rest up
to local policy makers and registry implementation of their
decisions.

The situation also points out a difficulty with the Unicode
position paper and my "include lower case mappings" trial
balloon:  Perhaps the correct rule is not "LowerCase characters
for scripts that recognize case" (or even CaseFold them) but 

	"if local mappings are to include case mapping
	operations, then LowerCase is the only appropriate one.
	If LowerCase isn't appropriate for you, then you should
	do no mappings at all.  The latter will cause putative
	domain names or URIs that contain upper-case characters
	to fail entirely, but at least will not create any
	ambiguity because the only labels/URIs permitted will
	have to be U-labels containing the unambiguous
	lower-case characters."

>> But your explanation makes part of the point I was trying to
>> make to Mark and Andrew:  if something we do constrains an
>> implementation past the point that the implementers (or
>> relevant policy-makers) consider acceptable, we will see
>> "solutions" that cause far more "massive" interoperability
>> problems than a mere few mismatched characters, even if those
>> lead to false positives.  If we treat a pair of characters
>> that might be considered the same as distinct,
> 
> They are distinct as lower cases, the same symbol (not the
> same character) as upper cases.
> The whole internationalization doctrine has this problem of not
> differentiating between the symbol and the character.

Indeed.  A different way to say this is that a given upper case
character may or may not be "the same as" a given lower case
character depending on circumstances, even though the symbols
are obviously different.  But we cannot solve that problem with
Unicode or, as far as I know, any character coding system yet
devised that does not provide for extensive metadata to identify
intent... if, in fact, we could even agree that it is problem.
>...

>> Without expressing a position as to whether it would be wise
>> or not, practical or not, this is consistent with the reasons
>> why many domains have created sunrise or similar mechanisms
>> to aid with the introduction of new facilities such as IDNs.
>> And decisions about the relationship between IDNs that contain
>> decorated characters and the base label string (with only
>> undecorated ones) will have to be taken in every domain that
>> uses Latin characters and introduces IDNs.
> 
> This decision is simple: unicode or not unicode, the language
> or the keyboard, internationalisation or multilingualisation.
> The problem is not related to IDNA but to decades of an
> erroneous internationalisation strategy.

The long-standing model for internationalization (with which you
may reasonably disagree) is to do the things globally that
preserve as much information as possible and then sort the rest
out as a localization model.   It inevitably does not work
perfectly and that is obviously a problem.

For IDNs, the only way that I can imagine to do what you are
looking for would be to require that every user specify a
language and locale for every label to be looked up and that
every file containing a domain name or URI contain that
information in a standard way so that it could be used in lookup
protocols.  From a technical standpoint, I think we could figure
out how to do that, but its impractically boggles my mind.

> However, the solution is not simple. Yet to force people
> against their right and will is not a good solution because it
> is not stable.

See above and my earlier comments.

>...
>> 2. is not legally acceptable:
>>> 
>>> --- "école" means school;
>>> --- "ecole" can be a TM: ex.
>>> http://www.defl.ca/fr/ecole.html. For other terms the
>>> accentuated and the non-accentuated terms are different
>>> words or TMs.
>> 
>> Here is where you confuse me, or perhaps we confuse each
>> other. Because of its exact-match properties, the DNS (with
>> or without IDNs) is notably unsuited to a "do what I mean"
>> function.  I am probably misunderstanding you, but it appears
>> from the above that you are expecting a system that will
>> cause the same pair of labels to be treated as matching under
>> some circumstances and as not-matching in others.
> 
> In your terms, yes. In French terms, no.
> 
> The problem is that "your" approach is based on
> internationlization (i.e. internationalising English ASCII as
> a reference) not on multilingulalization (each language/script
> being its own reference).

Maybe.  I would prefer to believe that it is based on the
assumption that it is possible to do useful things with a single
international (multiscript/multilanguage) character coding
system, rather than requiring that every application be language
and locale-specific.  Unicode is the only well-known instance of
such a multiscript/multilanguage coded character system, but I
believe that any theoretical alternative to it would exhibit the
same properties with regard to this issue.

> I do not claim there is not an internationalization
> impossibility, only that so far it was not found. And I fear
> that when looking for a solution French engineers, lead users
> and users will look more for a multilingualization than an
> internationalization solution. This would technically divide
> the Internet. Giving the French solution the ability to support
> multilateralization as a general feature. We both know the
> outcome: that French internet would have a presentation layer.

That would be necessary, but not sufficient.  You would also
need to establish sufficient context for the correct
presentation layer to be chosen, even if, e.g., a French user,
expecting French-style access to network resources,
interpretation of names, etc., wanted to access resources while
using guest equipment on a non-French network.  And a
multilingual French user would need to be able to access French
and non-French materials with the presentation interfaces
appropriate to each and name and context information in those
presentation layers/interfaces, not further down in the stack.
I can imagine ways of thinking about those issues, but they are
not easy or obvious and would require a tremendous among of
linguistically and culturally-sensitive information handling and
access and the information needed to make it all work.

>> I don't know how to do that in the DNS
>> or in any conceivable IDN protocol that rests on top of it.
> 
> This is the question to be answered. I do not have a global
> solution either, and this is why I suppose that no-case
> folding might be the most acceptable patch. But the decision
> of accepting that patch is not mine: it belongs to the market
> and to the French national communities.

And, as long as whatever decisions they make are fully informed
about costs and tradeoffs, I have no problem with that (and they
would quite properly ignore me if I did).

>>> 3. would violates the French language and people's equality.
>>> I do not see it being legally and technically accepted just
>>> because the IETF did not find a solution.
>> 
>> I don't understand how one can have a reasonable expectation
>> of both "A" and "not-A" being true at the same time, which
>> your comment appears to require unless the situation is to be
>> treated as unfair and anticompetitive.
> 
> The problem lies with case-folding being accepted in the DNS
> technology while the impact of case-folding has not been
> considered enough.

And that is precisely why the current IDNA2008 model removes all
mappings from the protocol.

>...
> We misunderstand. What is anti-competitive is that people not
> born with accentuated names or whose name is not challenged by

By the same token, the ASCII DNS, as interpreted by most
application protocols, has always been unfair to people who use
apostrophes in their names, nor can one use initials offset by
periods.   Allowing apostrophes and periods in domain names is
not fundamentally a DNS problem, but the former would cause
chaos with many command languages and the escapes required by
the latter would cause massive user confusion.   I suppose you
can argue that it is unfair; others would claim it is a
reasonable tradeoff.  Getting from either to "anticompetitive"
seems like hyperbole to me, but you obviously may disagree.
Where that discussion really takes us, IMO, is whether it is
reasonable to expect that every valid word in every language,
and every plausible combination of such words, to be able to be
accommodated in thee DNS.  That is a very difficult goal and it
is incompatible with other goals.  The question again is about
the reasonable tradeoff points.

> Yes there is at least one : not to create an artificial
> problem in forcing a language to follow the rules of another
> one. Please remember that this problem is not related to IDNA
> but to a wrong doctrine (internationalization) and to the
> resulting rough AZERTY standard (one could imagine solutions,
> but that would call on a wide debate of the francophonie).

See above and please remember that the perception that the DNS
is about "words" or "language", rather than appropriate
language-based mnemonics, is itself problematic.

>... 
> PS. You seem to imply that I would consider challenging IDNA.
> I recall you that my whole effort is precisely the opposite.
> To build a better architecture to include and enforce IDNA
> (and to extend it when needed/possible). This is why I urge
> this WG to proceed, because we need the IDNA solution to
> finally settle before we can build around/on top of it and
> stay fully interoperable.

I understand and appreciate that.  And frustrating as it
sometimes becomes for us and others, I appreciate your continued
efforts to engage on these issues and to be clear about the
problems you see.  But you identified decisions that could be
made by people who are not under your control and who might be
less sympathetic to the importance of overall Internet
interoperability than I believe you are.

>...

regards,
    john