Unicode position on local mapping

JFC Morfin jefsey at jefsey.com
Thu Feb 19 03:18:35 CET 2009


2009/2/18 John C Klensin <klensin at jck.com>

> > The point is that you assume that characters bijectively fold.
> > This is not the case. "é","è", "ê", "ë" and "e" fold the
> > same: as the same unicode "E".
>
> Well, yes and no.  Yes, it illustrates the problem.  But, while
> we understand this particular mapping  for French (as used in
> France),


As used in QWERTY keyboards. Roughly 48% of the French people use
accentuated upper cases.


> and the somewhat more complex case that Cary describes
> for Swedish, it also illustrates the difficulties of trying to
> do these mappings in the protocol, rather than via some set of
> registry conventions or prohibitions.


Correct in the IETF Legacy Internet IDNA perspective.
Should not in an Internet PLUS IDNA perspective (cf. final note).

In particular, please observe:
>
> (1) While "é","è", "ê", "ë" and "e" all convert* to the same
> upper-case character, Unicode "E" in French as used in France,
> they convert to separate upper-case forms with the respective
> diacritical marks in various other areas that consider
> themselves Francophone.


See above. This relates to French as used on the Internet + some part of the
Francophones.
Anyway, the point is not only about French example, but about the French
community having the technical, political, commercial, financial, etc.
capacity to develop something else if it faces something which disatisfies
its members. This is something we would like to avoid.

Some or all of them also have different
> upper-case associations with various other Latin-script-based
> writing systems for other languages.  That doesn't make anyone
> "right" or "wrong" except within their own systems, it just
> means that no global, protocol-level rule that recognizes the
> local mapping preference is possible (at least without carrying
> very language/ dialect/orthography-specific information in the
> labels.  I hope that, by this time, we all understand why that
> is not practicable.  If that is not true, I need to reopen the
> "alternatives" draft sometime soon.


May be will we have to do it, because I am not sure I understand why that is
not practicable.

       * I am deliberately avoiding terms like "case folding"
>        in this response because they have very specific
>        meanings in Unicode and I want to avoid causing further
>        confusion if possible.


I picked the French langage case because it is simple (upper-cases are good
example), but the problem is Unicode case-folding issue. We are confronted
to an uncompleteness of a feature of the Unicode standard (in the way _we_
want to use this standard). Up to us to analyse it carefully and fix it.

(2) Because IDNA ends up with lower-case characters (whether by
> the mappings of IDNA2003 or the prohibition of IDNA2008), the
> observation that, in some contexts, multiple lower-case
> characters that convert to the same upper-case one is not a
> problem as long as one does not conclude that the lower-case
> characters thereby become equivalent.  However, this is
> precisely the problem with doing even case conversion in the
> protocol: if your environment requires that "E" convert to any
> of four characters depending on context and/or external
> information, then



> (i) there is no practical way with IDNA to specify that contextual
> information,


I would not bet on this :-)  So I prefer not to challenge people over this.


> (ii) someone else might believe that the only way to get a lower-case
> e-with-acute from an upper case character is for the upper-case one to
> be E-with-acute, and


We obviously all agree that is _one_ way. Outside of the Internet the French
language survives well enough without that way. This dualilty adds to the
problem.


> (iii) the very fact that the conversion model is not strictly one-to-one
> means that the Unicode LowerCase operation is unsuitable for your purposes,
> since it can get to a lower case accented character only from an upper case
> character with the same accent.


Correct.


> Note that this problem is almost identical with the idea of
> converting from Traditional to Simplified Chinese in the
> protocol that got a lot of discussion with IDNA2003.   One
> cannot do it in the protocol because doing  so would require
> both language/locale information (to avoid, e.g., converting
> Japanese Kanji to Simplified Chinese) and because the mappings
> are not strictly one-to-one.  And it seems to me that the right
> protocol solution is the same also: avoid doing anything that
> either loses or makes up information and then leave the rest up
> to local policy makers and registry implementation of their
> decisions.


What belongs to the protocol is to say if case-folding is enacted or not,
and to support upper-cases as separated or non-sepearted characters from
lower-cases.

The situation also points out a difficulty with the Unicode
> position paper and my "include lower case mappings" trial
> balloon:  Perhaps the correct rule is not "LowerCase characters
> for scripts that recognize case" (or even CaseFold them) but
>
>        "if local mappings are to include case mapping
>        operations, then LowerCase is the only appropriate one.


Why ? "église" and "Eglise" (church and Church) are two different words,
while "eglise" has no meaning and does not give any hint.

>
>        If LowerCase isn't appropriate for you, then you should
>        do no mappings at all.  The latter will cause putative
>        domain names or URIs that contain upper-case characters
>        to fail entirely,


Why ?  This is not because "e", "é" ... can be related in some way to "E"
that "E" is not a character by its own.

but at least will not create any
>        ambiguity because the only labels/URIs permitted will
>        have to be U-labels containing the unambiguous
>        lower-case characters."


See above. This actually creates ambiguity.

>> But your explanation makes part of the point I was trying to
> >> make to Mark and Andrew:  if something we do constrains an
> >> implementation past the point that the implementers (or
> >> relevant policy-makers) consider acceptable, we will see
> >> "solutions" that cause far more "massive" interoperability
> >> problems than a mere few mismatched characters, even if those
> >> lead to false positives.  If we treat a pair of characters
> >> that might be considered the same as distinct,
> >
> > They are distinct as lower cases, the same symbol (not the
> > same character) as upper cases.
> > The whole internationalization doctrine has this problem of not
> > differentiating between the symbol and the character.
>
> Indeed.  A different way to say this is that a given upper case
> character may or may not be "the same as" a given lower case
> character depending on circumstances, even though the symbols
> are obviously different.  But we cannot solve that problem with
> Unicode or, as far as I know, any character coding system yet
> devised that does not provide for extensive metadata to identify
> intent...


I do not think so, because the first "intent" to document is only in which
way I did use Unicode. This information is a protocol information.


> if, in fact, we could even agree that it is problem.

>...

   >> Without expressing a position as to whether it would be wise

> >> or not, practical or not, this is consistent with the reasons
> >> why many domains have created sunrise or similar mechanisms
> >> to aid with the introduction of new facilities such as IDNs.
> >> And decisions about the relationship between IDNs that contain
> >> decorated characters and the base label string (with only
> >> undecorated ones) will have to be taken in every domain that
> >> uses Latin characters and introduces IDNs.
> >
> > This decision is simple: unicode or not unicode, the language
> > or the keyboard, internationalisation or multilingualisation.
> > The problem is not related to IDNA but to decades of an
> > erroneous internationalisation strategy.
>
> The long-standing model for internationalization (with which you
> may reasonably disagree) is to do the things globally that
> preserve as much information as possible and then sort the rest
> out as a localization model.   It inevitably does not work
> perfectly and that is obviously a problem.


Here, the issue is not internationalization; it is that IDNA is not end to
end and is entropic. The same as "oe" or short space is not on the Qwerty
keyboard, because the standardisation process was entropic. The IDNA process
loses information (through punycode) and has no metadata to document the way
to restore it.

My fear is that this is one of the points "too many" that may upset users
and lead them to consider alternative solutions instead of parallel
interoperable ones.

For IDNs, the only way that I can imagine to do what you are
> looking for would be to require that every user specify a
> language and locale for every label to be looked up and that
> every file containing a domain name or URI contain that
> information in a standard way so that it could be used in lookup
> protocols.  From a technical standpoint, I think we could figure
> out how to do that, but its impractically boggles my mind.


What I am looking for is a language/script transparent end to end network
and  multilingualized DNS. I fully understand that for compatibility reasons
with previous RFCs this is not something the IETF can directly document. But
I do not understand the rigidity in so doing, while some minimal flexibility
could help reducing the problem (for the French language and others).

> However, the solution is not simple. Yet to force people
> > against their right and will is not a good solution because it
> > is not stable.
>
> See above and my earlier comments.
>
> >...
> >> 2. is not legally acceptable:
> >>>
> >>> --- "école" means school;
> >>> --- "ecole" can be a TM: ex.
> >>> http://www.defl.ca/fr/ecole.html. For other terms the
> >>> accentuated and the non-accentuated terms are different
> >>> words or TMs.
> >>
> >> Here is where you confuse me, or perhaps we confuse each
> >> other. Because of its exact-match properties, the DNS (with
> >> or without IDNs) is notably unsuited to a "do what I mean"
> >> function.  I am probably misunderstanding you, but it appears
> >> from the above that you are expecting a system that will
> >> cause the same pair of labels to be treated as matching under
> >> some circumstances and as not-matching in others.
> >
> > In your terms, yes. In French terms, no.
> >
> > The problem is that "your" approach is based on
> > internationlization (i.e. internationalising English ASCII as
> > a reference) not on multilingulalization (each language/script
> > being its own reference).
>
> Maybe.  I would prefer to believe that it is based on the
> assumption that it is possible to do useful things with a single
> international (multiscript/multilanguage) character coding
> system, rather than requiring that every application be language
> and locale-specific.


Correct. But you cannot demand Unicode to deliver things that were not
intended in its charter.


> Unicode is the only well-known instance of
> such a multiscript/multilanguage coded character system, but I
> believe that any theoretical alternative to it would exhibit the
> same properties with regard to this issue.


We were talking of the lingual support layer; on a general basis. What I
only say, is that internationalization, as a given language centric
proposition, does not help evaluating and attending issues such as this one.
While reasoning from a multilingualization point of view instead, seems to
suggest more possibilities to investigate.


> > I do not claim there is not an internationalization
> > impossibility, only that so far it was not found. And I fear
> > that when looking for a solution French engineers, lead users
> > and users will look more for a multilingualization than an
> > internationalization solution. This would technically divide
> > the Internet. Giving the French solution the ability to support
> > multilateralization as a general feature. We both know the
> > outcome: that French internet would have a presentation layer.
>
> That would be necessary, but not sufficient.  You would also
> need to establish sufficient context for the correct
> presentation layer to be chosen, even if, e.g., a French user,
> expecting French-style access to network resources,
> interpretation of names, etc., wanted to access resources while
> using guest equipment on a non-French network.  And a
> multilingual French user would need to be able to access French
> and non-French materials with the presentation interfaces
> appropriate to each and name and context information in those
> presentation layers/interfaces, not further down in the stack.
> I can imagine ways of thinking about those issues, but they are
> not easy or obvious and would require a tremendous among of
> linguistically and culturally-sensitive information handling and
> access and the information needed to make it all work.
>

Correct if you are thinking of "alternative" solutions. Not if you consider
a "parallel" extended solutions approach. This is what I am aiming at.

>
> >> I don't know how to do that in the DNS
> >> or in any conceivable IDN protocol that rests on top of it.
> >
> > This is the question to be answered. I do not have a global
> > solution either, and this is why I suppose that no-case
> > folding might be the most acceptable patch. But the decision
> > of accepting that patch is not mine: it belongs to the market
> > and to the French national communities.
>
> And, as long as whatever decisions they make are fully informed
> about costs and tradeoffs, I have no problem with that (and they
> would quite properly ignore me if I did).


We have no problem (?) if Francophonie, Governement, france at large, etc.
takes the decision.
This will not be the case if some ventures start and sell competing "French
speaking Internets".

>>> 3. would violates the French language and people's equality.
> >>> I do not see it being legally and technically accepted just
> >>> because the IETF did not find a solution.
> >>
> >> I don't understand how one can have a reasonable expectation
> >> of both "A" and "not-A" being true at the same time, which
> >> your comment appears to require unless the situation is to be
> >> treated as unfair and anticompetitive.
> >
> > The problem lies with case-folding being accepted in the DNS
> > technology while the impact of case-folding has not been
> > considered enough.
>
> And that is precisely why the current IDNA2008 model removes all
> mappings from the protocol.


But does not add the possibility to signal which type of mapping has been
carried or not.

> We misunderstand. What is anti-competitive is that people not
> > born with accentuated names or whose name is not challenged by
>
> By the same token, the ASCII DNS, as interpreted by most
> application protocols, has always been unfair to people who use
> apostrophes in their names, nor can one use initials offset by
> periods.   Allowing apostrophes and periods in domain names is
> not fundamentally a DNS problem, but the former would cause
> chaos with many command languages and the escapes required by
> the latter would cause massive user confusion.   I suppose you
> can argue that it is unfair; others would claim it is a
> reasonable tradeoff.  Getting from either to "anticompetitive"
> seems like hyperbole to me, but you obviously may disagree.
> Where that discussion really takes us, IMO, is whether it is
> reasonable to expect that every valid word in every language,
> and every plausible combination of such words, to be able to be
> accommodated in thee DNS.  That is a very difficult goal and it
> is incompatible with other goals.  The question again is about
> the reasonable tradeoff points.


"Technical trade-off reasonability is unfortunately dependent from the
assessment of the effort carried to address the problem. There is a whole
jurisprudence and agreement set about the Technical Barriers to Trade
(TBT/WTO). If some ccTLD or some venture starts proposing a solution, the
impact might be bad on the TLD managers, most of them being totally unaware.

>
>
> > Yes there is at least one : not to create an artificial
> > problem in forcing a language to follow the rules of another
> > one. Please remember that this problem is not related to IDNA
> > but to a wrong doctrine (internationalization) and to the
> > resulting rough AZERTY standard (one could imagine solutions,
> > but that would call on a wide debate of the francophonie).
>
> See above and please remember that the perception that the DNS
> is about "words" or "language", rather than appropriate
> language-based mnemonics, is itself problematic.


Would you bet the market value of "business.com" on this ? :-)

>
>
> >...
> > PS. You seem to imply that I would consider challenging IDNA.
> > I recall you that my whole effort is precisely the opposite.
> > To build a better architecture to include and enforce IDNA
> > (and to extend it when needed/possible). This is why I urge
> > this WG to proceed, because we need the IDNA solution to
> > finally settle before we can build around/on top of it and
> > stay fully interoperable.
>
> I understand and appreciate that.  And frustrating as it
> sometimes becomes for us and others, I appreciate your continued
> efforts to engage on these issues and to be clear about the
> problems you see.  But you identified decisions that could be
> made by people who are not under your control and who might be
> less sympathetic to the importance of overall Internet
> interoperability than I believe you are.


This is why, in a network environment the only solution is to propose
something meeting the market/users needs enough for them not to have to tale
such decisions.

Best
jfc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090219/13577c92/attachment-0001.htm 


More information about the Idna-update mailing list