Message-Id: <6.1.2.0.2.20050212152430.04519eb0@mail.jefsey.com>
Date: Sat, 12 Feb 2005 23:03:21 +0100
To: John C Klensin <john-ietf@jck.com>
From: "JFC (Jefsey) Morfin" <jefsey@jefsey.com>
In-Reply-To: <175DC8BA0617C083CBD9B0B7@scan.jck.com>
References: <E1CzfLQ-0004mH-00@mx06.mrf.mail.rcn.net>
	<200502112102.34383.blilly@erols.com>
	<175DC8BA0617C083CBD9B0B7@scan.jck.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Cc: ietf@ietf.org
Subject: Re: IDN security violation? Please comment
Precedence: list
Sender: ietf-bounces@ietf.org
Errors-To: ietf-bounces@ietf.org

John,
May be some analysis to structure the debate. Lingual digital relations are 
supported through three layers: (1) computer interoperability, (2) human 
interintelligibility,(3) human interface.

- at layer 2 relations are brain to brain and support interintelligibility 
in using written languages. The scripts of these languages are supported 
through the Unicode system and are to be tagged for computer recognition.
- at layer  1 relations are end to end and support interoperability in 
using protocols with various digital, hexa, 7 or 8 bits coding and 
parameter systems registered with the IANA.

One of these protocol is the DNS which uses a "-.0Z" numbering plan within 
the 7 bits area, simplifying its human utilization by reference to Arab 0-9 
universally used characters and internationally used Roman A-Z characters. 
This also permits an easy bridging with other plans restricted to 0-9, O-B, 
or 0-F and the direct support of telephone numeric names. It has a direct 
total or partial mnemonic capacity for persons having English, Latin or 
Latin scripted languages.

Internationalization, at end to end layer, permits (punycode in the DNS 
case, not defined in the email LHS) to support a multilingualization at 
brain to brain layer and to provide the same mnemonic capacity to people 
having other languages. Vernacularization is the process which permits 
human interfaces and applications processes to fully take advantage of 
multilingualization, in usage cases ranging from language menus or combos 
to full IRI support.

A common problem is to overlook the multilingualization layer because it is 
transparent in English (an ASCII string is not affected by punycode). This 
layer violation creates the discussed security violation. This layer 
violation is the Verisign's disrespect of the ICANN requirements (at 
multilingualization layer) requiring the registration of IDNs using codes 
from a single language Table.

This common overlook of the multilingualization layer is aggravated by the 
proposition of a unique internationalization layer langtag (independent 
from IDN language Tables) where it does not belong: to describe all the 
vernacular views of a language.

IMHO, a correct generalized approach of multilingualism in the Internet 
consists in structurally acknowledging the three layers permitting to 
clearly tell the users in which exact context they are. This should be 
based upon a five constructors language tag (lang5tag):

- three internationalization layer descriptors.  They are used to register 
the IDN Tables: the language, the script and the domain of use. The RFC 
3066 define the use of ISO 639 codes for the language. RFC 3066bis proposes 
to use the codes of ISO 3166 for national domains and ISO 15924 for the 
scripts. This is a basic correct proposition, there are more general and 
more precise sources if needed.

- a multilingualization layer descriptor: the authoritative reference for 
the considered view of the language.

- a vernacularization layer descriptor: the style, that is the environment 
of the considered application (protocol, administrative, familial, formal, 
commercial, SMS, adult, etc.)

This lang5tag should be part of the IRI description, and supported by an 
icon to be shown in the browser bar. An example: if you send a mail your 
boss secretary will print and present in his daily folder, you may want him 
to know you sent it from a Chinese mobile instead of from your English text 
processor. An ISO 7000 conformant glyph system can probably be designed.

jfc


On 15:00 12/02/2005, John C Klensin said:
>--On Friday, 11 February, 2005 21:02 -0500 Bruce Lilly
><blilly@erols.com> wrote:
>
> > While I do not dispute that some mobile devices might use some
> > subset of some version of Unicode for text in some languages,
> > my point was, in response to John Klensin's "Until and unless
> > every one of us has a keyboard that permits easy input of
> > every Unicode character", that not only do I not expect to
> > have a keyboard permitting *easy* entry (no, that doesn't mean
> > "Grafiti" or "Decuma") of *every* Unicode character any time
> > soon, I don't expect it *ever*, because the Unicode code space
> > is expanding (in contradiction to the original Unicode Design
> > Principles) faster than the available memory space on
> > low-power, compact, mobile devices.
>
>Bruce (and others),
>
>You can argue and pick at this interminably, but I think you are
>missing the key point.
>
>There is, IMO, an extremely strong argument for saying
>
>         "Look, DNS names, and DNs as used in X.509 certs, are
>         ultimately protocol identifiers.  Safe and stable
>         operation of the Internet requires that protocol
>         identifiers be written in a small, restricted, generally
>         recognized, and easily distinguishable, set of
>         characters.  And everyone who has studied which
>         characters to use when the principles of "protocol
>         identifiers" and " statements are applied, including our
>         very internationalization-conscious friends at the ITU,
>         have concluded that the right characters are a subset of
>         those in the Roman-based script family.  The subset seem
>         to always be "without the embellishments of diacritical
>         marks or other embellishments".  It is almost always
>         defined in terms of case-independent matching rules or
>         in terms of only a single case being permitted -- more
>         often upper historically, although there are some
>         substantive arguments for lower. "
>
>The choice of Roman characters is ultimately based on the
>observation that, while there are several _languages_ that are
>more  widespread than English, nothing in the above says
>anything about English.  Those Roman-based characters are, for
>one reason or another, used, either as a primary or a secondary
>script, by more languages and people than everything else in the
>world put together.  That contributes significantly to
>"recognizable", which is an important criterion.
>
>And neither the "protocol parameter" argument, nor the argument
>that more characters would lead to more opportunities for
>confusion, did not come as a surprise to the IETF community
>within the last week or two.  Both arguments were raised,
>passionately and at great length, when the IDN effort was first
>coming together.  They were raised on the IETF list, on more
>than one WG list, in BOFs, etc.
>
>There is a second argument that can be made with equal strength.
>People like to write their names correctly.  Inability to do
>that is a profound source of irritation (at least) and was
>important enough, even in the 60s, to influence the way
>characters are handled in important operating systems to this
>day.  More generally, people prefer that the identifiers they
>pick have mnemonic value to them, and that means the ability to
>pick those identifiers based on their languages and scripts.
>Please note that argument applies at the geek interface level;
>we don't need to get up to the user interface one to make it.
>When we do get to the user interface and start worrying about
>non-expert would-be users of the Internet, we immediately
>encounter some very passionate, and almost certainly correct,
>arguments that users should be able to deal with, and navigate,
>the Internet and do so completely in their own languages and
>scripts.
>
>The problems with that argument, including opportunities for
>deliberate or accidental confusion among similar-looking
>characters, also come as no surprise to the IETF.  Like the
>"protocol parameter" position, they were discussed openly and at
>great length, with examples, many years ago.
>
>With both of those arguments in hand, and with the problems with
>each at least moderately well understood, the IETF (or at least
>everyone who could be persuaded to pay attention) made a
>decision.  That decision was made years ago and under
>considerable marketplace pressure, that, for the particular set
>of issue areas that included DNS names, the second set of
>arguments -- that accessibility in "native scripts" (and Unicode
>in particular) were more important than the "protocol
>identifier" argument-- were the dominant ones and that we needed
>to do this.   By implication at least, we decided that we would
>need to accept and understand the problems that decision caused
>and deal with them.
>
>There were another group of questions, which are the more
>complicated piece of the issue.  The obvious way to get the
>right functionality is not necessarily the best one.  There is a
>nasty tradeoff between techniques that can, at least in theory,
>be deployed quickly and ones that are likely to take longer but
>might be more satisfactory in the long term.   There is another
>nasty tradeoff between making something work well for the people
>who know that they need it and are willing to make an investment
>in conversion and upgrading of systems to get it versus making
>it work reasonably well (and perhaps more quickly) for everyone.
>
>Again, the IETF made decisions on those points.  My personal
>view is that some of those decisions were not especially
>well-informed and may even have been wrong, but they were
>decisions made in the community and made after the dissenting
>views were strongly expressed.
>
>So, today, we've got IDNs and IDNA.   Even if one believes that
>the _only_ reason for standardizing them is to provide a common,
>interoperable, way of doing something that people will clearly
>do somehow, the standards seem justified.  (For the record, I do
>not subscribe to the "that is the only reason for a standard"
>position in this case.)   I see no way to go back, even if we
>wanted to, and reestablish the "protocol parameter" argument for
>the DNS.
>
>So we are down to some serious and important questions --  but,
>again, ones that are neither new nor surprising.   In
>particular, since you and others have picked up bits from my
>earlier notes and interpreted them (I'm sure unintentionally)
>differently from what I intended:
>
>(i) The observation about YAH00 versus yah00 wasn't intended to
>say that a lower case test would solve very many problems.  It
>was only to point out that the particular YAH00 example wasn't a
>particularly good one, since it could be detected by the most
>trivial of tests.  I agree that test is not likely to be
>effective against a determined attacker or more clever examples.
>
>(ii) I have never argued that the "one label one script"
>requirement that Mark Davis and others have suggested is without
>value.  My comment was only that a requirement of that type was
>going to be a little harder to apply --in many cases and
>consistently-- than a casual reader might assume.  None of this
>is easy.  Life is hard.
>
>(iii) The observation about "...easy input of every Unicode
>character" is not, in any respect, an attempt to get us back to
>protocol identifiers.  It was, instead, about one of the more
>subtle questions associated with the IDNA story.  IDNA's most
>passionate advocates are convinced that, once a sufficient
>deployment level is achieved, no one will need to look at the
>internal, "punycode" form of IDNs, but will see only the "native
>character" form.  Others of us are convinced that user-visible
>punycode will be around forever, just as user-visible URLs will
>be.  We believe that will be driven partially by security
>concerns (I can more accurately compare two punycode strings by
>eyeball than I can a pair of arbitrary "native character"
>strings).  We believe that the difficulties you might have
>reading an IRI that contains an unfamiliar script out of a
>printed article or sign and typing it into a computer will cause
>you to wish that the punycode representation were readily
>available, because "recognize the character and then figure out
>how to key it in" is likely to be an insurmountable pair of
>problems.   The issue isn't one of  the expansion of Unicode or
>how many keystrokes are needed: if you can identify the
>character, any BMP Unicode character can be keyed in a little
>over four keystrokes, and non-BMP characters don't take many
>more (the "little" is determined by whatever you need to do to
>indicate that characters are being specified by offset.  The
>issue is recognizing the character accurately in the first
>place.  The cell phone story is equally unimportant because the
>first step in that story is identifying the right language so as
>to permit you to pick up the right phone (or switch it into the
>right state).  Language identification may or may not be harder
>than character identification, but it isn't likely to be easy in
>the general case.  Without language identification, you are back
>to character identification and four (or five or six) digit
>offsets.
>
>(iv) The TLD managers worldwide are not crying "please protect
>us from IDNs" and this latest "discovery" is unlikely to change
>that.  What they are saying is "we want and need to implement
>IDNs, please help us understand how to do that safely".   The
>answer to that question doesn't require "regulation" from on
>high.  It does require getting and sharing a much more subtle
>understanding of the issues, options, and tools than we have so
>far been successful in communicating.  IMO, the IETF should be
>putting energy into those issues and tools --and to alternatives
>to the use of DNS names (with IDNA) when that is appropriate.
>But efforts to move in those directions have gotten zero
>traction.   _That_ is, IMO, our problem, not whether we can turn
>back the clock and make a "protocol parameter" decision (or turn
>it back even further and reduce the number of scripts and
>characters in the world be several orders of magnitude).
>
>This isn't easy.  It is never going to be easy.  It poses
>opportunities for various nasty behavior that are harder to
>detect and defeat in a hostname/LDH-only world. The easiest way
>to get ourselves into trouble is probably to pretend it is easy
>and ignore the hard, risky, or edge cases.  We need to learn to
>cope: wishing for an easier and more homogeneous world or easier
>times generally, or wishing that an irreversible decision be
>reversed, won't get us much of anywhere, no matter how
>passionately those wishes are made.  And, like it or not, we are
>at as least as much risk of fragmenting the Internet by
>appearing to say "no" to some languages or scripts as we are
>from confusion among characters in well-thought-out
>internationalization efforts.
>
>      john
>
>
>
>_______________________________________________
>Ietf mailing list
>Ietf@ietf.org
>https://www1.ietf.org/mailman/listinfo/ietf


_______________________________________________
Ietf mailing list
Ietf@ietf.org
https://www1.ietf.org/mailman/listinfo/ietf