IDNA decode?

Mon May 16 17:36:03 CEST 2011

--On Monday, May 16, 2011 14:00 +0200 Simon Josefsson
<simon at josefsson.org> wrote:

> Yoshiro YONEYA <yoshiro.yoneya at jprs.co.jp> writes:
> 
>> On Mon, 16 May 2011 09:20:27 +0200 Simon Josefsson
>> <simon at josefsson.org> wrote:
>> 
>>> I had a feature request [1] regarding converting from IDN
>>> form to Unicode form.  I couldn't find a description how
>>> this is done in the IDNA2008 document set, but I must be
>>> missing it.  Could anyone point me in the right direction?
>> 
>> Section 5.3 "A-label Input" of RFC5891 describing how to
>> convert A-label  into U-label.
> 
> I read that as being part of the Domain Name Lookup protocol?

Yes and no, depending on your reasoning about display.  There
are reason other than display for converting A-labels to
U-labels.  Consider EAI and other protocols and applications
that involve a strong preference for the use of U-labels rather
than A-labels.

> Converting from IDN form to Unicode form is a display
> operation, and does not necessarily have anything to do with
> lookup.

Yes.   In retrospect, the reverse conversation might have been
spelled out a little better.  But the general design of the
IDNA2008 relationships (and RFC 5891 in particular) is to
establish the equivalence relationship between U-labels and
A-labels.  Conversions are valid if they preserve that
equivalence and invalid otherwise.  Unlike IDNA2008 or your
discussion below, there deliberately is no algorithm in
pseudo-code.

> Also, the section only covers labels, not entire domains.

That was deliberate.

> For reference the section is:
> 
> http://tools.ietf.org/html/rfc5891#section-5.3
> 
> It is not clear from that section, but what could be done is
> something like this:

(I've numbered your steps for convenience of discussion.)

>  (1) Check that the domain name and labels follows RFC 1034
> (with updates) and split it into labels, make a note
> whether it ends with a '.' or not.

>  (2)  For each label L do
>    (2.1) If the label does not begin with 'xn--' do 
        nothing
>    (2.2) If the label begins with 'xn--' then do
>      (2.2.1) convert label to lowercase
>      (2.2.2) XXX are tests in section 5.4 and 5.5 
>          needed for display?
>          the last section of 5.3 suggests it is a SHOULD
>      (2.2.3) remove the 'xn--' part
>      (2.2.4) perform punycode decode on the remaining 
         part
>        XXX what should be done if this step fails?
>      (2.2.5) replace the original label with the 
         decoded label
>   (2.3) assemble the domain name from the labels, 
        adding the final '.' if present in the input.

Yes, more or less.  I think that you are missing a bit for a
cautious and effective implementation.  What follows is personal
opinion about what I would consider and should not be taken as
authoritative in any way (the procedural statements might
reasonably be read as "MAY").

	(i) Note that (1) may not be possible as stated because,
	under some circumstances, you will get the domain name
	only as length-label pairs.  If you do 1034/1035 tell
	you how to build a dot-separated name, but have no way
	to know whether a trailing dot was intended.  You simply
	need a convention.

	(ii) Add a test after 2.1: "if the third and fourth
	characters of the label are "--" and the first and
	second are not "xn", balk.  The definition/specification
	of "balk" is outside the standard but, in many cases,
	this rather trivial test is going to be a good idea.

	(iii) If the environment is doing mapping and that
	mapping includes different label separators, add a step
	after 2.3 to substitute the local label delimiters for
	".".  Similarly, if the environment is nominally right
	to left, reorder labels to match local conventions.
	Note that both of these things require information that
	your need to be supplied out of band: neither can be
	deduced with 100% reliability from the FQDN or any of
	its labels).

	(iv) 2.2.2(XXX)  Up to you.  See below.

	(v) 2.2.4(XXX).   Not quite certain I understand your
	reference and question here.  The last sentence of 5.3
	is a very clear "MAY" and 5.5 is the Punycode conversion
	itself.  See below and the discussion of "balk" in (ii)
	above.

Several of the questions you seem to be asking come down to "how
much validation is necessary" or "how much validation should
actually be done".  The answer is the IDNA2008 deliberately
doesn't specify those answers.  You need to do a threat analysis
about the environment: how much do you trust what you find in
the DNS and what would be the costs if something dubious or
invalid were snuck in there.  I tend to err on the side of
either "trust by verify" or "trust no one", but I might change
my mind and take a chance in an environment that was severely
constrained for processing time or memory or where I had some
out of band knowledge that what was in the DNS could be trusted.

I think it is worth noting that a lot of folks around ICANN and
elsewhere seem to assume that there is a step between (2) and
(3) to verify that each label contains characters from only one
script and that all labels represent exactly the same script.
There is nothing to justify either of those requirements in
IDNA2008 or 1034/1035, but that doesn't prevent an expectation
that the DNS (or the programs that call it) will enforce various
people's business plans or moral principles.

If I were implementing this as a subroutine library, I think I'd
consider providing some extra arguments to pass some of the
problem out to the application (which presumably has hope of
knowing what is going on with the user): selection between
extensive and minimal validation, alternate label separators (if
one decides to tolerate those), and maybe even whether the
environment expects a special label ordering and which one.
But that is about implementation and adaptability of the
implementation to various risks and needs, not about what
IDNA2008 requires.

> However maybe some other part of the documents already
> explains this process.  Or is display of IDN's intentionally
> left outside the scope of IDNA2008?

If it was anywhere other than in 5891, it would be in 5894,
which is not normative.   The interesting question, IMO, is
whether it would be helpful (and uncontroversial) to insert some
of the discussion above --still non-normatively-- into 5894bis
if there ever were such a thing.

   john