Implementation questions

Mon Dec 22 12:15:28 CET 2008

Mark Davis wrote:
>
>
>   *IDNA Implementation Questions*
>
> *
> *In looking at how to implement 2008 (and maintain backward 
> compatibility with 2003), we are wrestling with some practical 
> questions that we'd appreciate feedback on. These are not questioning 
> the spec, but rather trying to see how to apply it to some practical 
> cases we've come up against.
>
>
>     Scenario
>
> Look at the following scenario, where we have three processes that 
> handle an IRI (perhaps just passing it through), with the final one 
> using it to access the DNS. (We'll use the term IRI for both when the 
> domain name labels are in punycode or in Unicode. They aren't 
> necessarily known to be A-Labels or U-Labels at any given point.)
>
> P1 => P2 => P3 => P4 => DNS
This is a very open-ended configuration. But let's shoot.
>
>
>     Questions
>
>
> 1. Suppose that P2 is on Unicode 5.1, and the others are on Unicode 
> 6.0. If P2 does a validity check, then it could prevent a perfectly 
> valid IRI from being correctly looked up. To prevent this problem, 
> does that mean that the best practice is for only P4 to do validity 
> checking? Or should the others do some weaker form of validity 
> checking, like skipping a check for UNASSIGNED?
If P2 checks at all, P2 should check for UNASSIGNED.

If P2 trusts P1 so much that it can skip the check for UNASSIGNED, it 
should trust P1 enough to skip all checks, which violates the scenario 
restriction above.

There is no "perfectly valid" IRI; there are IRIs that are valid under 
Unicode 6.0, and there are IRIs that are valid under Unicode 5.1. For 
that matter, there are IRIs that are valid under Unicode 7.0. We have 
accepted that these can't be looked up until upgrades happen.

This is exactly the same class of processing logic needed if P1->Pn are 
expecting NFC-compliant UTF-8. Either they trust that input is valid, or 
they check.

>
> 2. Suppose P3 is a non-IDNA aware process, so IRIs should be converted 
> to Punycode by P2 before sending. Should one do a validity check in 
> P2? How do we avoid problem #1 in that case?
Given the answer to #1, this is not an issue. P2 should check.
>
> 3. The current protocol spec appears to only require validity checking 
> when converting to punycode. So when an IRI is already in punycode 
> (which could have been from IDNA2003 application), it might not  
> undergo any checking at all when going from P1 to the DNS; so 
> everything depends on the registry's doing the right thing. Is it best 
> to check anyway, or does that run into problem #1?
See above about "trust". Either P2.... trust P1 to have done the checks, 
or they don't. If they do checks at all, they need to do all checks.
>
> 4. If P2 accepts an IRI in Unicode and passes it on to P3 in Unicode 
> (never converting to punycode), should it do any validity checking?
See above.
>
> 5. When a search engine does indexing, it has to map together IRIs 
> that are "equivalent" (resolving to the same logical location). When 
> it provides an IRI to the user for a page, that IRI should go to the 
> indexed page. However, because IDNA2003 and IDNA2008 browsers may go 
> to different places with the same IRI, which do we provide? If we try 
> to test for which browser the user has, that is clumsy and error-prone.
When a search engine does indexing, it maps together a lot more URLs 
than the ones that appear superficially syntactically equivalent.

My personal answer is that the URI (with a punycoded domain name) should 
be provided, because that's what the search engine has actually observed 
to cause the page to be fetched), and reconstructing an IRI from an URI 
is an error prone process (you basically have to guess).

Given that some people have decided that they want to provide IRIs to 
the user, some A-labels have to be converted back to U-labels. In both 
the cases we have let ourselves be twisted around by (final sigma and 
ß), the punycode form maps back to exactly one U-label.

This U-label will cause a different A-label in an IDNA2003 browser than 
in an IDNA2008 browser. But there's absolutely no ambiguity on which 
U-label to return; there is no possible U-label that one can return to 
an IDNA2003 browser that can cause that browser to go to the IDNA2008 site.

So the answer is simple: Return the A-label form. Preferably by 
abandoning the idea of returning IRIs.

                             Harald