Treatment of UNASSIGNED Characters in Unicode

Mark Davis mark at macchiato.com
Sun Dec 21 21:43:41 CET 2008


The text in question was *"(something that is now recognized as a
considerable source of risk)".

*As the high bit, we could fix this textual problem with a few lines in
Rationale that summarized your argument, in a Section X, and then instead of
the above text, we could have "(something that may be a source of risk - see
Section X)". That would address the dangling reference for "now recognized",
and give people some concrete reasoning.

==

As the second-order bit, however, I think the argument you convey is
inconsistent with actual statements in Protocol.

1. If you say that the registries cannot be depended on, then we need to
remove the text that IDNA2008 relies on "the assumption that the names
present in the DNS are valid". See below.

http://tools.ietf.org/html/draft-ietf-idnabis-protocol-08#section-5.1

   Although some validity checks are
   necessary to avoid serious problems with the protocol (see
   Section 5.5 <http://tools.ietf.org/html/draft-ietf-idnabis-protocol-08#section-5.5>ff.),
the lookup-side tests are more permissive and rely
   on the assumption that names that are present in the DNS are valid.

2. Since the onus is on lookup, then we also would need to tighten the
lookup requirements so that they are as strong as the registry requirements.
This would imply that:

a. lookup would be required to test for CONTEXT and BIDI rules.
b. lookup would be required to test A-Labels, rather than allow them to sail
through unchallenged.

http://tools.ietf.org/html/draft-ietf-idnabis-protocol-08#section-5.4

   If the input to this procedure appears to be an A-label (i.e., it
   starts in "xn--"), the lookup application MAY *[test it].*

For example, the A-Label xn--iny-zx5a.com (the pernicious I♥NY.com) could go
sailing through without any checks, even through one of the characters is
DISALLOWED (not just UNASSIGNED). So in particular, suppose I have a process
P1 that takes in IRIs and transforms them to A-Label form. It is not doing a
lookup, so it is not subject to the requirements of Protocol. It passes that
off to a second process P2 that does a lookup. P2 is handed an A-Label, so
it is not required to do any tests.

Or P2 could just be picking up IRIs from web pages, XML documents, or any
other sources, where the real originator of the A-Label is obscured, and may
have been an IDNA2003 process, which allows for UNASSIGNED code points in
lookup.

===

And yet I don't want to argue that we have to do #1 and #2. What we all
really would like to do is block an IRI that uses codepoints that are
UNASSIGNED as of the current, latest, version of Unicode, but let through
IRIs that use codepoints that have been tested by  a process that supports a
later version of Unicode. But there doesn't appear to be a feasible way to
do that.

[The more I think about the application of these constraints in real-live
implementions -- who has to do what, and how we are going to make this work
-- the more my head swims! See the implementation questions note that I
filed yesterday (
http://www.alvestrand.no/pipermail/idna-update/2008-December/003287.html).]

Mark

PS.

John: I noticed an orthogonal textual problem in the statement in #1. The
"(see Section 5.5<http://tools.ietf.org/html/draft-ietf-idnabis-protocol-08#section-5.5>ff.)"
note sets the reader's expectations that 5.5 will explain what the "serious
problems" are. But it just lists the requirements, not the "serious
problems". Maybe that could be addressed by moving that parenthetical to
after "the lookup-side tests".

On Sun, Dec 21, 2008 at 03:43, Vint Cerf <vint at google.com> wrote:

>
>
>
> Mark,
>
> the simplest reason I see for NOT permitting UNASSIGNED
> characters to be included in lookup has to do with our
> inability to assure compliance especially at lower levels of
> the domain name space. Unscrupulous or merely incompetent or
> inattentive registrars (by this I do not mean the ICANN
> definition of "registrar" but rather any entity that places
> domain names into zone files at any level in the system) might
> use (ie register domain names with) unassigned characters in
> an attempt to cause confusion or to use misleading
> registrations for abusive purposes. By prohibiting the lookup
> of UNASSIGNED characters, such abuses are blocked.
>
> Since the complete property list for an unassigned code point is
> unknown, and
> remains unknown until the code point is assigned, we can't know
> whether that code point will
>
>        -- turn out to be DISALLOWED (presumably because it is
>        assigned to a symbol, punctuation, or a letter that
>        decomposes under NFC to some other character)
>
>        -- turn out to be something that requires contextual
>        treatment (i.e., CONTEXTO or CONTEXTJ and which one),
>        much less what the relevant rules would be.
>
>        -- turn out to be PVALID.
>
> In essence, permitting it to be looked up establishes a "PVALID
> until proven otherwise" status, which therefore raises most (but
> not quite all) of the issues associated with changing the status
> of a character from PVALID to DISALLOWED. Something I think
> most of us would not want to facilitate.
>
>
> Vint Cerf
> Google
> 1818 Library Street, Suite 400
> Reston, VA 20190
> 202-370-5637
> vint at google.com
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081221/2e473c00/attachment-0001.htm 


More information about the Idna-update mailing list