Rationale problems

Fri Dec 5 22:30:11 CET 2008

Mark

On Thu, Nov 27, 2008 at 13:07, John C Klensin <klensin at jck.com> wrote:

> Suggestions for which comments do not appear belong has been
> responded to and incorporated into the document, although not
> always exactly as suggested.  See the comment about "harmless"
> in an earlier note for several of these -- if people don't like
> suggested rephrasing or other changes, they should object
> on-list in a timely fashion.  And, whether they like the precise
> changes suggested or not, they should check the document when it
> appears to be sure they are comfortable with it.
>
>
> --On Thursday, 20 November, 2008 13:12 -0800 Mark Davis
> <mark at macchiato.com> wrote:
>
> >...
> > ------------------------------
> >
> > Some
> > characters are sufficiently problematic for use in IDNs that
> > they should be excluded for both registration and lookup
> > (i.e., IDNA- conforming applications performing name lookup
> > should verify that these
> > characters are absent; if they are present, the label strings
> > should be rejected rather than converted to A-labels and
> > looked up.
> >
> > =>
> >
> > Some characters are inappropriate for use in IDNs and are thus
> > excluded for both registration and lookup (i.e., IDNA-
> > conforming applications performing name lookup should verify
> > that these characters are absent; if they are present, the
> > label strings should be rejected rather than converted to
> > A-labels and looked up. Some of these characters are
> > problematic for use in IDNs (such as the FRACTION SLASH
> > character), while some of them (such as the HEART symbol)
> > simply fall outside the conventions for typical identifiers
> > (basically letters and numbers).
> >
> >
> > *Rationale. *For only an miniscule fraction of the characters
> > that were in unmapped in IDNA2003 and illegal in IDNA2008 is
> > there any evidence of being "problematic". Also the "should
> > be" language is more appropriate for a proposal, not
> > describing the current spec. The above wording makes clear the
> > main reason for this break in compatibility, while noting the
> > problematic nature of some characters. Also supplies some
> > concrete examples (we could use more of that!).
>
> While I've made this change almost as suggested, people should
> check it carefully, since "problematic" is somewhat in the eye
> of the beholder and, more important, in operational experience
> with IDNs.  I also note that I've had as many requests to take
> examples out (because they can be misleading and/or because
> people disagree with the exact ones used) as I have to include
> more of them.  Either the WG needs to make up its collective
> mind, or I will keep applying my editorial judgment.
>
> >...
> > ------------------------------
> >
> > If
> > a character is classified as "DISALLOWED" in error and the
> > error is sufficiently problematic, the only recourse would be
> > either to introduce a new code point into Unicode and classify
> > it as "PROTOCOL-VALID" or for the IETF to accept the
> > considerable costs of an incompatible change and replace the
> > relevant RFC with one containing appropriate exceptions.
> >
> > =>
> >
> > If a character is classified as "DISALLOWED" when it should
> > not have been, there are two possible recourses:
> > (a)
> > Replace the relevant RFC with one containing appropriate
> > exceptions, accepting a change that many people feel to have
> > considerable costs. (b) Propose a new character in Unicode
> > that would be identical (except for its behavior in IDNA),
> > which would be extremely unlikely to be accepted, since it
> > would violate Unicode and ISO policies on duplicate encoding.
>
> People have complained about statements that appear to predict
> what the IETF will do in the future or to commit it to future
> actions and those statements have been removed or corrected as
> they have been identified.  It is, IMO, much less reasonable to
> make statements that predict or attempt to constrain future
> Unicode or ISO policies.  If the WG wants this sort of
> statement, I believe it should be made only as a direct and
> attributable quotation, e.g., "Mark Davis, speaking for the
> Unicode Consortium and ISO/IEC JTC1/SC2, says 'such a proposal
> would be extremely unlikely...' [Ref to date/time/place of
> statement]".

I actually don't like (b) either. The only reason I propose it is that your
original text was even more problematic, since would give the completely
false expectation that the Unicode consortium would be open to that.

If you aren't going to make this change, then I'd suggest the following
text:

If a character is classified as "DISALLOWED" when it should not have been,
it is possible to replace the relevant RFC with one containing
appropriate exceptions. However, many people feel that this recourse
has considerable costs.

>
> There are some similar statements in the documents that we need
> help identifying so that they can be removed or converted to the
> above form.
>
> > *Rationale.* Reflect reality. There is no particular consensus
> > that changing DISALLOWED to PVALID would in fact have such
> > costs.
>
> I note that the claim that it is reasonably safe to change
> DISALLOWED to PVALID has been suggested several times and has
> gotten little or no traction in the WG or at its meetings.   If
> anything, there is rough consensus in the other direction (not
> just "some people do feel..."). This one is up to Vint but,
> absent instructions from him, I do not believe it is appropriate
> to change this section (or the corresponding text in Tables) in
> the direction of weakening the requirement.

I may be just banging my head against a brick wall here, but nobody has been
willing to step up to the plate to say that "this causes me problems because
of situation X". No concrete examples have been cited. And if you can't give
even one single example of this being a problem, then you *at least* should
qualify it to indicate that it is an opinion.

>
> > However, because some people do feel that is to be the
> > case, we can certainly reflect that opinion here. We also
> > don't want to hold out false hope that Unicode/ISO would do a
> > duplicate encoding just for the purpose of IDNA.
>
> > ------------------------------
> >
> > For
> > example, it is generally believed that labels containing
> > characters from more than one script are a bad practice
> > although there may be some important exceptions to that
> > principle.
> >
> > =>
> >
> > For example, labels containing characters from more than one
> > script are problematic where those characters can cause
> > problems of visual confusion, such as using a Cyrillic
> > character for "a" -- which looks exactly like a Latin "a" --
> > in the midst of an otherwise Latin label. In other cases,
> > mixing scripts may be perfectly acceptable, such as using
> > Latin letters in the midst of Chinese characters.
> >
> > *Rationale. *Use concrete examples of problems.
>
> See comment above about examples.  Since confusion is not the
> only criterion, what is, or is not, appropriate in mixing
> scripts is very much a local judgment, I don't believe we should
> go much further than the existing text unless there is fairly
> clear WG consensus (and maybe some evidence of registry/operator
> consensus) to do so.

Again, unless you can cite at least one concrete example of a problem, it is
spurious to make this kind of "problematic" claim.

>
>
> > ------------------------------
> >
> > Other
> > issues in domain name identification and processing arise
> > because IDNA2003 specified that several other characters be
> > treated ...
> > [[anchor16:
> > Above text is a substitute for an earlier (pre -01) version
> > and is hoped to be more clear. Comments and improvements
> > welcome.]]
> >
> > =>
> >
> > [Remove]
> >
> >
> > Rationale. Unless at least one example of a concrete problem
> > can be provided, this needs to be removed. What is wrong with
> > changing the Regex for recognizing a URL from using [.] as the
> > label delimiter to using [.。｡﹒．]  (that is,
> > [\x{002E}\x{FF0E}\x{FE52} \x{3002}\x{FF61}])?
>
> This has been explained several times in terms of the
> requirement to be able to identify domain names in text
> independent of whether they contain non-ASCII characters or not
> and independent of whether the classification algorithm is
> IDNA-aware.  There are firm DNS and security considerations for
> such identification, and the main clue is the presence of what
> is called a "faceted name" in other contexts, i.e., a string of
> characters containing embedded periods without spaces before or
> after the periods.  There is no regex inherently involved in
> that test, so "changing the regex" is irrelevant.  Issues about
> URL recognition independent of domain names are almost as much
> so, since URLs do come with additional clues about recognition.
> If the text isn't good enough, I invite suggestions about how to
> fix it, but the DNS community has already told us, multiple
> times, that eliminating the rule or broadening the selection of
> "dot-oids" is a showstopper.

Regex is only a mechanism. It is not exactly rocket science to recognize a
fixed set of characters instead of just a single one. And this has not
represented a real problem in practice. This is really a data-extraction
issue, recognizing when a sequence of characters in flowing text is a URL.
And people do this all the time. It is at a level far, far above the DNS.

>
>
>
> > ------------------------------
> >
> > Highly Localized Preprocessing.
> >
> > =>
> > [Remove this section and reword neighboring text. ]
> >
> >
> > *Rationale.* Major security issue (see notes on protocol).
>
> Whether the text can be improved or not, the principles behind
> it text reflect operational reality.  As far as I can tell,
> there are only two ways to eliminate the issue you are concerned
> about:
>
>        (i) Require support, in the protocol, for all IDNA2003
>        mappings, including character-elimination and case
>        folding mappings.   That would prevent some changes we
>        have decided to make because they are substantively
>        important for some scripts, would cause Unicode
>        characters added after 3.2 to behave differently from
>        characters present in 3.2, and would deprive us of the
>        isomorphism between A-label and U-labels.   It would
>        probably not require support for base characters that
>        IDNA2008 disallows (such as symbols), but any mappings
>        onto an IDNA2008 PVALID target would have to be
>        supported.
>
>        (ii) Prohibit all mappings, insisting that the _only_
>        native-character form for IDNs is the IDNA2008 U-label
>        form.  That would yield absolutely consistent and
>        predictable behavior across implementations but, by
>        prohibiting obvious case mappings and the like, would be
>        nearly certain to be ignored by at least some
>        implementers.  Note that "ignore by some in violation of
>        the protocol" is far more dangerous than "apply a
>        permitted local interpretation" because, in the former
>        case, other implementations not only have no idea what
>        to expect but may safely believe that no mappings occur.
>
> If one doesn't like either of those extreme options, there is no
> choice but to put in some weasel-words about local mappings and
> interpretations.   I have to say that I've got a growing level
> of sympathy for the second, but I do not believe it to be
> acceptable to the WG or the community.

There is also option #3. Require (or at least strongly recommend) that IF
there is a mapping, that it match IDNA2003 as much as possible. The
deviations from the IDNA2003 mapping are really rather small, and can be
listed in a small section. I'd be glad to supply the text. Compared with the
current complexity of IDNA2008, they are trivial.

Option #3 could prevent the large majority of the mismatching security
problems associated with "Local" mappings.

>
>
> > ------------------------------
> >
> > Anyone looking up a label in a DNS zone is required to
> > ...
> > o Avoid validating other contextual rules about characters,
> > including mixed-script label prohibitions, although such rules
> > may be used to influence presentation decisions in the user
> > interface.
> >
> > =>
> > [Remove last clause. Add editorial note that this needs to be
> > reviewed against final text in protocol.]
> >
> > *Rationale. *The protocol does not, and should not, *require*
> > someone not to validate that a purported U-Label is actually a
> > U-Label!! Secondly, this and any other place in the document
> > that reiterates what protocol requires needs to be marked to
> > be verified before publication for accuracy, so that items
> > like this don't mistakenly get through.
>
> Note added.  Last clause will not be removed until we finalize
> the protocol text and can do that verification.
>
> That said, the purpose of this statement is to do something you
> have argued for elsewhere and have even argued that not doing it
> is a security risk.  Statements like this draw a clear boundary
> between things about which it is appropriate to warn the user
> but then look up and things that are appropriately not looked
> up.  We have to have predictable behavior about the latter or we
> get into really strange territory.

I don't think I've ever argued that contextual rules MUST NOT be validated
in lookup. The only reason that would make sense is if we anticipate that
CONTEXT rules could change so as to invalidate previously valid labels.
That, of course, would be a major source of instability in the protocol.

I've been assuming all along, even though the guidelines for how to do
CONTEXT rules is not yet present, that that would be forbidden. Are you
saying that CONTEXT rules could be changed in this way? If so, what would
that mean to registries? Would they have to invalidate all of the current
registrations that would become invalid?

>
> > ------------------------------
> >
> > characters are permanently excluded
> >
> > =>
> >
> > characters are excluded
> >
> > *Rationale. *
> > In accordance with "expected" language elsewhere. We need only
> > say "excluded".
>
> See comments above.  The IETF expectation is "permanently".
> This has been discussed several times and no other conclusion
> has been reached.

I really don't understand. In this very same email, you said "People have
complained about statements that appear to predict
what the IETF will do in the future or to commit it to future
actions and those statements have been removed or corrected as
they have been identified."

Ironically, this is precisely one of those statements! The correct thing to
do, in accord with your own statement, is to remove the word "permanently",
because it "predicts with the IETF will do in the future or commits it to
future actions".

>
> > ------------------------------
> >
> > For example, an essential element of the ASCII case- mapping
> > functions is that uppercase(character) must be equal to
> > uppercase(lowercase(character)). That requirement may not be
> > satisfied with IDNs. For example, there are some characters in
> > scripts that use case distinction that do not have
> > counterparts in one case or the other.
> >
> > =>
> > [Delete]
> >
> > *Rationale. *While the roundtripping under case operations of
> > ASCII of characters is a feature, it is not an "essential"
> > feature of ASCII.
>
> It is an essential feature of DNS, hostname, ASCII.

That is reasonable. But the wording is then still incorrect. It should be:

For example, for the DNS an essential element of the ASCII case-
mapping functions is that uppercase(character) must be equal
to uppercase(lowercase(character)).

>
>
> > Moreover, even in ASCII, strings do not
> > roundtrip: "McGowan" doesn't roundtrip, for example. And
> > neither of these points are relevant to the argument at hand,
> > they just weaken it.
>
> See if you can get WG consensus for changing this text.
>
> > ------------------------------
> >
> > For
> > example, putative labels that contain unassigned code points
> > will now be rejected, while IDNA2003 permitted them (something
> > that is now recognized as a considerable source of risk)
> >
> > =>
> >
> > For
> > example, putative labels that contain unassigned code points
> > will now be rejected, while IDNA2003 permitted them (something
> > many feel to be a considerable source of risk)
> >
> >
> > *Rationale.*
> >  It isn't so recognized. There is no example of any case where
> > it is a risk, since a label containing unassigned characters
> > was always rejected in registration in IDNA2003, and therefore
> > couldn't be matched. Note that IDNA2008 also does not require
> > lookup to completely verify that putative U-Labels are actual
> > U-Labels.
>
> While I respect your opinion, please see if you can get WG
> consensus for changing this text.   Even if you do, I'd expect
> that you would get some pushback in IETF Last Call because
> several people have done the analysis, particularly about
> normalization of code points not yet assigned, and come to a
> conclusion different from yours.

This is frustrating. The phrasing "is now recognized" is clearly, simply,
false. It implies that *everyone* recognizes it as a risk, when that is
hardly true. The only thing I'm asking for is to change it to reflect
reality, that there is a body of opinion that it is a risk. And I think
we're even willing to say that it is a large body of opinion -- despite no
evidence for this being provided!

>
>    john
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081205/8f1ec6a2/attachment-0001.htm