Issues lists and the "preprocessing" topic
John C Klensin
klensin at jck.com
Tue Aug 19 02:48:04 CEST 2008
Hi.
I'm finishing cross-checking and updating the Protocol and
Rationale documents, but, in the interim and in the hope of
keeping things moving forward,
(1) I'm attaching updated version of the "outstanding issues"
documents circulated before the IETF meetings. These reflect
decisions made there. Those who disagree, especially if you
have something new to say), are _strongly_ encouraged to make
your case on the mailing list. If we have silence, I'm going to
ask Vint to declare many of these issues closed (which ones
should be obvious from the lists). I've also reorganized the
lists by status, i.e., separating those things that I believe
are still in need of discussion from those that I believe are
finished or nearly so. Opinions may differ about those
categories but, if so, I hope that those who disagree will speak
up very soon.
(2) I've been working on the "preprocessing" and mapping issue
in an attempt to reflect where we stand in the documents. It
is unlikely that the next versions of the drafts will have this
completely right, but I want to try to return to principles and
see if we agree (or not) about them. If we do agree, we can
then have discussion about tuning the text to best reflect those
principles. If we do not, then I believe that it would be more
effective to discuss those disagreements about principles rather
than quibbling about specific text.
I believe that:
(a) Our target is to have any IDN that moves across the
network contain non-ASCII labels in either U-label or
A-label form (i.e., that no mapping should be required).
In addition, IDNs in protocol contexts, including HTML
"href"s, should be in A-label form (i.e., be URIs, not
IRIs). We aren't going to completely accomplish either
of those goals for the reasons below, but they are still
desirable targets.
(b) Both long-term and short-term, systems that actually
read and manipulate strings typed by users are going to
need more flexibility than ones that process files.
Such flexibility may include some operations that we
talked about during the IDNA2003 development period but
that, as far as I know, have not been implemented on any
significant scale. For example, my current preferred
email client distinguishes between "copy link location"
and "copy email address", so that, given
<a href="mailto:foo at example.com">Joe Blow</a>
Copy link location would yield "mailto:foo at example.com"
Copy email address would yield "foo at example.com"
and a possible "copy" would yield "Joe Blow",
each in the relevant copy buffer ("clipboard").
One can imagine similar copy operations applied to IDNs
or IRIs that would yield domain names containing
A-labels or URIs, respectively.
Obviously, the typical user would have no clue about the
differences among these operations initially. But,
faced with situations in which "copy and paste" just
doesn't work, strings that cannot be displayed when
passed to a colleague or even a different application on
the same system without display problems (e.g., rows of
question marks or boxes or characters drawn from some
other CCS), and given decent user interfaces, they would
learn quickly.
(c) Compatibility with IDNA2003 will require mapping of
stored strings in some contexts. Ideally, those
mappings should be strictly confined to characters
mapped by IDNA2003 and interfaces to them should be
designed to encourage migration to no-mapping forms.
Some types of applications, such as indexing ones, might
need to preserve these types of mapping much longer than
others. At the other extreme, web browsers might be
configured to warn before mapping, or even to reject
domain names that require them, unless the user was
clearly referencing older pages.
The implications of the above are that we not only aren't
encouraging extensive local-option mapping, we are encouraging
no mapping at all except for backward compatibility when
necessary and as a user interface convenience. For the latter,
the expectation is that one will make the mappings as early as
possible and use only the mapped (U-label or A-label) form in
files; storing anything else in a file or sending it across the
network is strongly discouraged. Also, even when mappings are
done, the rule that is now present in the documents still
stands, i.e., one must not map a PVALID or CONTEXT character
into anything else -- mapping is permitted only for DISALLOWED
characters.
So, do others agree with that and, if not, where are the
disagreements and why?
john
-------------- next part --------------
Issues list, IDNABIS Protocol (as of 20080807)
This revision has been reorganized to group issues into three
groups: those that need additional discussion to resolve, those
on which comments have not been received from other than the
originators and that should be closed if no further comments
are received, and those that are believed to be settled after
IETF 72 in Dublin and that will be closed unless there are
protests on the mailing list against the meeting decisions.
Note that comments from those who originated the issue and that
repeat their initial remarks are not helpful for any of these
categories -- additional perspectives would be much more
useful.
Section numbers referenced are the same in both
draft-ietf-idnabis-protocol-01 through -03 except as noted.
"Critical Path" slide numbers refer to the slides used during
IETF 72 and are followed by the titles of those slides.
There are several issues that, because of text moving back and
forth and suggestions for additional moves, are in the issues
list and summary for Rationale. They are, in general, not
repeated here.
Note that, unlike Rationale and some independent issues, no
specific "outstanding issues" list was posted for Protocol in
May. That was due, in part, to the underwhelming response to
the other postings. So this document incorporates what would
have been that list.
-------------
Discussion and Resolution Still Required
-------------
** P.11 ** Placeholder: Description of steps in both lookup and
registration
These two sections still need work.
Status: Discussion anchor in text (anchor2 in -02).
Not worth doing much with until some related issues are
sorted out (e.g., whether these should be recombined, see
above). Specific suggestions welcome.
Status: No change at IETF 72
** P.12 ** Placeholder: Preprocessing
There is a lengthy placeholder in the text about
preprocessing issues. Even if the WG doesn't take on the
task of standardizing preprocessing, there is a great deal
of controversy as to whether it is necessary, should be
required, and should be standardized down to the last
mapping or if our goal is actually to change the processing
model, not just the descriptive one.
Comment: this needs to be resolved in the WG before the
text in Protocol can be made internally consistent and
consistent with Rationale. Aspects of topic have been
discussed extensively in postings in recent days, including
the Issues/ Status report for Rationale-01.
Critical Path slide 3, "Rationale 2".
Status: Text being rewritten after discussions at IETF 72.
Please watch for it and comment.
--------------------
Comments needed or issues will be considered settled. Some
items have been placed in this category because new text in
protocol-03 or protocol-04, suggested by on-list or in-Dublin
discussions,is believed to have resolved the issue.
--------------------
** P.6 ** Requirement for Policy
Mark writes about Section 4.4: "While exact policies are not
specified as part of IDNA2008 and it is expected that
different registries may specify different policies, there
SHOULD be policies." This SHOULD is pointless, unless some
constraints or guidance are given. Otherwise my policy could
be "any valid IDNA label", which would be precisely the same
as no policy at all.
See also R.9 and R.27.
Comment: I fully expect that some zone administrators will
adopt exactly that sort of policy. We've said that policy
decisions, and consistent application of those decisions, by
zone administrators are an important part of the
registration model. We have said that specifying those
policies and determining their adequacy is not an IETF
matter, but rather a matter for governments, enterprise
management, ICANN, and so on. We have seen application
implementers evaluate per-zone policies and respond with
decisions about what to display. So, I don't think this is
pointless. The problem is whether different language would
better describe the handoff.
Of course, were we to decide that our audience is purely
protocol implementers, all of this would go away (i.e., this
is linked to the "reorganize documents and remove rationale"
issue of R.1).
Critical Path slide 3, "Rationale 2".
Status: New text coming for the next versions of the
documents after more discussion in Dublin (IETF 72).
Please watch for it and comment.
** P.7 ** Universality of Unicode
Section 5.2 says: "The local character set, character
coding conventions, and, as necessary, display and
presentation conventions, are converted to Unicode (without
surrogates), paralleling the process described above in
Section 4.2."
Mark writes: In the vast majority of cases in modern
software, the local charset IS Unicode, so this may be
confusing. Also, UTF-16 does and must use surrogate code
units, so this needs to be more precise. And excluding
surrogate code points isn't necessary since gc=Cs are
forbidden anyway. Suggest:
"The string is converted from the local character set into
Unicode, if it is not already Unicode. The exact nature of
this conversion is beyond the scope of this document, but
may involve normalization, as described in Section 4.2."
Comment: I don't know how to evaluate "vast majority...",
but I keep running across examples and discussions that
suggest that the presumed minority is fairly large. The
most recent example is a lengthy discussion about
interoperability problems with email text body parts;
problems that would presumably be infrequent or trivial if
most systems primarily supported Unicode.
Status: The specific textual change suggested has been
made and the virtual timer has run out. Anyone who doesn't
like this should object immediately, otherwise I'll ask Vint
to declare the change final.
** P.8 ** Validation of A-labels
Section 5.4 says: "In general, that conversion and testing
should be performed if the domain name will later be
presented to the user in native character form (this
requires that the lookup application be IDNA-aware)."
Mark writes: Suppose that program X creates an A-Label from
a U-Label, then sends that A-Label to program Y, which sends
it to program Z, which sends it to program W, which displays
it. It sounds like each of Y, Z, W need to validate. Is
that the intent of this text? If it is only W that needs to
validate, then it gets a bit murky in today's world, where
the boundaries between cooperating processes and programs
are very fuzzy.
Comment: That "murky" situation is exactly why the text
leaves a lot of judgment in the hands of the application.
Note that the sentence after the one you quoted now says, in
part, "others may treat the string as opaque to avoid the
additional processing at the expense of providing less
protection and information to users", which is intended to
be a very clear statement that there is a trade off
involved. I believe that, if you read the surrounding text,
you will find that the specific answer to your question is
that it is important that W should (note lower case)
validate unless it has some reason not to (note that it has
to go to most of the work to validate in order to display
and that knowing, as a consequence of system design and
knowledge of adequate checking and auditing, that it
wouldn't have received the string unless it was valid).
Validation is optional, and less likely, for Y and Z, but
their implementers may certainly do so if they are being
cautious.
That text is consistent, I think, with on-list discussions
that seem to have concluded that we must be very careful
about not imposing IDNA requirements on programs that are
not IDNA-aware but that there are some serious spoofing,
abuse, and malware opportunities if programs simply assume
the validity of strings that appear to be A-labels.
Critical Path slide 4, "Protocol 1".
Status: Suggestions for clearer text would be welcome; no
substantive change between -01 and -02. Status unchanged
during IETF 72.
** P.10 ** Similarity of registration and resolution procedures
Mark points out that the steps in 5.5 are all the same as
in 4.3 -- except for bidi. This fact should be very clear
in the text.
See also P.15
Comment: Of course, they were more different until the MAYBE
categories were removed. There still seems to be merit in
describing them separately because implementers of
registration procedures and actions and implementers of
lookup ones tend to be different even if there is some
shared code. A comment could be inserted nothing the
parallelism (sic), but I'm not quite sure how or where to do
that without encouraging people to get sloppy about the
differences that do exist.
Specific suggestions and discussion would be welcome.
Unchanged at IETF 72.
** P.14 ** (Editorial) Text about A-labels on registration
Should this text be moved from 4.1 to 4.3?
Status: See placeholder in draft. No discussion through
IETF 72. People with strong opinions should express them
soon.
** P.15 ** Reducing duplication in "registration" and "lookup"
The description of steps in Sections 4 and 5 are very
similar. There are suggestions that the two be recombined
(see Mark's notes from some time ago) and another one that
we create a new section with the common material and point
to it from the two existing (but shortened) sections (see
Marcos's notes on Protocol).
Comment: Keeping the two sections separate contributes,
IMO, to the goal of making it easier for people to find the
information they need to implement things correctly, so I'm
nervous about recombining the sections. Marcos's
suggestion seems like a middle ground, although it will
result in more page-flipping.
Status: We need some discussion, of which none has occurred
through IETF 72 (other than comments from those who
originally made the suggestions). At least rough consensus
on this is needed if it is going to be changed.
Otherwise, the default is more or less status quo.
** P.16 ** Versions and the Conceptual Rules
Marcos suggested that the Conceptual Rules Registry and the
Derived Properties one have a formal version structure. I
may or may not understand the suggestion the way he
intends, but, if I do, this is an Issue.
Comment: The IETF has had poor success with version numbers
in tables and the like, especially those that are intended
for future compatibility, with the notorious "MIME-Version:
1.0" header standing out as an example. So I'm reluctant
to this unless we have a clear understanding of how we (and
implementations) would use those versions, what error
conditions we would expect and to whom they would be
reported, etc. Much more discussion and/or text needed.
That said, asking IANA to keep a "last modified" date in
the registry in some easily-processed form would seem both
reasonable and, at worst, harmless.
Critical Path slide 4, "Protocol 1".
Status: No discussion during IETF 72. If others don't
express opinions, this will be dropped or turned into a
suggestion for a "last modified date". In any event, it
is now an issue for "tables".
-------------------
Resolved in Dublin (IETF 72) or otherwise settled
These items are final and the issues closed unless arguments
are raised on the list that reverse the apparent Dublin
consensus.
-------------------
** P.1 ** Location of the Contextual Rules table
This table is the Appendices to Protocol at present; in
previous versions it was in Rationale. Based on list
discussions, it should probably be moved to Tables,
probably along with some additional material that is
still in Rationale.
Status: Resolved in Dublin (IETF 72); text will be moved.
The relevant text has been sent to Patrik.
** P.2 ** Format of the Contextual Rules table.
At present (Protocol-02), that table consists of a condensed
format that more or less matches the format used for
standard entries in Tables. With it, important fields are
separated by semocolons, potentially followed by comments
that start with "#". As an example, the first entry in the
table looks like the following in Protocol-01:
002D; HYPHEN-MINUS; F;
Must not appear at the beginning or end of a label;
Regular expression:
[^^]\u002D|\u00SD[^$] ;
# Note that a prohibition on having two hyphens as
the third and fourth characters of anything but a
valid A-label appears in the specification.
Mark Davis suggested a different format. He wrote (I have not
preserved his formatting):
I suggest that the table be formatted for clarity to not
depend on whitespace -- using names for each field -- and be
broken into a list of condition/result pairs.
Code point: 200C
Name: ZERO WIDTH NON-JOINER
Lookup: True
# Allow ZWNJ for breaking cursive connection, as needed in
Farsi.
Before: [[:Joining_Type=Dual_Joining:]
[:Joining_Type=Left_Joining:]]
[:Joining_Type=Transparent:]*
After: [:Joining_Type=Transparent:]*
[[:Joining_Type=Dual_Joining:]
[:Joining_Type=Right_Joining:]]
Value: PVALID
Comment: There is no dependency on whitespace, at least in
Protocol-01. Mark's comments may have reflected an earlier
version. I've illustrated a version of these change with
the alternate appendix (see elsewhere), but I'm not sure
about it for two reasons. The first is whether people
prefer a more compact format or one that uses more vertical white
space. We hope that the Contextual Rules registry will remain
small, but ending up with a significantly longer Tables
document (remember that this section is normative) as the
result of format may not be in our best interest. The
second point is related to the next issue.
Critical Path slide 4, "Protocol 1".
Status: Resolved in Dublin (IETF 72), in favor of a
"rules" approach rather than a regex one. Both appendices
have been removed from Protocol and text has been sent to
Patrik for inclusion in Tables and modification, as
needed, into the style of that document.
** P.3 ** Definition of the Contextual Rules - Regex or otherwise
Based on my understanding of what was being asked for, the formal
definitions of the Contextual Rules were written to use a strict
regular expression syntax with one regular expression per rule.
That syntax, illustrated by the example above for U+002D,
is not easy to look at or understand, but would lend itself
to an automatic rule interpreter. Mark's example is not
one of a single rule, or even formal use of a regular
expression. If we are going to go that route, there may be
even more simple ways to express the rules, leaving
applications implementers on their own for the
formalizations to be used in their code. A second appendix
has been supplied in -02 as the beginning of a suggestion
for discussion.
Please examine the two examples above, the discussion in
the text, and the forms used in both appendices and advise
on what you would like to see and why.
Note that they are provided in these versions of the
document for comparison purposes only. Assuming we can
agree on which one we want, only one will survive into the
post-IETF versions of the documents.
Critical Path slide 4, "Protocol 1".
Status: Resolved in Dublin (IETF 72), in favor of a "rules"
approach rather than a regex one. Both appendices have
been removed from Protocol and text has been sent to
Patrik for inclusion in Tables.
** P.4 ** Protocol reference to Bidi Constraints
In 4.3.2.4: the bidi constraints apply to more than just
single labels.
Comment: Noted. The question of those Bidi constraints is
probably one of the larger and more substantive open issues
we face.
Status: Resolved in Dublin (IETF 72) after some very strong
assertions from the Security and DNS folks about the
implausibility of cross-label checking. Since this appears
to be a showstopper for them that would certainly result in
blocking DISCUSS positions, I think the topic is dead in
the IDNABIS WG -- anyone who wants to argue it should
probably take it up in another arena. Text in Protocol
and Rationale is being conformed.
** P.5 ** Bidi-checking Requirement on Lookup
Should this be a SHOULD or a MUST?
Comment: See the discussion anchor and text in Section 5.5
(note that this anchor has been in the document for some
time and there have been no comments on-list).
Critical Path slide 5, "Protocol 2".
Status: Still no comments through IETF 72. The anchor has
been removed and this is considered done.
** P.9 ** Use of "in parallel"
Section 5.5 and elsewhere use the term "in parallel" to
describe the relationship between two (or more) sets of
steps or procedures. Mark expresses concern that this will
create confusion with concurrent operations, which is not
intended. He suggests other wording.
Comment: Specific suggestions for alternate text would be
welcome.
Status: The term "parallel" has been removed from both
Protocol and Rationale. This is an editorial matter and we
are therefore done with it unless someone objects RSN.
** P.13 ** Labels starting in combining marks
In Section 5.5 (lookup validation and testing), the text
contains a prohibition on labels starting with combining
marks. I think we have consensus on the prohibition. Is
the statement of it adequate?
Status: There is a discussion anchor in the text. It will
be removed if there is no discussion on this in the near
future. No discussion through IETF 72 and no comments on
list. Anchor removed. Done.
-------------- next part --------------
Issues list, IDNABIS Rationale (as of 20080807)
This revision has been reorganized to group issues into three
groups: those that need additional discussion to resolve, those
on which comments have not been received from other than the
originators and that should be closed if no further comments
are received, and those that are believed to be settled after
IETF 72 in Dublin and that will be closed unless there are
protests on the mailing list against the meeting decisions.
Note that comments from those who originated the issue and that
repeat their initial remarks are not helpful for any of these
categories -- additional perspectives would be much more
useful.
Section numbers refer to draft-ietf-idnabis-rationale-00
through -02. "Critical Path" slide numbers refer to the slides
used during IETF 72 and are followed by the titles of those
slides.
This is more or less a status summary and issue discussion for
that document in the hope of facilitating wider discussion. I
will further update the list as needed and create a list
without comments for discussion and tentative decisions before
Dublin. The numbers given are to facilitate referencing.
The issues/comments from R.10 through R.25 inclusive onward are
derived from messages from Mark Davis, primary the note sent
Monday, 07 July, 2008 08:05 -0700. Consequently and just as a
convenience, in places the text reads as a reply directed to
him.
-------------
Discussion and Resolution Still Required
-------------
** R.1 ** Should this document exist at all?
There is an ongoing discussion about whether the documents
should be reorganized in part to move material that is
necessary to the protocol implementation entirely out of
this document and then to drop this document.
Critical Path slide 2, "Rationale 1"
Status: Not really discussed during IETF 72; awaiting
instructions from Vint and/or a meaningful discussion.
The rest of this list assumes the answer is "no";
otherwise it would be irrelevant.
** R.2 ** Normative text in this document.
If the document is retained, should we try to remove all
normative references from the other documents to it and/or
remove all normative text from it and put it in the other
documents? (Those two definitions of the issue probably
amount to the same thing.)
Comment: This issue is discussed at some length in my note
on document organization.
Critical Path slide 2, "Rationale 1"
Status: No substantive discussion during IETF 72.
Discussion needed after R.1 is resolved.
** R.3 ** PROTOCOL-VALID Explanation
Section 6.1.1 contains an overview of Protocol-Valid. Is
that explanation adequate? If more is needed, what?
In addition, there is an inconsistency between Rationale
and Tables about the relationship between the two CONTEXT
categories and PROTOCOL-VALID. Tables treats the two
categories as disjoint. The explanation in Rationale
treats them as special subset cases of PROTOCOL-VALID,
more or less as PROTOCOL-VALID with special flags set.
Comment: The approach used in Tables is computationally
easier and reflects the structure of both documents before
a tentative decision in January to try the subset approach.
My impression from discussions of these revisions in
various meetings around the world is that the Tables
(disjoint) approach is more generally understood. The two
document obviously should be consistent; unless there are
strong enough reasons to use the subset approach to justify
changing Tables, Rationale will be changed to match it in
the first post-IETF version.
Status: The overview text about PROTOCOL-VALID has been
changed to something that may be more clear. Comments or
suggestions for improvements on that issue or the more
general ones discussed above are welcome. No progress
during IETF 72. This section probably needs to be
harmonized with "Tables".
** R.7 ** Explanation of removal of symbols
Section 10.5 discusses the reasons why symbols are not
permitted in IDNA2008. There has been controversy about
some of the statements and examples, with some disagreement
about whether some of them are even factually correct. We
either need to identify everything that is controversial out
of this section and trim it out (which might leave very
little), fix specific examples on which we can agree, accept
(and possibly note) the disagreements, or adopt some other
strategy. Of course, if we drop rationale and explanatory
material in favor of a "this is just how it is" approach,
the specific issues with this section vanish.
See also R.25, which focuses on a specific part of the
explanation. There was no substantive discussion of the
issue at IETF 72 and this issue is still open.
** R.5 ** Permanence of DISALLOWED
There has been extensive on-list discussion about whether
migration of characters from DISALLOWED to PROTOCOL-VALID
(or CONTEXTO) should be easy, or at least easier than
migration from PROTOCOL-VALID to DISALLOWED. I do not
believe we have reached consensus even though the material
in 6.1.2 reflects what I believe to have been the general
trend of the discussions when they ran down. Does anyone
have anything new to say about this and, if not, should I
remove the placeholder?
See also R.11
Status: Issue raised at IETF 72 in Dublin, but no
conclusion reached. Placeholder remains in -02.
Discussion (not just repetition) needed.
** R.16 ** Description of major changes from IDNA2003, invalid
labels
Should item 11 ("Make some currently-valid labels that are
not actually IDNA labels invalid.") be dropped?
Mark said: 'Why do we care that labels invalid under IDNA2003
are also invalid under IDNA2008? Why wouldn't they be?
Perhaps an example would help to clarify this.'
Comment: That isn't what it says, actually. I think the
intent of the statement was to call out the fact that
conforming IDNA2003 implementations can look up labels
starting in "xn--" that are not valid A-labels and labels
that contain "--" in the third and forth positions that don't
start in "xn". In IDNA2008, both of those lookup operations
are somewhat discouraged and conforming applications MAY
decline to look them up. (N.B., those two reasons for not
looking something up are separate and part of the
outstanding issues list for Protocol.)
Status: No substantive discussion at IETF 72. Awaiting
comments.
** R.20 ** Display of A-labels (punycode-coded strings) to users
The text says 'Applications MAY allow the display and
user input of A-labels, but are encouraged to not do so
except as an interface for special purposes, possibly for
debugging, or to cope with display limitations.'
Mark writes: There is widespread use of the A-Label to
signal a possible spoof -- while you discuss that later, I
think it's swimming against the tide not to mention it here.
Comment: We definitely need to talk about this. There is a
difference between recognizing that something is done, even
on a "widespread" basis, and encouraging it. From a
security practices and human factors standpoint, switching
into A-labels for too many different reasons is a bad idea
under both the principle that excessive warnings cause
typical users to ignore all of them after a while and
because the user has no way to differentiate among the cases
(at least without a handy A-label -> Unicode code point
list mapper in addition to an A-label -> U-label mapper and
knowledge as to what to do with both). Consider the two
most popular causes of A-label display today -- failure to
have the relevant script installed (much more likely an
indicator of "you are unlikely to be able to read the
content at that destination" than of evil-doing unless one
makes the classic assumption that anyone you don't know is a
nasty barbarian) and failure to be part of a TLD that
practices approved registration hygiene (extremely prone to
false positives unless the TLD is actively recruiting
evildoers) -- and then think about the effectiveness of
A-labels as a spoof warning in environments in which we know
that the typical user's response to being barraged by "an
<incomprehensible> thing might happen if you continue, do
you want to continue" messages is to click "yes" every time.
Your original suggestion (a half-dozen years ago) to color
these things was actually much better.
I don't think there is a place for any of that discussion or
associated recommendations in the document, but I also don't
think we should be saying things that constitute
recommending the practice. The text as it stands is just a
MAY with a fairly weak "encouraged" clause so people who
believe that A-label display is the best solution are still
conforming.
Whatever is said about display of A-labels, it seems clear
that users should be able to type them, in some way, on
input. Does that need to be said explicitly?
Status: It is now (-02) said explicitly, although the text
is easily removed if people conclude that it is clutter.
This topic was not discussed at IETF 72 and I await further
comments.
** R.24 ** Local (user interface) mappings
The text says: "None of those local decisions are a threat
to interoperability as long as (i) only U-labels and
A-labels are used in interchange with systems outside the
local environment,...".
Mark writes: Doesn't really follow that there are no
problems. The obvious example of interoperability problems
are where a Turkish friend has a URL that works in his
browser, copies the text in an email and sends to me. When
I click on it, it either 404's or **much worse**, goes to a
different website.
Comment: First of all, Mark and others have suggested many
times that the Turkish "i" is such a rare and unique case
that there is no point worrying about it. I'm not convinced
of that, but I think we either need to take it seriously as
a problem... or not.
More to the point, the key point is in the text following
that which you quoted, i.e., "(ii) no character that would
be valid in a U-label as itself is mapped to something
else,...". As long as the dotless "i" (U+0131) remains
PVALID (it is today), then the case I assume you are calling
out above (since you specifically mentioned a Turkish
friend) is prohibited (which doesn't mean it won't happen --
see below).
But I think this case identifies the sources of our
disagreement about required standardized mapping and a
number of other issues (including the one about the [mis]use
of A-label display discussed above). If I'm making the right
inference from several of your comments, you tend to look at
whatever is going on today and say "this is the pattern, it
is in wide use, we need to accept it, standardize it, and
promote it". My view is that, in a world in which we are
talking about planning for the next billion Internet users
and the billion after that, we should be figuring out what
is optimal, including learning from the consequences of
previous behaviors and decisions, and then developing plans
about how to get there (even if that implies a
slightly-bumpy transition).
In this particular case, I look at some of the possibilities
that could lead one to end up with a different target host
or site than intended and say "another reason why we need to
restrict URLs to final (non-mapped) characters only and get
people out of the habit of believing that they can use, or
advertise, one character in a URL or email address and
expect a different character in the DNS". You look at
figures about how often Google encounters characters that
need mapping and say "we have to preserve the mappings
forever" while I look at the same data and say "we have to
think through the transition and legacy issues very
carefully, but need to get that behavior under control
before it gets even worse".
Suggestions about better text would be welcome, but the
group needs to figure out what it wants to do with the
underlying difference in models between "make the Internet
better by stabilizing current behavior patterns" and "make
the Internet better by facilitating evolution to
better-designed patterns".
See note from Mark titled "Interoperability" and my
response.
Critical Path slide 3, "Rationale 2"
Status: Placeholder/discussion anchor inserted in text. The
text itself has been rewritten in -02 somewhat to reflect
IETF 72 discussions. I believe we now have consensus on
the principles although probably still not on the text.
Suggestions welcome.
** R.25 ** Explanation of the symbol prohibition
In Section 10.5, one of the bullet points starts "Most
Unicode names for letters are, in most cases, fairly
intuitive, unambiguous and recognizable to users of the
relevant script....and there are far more squares of
various flavors in Unicode than there are hearts or stars."
Mark wrote: This just needs to be removed; the argumentation
is faulty. For the same pronunciation, Chinese has hundreds
of possible characters. If you want another reason (and
someone to point a finger at), you could say: "The Unicode
Standard recommends that these types of identifiers not
contain symbols [UAX31].
Comment: This needs further discussion. I believe that the
comment about Chinese is totally irrelevant: after all, a
large number of languages have multiple characters within
the same script for the same phoneme (as well as using the
same character to represent multiple phonemes) and none of
that generally affects the name of the character. We may
need to agree to disagree and then let the WG decide whether
the discussion is important enough to try to tune in the
light of the disagreement. I don't know if better phrasing
would help things; I gather from the above that you are
convinced that it would not.
Status: A placeholder/discussion anchor has been inserted
into the text. This topic needs further discussion from
additional parties.
--------------------
Comments needed or issues will be considered settled. Some
items have been placed in this category because new text in
rationale-02, suggested by on-list or in-Dublin discussions, is
believed to have resolved the issue.
--------------------
** R.4 ** Contextual rules and their application
Section 6.1.1.2 contains a discussion of contextual rules
and a placeholder for an in-depth explanation of the syntax
for those rules. It may not be right and/or clear.
See P.1, P.2, and, especially P.3.
Status: Contextual rule sections have been removed from
Protocol for placement in Tables. This section will be
changed to conform. Please watch for it and comment if
needed.
** R.9 ** Scope of requirements on Registries
The language of the first sentence of Section 10.1.2
appears to make a requirement on all DNS names and servers.
That would require an update to RFC 1123, among other
things. Such an update is out of scope and probably
undesirable.
Comment: That reading was never intended. The language
assumed more context than the reader might have had and was
generally sloppy. There is a more general issue about
policy requirements; See R.27 which remains an open issue
due to new text.
Status: This text was changed in the -01 document to
restrict its scope to zones supporting IDNA and
eliminate the restriction on other prefixes.
WG members should review both the change and the note in
the text of -01 to verify that they are what is desired.
There have been no such comments on-list or during IETF 72,
so we will soon decide that we are done.
** R.15 ** Description of major changes from IDNA2003, Bidi
Mark suggested removing item 9 ("Make bidirectional domain
names in a paragraph display in a non-surprising fashion.")
because it is just a special case of the previous item.
Comment: that text wasn't mine and I'd like to hear from Paul
and/or Harald and Cary before making any changes. I believe
the intent of the two separate statements was to distinguish
between the case in which one knows that the string is a
domain and the case in which one needs to deduce that the
string is a domain name from running text.
Status: A placeholder comment has been inserted in the -01
document to identify this issue. In the -02 document, item 9
has been tentatively removed and item 8 rewritten a bit to
make the relevant distinctions. People should check that new
text carefully to be sure it reflects their intent.
** R.17 ** Safe, but only in conjunction...
Mark wrote: '"that are safe for use only in
conjunction". Since you never say why they are unsafe, this
needs clarification. Do you mean this because of visible
confusability?'
Comment: The reference was in conjunction with combining
characters that have represent much the same situation as
the joiners, i.e., that they don't add decoration to the
previous character but, instead, just change its
presentation form. An example of this is Arabic Tatweel
(U+0640), which has been discussed extensively in comments
by individuals and from ASIWG. I don't know how many other
examples there are. In the case of Tatweel, the ASIWG
recommendation has been to ban it entirely (i.e., treat is a
DISALLOWED). If we follow that advice, it may not be a
real example, but perhaps it illustrates the point.
It is worth noting that, as long as the approval process for
changes remains IETF Review, it is sufficient for this
document to use a relaxed definition of "safe", to be
evaluated on a case-by-case basis, rather than trying to
narrow things down to definitions that would be automatic.
Text that would further clarify this would be welcome.
Status: No comments or suggested text received. This will
be dropped as an issue unless there is a discussion RSN.
** R.18 ** DISALLOWED in error.
Mark commented on the statement 'If a character is
classified as "DISALLOWED" in error and the error is
sufficiently problematic, the only recourse would be either
to introduce a new code point into Unicode and classify it as
"PROTOCOL-VALID..."'. Unless you have some evidence to
think that this is a real possibility (I don't), it should
be removed.'
Comment: I don't feel that this is worth fighting wars, or
even writing long explanations (again), over. If there is
consensus that it should come out, it will come out.
However, I think we should all understand that most of
these kinds of "unless you have evidence", "unless you
can prove this could happen", and "unless you can show an
example" arguments can produce different conclusions
depending on how they are stated. For example, the issue
with the above could be stated (with apologies for being
obnoxious) "Unless there is evidence that the Unicode
Consortium has never made, and will never make, a serious
mistake, then the text should stay in". Again, I'm not
arguing for keeping the text, but I do think we need to make
decisions without getting trapped by the phrasing of
questions.
Status: No further discussion. Unless some, from other than
Mark or myself, appears soon, this will be considered
settled.
** R.23 ** Detecting domain names in text
Section 9, Paragraph 2 of the text contains the statement
"If a domain name appears in an arbitrary context (such as
running text), one may be faced with the requirement to know
that a string is a domain name in order to adjust for the
different forms of dots but also to have traditional dots to
recognize that a string is a domain name -- an obvious
contradiction."
Mark wrote: "Not a contradiction, remove. Example, if one
recognizes full-width dot in detecting URLs, then one can
clearly use them in parsing within labels."
Comment: Either I don't understand your example, or it
supports my point. If one "recognizes full-width dot in
detecting URLs", then one has already made a decision that
goes outside IDNA and the URL standards. One could as
easily "recognize" any other character, whether dot-like or
not and whether on the IDNA2003 "treat as dot" list or not.
That recognition would occur in one of two cases:
(i) One treats the dot-like character as equivalent to an
ASCII dot throughout the application. The domain name is
then recognized, in essence, by recognizing the ASCII dot as
usual. That might well work for a full-width character in
contexts that are used to treating all full-width
Latin-derived characters as identical to their ASCII
equivalents, but certainly would not apply to various other
characters that people have insisted are dot-like enough (or
sentence-separator-like enough) to be treated as label
separators.
(ii) One treats the dot-like character as an ASCII dot
because one knows that one is in a domain name or URL
context. But therein lies the contradiction to which I
referred: if you need to know the context in order to
determine whether the dot-oid is to be treated as an ASCII
dot and/or label separator, then you cannot use the
dot-oid to determine the context.
Obviously the explanation in the text is not good enough.
Status: New text was inserted in -01, along with a
placeholder note. Further rewriting was done in -02 to
reflect the input from other areas during IETF 72.
Comments and suggestions welcome.
** R.27 ** The role of policy vis-a-vis this spec.
This specification (and Protocol) stress that registry
policies are an important element of a working, complete,
IDN environment. We don't specify any policies, even
minimal ones, and "our policy is 'open season on users'" is
a possible response. Is that plausible or do we need to do
something else? And, if something else, what?
See R.9 and P.6.
Critical Path slide 3, "Rationale 2"
Status: New text coming for the next versions of the
documents after more discussion in Dublin (IETF 72).
Please watch for it and comment.
** R.28 ** Transitions for applications taking advantage of
IDNA2003 mappings
There are a number of web pages in the wild in which
characters mapped out by IDNA2003 (e.g., "Mathematical"
forms, subscript and superscript digits) are used,
presumably in the interest of a distinctive presentation.
Rationale does not discuss that issue nor offer specific
advice about transitions. Should it?
Status: New text in -02. Please review and comment.
-------------------
Resolved in Dublin (IETF 72) or otherwise settled
These items are final and the issues closed unless arguments
are raised on the list that reverse the apparent Dublin
consensus.
-------------------
** R.6 ** User agents and warnings
The last paragraph in Section 6 (actually 6.3 on layered
restrictions) explicitly points out the role of user agents,
and then concludes with a warning against threats that
cannot be completely prevented or blocked. That sentence
is redundant with a disclaimer in Security Considerations.
Should it be removed (I believe that someone explicitly
asked that something along these lines be said in 6.3, but
it is redundant, so I'm checking). Silence will be
interpreted as "leave the text, drop the placeholder".
Status: No comments before or during IETF 72 in Dublin.
Placeholder has been removed and this issue is believed to
be settled.
** R.8 ** Mechanisms for updating the context registry
Section 13.2 ("IDNA Context Registry") contains a discussion
of the updating rules for the Contextual Rules registry.
Are those rules the ones we want?
Comment: Pursuant to discussion on-list, this section was
rewritten in -01 to require IETF review and approval. As
discussed on the list, I (and at least some others) still
hope that we can eventually get to a process that is more
based on expert review, but it appears to be well in the
future.
Critical Path slide 2, "Rationale 1"
Status: This is believed to have been settled at IETF 72 in
Dublin and there have been no comments on the list about
the -01 changes. Most of the material in 13.2 (part of
IANA Consideration) has been removed to the Tables document
and a pointer inserted.
** R.10 ** Stability of Labels.
Mark Davis wrote: I believe quite strongly that once a
domain name is valid, it should not be invalidated by any
later version of IDNA. Now, while we cannot prevent a later
RFC from doing that, we *can* prevent such invalidation by
the normal process of updating tables under these RFCs for
new versions, adding exceptions, and changing contextual
rules.
Comment: I've included this in the Rationale list to be sure
that it doesn't get lost. I believe that there is consensus
on this point. If that is the case, the only issue is
identifying any places where the documents might be
inconsistent with it. If there are no additional comments
on this subject through the end of IETF, I propose to
interpret that as agreement.
Status: No further comments; interpreted as agreement.
Done.
** R.11 ** Instability of Nonlabels.
Mark wrote:
I also think that making nonlabels stable should *not *be a
goal. It can't really be achieved anyway, since the presence
of an UNALLOWED character can make a label be invalid in
version X yet valid in version Y (where that character is
defined).
Comment: See R.10 and R.5 above. I assume that "UNALLOWED"
was a typo for "UNASSIGNED", not "DISALLOWED" or something
else. I've included this in the Rationale list to be sure
that it doesn't get lost. I believe that there is consensus
on this point although perhaps not as clearly so as with
R.10, but that it is worth making a distinction between
UNASSIGNED and DISALLOWED in terms of stability (see R.5).
If that is the case, the only issue is identifying any
places where the documents might be inconsistent with it.
On the other hand, we may just have a misunderstanding. See
the second "Description of major changes..." list, below.
Status: Some hall discussions (not in the WG) during IETF
72 lead me to believe that we have agreement that strings
that do not qualify as labels cannot be guaranteed to be
preserved in that state when the reason is because they
contain UNASSIGNED characters. Rationale explicity says
that strings that are invalid because they contain CONTEXT
characters but fail the rule texts cannot be guaranteed to
not change state. However, the real issue is whether a
character, once DISALLOWED, stays DISALLOWED (see R.5).
Unless there is dissent from that analysis I propose to drop
this issue and leave R.5.
** R.12 ** Management.
(Again from Mark's note and again in this list to prevent
its getting lost)
The process of adding backwards compatibility characters,
context conditions, and exceptions needs to be much more
definitive.
Comment: I think the discussions during and since IETF 71
have focused on "change the RFCs using normal IETF
processes". -01 has been changed to reflect that (see R.8
for more discussion on the specific topic of Contextual
Rules). If we believe that, I don't know how much more can
be said at this stage.
Status: There has been extensive on-list discussion,
confirmed by discussion in Dublin. This is believed to
have been settled at IETF 72 and, subject to the usual
qualifications, is done.
** R.13 ** Statement about the reason why IDNA uses Unicode
The statement in 00 was "strange" and problematic.
Status: After discussion on the list, Ken Whistler's
suggested text was substituted into the text of -01. To the
extent necessary (I think we reached agreement on the list),
people should verify that is what is wanted.
Status: Silence after -02 is posted will be interpreted as
consent and this Issue will be dropped.
** R.14 ** General editorial suggestions from Mark.
Status: all of these that appear editorial and
uncontroversial and have been incorporated. Those that were
not considered editorial and uncontroversial are noted
elsewhere in this list.
People should check diffs and/or Mark's list to be sure they
are satisfied with the changes.
Since no comments have been received on any of the changes,
they are considered done.
** R.19 ** Slightly-redundant text.
The last paragraph of section 6.3 is redundant with a
similar comment in Security Considerations. Should it be
retained?
Comment: Mark suggests "yes" and I'm inclined to agree. If
others agree, I'm removing the "anchor" comment.
Status: No dissenting comments either on-list or during
IETF 72. This is considered Done and the anchor has been
removed.
** R.21 ** The "ae" example and discussion text.
In Section 7.3, Paragraph 4, the text reads 'the
two-character sequence "ae" is usually treated as a fully
acceptable alternate orthography.' Mark suggests adding
'for the "umlauted a" character'.
Comment: This should have been clear from the previous
paragraph and the fact that the sentence in question starts
with "That character (U+00E4)", a very explicit
back-reference to that previous paragraph. However, the
suggested change seems relatively harmless.
Status: The suggested change has been made to the text.
Anyone who is unhappy about it should say so. Previous
experience indicates that the RFC Editor may take exception
to that many repetitions of "unlauted a", but we will deal
with that if and when we get there.
Status: No comments received; this is considered done.
** R.22 *** "cannot be represented directly in domain names"
In the first paragraph of Section 9, the text "use
characters that cannot be represented directly in domain
names but for which interpretations are provided." appears.
Mark asks: "What is meant by this, and how is it different
in IDNA2008? In both IDNA2003 and 2008 they are illegal."
Comment: This was intended to get at the mapping issues,
both with characters that disappear under NFKC and hence can
be interpreted to be part of domain names and, in
particular, to odd cases like the Sharp S (Eszett) mapping
to "ss". The situations are definitely different between
IDNA2003 and 2008. I think what I was trying to do with
that convoluted sentence was to identify the situation
without getting into a discussion of mapping and mapping
issues. I obviously failed.
In -01, a placeholder was inserted in the text, along
with preliminary suggested text from Patrik. Please review,
comment, and suggest improvements or alternatives if
appropriate.
Status: No discussion on list or at IETF 72. This issue is
considered done.
** R.26 ** Trimming additional text.
An editorial note appears in the Change Log at the end of
Section 15.8 about whether a list of additional text
sections should be rewritten, trimmed, or dropped. The
group should review that list.
Status: Change Log has been trimmed in -02, taking the
questionable text and note with it.
More information about the Idna-update
mailing list