Issues lists and the "preprocessing" topic

Erik van der Poel erikv at google.com
Wed Aug 20 17:09:24 CEST 2008


Hi John,

I'm not quite sure how to put this into words, but here goes, anyway...

You seem to distinguish text typed by a user from that found in
"files", but you don't seem to admit that some files are created by
plain text editors that allow users to type whatever they want. In
principle, "files" could be created by specific apps that enforce
particular transformations from user-typed text to some canonical form
that depends on context (e.g. "protocol contexts" like HTML hrefs).
However, the fact is that various "files" (e.g. HTML, email) are typed
directly by users, and various apps process these files automatically,
without user intervention. So, if we are to exchange such files,
whether over the net or via other media such as DVDs, the receiving
apps ought to agree on the transformations that are to be applied to
the text, otherwise we will have interoperability problems. Any
spec/rationale that ignores these facts is not very useful (I'm not
saying "useless").

Now, the recent IETF meeting has made it even clearer to me that there
seems to be a rift between "clean spec proponents" and "real-world
HTML implementors". IDNA may be new and rare enough to point the
HTML/email/etc ship(s) in a new direction via prescriptive specs, but
I am quite skeptical. I believe we may have already arrived at the
point in time where we must write *descriptive* specs. Note that HTML
5 has many sections that are descriptive, but also has a few that are
prescriptive, especially those that cover new features.

Bottom line: IDNA200X may choose the prescriptive route if the WG
agrees to do so, but my fear is that it may be ignored by
implementors. The arguments given for strict usage of A-labels and
U-labels in various contexts are, in my opinion, not as compelling as
the argument that receiving apps ought to agree on the transformations
that are to be applied to uncanonicalized user-typed text that *is*
being sent over the net and other media.

Yes, yes, be conservative in what you send, and liberal in what you
accept, but it may be difficult for the WG to convince receiving app
implementors to become less liberal in what they accept. HTML
implementors are notorious in this regard. :-)

The browser developers have been far too quiet on this mailing list.
What are your opinions, plans, etc?

Erik

On Mon, Aug 18, 2008 at 5:48 PM, John C Klensin <klensin at jck.com> wrote:
> Hi.
>
> I'm finishing cross-checking and updating the Protocol and
> Rationale documents, but, in the interim and in the hope of
> keeping things moving forward,
>
> (1) I'm attaching updated version of the "outstanding issues"
> documents circulated before the IETF meetings.  These reflect
> decisions made there.  Those who disagree, especially if you
> have something new to say), are _strongly_ encouraged to make
> your case on the mailing list.  If we have silence, I'm going to
> ask Vint to declare many of these issues closed (which ones
> should be obvious from the lists).   I've also reorganized the
> lists by status, i.e., separating those things that I believe
> are still in need of discussion from those that I believe are
> finished or nearly so.  Opinions may differ about those
> categories but, if so, I hope that those who disagree will speak
> up very soon.
>
>
> (2) I've been working on the "preprocessing" and mapping issue
> in an attempt to reflect where we stand in the documents.   It
> is unlikely that the next versions of the drafts will have this
> completely right, but I want to try to return to principles and
> see if we agree (or not) about them.   If we do agree, we can
> then have discussion about tuning the text to best reflect those
> principles.  If we do not, then I believe that it would be more
> effective to discuss those disagreements about principles rather
> than quibbling about specific text.
>
> I believe that:
>
>        (a) Our target is to have any IDN that moves across the
>        network contain non-ASCII labels in either U-label or
>        A-label form (i.e., that no mapping should be required).
>        In addition, IDNs in protocol contexts, including HTML
>        "href"s, should be in A-label form (i.e., be URIs, not
>        IRIs). We aren't going to completely accomplish either
>        of those goals for the reasons below, but they are still
>        desirable targets.
>
>        (b) Both long-term and short-term, systems that actually
>        read and manipulate strings typed by users are going to
>        need more flexibility than ones that process files.
>        Such flexibility may include some operations that we
>        talked about during the IDNA2003 development period but
>        that, as far as I know, have not been implemented on any
>        significant scale.  For example, my current preferred
>        email client distinguishes between "copy link location"
>        and "copy email address", so that, given
>            <a href="mailto:foo at example.com">Joe Blow</a>
>        Copy link location would yield "mailto:foo at example.com"
>        Copy email address would yield "foo at example.com"
>        and a possible "copy" would yield "Joe Blow",
>        each in the relevant copy buffer ("clipboard").
>
>        One can imagine similar copy operations applied to IDNs
>        or IRIs that would yield domain names containing
>        A-labels or URIs, respectively.
>
>        Obviously, the typical user would have no clue about the
>        differences among these operations initially.  But,
>        faced with situations in which "copy and paste" just
>        doesn't work, strings that cannot be displayed when
>        passed to a colleague or even a different application on
>        the same system without display problems (e.g., rows of
>        question marks or boxes or characters drawn from some
>        other CCS), and given decent user interfaces, they would
>        learn quickly.
>
>        (c) Compatibility with IDNA2003 will require mapping of
>        stored strings in some contexts.  Ideally, those
>        mappings should be strictly confined to characters
>        mapped by IDNA2003 and interfaces to them should be
>        designed to encourage migration to no-mapping forms.
>        Some types of applications, such as indexing ones, might
>        need to preserve these types of mapping much longer than
>        others.  At the other extreme, web browsers might be
>        configured to warn before mapping, or even to reject
>        domain names that require them, unless the user was
>        clearly referencing older pages.
>
> The implications of the above are that we not only aren't
> encouraging extensive local-option mapping, we are encouraging
> no mapping at all except for backward compatibility when
> necessary and as a user interface convenience.   For the latter,
> the expectation is that one will make the mappings as early as
> possible and use only the mapped (U-label or A-label) form in
> files; storing anything else in a file or sending it across the
> network is strongly discouraged.   Also, even when mappings are
> done, the rule that is now present in the documents still
> stands, i.e., one must not map a PVALID or CONTEXT character
> into anything else -- mapping is permitted only for DISALLOWED
> characters.
>
> So, do others agree with that and, if not, where are the
> disagreements and why?
>
>    john
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>


More information about the Idna-update mailing list