Parsing the issues and finding a middle ground -- another attempt

Wed Mar 4 03:57:13 CET 2009

On Tue, Mar 3, 2009 at 3:54 PM, John C Klensin <klensin at jck.com> wrote:
> (Message rearranged ... easy part first)
> --On Tuesday, March 03, 2009 15:05 -0800 Erik van der Poel
> <erikv at google.com> wrote:
>> I can live with Eszett, Final Sigma, ZWJ and ZWNJ under xn--,
>> but I'd be happier if we heard some kind of confirmation or
>> approval from the German, Greek and Iranian registries,
>> respectively.
>
> We already have "confirmation" from the German registry.  I
> think the notes that Marcos posted months ago were completely
> clear.  If I correctly understand the position of the Greek
> registry, they would prefer the IDNA2003 mappings for Final
> Sigma and would also prefer additional mappings to cover the
> Tonos cases.  If we adopted this model, they could effectively
> get the behavior they want for Final Sigma by declining to
> register any labels that actually contained that character.  I
> don't know whether they would be better or worse of with the
> character itself as PVALID or banned, but my instinct says it
> would be better treated as PVALID in the protocol.
>
> For ZWJ/ ZWNJ, there are actually a fairly significant number of
> registries and scripts involved.  The characters are as, or
> more, important in Pakistan than they are in Iran and may be
> more important for Devanagari and some other Indic scripts than
> than they are for anything Arabic-script-based.  Over the years,
> we've gotten significant input from the Indian registry that
> they are needed (although no longer, with Unicode 5.1, for
> Malayalam).

I realize that ZWJ and ZWNJ are needed for scripts and languages other
than the Iranian registry's chief scripts/languages. It'd be nice to
hear from at least one registry that really needs each of the four
special cases. I'd especially like to hear that they have considered
and understood the transition issues, particularly in the context of
URLs/URIs/IRIs. E.g. that the plan is to encourage registrants to wait
until most of the installed clients can handle them before using them
extensively. But of course the registry's business is their own
business, and I don't need them to respond -- it'd just be nice,
that's all.

>> I am sympathetic with John's concern that implementations
>> ought to be able to perform local mappings for Turkish, and
>> still claim compliance with the protocol.
>>
>> I am also sympathetic with John's concern that other
>> protocols, such as email, ought to be able to mandate U-labels
>> or A-labels (without mapping).
>>...
>
> I've been hinting about what is about to follow but seem to get
> sidetracked every time I try to construct a note that works its
> way through all of the cases and issues.  Just to mention
> another alternative before we go down too far down one path or
> the other...
>
> Independent of what various implementations have done and gotten
> away with often enough to establish a frequent practice
> (remembering that a small fraction of the web pages in the world
> is still a very large number), the intended (by the IETF, at
> least) reading of the URI spec is that IDNs in the domain name
> field must be in A-label form.  There is no real provision in
> that interpretation of that spec for %-escaped UTF-8: one can
> write them, but it isn't an IDN slot so one should really expect
> that they would be looked up in the DNS with the percent signs
> and digits, not converted to A-labels.

I'm not sure what the specs say, but I would have thought that one
should unescape the %-escapes after extracting the domain name from
the URI, before looking up the string in the DNS. So, to lookup a
literal percent sign, one would have to escape it as e.g.
http://100%25genuine.biz.

> IRIs, obviously, can contain UTF-8 strings in the domain part.
>
> Just as an idea, rather than as a proposal, one possibility
> would be to more clearly specify that the IRI-> URI mapping was
> required to convert the domain-specific field into A-label form
> (rather than converting to %-escape form) _and_ to apply the
> IDNA2003 mappings (or whatever mappings one could get consensus
> on) there, possibly with the mappings specified on a
> per-protocol basis.
>
> That would:
>
>        * Isolate the mappings completely from IDNA, while
>        providing a logical place to apply them.
>
>        * Permit isolating concerns about the web generally, and
>        HTML/HTTP in particular, from other protocols and
>        concerns.
>
>        * Be consistent with current extended use of URIs if one
>        adopted some changes in terminology (but with no effect
>        on practice).  Specifically, since %-escapes are valid
>        in IRIs (although generally silly), we could describe
>        the extensions that use non-ASCII strings in URI slots
>        as extensions that permit IRIs in those contexts rather
>        than the standard-required URIs.  That still doesn't
>        make it a good idea (that is a separate discussion), but
>        would decouple the issue of non-ASCII characters, and
>        even %-escapes for domain slots, in URIs from the issue
>        from the URI spec and make them issues of where IRIs are
>        permitted.   And, for protocols where mapping was
>        appropriate, it would permit the mappings to be applied
>        regardless of whether the domain name was written in
>        UTF-8, in %-escaped UTF-8 or, in principle, in some
>        other character set or in A-labels -- the decisions
>        would be matters for the IRI spec, not IDNA.
>
> I don't know if this is a good idea or not.  And, obviously, the
> above is just an outline of a description rather than what I had
> intended to write.  But is a different logical possibility for
> dealing with the mapping issues and one that would make most of
> the transition issues into web issues rather than IDNA issues.

I'm delighted to say that this all sounds reasonable, though I'm sure
many would implement mappings in other contexts as well, such as
clickable domain names in plain text email.

Erik