Mappings

Tue Jul 21 17:29:23 CEST 2009

--On Tuesday, 21 July, 2009 10:00 +1000 Chris Wright
<chris at ausregistry.com.au> wrote:

>... 
>> Now, what goes on between the registrar and the user to
>> identify the relevant U-label is, well, between the registrar
>> and the user.  My personal recommendation is that registrars
>> stay as close to expecting U-labels from users as possible,
>...
> I 100% agree and understand that the 'real' label is the
> U-label that has a 1-1 corresponding relationship with an
> A-label, registries should only deal with U-labels/A-labels
> and the rationale for all of that makes sense.
> 
> So I guess what you are saying is that by stating that
> registries should only accept Labels in NFC form with protocol
> valid code points (PVALID or CONTEXTx) you are implicitly
> saying that someone (probably registrars) SHOULD apply NFC to
> any string before sending it to the registry, and then by
> virtue of the fact that all uppercase code points are not
> protocol valid, you are implying that 'someone' SHOULD
> lowercase / case fold names before sending them through, and
> then further implying that because all of this is done on
> registration, application developers SHOULD do the same thing
> before looking up names?

No.  We have a slightly different model here.  First of all, I
believe that the question of how to balance letting the users
know what is actually happening and trying to keep them
relatively close to it versus trying to do a lot with smoke and
mirrors to yield a better and more convenient experience is a
very complex one, involving equally complex tradeoffs.  The
observation that domain names are used as identifiers, or parts
of identifiers, further complicates the tradeoffs, especially
when (as is the case here), different types of systems are going
to try to compare those identifiers in different ways, sometimes
following whatever standards are applicable.  So we have URIs
compared by string matching with rules in some places that, if a
string match doesn't work, the URIs are different.  We have
domain names compared by string matching, by comparing query
results, by resolving aliases and then doing string matching, by
noticing that part of the string is an IDN and comparing ACE
forms (as IDNA2003 requires but IDNA-unaware applications may
not know about),and possibly in other ways, most of them varied
by the observation that "string comparison" is conditioned by
the "case-insensitive matching that assumes ASCII if the high
bit in an octet isn't set" rule that seems very odd if stated
that way.

My own guess is that, as a community, we will never completely
agree about the right way to make those tradeoffs.  And,
regardless of what this WG does, there will be inconsistencies
in behavior because there are many sets of standards applicable
to different uses of those identifiers and they are not, beyond
a certain point, consistent (even if they were always followed).
There are also some common practices that are not standards,
such as use of UTF-8 directly (with rules and restrictions
different from the IDNA ones) for identifiers that the user
cannot tell from DNS-based identifiers.

> My concern is about when users cannot (or do not) input
> U-labels that we describe how to make a U-label. I understand
> that this is broad and thus possibly impractical, but if we
> assume the starting point is a Unicode string we should be
> able to describe something. 

One answer is that particular assumption is not justified, as
pointed out in the mapping document and in earlier versions of
Rationale and Protocol.  Another is that it has to be the
responsibility of the application or API that ultimately calls
on IDNA (or the operating system in which it is embedded) to get
that straightened out and to do so in a way that is compatible
with other things that it is doing.   It would be, at least IMO,
even more unreasonable to ask users to type IDNs in some special
and unusual way than it would be to force them into escaping
characters in just to get U-labels directly (of course, even
that tactic would not work if the local system's base user
interface character set is not Unicode-based).

So we are not "describing how to make a U-label", precisely
because we would have to consider making that description
differently for each relevant operating system or embedded
environment.  What we have said, all along, is "by the time this
gets to IDNA, it must be in this form and contain only these
characters".   The mapping document supplements that very
general requirement by identifying cases for which there is
fairly general agreement that mapping is desirable to avoid
confounding very obvious user expectations.  Going further than
that puts us at greater risk of making the kinds of decisions
that many of us feel would put us at an inappropriate point in
the "smoke and mirrors" direction because what the user would
see on translation from a stored A-label to a U-label might look
too different from what the user put in.

As some level, those considerations might reasonably apply to
the registration side as much as to the lookup one.  But, in
addition to some other issues (which I hope are adequately
covered in Rationale), the WG concluded that it was especially
important on registration that the registrant be extremely clear
about what was being registered, about what the U-label was, and
that she be reinforced in the understanding that the canonical
U-label form is the "real" form of the label, to be used in any
context in which the "that form will always work" guarantee is
more important than, e.g., the preferences of a marketing
department.  We aren't telling those registrants what those
circumstances are -- we can't know and opinions and advice will
differ.

> We have algorithms that decide if
> a code point should be PVALID or not, surely the logic that
> was used in that instance would allow us to come up with a way
> those algorithms can be applied to turn a Unicode string into
> a U-label (I am not expecting that this will always be
> possible) I think the process described in the mapping
> document should be sufficient for most cases I.e. some form of
> case folding/lower case, width mapping, followed by NFC. I am
> concerned about the ordering though, on looking into it more
> the following concern comes up, regardless of where the
> mappings document places NFC, NFC will always need to be done
> as a last step anyway (as the registry expects the label to be
> in NFC form):

Well, at least it will need to be checked.

> So the question to the Unicode experts is does:
>...

> I have to admit that you do now have me questioning my own
> position on mappings and whether that process (ie. The
> mappings) needs to be specified as a MUST as part of the
> lookup process, perhaps a SHOULD will be sufficient. I am
> going to think about this more...

Let me know what you conclude and, perhaps more important, why.
We may not agree, but understanding of each other's perspectives
is useful.

>...
>> Conversely, if a registrar or registrant decides to submit
>> only U-labels and A-labels, there is no issue about mapping
>> (or much of anything else unless you are also applying variant
>> processing, which is also outside the WG's scope).
> 
> Agreed

    john