I-D ACTION:draft-klensin-idnabis-issues-01.txt
Kenneth Whistler
kenw at sybase.com
Sat Mar 10 01:44:33 CET 2007
John,
> # 8.2. More Flexibility in User Agents
>
> # For example, an essential element of the ASCII
> # case-mapping functions, that uppercase(character) =
> # uppercase(lowercase(character)),
>
> > Replace character by string, and you see that this is false
> > for ASCII (and it is not clear what the relevance is).
>
> I have made that replacement and, if it is not true for ASCII
> even after it, I'm missing something very fundamental.
O.k., exegesis follows. When Mark says, "replace character
by string", he means, consider the following text:
For example, an essential element of the ASCII
case-mapping functions, that uppercase(string) =
uppercase(lowercase(string)), ...
But I think he misread that as implying roundtripping of
casing -- which clearly is not the case for ASCII strings.
In other words:
uppercase(lowercase(C)) = C is true for ASCII
uppercase(lowercase(S)) = S is false for ASCII
I agree with you, however, that for ASCII-only, the
statement as stands would be true: in other words, it
doesn't matter if you uppercase a string or uppercase
the lowercase of a string -- you end up with the same
result either way.
> Unless
> I have somehow mis-stated the condition, it is essential to
> the matching rules of the DNS, so, if there is a flaw, it
> hasn't been obvious.
However, that said, I agree with the thrust of Mark's comment
that it isn't clear what the relevance is in this section.
In fact, I find the entire paragraph in the draft
obscure:
As suggested earlier in this section, it appears to be desirable to
do as little character mapping as possible consistent with having
Unicode work correctly (e.g., NFC mapping to resolve different
codings for the same character is still necessary) and to make the
mapping between A-labels and U-labels idempotent. Case-mapping is
not an exception to this principle: if only lower case characters can
be registered in the DNS (i.e., present in a U-label), then IDNA200x
should prohibit upper-case characters as input. Some other
considerations reinforce this conclusion. For example, an essential
element of the ASCII case-mapping functions, that
uppercase(character) = uppercase(lowercase(character)), may not be
satisfied with IDNs: the relationship may even be language-dependent.
Of course, the expectations of users who are accustomed to a case-
insensitive DNS environment will probably be well-served if user
agents perform case mapping prior to IDNA processing, but the IDNA
procedures themselves should neither require such mapping nor expect
it when it isn't natural to the localized environment.
This is intended in the draft to serve as justification for
not doing casefolding as part of IDNAprep (or whatever the
process is called), and, in keeping with the title of the
section, presumably, arguing that user agents should be
flexible in their handling of casing of IDNs.
But I have been reading, re-reading, and re-re-reading, and
my conclusion is that this comes down to essentially:
As suggested earlier in this section, it appears to be desirable to
do as little character mapping as possible. mumbo jumbo mumbo
jumbo. Case-mapping is character mapping. mumbo jumbo mumbo
jumbo. Therefore, IDNA procedures themselves should not
require case-mapping. User agents can take care of it.
Now maybe the intent here is to keep the text hard to interpret,
I don't know. But even the advice doesn't seem well-structured.
Here is a crack at rewriting the text to do this better:
==================================================================
As suggested earlier in this section, it appears to be desirable
to do as little character mapping as possible in the IDNA
procedures themselves. Some character mapping is required
to ensure that the procedures mapping A-labels to U-labels
and back are idempotent, and to ensure that canonical
equivalence requirements for the use of Unicode itself are
followed (e.g., NFC normalization of input), but other character
mapping should be avoided.
With regards to case folding, the situation is as follows.
If only lowercase letters can be registered in the DNS (i.e.,
be present in a U-label), then the character mappings implied
by case folding can be avoided in the IDNA procedures by
simply prohibiting uppercase letters as input. This keeps
the IDNA procedures simpler, but at the cost of requiring
some greater degree of flexibility in user agents.
[[ Note: remember that "more flexibility in user agents" is
nominally the topic of this section! ]]
The expectations of users who are accustomed to a case-insensitive
DNS environment will probably be well-served if user agents
perform case folding (to lowercase) prior to IDNA processing,
even though the IDNA procedures themselves should neither
require nor expect such mappings. And due caution is in
order. It is not advisable to perform language-specific
case mappings on IDNs, as this potentially could result in
different resolutions for the same input. For example, the
string "III", if lowercased by Turkish casing rules, would
result in a different U-label than if lowercased by English
casing rules.
===================================================================
I think something like that is much clearer. Note that it
doesn't change the recommendation you are trying to make
in idnabis-issues, namely that IDNA should not do casefolding,
but leave any casefolding to the user agents before they
call the IDNA procedures to do IDN resolving. However, I
really think the recommendation on flexibility for the
user agents needs the caution about this spelled out.
Instead of vague text that "the [casing] relationship may even
be language-dependent" as putatively contributing to the
argument to keep casefolding out of IDNA procedures (which
doesn't hold any water), the *real* issue here is that
if IDNA procedures don't do language-*in*dependent casefolding
as a matter of course, and you leave this up to user agents,
then you need to caution them *not* to go down the garden
path of applying language-dependent casefolding to URI's
before doing domain name resolution, or you are opening
yourself up for another entire class of spoofings and
incomprehensible label behavior, where the exact same
input string resolves or not (or worse, resolves *differently*)
depending on a localization setting in a browser.
--Ken
More information about the Idna-update
mailing list