Mark Davis ⌛
mark at macchiato.com
Tue Jun 30 22:57:25 CEST 2009
On Tue, Jun 30, 2009 at 02:07, John C Klensin <klensin at jck.com> wrote:
> --On Monday, June 29, 2009 12:46 -0700 Mark Davis ⌛
> <mark at macchiato.com> wrote:
> >> > Now, my position is still that the simplest and most
> >> > compatible option open to us is to simply map with NFKC +
> >> > Casefold.
> >> I continue to believe that CaseFold is a showstopper. When
> >> its results are not identical to those produced by LowerCase,
> >> it produces results that are astonishing to some users and
> >> leads us into the "is that a separate character or not" trap
> >> that we've seen manifested at least twice. I note that TUS
> >> recommends against its use for mapping (as distinct from
> >> comparison) and appears to do so for just the reason that it
> >> involves too much information loss.
> > You need to provide actual data behind this. Please list
> > exactly the characters that you mean, and why you think they
> > are problematic. Note also that the formulation that I gave
> > means that any character that is PVALID would automatically be
> > excluded, eg if final-sigma is PVALID then it is unaffected.
> > And we can certainly introduce other exceptions.
> I don't like operating by exception when it can be avoided (an
> argument you have made as well). Getting into situations in
> which exceptions are required is not advantageous if we are
> trying to be as Unicode version independent as possible. I
> also prefer operations that casual users understand and believe
> they do (e.g., a Lower Case operation is fairly comprehensible
> to any user of a script that supports case distinctions, while
> CaseFolding is dependent on Unicode coding decisions. Final
> Sigma and Eszett are, indeed, the current examples but it
> continues to appear that LowerCase is both necessary and
> sufficient and that, if transformations it does not cover are
> needed, it is they that should be handled by exception.
What I am saying is that there would be exceptions to the set of mappings in
any event, not particularly just the case mappings.
I find myself puzzled. You seem to be focused on the name of the property,
rather than the results. I suggest that you list the differences between the
Lowercase mapping and the CaseFold mapping, and indicate at least one
example where there is the possibility of a real problem. The only possible
issues I could see would be:
- The character would be more likely interpreted as a different valid
character than the one it maps to, or
- We might add it as PVALID in the future.
I don't see any of those cases, so if you do, please list them for
> > And I know full well about the issues in TUS, having written
> > or participated in the writing of them.
> I assumed, given your role in the case-handling material, that
> you had written it. What I'm having trouble understanding is
> why, given the perfectly logical (at least to me) explanation of
> why CaseFold should be used for matching only that appears
> there, you keep wanting to use it as a mandatory mapping
> operation here.
The problem is that we are calling upon mapping to do the work of matching.
As you know, we can't actually change the mapping operation in the DNS, so
we are forced into this. Part of what I did was to go through all the
mappings, and try to make a reasoned judgment about which mapping operations
would serve as both, and not serve as complications. What I'd like is a
concrete review of the results, rather than vague (untestable) statements of
> A different way of looking at this is that I'm trying to resist
> mapping transformations that non-expert users believe lose
> significant information unless they can be demonstrated to be
> really important. Whatever can be said for, e.g., the
> FinalSigma -> Lower Case Sigma transformation and the tradeoffs
> between information-preservation and IDNA2003 compatibility, it
> seems to be generally understood that the transformation is
> information-losing. From that perspective, and the related
> perspective of minimizing complexity by choosing simpler
> operations rather than more complicated ones and not performing
> mappings that are not justified by real-world usage, it seems to
> me that it is you who need to make the case for operations that
> lose more information, for more complexity, and for mapping of
> more characters.
There are at least two different motivations for the mapping.
- Don't have unnecessary compatibility breakage with IDNA2003
- Meet peoples' expectations. A subcategory of this is where people see a
name, paste it in, and it doesn't work because there is a variant character
(eg µ instead of μ).
As to performing mappings that are not justified by real-world usage: what
data are you making your claims based on?
As to losing information, I think you are quite mistaken. The information
lost in case folding is far greater than the other mappings proposed. Look
at the following, for example:
That is far more different, to more people, than the difference between l+j
and the lj character in http://ljubav.rs (the lj being a single character
transliteration of љ).
Or the difference between fullwidth and normal:
I'm afraid that a focus on case-mapping is, and will be preceived as, a
Western-European language focus; excluding mappings that are important to
other parts of the world.
> >> ...
> > You make it sounds like final sigma, ZWJ/NJ, eszett and the
> > other cases under discussion were oversights in the process of
> > developing the current IDNA. That wasn't the case; these were
> > deliberate choices made at the time. A case mapping is also a
> > 'loss of information', but one that people clearly want.
> Taking the last as an example, I think "a case mapping" was a
> deliberate choice, one that I supported at the time and, given
> the assumptions behind IDNA2003 would support it again. I do
> not believe it is plausible to argue that a majority of the
> participants in the original IDNA WG, much less in the IETF,
> understood the implications of the differences between case
> folding and lower case mapping well enough to have exercised
> informed consent, much less to have made a "deliberate choice".
> Instead, they were informed by experts, yourself included, that
> toCaseFold was the correct explanation and went along with it
> despite some concerns about individual characters (which most of
> the participants did not understand either).
> Obviously one can have both "not an oversight" and "insufficient
> understanding to have informed consent", so we are not
> necessarily disagreeing.
It would, of course, be useful if we were all experts on all the topics
involved. Failing that, we do have to rely on others' information; for
example, on your knowledge of the DNS. That doesn't, of course, mean that
anyone gets a free pass...
> >> > The rest of the tests for U-Label remain unchanged.
> >> I believe that doing this by the type of change to Tables that
> >> you recommend either requires a change to the way that the
> >> definition of U-label is stated or requires us to abandon the
> >> very clear concept of a U-label that is completely symmetric,
> >> with no information loss in either direction, with an A-label.
> > I don't see why you would think that. A U-Label remains just
> > the way it is, and has a 1-1 relation with an A-Label. The
> > only difference is that we have an additional category of
> > M-Label; one that is not a U-Label but maps to one.
> At a minimum, the already-complicated pictures in Defs will
> require redrawing, which was not mentioned in your list. But,
> independent of that bit of work, I still wish we could avoid
> introducing yet another label category.
If an ASCII picture were the only thing standing between us and a successful
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update