Mark Davis ⌛
mark at macchiato.com
Mon Jun 29 21:46:05 CEST 2009
On Mon, Jun 29, 2009 at 10:55, John C Klensin <klensin at jck.com> wrote:
> Several comments inline...
> --On Sunday, June 28, 2009 21:28 -0700 Mark Davis ⌛
> <mark at macchiato.com> wrote:
> > Returning to the discussion, now that some of my other
> > standards work is under control (RFC4646bis was approved,
> > whew!)
> > Now, my position is still that the simplest and most
> > compatible option open to us is to simply map with NFKC +
> > Casefold.
> I continue to believe that CaseFold is a showstopper. When its
> results are not identical to those produced by LowerCase, it
> produces results that are astonishing to some users and leads us
> into the "is that a separate character or not" trap that we've
> seen manifested at least twice. I note that TUS recommends
> against its use for mapping (as distinct from comparison) and
> appears to do so for just the reason that it involves too much
> information loss.
You need to provide actual data behind this. Please list exactly the
characters that you mean, and why you think they are problematic. Note also
that the formulation that I gave means that any character that is PVALID
would automatically be excluded, eg if final-sigma is PVALID then it is
unaffected. And we can certainly introduce other exceptions.
And I know full well about the issues in TUS, having written or participated
in the writing of them.
> > Proposal: A. Tables document
> > Add a new type of character: REMAP. A character is REMAP if it
> > meets *all of * the following criteria:
> > 1. The character is not PVALID or CONTEXTO
> > 2. If remapped by the Unicode property NFKC_Casefold*, then
> > the resulting character(s) are all PVALID or CONTEXTO
> > 3. The character is a LetterDigit or Pd
> > 4. The character has one of the following
> > Decomposition_Type values: initial, medial, final,
> > isolated, wide, narrow, or compat
> I am very concerned that collapsing initial, medial, and final
> together may get us into problems with other language
> communities similar to those we have gotten into with Final
> Sigma, especially when those communities denote word boundaries
> by the appearance of final or initial forms and hence would use
> those forms in a style similar to the way "BigCompany" or
> "big-company" might be used in ASCII.
The mechanism used to indicate boundaries is not, as you think, the use of
the presentation forms; it is the use of the ZWNJ/J, which we already
> As I've said several times before, even if we disallow the
> NFKC-affected forms those characters, if a need arises, we can
> (painfully) redefine them as PVALID and allow them. But, if we
> map them to something else, we lose all information about what
> was intended/desired and end up in precisely the mess we have
> with e.g., Final Sigma and ZWJ/ZWNJ in which "the right thing
> to do" poses enough compatibility problems to cause opposition
> to making changes.
You make it sounds like final sigma, ZWJ/NJ, eszett and the other cases
under discussion were oversights in the process of developing the current
IDNA. That wasn't the case; these were deliberate choices made at the time.
A case mapping is also a 'loss of information', but one that people clearly
If you have any particular characters that you think would be of concern,
you should raise them as issues.
> > 5. The character does not have the Script value: Hangul
> > The REMAP characters are removed from DISALLOWED, so that the
> > TABLES values form a partition (all the values are disjoint).
> This strikes me as dangerous -- see below.
> > B. Protocols documentChange sections 4.2.1 and 5.3 so as to
> > require:
> > 1. Mapping all REMAP characters according to the Unicode
> > property NFKC_Casefold,
> > 2. Then normalizing the result according to NFC.
> Making this change to 4.2.1 eliminates the requirement that the
> registrant understand _exactly_ what is being registered, i.e.,
> that the communication path between the registrant and registry
> occur only using U-labels and/or A-labels. My understanding was
> that we had reached one of the more clear consensus we had in
> these discussions that the "no mapping on registration"
> restriction was appropriate. Are you proposing to reopen that
Sorry, you are correct. This would only affect the lookup part.
> > The rest of the tests for U-Label remain unchanged.
> I believe that doing this by the type of change to Tables that
> you recommend either requires a change to the way that the
> definition of U-label is stated or requires us to abandon the
> very clear concept of a U-label that is completely symmetric,
> with no information loss in either direction, with an A-label.
I don't see why you would think that. A U-Label remains just the way it is,
and has a 1-1 relation with an A-Label. The only difference is that we have
an additional category of M-Label; one that is not a U-Label but maps to
> There is also a subtle interaction with Section 5.5: if the
> mapping is performed by the time Section 5.3 concludes (or,
> under special circumstances, not applied at all), then Section
> 5.5 must also prohibit REMAP.
You are correct; that was my intention, but I forgot to mention it. Yes,
there needs to be a change in 5.5.
o Labels containing prohibited code points, i.e., those that are
assigned to the "DISALLOWED" category in the permitted character
o Labels containing remapped code points, i.e., those that are
assigned to the "REMAP" category in the permitted character
> > C. Defs document
> > 1. Define REMAP
> > 2. Define an M-Label to be one which if remapped according
> > to B1+B2, results in a U-Label.
> The idea of an M-Label still makes me uncomfortable. Again, we
> have had that discussion before.
> Idna-update mailing list
> Idna-update at alvestrand.no
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update