CONTEXTO Proposal

Tue Jul 21 18:51:02 CEST 2009

--On Monday, 20 July, 2009 13:09 -0700 Mark Davis ⌛
<mark at macchiato.com> wrote:

> I believe that none of the current CONTEXTO characters are
> really required to be CONTEXTO, and all should be simply
> PVALID. I'd like to ask for a consensus call on this. There is
> a copy at
> http://www.macchiato.com/unicode/idna/exceptions/contexto-prop
> osal in case emailers make this less readable.
>...

Mark,

You've been arguing for elimination of CONTEXTO, and elimination
of CONTEXTJ as a category (replacing it, if needed, by special
cases for ZWJ and ZWNJ only) since at least the pre-WG meeting
in January 2008.  For CONTEXTO, I see nothing new in this
proposal other than its format and, unfortunately, have very
limited time right now to try to rehash those old arguments.  I
am also concerned that many formal consensus calls at the level
of individual details will bog us down sufficiently, especially
if there are any long speeches at the microphone, to prevent any
real progress in Stockholm.  So I urge Vint to consider those
requests very carefully.

With regard to your specific suggestions for emptying the
category.  I hope I can summarize the previous discussions and
design decisions rather than repeating them (I'm going to take
hyphen last).   

First, I agree with Michel's observation about fonts and
typographic variations.  I also think it cannot be
overemphasized that people see what they expect to see: if a
user considers a particular character "normal" in particular
circumstances, a glyph in that position doesn't need to look
very much like that character to be mistaken for it.  For
example, in an orthography that doesn't have dots except at the
baseline, many users will mistake middle dots for a baseline dot
and some will even mistake high dots for the baseline one unless
they are somehow warned to look at them.  That hypothesis about
user perception behavior was confirmed by research in human
factors and perception going back to the late 50s.  I can try to
dig the citations out for you if that would be helpful, but
probably not today.

Second, "but is illegal anyway" is unconvincing.  If the subject
characters are PVALID, then they can appear in contexts in which
the "confusable" character would be banned and used as a
workaround to simulate that character if someone wants to use it
that way.  Taking Geresh and my "O'"-prefixed Latin character
name as an example, we know that people want to write those
names.  Your note seems to believe that the issue here is
confusability with banned characters.  It is not.   It is,
instead, a variation on the "how to you type that in" question
that Shawn, Chris, and others have been raising.  If someone
uses Geresh in one of those strings and displays it (just as
they might display mixed-case characters, etc.), the user will
believe it is the usual accent character and have no way to type
the actual character in.  If they try to use the non-combining
accent (single quote), they will get parsing errors since the
name would otherwise be basic Latin.  If they can type the
Geresh (or figure out how to escape it in), it will work.  But,
in an area where names like that are common, and domaineers and
other name-marketers have gotten a foothold, it would be only a
matter of time before "...is illegal anyway" gets turned around
into "since single quote is DISALLOWED, why not map it into
Geresh so we can exercise our right to write our names".

I certainly would not encourage that, but it is the context in
which we operate.  

The Arabic-Indic digit issue is complicated.  We've gotten
strong recommendations from a UN and Arab League-based working
group whose participants include experts on several of the
relevant writing systems and who have looked at the domain
name-related issues (not just writing of ordinary text).  There
is not complete agreement within that group on those
recommendations (as we have seen when the discussions have
spilled over onto this list).  There are those who, also on this
list, have expressed doubts about the legitimacy of the working
group itself.  And, as Alireza points out, there is an
additional complication due to the Jawa use of Indo-Arabic digit
2, so whatever the rule is and wherever it is put, it needs to
be stated very carefully.

FWIW, much of that issue would not exist had Unicode not made
the decision to code the overlapping Arabic-Indic digits twice
while viewing Persian and Indic forms as just font/display
variations.   I think I understand the reasons for those
decisions and believe that other decisions would have caused
other problems, but it may be useful to remember that this isn't
a problem of IDNA's making independent of Unicode design
decisions.

I don't know how the IETF can evaluate the claims and
counterclaims in this area, but it does appear to me that what
we have is something that those who have commented believe they
can deal with.

As far as whether those rules should be in Bidi or handled as
contextual rules, you will probably recall that I originally
proposed handing them in Bidi, partially to avoid much more
general arguments about digit-mixing across scripts.  I still
believe that would have been a better decision, although only
slightly so.  But the WG concluded that it should be in the
Contextual rules and I've moved on and, in the interest of
getting things done and because it affects the definition but
not what is permitted or prohibited under the protocol, I'm
opposed to reopening the issue.

Finally, there is hyphen.  Handling it as a contextual rule is a
decision that Patrik and I made one afternoon because it seemed
elegant, obvious, and very clear.  Of course one could do it in
other ways, but it is trickier than you suggest because it is
not clear that it is a "requirement of the DNS system".  First
of all, as several people keep reminding us, the basic
requirement of the DNS system is "octets".  IDNA (both 2003 and
2008) impose the LDH requirement on anything that appears in an
IDN context (even an all-ASCII string) but that is not a DNS
requirement; it is a requirement imposed by protocols that use
the DNS and chose to require (and often enforce) that rule.
Second, once one moves into the non-ASCII space, IDNA doesn't
inherit the "no leading or trailing hyphens" rule until the IDNA
specifications say it does (which both IDNA2003 and IDNA2008 do).

My recollection is that the WG decided to not make the
consecutive hyphen test on lookup, but I can't find it in my
notes so, unless someone else believe that text is unnecessary
and undesirable, I'll get it into Protocol-14.

regards,
   john

p.s. even if the CONTEXTO category were emptied, there would be
a good argument for leaving it in the spec because (i) we might
need it in the future and (ii) removing it might cause errors
due to side effects in other areas of the documents.  It is just
a little late.