CONTEXTO Proposal

Tue Jul 21 23:00:30 CEST 2009

I am not arguing for removing the ability to have CONTEXTO, or CONTEXTJ
category; it doesn't hurt anything to have them (other than complexity for
the reader).

What I'm talking about is the *choice* of characters in CONTEXTO. There is
good reason to have constraints on hyphen and the ARABIC-INDIC cases,
although I think those would be more clearly, consistently, and effectively
handled in other sections (Protocol and Bidi).

So we are only talking about 6 characters. The problem is that we have
really only turned our attention to these characters at the very last
moment.

   - The rules for all of the CONTEXT characters are in not particularly
   good shape; look at just the changes in this last week.
   - The 6 characters are a small fraction of the set of similar characters
   that are just plain PVALID. There has been no systematic effort to identify
   other characters that have the same issues as the 6 listed.
   - The particular constraints on these 6 characters has essentially zero
   value; anyone trying to prevent harmful instances of characters has to to
   have fundamentally different, and more sophisticated, data.
   - Tables should at least have a documented reason for constraining each
   of these 6 characters; otherwise it's a mystery to users.
   - I can only see these 6 characters causing problems, because we got the
   rules wrong (such as the Katakana middle dot, which has to have different
   constraints, or the middle dot, which is used in orthographies aside from
   Catalan), and not allowing them to have perfectly reasonable URLs in their
   language, without a revision of this RFC.
   - The Bidi constraints -- which *are* valuable -- may end up be tarred
   with the same brush by association.

Since the set of rules and constraints have changed, with only a few people
saying yay or nay, we should have a consensus call on the final result.

But in the end, you've worn me down -- I don't care. It is not required to
process these on lookup, and along the lines Chris said, I suspect that no
registries that are not contractually bound to follow this list would bother
checking them. So I suspect that the vast majority of implementations will
just ignore the CONTEXTO constraints -- like any software I have a choice
with.

Mark

On Tue, Jul 21, 2009 at 09:51, John C Klensin <klensin at jck.com> wrote:

>
>
> --On Monday, 20 July, 2009 13:09 -0700 Mark Davis ⌛
> <mark at macchiato.com> wrote:
>
> > I believe that none of the current CONTEXTO characters are
> > really required to be CONTEXTO, and all should be simply
> > PVALID. I'd like to ask for a consensus call on this. There is
> > a copy at
> > http://www.macchiato.com/unicode/idna/exceptions/contexto-prop
> > osal in case emailers make this less readable.
> >...
>
> Mark,
>
> You've been arguing for elimination of CONTEXTO, and elimination
> of CONTEXTJ as a category (replacing it, if needed, by special
> cases for ZWJ and ZWNJ only) since at least the pre-WG meeting
> in January 2008.  For CONTEXTO, I see nothing new in this
> proposal other than its format and, unfortunately, have very
> limited time right now to try to rehash those old arguments.  I
> am also concerned that many formal consensus calls at the level
> of individual details will bog us down sufficiently, especially
> if there are any long speeches at the microphone, to prevent any
> real progress in Stockholm.  So I urge Vint to consider those
> requests very carefully.
>
> With regard to your specific suggestions for emptying the
> category.  I hope I can summarize the previous discussions and
> design decisions rather than repeating them (I'm going to take
> hyphen last).
>
> First, I agree with Michel's observation about fonts and
> typographic variations.  I also think it cannot be
> overemphasized that people see what they expect to see: if a
> user considers a particular character "normal" in particular
> circumstances, a glyph in that position doesn't need to look
> very much like that character to be mistaken for it.  For
> example, in an orthography that doesn't have dots except at the
> baseline, many users will mistake middle dots for a baseline dot
> and some will even mistake high dots for the baseline one unless
> they are somehow warned to look at them.  That hypothesis about
> user perception behavior was confirmed by research in human
> factors and perception going back to the late 50s.  I can try to
> dig the citations out for you if that would be helpful, but
> probably not today.
>
> Second, "but is illegal anyway" is unconvincing.  If the subject
> characters are PVALID, then they can appear in contexts in which
> the "confusable" character would be banned and used as a
> workaround to simulate that character if someone wants to use it
> that way.  Taking Geresh and my "O'"-prefixed Latin character
> name as an example, we know that people want to write those
> names.  Your note seems to believe that the issue here is
> confusability with banned characters.  It is not.   It is,
> instead, a variation on the "how to you type that in" question
> that Shawn, Chris, and others have been raising.  If someone
> uses Geresh in one of those strings and displays it (just as
> they might display mixed-case characters, etc.), the user will
> believe it is the usual accent character and have no way to type
> the actual character in.  If they try to use the non-combining
> accent (single quote), they will get parsing errors since the
> name would otherwise be basic Latin.  If they can type the
> Geresh (or figure out how to escape it in), it will work.  But,
> in an area where names like that are common, and domaineers and
> other name-marketers have gotten a foothold, it would be only a
> matter of time before "...is illegal anyway" gets turned around
> into "since single quote is DISALLOWED, why not map it into
> Geresh so we can exercise our right to write our names".
>
> I certainly would not encourage that, but it is the context in
> which we operate.
>
> The Arabic-Indic digit issue is complicated.  We've gotten
> strong recommendations from a UN and Arab League-based working
> group whose participants include experts on several of the
> relevant writing systems and who have looked at the domain
> name-related issues (not just writing of ordinary text).  There
> is not complete agreement within that group on those
> recommendations (as we have seen when the discussions have
> spilled over onto this list).  There are those who, also on this
> list, have expressed doubts about the legitimacy of the working
> group itself.  And, as Alireza points out, there is an
> additional complication due to the Jawa use of Indo-Arabic digit
> 2, so whatever the rule is and wherever it is put, it needs to
> be stated very carefully.
>
> FWIW, much of that issue would not exist had Unicode not made
> the decision to code the overlapping Arabic-Indic digits twice
> while viewing Persian and Indic forms as just font/display
> variations.   I think I understand the reasons for those
> decisions and believe that other decisions would have caused
> other problems, but it may be useful to remember that this isn't
> a problem of IDNA's making independent of Unicode design
> decisions.
>
> I don't know how the IETF can evaluate the claims and
> counterclaims in this area, but it does appear to me that what
> we have is something that those who have commented believe they
> can deal with.
>
> As far as whether those rules should be in Bidi or handled as
> contextual rules, you will probably recall that I originally
> proposed handing them in Bidi, partially to avoid much more
> general arguments about digit-mixing across scripts.  I still
> believe that would have been a better decision, although only
> slightly so.  But the WG concluded that it should be in the
> Contextual rules and I've moved on and, in the interest of
> getting things done and because it affects the definition but
> not what is permitted or prohibited under the protocol, I'm
> opposed to reopening the issue.
>
> Finally, there is hyphen.  Handling it as a contextual rule is a
> decision that Patrik and I made one afternoon because it seemed
> elegant, obvious, and very clear.  Of course one could do it in
> other ways, but it is trickier than you suggest because it is
> not clear that it is a "requirement of the DNS system".  First
> of all, as several people keep reminding us, the basic
> requirement of the DNS system is "octets".  IDNA (both 2003 and
> 2008) impose the LDH requirement on anything that appears in an
> IDN context (even an all-ASCII string) but that is not a DNS
> requirement; it is a requirement imposed by protocols that use
> the DNS and chose to require (and often enforce) that rule.
> Second, once one moves into the non-ASCII space, IDNA doesn't
> inherit the "no leading or trailing hyphens" rule until the IDNA
> specifications say it does (which both IDNA2003 and IDNA2008 do).
>
> My recollection is that the WG decided to not make the
> consecutive hyphen test on lookup, but I can't find it in my
> notes so, unless someone else believe that text is unnecessary
> and undesirable, I'll get it into Protocol-14.
>
> regards,
>   john
>
> p.s. even if the CONTEXTO category were emptied, there would be
> a good argument for leaving it in the spec because (i) we might
> need it in the future and (ii) removing it might cause errors
> due to side effects in other areas of the documents.  It is just
> a little late.
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090721/db778ed9/attachment.htm