[Json] Json and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Thu Jan 22 09:06:18 CET 2015

On 1/21/2015 7:37 PM, Andrew Sullivan wrote:
> Hi,
>
> On Wed, Jan 21, 2015 at 06:58:09PM -0800, Asmus Freytag wrote:
>
>> I would go further, and claim that the notion that "*all homographs are
>> the**
>> **same abstract character*" is *misplaced, if not incorrect*.
> This does not seem false to me, and actually I'm not sure that it
> would be problematic for John either.

Andrew,

As I already mentioned in the reply to Nico, I overstated this to make a 
point.
>
>> U+08A1 is not the only character that has a non-decomposable
>> homograph, and because the encoding of it wasn't an accident, but
>> follows a principle applied by the Unicode Technical Committee, it
>> won't, and can't be the last instance of a non-decomposable
>> homograph.
> I also agree with this, but it appears that it may represent something
> problematic for IETF identifiers.  Moreover, "non-decomposable
> homograph" is not entirely useful here, because it's not merely the
> non-decomposability that is at issue.

If it was canonically decomposable it would be normalizable. And UTC is 
religious about not adding code points that can be normalized. Leaving 
out the decomposition is not a "cheap" way to get around the restriction 
on not adding normalizable characters. It does reflect the considered 
opinion of the UTC that the two ways to encode the same shape 
(homograph) do not have the same identity (are, in other words, not the 
same abstract character).

> There's also the fact that this
> particular case (and all the other cases I know of so far) are not
> susceptible to the other stability tests and are always in the same
> script.

Yes, homographs are not limited to the case of scripts evolving one into 
another, or borrowings between them.

>   It may well be that this is merely revealing the extent to
> which I have missed important cases, I will cheerfully concede.

Probably not literally being cheerful about having missed them. I missed 
a few like your Tamil example, because when we looked at that, we 
considered the root, were digits are out.

>   That
> hardly suggests that our identifier system is robust, since if engaged
> people like me (who are nevertheless admitted amateurs) are missing
> chunks of problems for identifiers, we can hardly expect ordinary
> operators to get policies right.

There are limits of what you can bake into the protocol, because what is 
desirable in a global community may be violently at odds with what is 
natural and therefore desirable at a local level. The protocol has to 
cater to all levels.

I personally see the best chance in "leading by example" and doing it in 
a way that the specification of the example is complete enough to be 
useful. So, where to find examples?

Already the process of doing a preliminary investigation the repertoire 
of letters for modern, widely used scripts for use in the Root Zone has 
identified and documented a number of issues, and resulted in a cut, 
from some 97,000 PVALID code points to about 34,000 code points deemed a 
safe starting point for globally useful identifiers.

The next stage of this process is going to create additional reduction 
of the numbers actually recommended for the Root Zone, plus further 
documentation of the issues by local and global experts.

Once this repertoire and its background documentation is completed, it 
will be possible to recommend the resulting set as a much safer starting 
point than IDNA 2008, while still remaining open to needs other than the 
Root Zone.

Part and parcel of this effort is the attempt at a more rigorous format 
for the specification of "IDN tables" (or Label Generation Rulesets, a 
better term). A rigorous format that not only captures the repertoire, 
but also all context rules in machine processable format, so they can be 
specified explicitly, compared and shared.

>
>> it appears not be be due to a "breakdown" of the encoding process
>> and also does not constitute a break of any encoding stability
>> promises by the Unicode Consortium.
> I will not speak for anyone else, but any worries I have are not
> directed at UTC.  My worry is much worse: that we're asking Unicode to
> provide something that nobody can, especially when the full generality
> of the goal of Unicode is taken into consideration.

That's something that I would take as given. What you call the "full 
generality" of a universal character code standard simply forces its 
creators and maintainers to add features that do not play well in some 
contexts, but are essential in others.

> Unicode has a really hard set of problems to solve.  I don't think
> anyone is intentionally suggesting, "Oh, those clowns at Unicode laid
> an egg."  (If they are, then I'll say I think that's setting the bar
> unreasonably high, and is quite unfair.)  But I do think that this new
> character highlights a bunch of issues that are super important for
> identifiers, especially when those identifiers are wandering around
> without locale clues.  I think for the sake of the Internet we must
> all worry about the implications of that.

OK, now that we've gotten that out of the way (and I had been wondering 
a bit), it's indeed time to think about the implications.

I've become convinced that a robust identifier system cannot be based on 
the existing encoding technology (and in fact, any sufficiently 
universal encoding technology) without the support of context rules and 
"blocked variants". The latter are a means to rigorously exclude the 
co-occurrence of some code point sequences in otherwise identical 
labels, while preserving the freedom to the creator of the label 
(applicant) to chose the sequence that best matches the needs.

Having a format that allows rigorous specification and machine 
evaluation of such blocked variants makes it possible to implement them 
consistently, so they can be filtered out mechanically ahead of any 
case-by-case analysis that might be required for mere "accidental" 
similarities or other, softer, forms of confusability.

Some of these tools are being created now and were not available when 
IDNA2008 first came into existence. In updating IDNA2008 (or in updating 
the recommendations on how to deploy it) these additions to the 
technical landscape should be considered.

A./