[Json] Json and U+08A1 and related cases

Sun Jan 25 03:57:50 CET 2015

On 1/24/2015 5:15 PM, Shawn Steele wrote:
>
> As long as we’re being very open about the identifiers, I think that 
> DNS may have been intended to be unique identifiers, but they have 
> evolved into human readable (for the most part) identifiers.  If they 
> were “just” unique, a bunch if #s would’ve sufficed.  Clearly now they 
> are not just unique identifiers, but also cater to linguistic behavior.
>

They are reasonably mnemonic, without being subject in all instances to 
the same rules as actual words or phrases.

> I think that the important part of the name resolution isn’t whether 
> or not certain characters are “allowed”, but rather that they resolve 
> to the same thing (eg: they’re identifiers).
>

There are at least two flavors of "allowed" here.

One is whether a code point is permitted by the protocol, or, perhaps 
permitted in certain contexts. The protocol addresses this in a black & 
white manner, globally.

The other is, whether two labels may exist, that differ only by two, 
otherwise confusable (or homograph) code points/sequences.

Here, you have two basic options.

You can set up an exclusion mechanism. Once one of the labels has been 
registered, the other can no longer be registered. (In some contexts, 
these are called "blocked variants"). This mechanism works fine for a 
whole lot of scenarios. It doesn't a-priori elminate any of the 
variants, so if one language needs one, while another language needs the 
other, you can have users of both languages compete normally for the 
available name space, without allowing malicious or accidental spoofing. 
Such an exclusion mechanism, if mechanically applied (without 
case-by-case review and/or appeals), is a robust method to manage such 
contentions. It has the further advantage that it impacts only 
registration of labels, not their lookup.

The other option is the one you describe:

> I don’t think that it’s important that DNS support all possible 
> combinations, but that where names are resolved that they are 
> consistent.  Currently 5 names can resolve to the same IP, and I don’t 
> see a problem with that.  So I think that it should be totally 
> possible for the “confusable” characters to merely resolve to the same 
> thing.  Eg: be bundled.  Sure, then people can’t register some names 
> that use similar letters (or variations), but then it isn’t 
> confusing.  Also you have a round-tripping problem because if 5 names 
> resolve to the same thing, which do you display?
>

this kind of bundling is called "allocatable variants" in some contexts. 
They can be appropriate where there is a reasonable expectation that 
some users would use one, and other users would use one of the other 
variants in a bundle to access the same IP. Either, because users 
normally don't make the distinction reliably enough, or because 
depending on system configuration etc. they may normally not be able to 
input one of the variants. There are examples in Arabic and Chinese 
where this kind of thing is done today, and for good reason.

However, the downside of this approach is that you can quickly get a 
very large number of variant labels (especially if the label is long) 
because variant code points could appear in many positions (and even the 
set of variant code points at a given position could be larger than just 
2 or 3).

When you work this out for the FQDN, the number of names for the same IP 
could be interestingly large. Also, since there's no way to enforce 
this, you may not actually end at the same IP. But at least, as long as 
the bundle goes to the same registrant, it would present a block to 
malicious spoofing by a third party.

In the case we are discussing here (the one that lead IETF to delay the 
IDNA tables for Unicode 7.0), I see no case for doing something like a 
bundle. There simply isn't the expectation that some users would 
regularly use the code point sequence to input the label. In fact, 
normally, if you did anything on the protocol level it would be a 
context rule to disallow the sequence altogether (it's not really 
needed). However, it was there first, and all that, so on the protocol 
level you can't do anything, or nothing that wouldn't make the situation 
worse.

Next best thing is to recommend that zone operators implement the kind 
of exclusion mechanism represented by 'blocked variants'.

A./

> -Shawn
>
> *From:*Idna-update [mailto:idna-update-bounces at alvestrand.no] *On 
> Behalf Of *Vint Cerf
> *Sent:* Saturday, January 24, 2015 6:45 AM
> *To:* Martin J. Dürst
> *Cc:* John C Klensin; Asmus Freytag; idna-update at alvestrand.no; The IESG
> *Subject:* Re: [Json] Json and U+08A1 and related cases
>
> I have been following this discussion with some interest and have come 
> away with a thought that some of you may wish to refine or perhaps 
> debate. Basically, I see the UNICODE effort as only partly aligned to 
> the needs of the Internet's Domain name System and the effort to use 
> the UNICODE character parameters/descriptors/properties does not 
> always line up with the desirable properties of the use of characters 
> in the DNS. It seems to me useful to recall that domain names are 
> identifiers that are not expected or even intended to follow purely 
> linguistic constraints. They are used to create what are intended to 
> be unique identifiers. Characters that have a high probability of 
> looking the same but are encoded differently work against that goal. 
> Of course I am fully aware of the confusability of the lower case 
> letter "L" and the digit "ONE" (and "OH" and "ZERO") that is sometimes 
> used as an example of the inconsistent toleration of confusion in the 
> ASCII labels but I consider this to be an argument of the form "you 
> allowed a case of confusion therefore you should tolerate all confusion".
>
> I do wonder whether it is worth considering an attempt to create a new 
> set of properties of UNICODED characters that are of specific use to 
> the DNS. The IDNA 2008 work tried to use properties of characters 
> developed for purposes other than the DNS and the fit is not always 
> perfect.
>
> vint
>
> On Fri, Jan 23, 2015 at 4:14 AM, "Martin J. Dürst" 
> <duerst at it.aoyama.ac.jp <mailto:duerst at it.aoyama.ac.jp>> wrote:
>
>     Hello Asmus,
>
>     On 2015/01/22 11:58, Asmus Freytag wrote:
>
>         I would go further, and claim that the notion that "*all
>         homographs are
>         the**
>         **same abstract character*" is *misplaced, if not incorrect*.
>
>
>     That's fine. Nobody would claim that 8 (U+0038) and ৪ (Bengali 4,
>     U+09EA) are the same abstract character. (How 'homographic' they
>     look will depend on what fonts your mail user agent uses :-)
>
>         U+08A1 is not the only character that has a non-decomposable
>         homograph, and
>         because the encoding of it wasn't an accident, but follows a
>         principle
>         applied
>         by the Unicode Technical Committee, it won't, and can't be the
>         last
>         instance of
>         a non-decomposable homograph.
>
>         The "failure of U+08A1 to have a (non-identity)
>         decomposition", while it
>         perhaps
>         complicates the design of a system of robust mnemonic
>         identifiers (such
>         as IDNs)
>         it appears not be be due to a "breakdown" of the encoding
>         process and
>         also does
>         not constitute a break of any encoding stability promises  by
>         the Unicode
>         Consortium.
>
>         Rather, it represents reasoned, and principled judgment of
>         what is or
>         isn't the
>         "same abstract character". That judgment has to be made
>         somewhere in the
>         process, and the bodies responsible for character encoding get
>         to make the
>         determination.
>
>
>     While I can agree with this characterization, many judgements on
>     character encoding are by their very nature borderline, and U+08A1
>     definitely in many aspects is borderline. What I hope is that the
>     Unicode Technical Committee, when making future, similar
>     decisions, hopefully puts the borderline a bit more in support of
>     applications such as identifiers, and a bit less in favor of
>     splitting. Also, that it realize that when principles lead to more
>     and more homograph encodings, it may very well pay off to
>     reexamine some of these principles before going down a slippery slope.
>
>     Regards,   Martin.
>
>
>     _______________________________________________
>     Idna-update mailing list
>     Idna-update at alvestrand.no <mailto:Idna-update at alvestrand.no>
>     http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150124/506d7e21/attachment-0001.html>