<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 1/24/2015 5:15 PM, Shawn Steele
wrote:<br>
</div>
<blockquote
cite="mid:CY1PR0301MB0731B01A94DD3DE4BB0865B682340@CY1PR0301MB0731.namprd03.prod.outlook.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"Shonar Bangla";
panose-1:2 11 5 2 4 2 4 2 2 3;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">As
long as we’re being very open about the identifiers, I think
that DNS may have been intended to be unique identifiers,
but they have evolved into human readable (for the most
part) identifiers. If they were “just” unique, a bunch if
#s would’ve sufficed. Clearly now they are not just unique
identifiers, but also cater to linguistic behavior.</span></p>
</div>
</blockquote>
<br>
They are reasonably mnemonic, without being subject in all instances
to the same rules as actual words or phrases.<br>
<br>
<blockquote
cite="mid:CY1PR0301MB0731B01A94DD3DE4BB0865B682340@CY1PR0301MB0731.namprd03.prod.outlook.com"
type="cite">
<div class="WordSection1">
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">I
think that the important part of the name resolution isn’t
whether or not certain characters are “allowed”, but rather
that they resolve to the same thing (eg: they’re
identifiers). <br>
</span></p>
</div>
</blockquote>
<br>
There are at least two flavors of "allowed" here.<br>
<br>
One is whether a code point is permitted by the protocol, or,
perhaps permitted in certain contexts. The protocol addresses this
in a black & white manner, globally.<br>
<br>
The other is, whether two labels may exist, that differ only by two,
otherwise confusable (or homograph) code points/sequences.<br>
<br>
Here, you have two basic options.<br>
<br>
You can set up an exclusion mechanism. Once one of the labels has
been registered, the other can no longer be registered. (In some
contexts, these are called "blocked variants"). This mechanism works
fine for a whole lot of scenarios. It doesn't a-priori elminate any
of the variants, so if one language needs one, while another
language needs the other, you can have users of both languages
compete normally for the available name space, without allowing
malicious or accidental spoofing. Such an exclusion mechanism, if
mechanically applied (without case-by-case review and/or appeals),
is a robust method to manage such contentions. It has the further
advantage that it impacts only registration of labels, not their
lookup.<br>
<br>
The other option is the one you describe:<br>
<br>
<blockquote
cite="mid:CY1PR0301MB0731B01A94DD3DE4BB0865B682340@CY1PR0301MB0731.namprd03.prod.outlook.com"
type="cite">
<div class="WordSection1">
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">
I don’t think that it’s important that DNS support all
possible combinations, but that where names are resolved
that they are consistent. Currently 5 names can resolve to
the same IP, and I don’t see a problem with that. So I
think that it should be totally possible for the
“confusable” characters to merely resolve to the same
thing. Eg: be bundled. Sure, then people can’t register
some names that use similar letters (or variations), but
then it isn’t confusing. Also you have a round-tripping
problem because if 5 names resolve to the same thing, which
do you display? </span></p>
</div>
</blockquote>
<br>
this kind of bundling is called "allocatable variants" in some
contexts. They can be appropriate where there is a reasonable
expectation that some users would use one, and other users would use
one of the other variants in a bundle to access the same IP. Either,
because users normally don't make the distinction reliably enough,
or because depending on system configuration etc. they may normally
not be able to input one of the variants. There are examples in
Arabic and Chinese where this kind of thing is done today, and for
good reason.<br>
<br>
However, the downside of this approach is that you can quickly get a
very large number of variant labels (especially if the label is
long) because variant code points could appear in many positions
(and even the set of variant code points at a given position could
be larger than just 2 or 3).<br>
<br>
When you work this out for the FQDN, the number of names for the
same IP could be interestingly large. Also, since there's no way to
enforce this, you may not actually end at the same IP. But at least,
as long as the bundle goes to the same registrant, it would present
a block to malicious spoofing by a third party.<br>
<br>
In the case we are discussing here (the one that lead IETF to delay
the IDNA tables for Unicode 7.0), I see no case for doing something
like a bundle. There simply isn't the expectation that some users
would regularly use the code point sequence to input the label. In
fact, normally, if you did anything on the protocol level it would
be a context rule to disallow the sequence altogether (it's not
really needed). However, it was there first, and all that, so on the
protocol level you can't do anything, or nothing that wouldn't make
the situation worse.<br>
<br>
Next best thing is to recommend that zone operators implement the
kind of exclusion mechanism represented by 'blocked variants'.<br>
<br>
A./<br>
<br>
<blockquote
cite="mid:CY1PR0301MB0731B01A94DD3DE4BB0865B682340@CY1PR0301MB0731.namprd03.prod.outlook.com"
type="cite">
<div class="WordSection1">
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">-Shawn<o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><b><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif">
Idna-update [<a class="moz-txt-link-freetext" href="mailto:idna-update-bounces@alvestrand.no">mailto:idna-update-bounces@alvestrand.no</a>]
<b>On Behalf Of </b>Vint Cerf<br>
<b>Sent:</b> Saturday, January 24, 2015 6:45 AM<br>
<b>To:</b> Martin J. Dürst<br>
<b>Cc:</b> John C Klensin; Asmus Freytag;
<a class="moz-txt-link-abbreviated" href="mailto:idna-update@alvestrand.no">idna-update@alvestrand.no</a>; The IESG<br>
<b>Subject:</b> Re: [Json] Json and U+08A1 and related cases<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">I have been following this discussion
with some interest and have come away with a thought that
some of you may wish to refine or perhaps debate. Basically,
I see the UNICODE effort as only partly aligned to the needs
of the Internet's Domain name System and the effort to use
the UNICODE character parameters/descriptors/properties does
not always line up with the desirable properties of the use
of characters in the DNS. It seems to me useful to recall
that domain names are identifiers that are not expected or
even intended to follow purely linguistic constraints. They
are used to create what are intended to be unique
identifiers. Characters that have a high probability of
looking the same but are encoded differently work against
that goal. Of course I am fully aware of the confusability
of the lower case letter "L" and the digit "ONE" (and "OH"
and "ZERO") that is sometimes used as an example of the
inconsistent toleration of confusion in the ASCII labels but
I consider this to be an argument of the form "you allowed a
case of confusion therefore you should tolerate all
confusion". <o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I do wonder whether it is worth
considering an attempt to create a new set of properties
of UNICODED characters that are of specific use to the
DNS. The IDNA 2008 work tried to use properties of
characters developed for purposes other than the DNS and
the fit is not always perfect. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">vint<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">On Fri, Jan 23, 2015 at 4:14 AM,
"Martin J. Dürst" <<a moz-do-not-send="true"
href="mailto:duerst@it.aoyama.ac.jp" target="_blank">duerst@it.aoyama.ac.jp</a>>
wrote:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal" style="margin-bottom:12.0pt">Hello
Asmus,<br>
<br>
On 2015/01/22 11:58, Asmus Freytag wrote:<o:p></o:p></p>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal">I would go further, and claim that
the notion that "*all homographs are<br>
the**<br>
**same abstract character*" is *misplaced, if not
incorrect*.<o:p></o:p></p>
</blockquote>
<p class="MsoNormal" style="margin-bottom:12.0pt"><br>
That's fine. Nobody would claim that 8 (U+0038) and <span
style="font-family:"Shonar
Bangla",sans-serif">
৪</span> (Bengali 4, U+09EA) are the same abstract
character. (How 'homographic' they look will depend on
what fonts your mail user agent uses :-)<br>
<br>
<o:p></o:p></p>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal">U+08A1 is not the only character
that has a non-decomposable homograph, and<br>
because the encoding of it wasn't an accident, but
follows a principle<br>
applied<br>
by the Unicode Technical Committee, it won't, and
can't be the last<br>
instance of<br>
a non-decomposable homograph.<br>
<br>
The "failure of U+08A1 to have a (non-identity)
decomposition", while it<br>
perhaps<br>
complicates the design of a system of robust mnemonic
identifiers (such<br>
as IDNs)<br>
it appears not be be due to a "breakdown" of the
encoding process and<br>
also does<br>
not constitute a break of any encoding stability
promises by the Unicode<br>
Consortium.<br>
<br>
Rather, it represents reasoned, and principled
judgment of what is or<br>
isn't the<br>
"same abstract character". That judgment has to be
made somewhere in the<br>
process, and the bodies responsible for character
encoding get to make the<br>
determination.<o:p></o:p></p>
</blockquote>
<p class="MsoNormal"><br>
While I can agree with this characterization, many
judgements on character encoding are by their very
nature borderline, and U+08A1 definitely in many aspects
is borderline. What I hope is that the Unicode Technical
Committee, when making future, similar decisions,
hopefully puts the borderline a bit more in support of
applications such as identifiers, and a bit less in
favor of splitting. Also, that it realize that when
principles lead to more and more homograph encodings, it
may very well pay off to reexamine some of these
principles before going down a slippery slope.<br>
<br>
Regards, Martin.<o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><br>
_______________________________________________<br>
Idna-update mailing list<br>
<a moz-do-not-send="true"
href="mailto:Idna-update@alvestrand.no"
target="_blank">Idna-update@alvestrand.no</a><br>
<a moz-do-not-send="true"
href="http://www.alvestrand.no/mailman/listinfo/idna-update"
target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><o:p></o:p></p>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Idna-update mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a>
<a class="moz-txt-link-freetext" href="http://www.alvestrand.no/mailman/listinfo/idna-update">http://www.alvestrand.no/mailman/listinfo/idna-update</a>
</pre>
</blockquote>
<br>
</body>
</html>