IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Tue Jan 27 01:51:20 CET 2015

John,

thank you for this well-considered reply.

I find no other way to add to what I wrote earlier than by inserting a 
few comments and additional thoughts inline. Sorry if that makes the 
reply somewhat lengty.

On 1/26/2015 2:05 PM, John C Klensin wrote:
> Asmus,
>
> Some reflections on your note with the understanding that I
> largely agree with Pete's concerns and will try to avoid
> repeating them (more on that below).  I do wish we could either
> give a precise and consistent definition to "homograph" (or
> "homoglyph" and get agreement on using them) or stop using the
> terms entirely because they have, IMO, introduced a great deal
> of confusion into our discussions.  FWIW, I like your
> definitions, I just despair of getting very many people to use
> them consistently.

I will continue to use them consistently.    :)

> Nick, most of my response to your observation about "handle as
> confusables" appears toward the end of this note.  Since I'm
> complaining about terminology, I've also discovered that phrase
> is usually followed by a great deal of handwaving about what
> that sort of "handling" means in practice.  Again, please see
> below but try to explain _exactly_ what you mean/intend by such
> handling for both the entire DNS hierarchy and the various
> identifiers that might be covered by PRECIS or other systems of
> protocols.
>
> More (lots more, unfortunately) below...
>
>
> --On Sunday, January 25, 2015 22:30 -0800 Asmus Freytag
> <asmusf at ix.netcom.com> wrote:
>   
>>   From the perspective of a robust identifier system, I would
>> want an underlying encoding that is constructed so that it
>> serves as a catalog of unique, non-overlapping shapes (that is
>> shapes that are positively distinguishable). With that, I
>> could ensure that a unique sequence of code values results in
>> a unique, distinguishable graphical form (rendered label) that
>> users can select with confidence.
> As you certainly know, perhaps unless one could keep all
> identifiers as unambiguously non-words, one would also want
> something that is a good match to people's perceptions of how
> the sounds they make are represented in printed form and vice
> versa.    Those criteria conflict -- no perfect solution is
> possible unless, perhaps, one allows only numeric identifiers
> and restricts them to a single set of digits.

Thanks for stating that. I had that in mind when drafting my original
passage, but I skipped it, so that I could get to my other points.

Not everyone has a strong visual memory. Remembering that is useful.

>   
>
> So, while I believe your comment above is correct, I also
> believe it is somewhat of a strawman in the real world with
> realistic systems, at best, considering those goals as important
> decision-making criteria.   In the real world, one ends up
> making tradeoff decisions.  One can prefer that one's own
> criteria are considered most important but, if the encoding
> system is going to serve multiple objectives, probably the most
> it is plausible to expect is consistency of application of
> whatever conventions and priorities are developed.

That implication was the point of my strawman exercise.

>
> FWIW, I note that Harald Alvestrand and I explored some related
> ideas in RFC 5242, largely to demonstrate that they were
> unworkable.   Whatever one might say about the reasons why that
> effort failed, I'd suggest that the original design for what
> became ISO 10646 came a bit closer to the objectives above
> simply by being intended as a single unified character set with
> identical characters (even from multiple scripts) assigned a
> single code point and not combining characters.

In fact, the opposite - that design had four parallel code tables for 
the four main communities of users of the Han script. Can you imagine 
not a few dozen homographs, but tens of thousands ?

It is, and I can say that with the confidence of one involved, one of 
the primary reasons that lead to its defeat.

Combining characters exist unquestioningly in dozens of scripts where 
they are clearly letters (even if graphically dependent on a base 
shape). In other scripts, including Latin, they are so clearly 
productive that to enumerate all combinations seemed hopeless. I 
remember one draft proposal listing 2,000 of them (and it was considered 
a "partial" set).

Now, most of these combinations occur in scholarly use, and legacy made 
no attempt to deal with that. The combinations needed for the principal 
orthographies (forgetting Africa for a bit) are a manageable number. But 
you cannot create a universal character encoding and leave out scholarly 
use... anyway, the result is that there's the need to accommodate 
productive uses while at the same time dealing with legacy -- your other 
remarks show that you know what is involved there - so this is just 
further delineation of detail not disagreement in principle.

> The Unicode
> design of more or less concatenating script-related blocks
> (originally based on concatenating existing language or
> script-specific national and ISO standards, see below) and of
> allowing combining sequences is very different and probably much
> more practical (I assume you would drop "probably" from that
> statement).

It was hotly debated and many people that normally wouldn't dream of 
being active in character encoding actually did educate themselves on 
the difference, and selected Unicode's approach (as it was then) by a 
wide margin. Even Unicode had to morph a bit over time -- but that's a 
different story with which you are all familiar.

>
>> It's a mistake to assume that this describes in any way the
>> primary mission of the Unicode Standard.
>>
>> Instead, Unicode is concerned with allowing authors to create
>> strings for code values that will be (more or less
>> predictably) rendered so that human readers can discern an
>> intended textual meaning. "If I want to write "foo" in
>> language X, which codes should (and shouldn't) I use?" is the
>> question that goes to the heart of this problem.
> I think that is an excellent summary and I appreciate your
> providing it.  It also means that there is an inherent conflict
> between identifier applications and that primary mission.

A key point.

> That
> conflict is likely to be less when some language context can be
> inferred or assumed than when it cannot, but not completely
> absent in either case.

Writing, as a human activity, is full of ambiguities - as are the 
languages that are written. Humans are amazingly able to deal with 
these, but these ambiguities open the window for other very human 
activities, like spoofs. (And they are by no means all visual).

>   Unless one resolves that conflict by
> moving away from Unicode for identifiers or adopting truly
> draconian restrictions such as the "all numeric" one above, it
> also suggests that, whatever we do, there are going to be some
> rough edges in identifier use of Unicode.  Vint's observation
> about "argument[s] of the form 'you allowed a case of confusion
> therefore you should tolerate all confusion'" and a similar
> observations about "we did that before, therefore it is ok to do
> it again" are, IMO, both applicable here.

Yes to the rough edges - they are ultimately not solely due to any 
technical decisions Unicode has or hasn't made, but due to the way 
mnemonics are grounded in language, and how language is represented in 
writing.

>
> I believe the Unicode Consortium tried to develop one set of
> rules that would work moderately well for general cases with UAX
> #31.  The IETF studied that specification when IDNA was first
> being developed and concluded that it wasn't a sufficient match
> for DNS needs.  While I believe that both UAX #31 and IDNA have
> influenced each other over the years, they haven't converged
> sufficiently to be interoperable.

I've been recently involved in researching a repertoire for the DNS Root 
Zone. This process uses some of the data presented in UAX#31 to be more 
restrictive (and therefore more robust) than raw IDNA 2008. That said, I 
do appreciate your point here.

>    I believe that at least some
> of the differences are real differences in design criteria, not
> just matters of taste.  As with the more general identifier
> versus text issue, that is probably because the primary (and
> priority) use cases are different and not merely because of
> differences in taste.  In any event, I think it was inevitable
> that neither is completely adequate to cover all identifier
> cases.

Natual-language friendly systems of identifiers are not all subject to 
spoofing and other form of intentional abuse or unintentional confusion. 
Some systems have namespaces that are populated by single (or 
cooperating) users. For both, take programming language as an example. 
For such systems, many of the restrictions, rules and policies germane 
to the DNS just do not apply.

>
>> It's not quite as simple as that, because there's also the
>> need to make certain automated processes (from spell-checker
>> to sorting) come up with the correct interpretation of the
>> text - the latter, for example, is behind the need to
>> separately encode a Greek omicron from a Latin o.
> Because I think understanding how we go to where we are today is
> helpful to informing future decisions, I suggest that, while
> that example is helpful, it is important to note that the
> _reason_ those two code points are encoded separately is a
> direct consequence of the way early versions of Unicode were
> created, i.e., that the original Greek letter block of Unicode
> was identical in repertoire and order to the Greek letters of
> ISO/IEC 8859-7 and ELOT 928.  While I recall (perhaps
> incorrectly) that part of the rationale for doing things that
> way in the Unicode of the late 1980 was to preserve ISO script
> relationships and orderings (that is more or less confirmed in
> the chronology at
> http://www.unicode.org/history/versionone.html).  It was not
> some abstract reasoning about, e.g., sorting or spell correction.

The 8859 series was not purely reproduced. Overlapping code points were 
removed - sometimes not rigorously enough (because legacy constraints 
were overwhelming - think of mu and micro - and not always able to be 
resolved).

However, Latin-2 and up were not maintained, so already there, the 
division into scripts, rather than merely preserving some of the layout, 
was visible as the guiding principle. That such division is useful, 
among other for sorting, was clear from the start. But it has since 
become even more important, to the point where borrowed letter forms 
need to be re-encoded in the target script to make the system continue 
to work ("kurdis" Q and W are examples of recent additions of Latin 
letters to the Cyrillic script - and the rationale was sorting).

>
>> Another complication is that human readers can be very
>> tolerant when associating shapes to letters in well
>> established contexts, but not tolerant at all outside of these
>> contexts. If you consider all, including decorative type
>> faces, the letter 'a' can have a bewildering array of actual
>> shapes, without losing its essential "a-ness" --- when used in
>> context. Some of the shapes for capital A can look like U
>> (check your Fraktur fonts), and out of context of running text
>> in Fraktur would be misidentified by many users.
> And that, as I understand it, is the reason why Unicode has a
> supposedly-firm rule against assigning separate code points to
> font (or type style) variations.

I would phrase that differently. Unicode encodes the "identity" of the 
character; in some instances, esp. where the identity is the membership 
in a well enumerable alphabet, the associated shapes can have 
considerable variety in practice, without invalidating the encoding.

In other cases, notably for symbols, but sometimes also for punctuation, 
and, in large measure for ideographs, the shape as such defines the 
identity. (For ideographs, whether a particular variation is permissible 
or identity-changing is subject to very detailed analysis.)

>   At best, that is where more
> tradeoffs intrude because the relationship between, e.g., "a"
> (U+0061) and "MATHEMATICAL BOLD SMALL A" (U+1D41A) appears to
> the casual observer to be entirely about font variations.   Even
> if difference between textual and mathematical usage is accepted
> as significant enough to justify a separate set of codes, the
> relationship between U+1D41A and, e.g., U+1D44E (italic rather
> than bold) appears to be nothing more than a font variation
> within the same semantic context.

When used as a letter in text, "a" can be bold, italic, underlined, etc. 
and doesn't stop being an "a". (And it can have a handle or be a bowl a 
without issues.) Ransom notes show that even inconsitent rendering does 
not make the text illegible.

When used as a symbol (as part of notations) all these different flavors 
of the "a" do in fact have a different identity. In phonetic notation, 
the bowl or handle forms of 'a' each have fixed, and unambiguous 
identities - they can no longer be substituted. While you cannot say 
what a bold a means in any mathematical text without recourse to more 
context (such as the field or sub-field, or the author's conventions) 
you can assert that in general, substituting an italic form for a bold 
(or an unmarked form) is likely to change the meaning of the text. 
(Especially, of the text contains some italic forms already, which then 
would be conflated).

Being a universal encoding, Unicode must cater to overlapping 
conventions for the use of the same form.

In most cases, this can be handled by forcing the user to apply the 
conventions on top of the encoding, but in some cases, that's so far 
from optimal that both uses have to be supported in the encoding.

I agree that the casual user doesn't tend to reflect on the fact that 
0061 is a letter and 1D41A is a letterlike symbol, and therefore 
different rules apply. So, you'll find many mathematical texts where the 
author incorrectly relies on style rather than semantic encoding (or at 
the minimum semantic markup)...

>
> >From my perspective, we can design identifier standards that
> adjust to almost any reasonable, consistent, and well-described
> set of conventions of a repertoire and coding system optimized
> for different goals, including the goals you describe.  The
> fundamental problems we seem to be facing here are less that
> difference in design goals but failures in "consistent and
> well-described", even if we consider only decisions made from
> 3.2 or 4.1 forward.

As a work in progress, and being required to respond to continual 
changes in how people write (emoji  anyone?) Unicode has to retain some 
flexibility that prevent an overly rigorous system of constraints. 
Especially, as some things that people do are such odd-balls that it's 
really tough to anticipate them correctly upfront.

The Cherokee are currently in the process of adding a lower case to 
their script. Totally unexpected, and breaks with all ideas about 
stability of case foldings. What to do?
>
>> Finally, Unicode is intentionally designed to be the *only*
>> such system, so that code conversion (other than trivial
>> re-packaging) is in principle needed only for accessing legacy
>> data. However, at the start, all data was legacy, and Unicode
>> had to be designed to allow migration of both data and systems.
> Presumably hence the compatibility decisions with existing
> standards, as discussed above.
>
>> Canonical decomposition entered the picture because the legacy
>> was at odds with how the underlying writing system was
>> analyzed. In looking at the way orthographies were developed
>> based on the Latin and Cyrillic alphabets, it's obvious that
>> plain letterforms are re-used over and over again, but with
>> the addition of a mark or modifier. These modifiers are named,
>> have their own identity, and can, in principle, be applied to
>> any letter -- often causing a predictable variation of value
>> of the base letter.
>>
>> Legacy, instead, cataloged the combinations.
> And that is exactly what the standard says happened and why, at
> least the way we read it and as it was explained to us.  With
> the help of what we understood normalization to be for, with the
> exclusion of all compatibility characters (perhaps too harshly
> restrictive for some Chinese cases, and with a few
> character-by-character adjustments, we thought we were coping
> fairly well.  However, the assumptions we made --again, based on
> both what the Standard appeared to say in plain language and
> advice from your colleagues -- included a belief that no new
> characters would be added within a given script if it could be
> composed from existing characters (i.e., that both the base
> character and any required combining characters were already
> coded).

The fact that Danish o-slash was not decomposed into o+slash was 
something that was "in your face" from Unicode 2.x (whenever the 
decompositions were first cataloged). This is not exactly a hard to miss 
character. Just pointing that out - and yes, I've managed to 
occasionally forget that one myself.

>   What took us by surprise this time around --
> surprises that have caused some of us to question a whole series
> of basic assumptions -- is that, in addition to the conditions
> staged in the standard for adding new code points, there are an
> additional set of rules and cases about phonetic, semantic, and
> use-case distinctions that justify new code points.

The way this works, loosely speaking, if someone comes and says I want 
XY as a single code point, but with all reasonable assumptions about the 
way this is supported, the consensus view is that X + Y already 
correctly represents the "identity" of what is being requested as XY, 
then the request dies very unceremoniously.

If the consensus view is that the X + Y does not adequately represent 
the identity of what is being requested as XY, then the request can be 
considered, because in that case, decomposing XY to X+Y would have been 
a non-starter even if this had been encoded in Unicode 1.0.

(I'm using identity here as a unifying concept - it's not spelled out 
that way everywhere in the text of the standard, but you can find it 
mentioned. I find it really simplifies some arguments and should be used 
more often.)

For slashes, bars, strokes, hooks, and other overlaid or attached 
alterations of a base shape, there's a long-standing principle that says 
that the sequence is not specific enough to describe the modification, 
and thus doesn't capture the identity.

For the hamza, I gave the argument.

And, in general, for Arabic, the presumption is that decompositions 
aren't given (not repeating myself here).

Crucially, while Unicode will allow rather few exceptions, having this 
ability to investigate the identity allows it to flexibly deal with any 
unusual situations that come up, where someone's idea of an orthography 
may not play well with the principles of the script it's based on.

Even so, some cases are beyond repair. The reputed African orthography 
using the @ as a letter. Re-encoding @ is just unthinkable...

>
> I'm convinced that, with your help, we can develop new rules or
> derived properties, perhaps even leading to a new normalization
> form that does what we thought NFC and NFD did and that is
> better at supporting context-free identifiers.

The existing case of discussion, where sequence is used by one user 
group for one purpose and the singleton is used by another user group 
for a different purpose is textbook for why normalization  -- which 
selects one or the other alternative -- is not always a possible answer.

If one of the use cases were totally marginal and the other totally 
mainstream, perhaps you could get away playing favorite by allowing one 
and not the other (with the difficulty that the order in which they show 
up over time in the repertoire is not, unfortunately always with the 
mainstream use first).

At the same time, I share your reluctance to say "implementer beware" 
and leave it at that. Especially if you have identified the issue.

That's the basis of my distinction between "confusables by intent" and 
"confusables by accident/circumstance". A distinction that is not 
entirely unique, I see it in the Unicode data tables as well (not ones I 
contributed to).

Confusables by intent could be identified (Unicode has done some of the 
legwork, although I don't agree with all of their identifications). Once 
these are known and have a property, rather than eliminating them from 
the PVALID set, IDNA 2008 could require implementations to robustly 
handle exclusion for them.

Another thing that could be done is to create a profile of a 
"restricted" IDNA2008, where the PVALID set is limited to a "safer" or 
more conservative subset of code points, which might be useful for the 
most language-agnostic implementations of identifiers (such as we are 
developing for the root zone).

>   But we need to
> know what the cases and rules that create them are.  I now infer
> that, in addition to the "Mathematical" characters as exceptions
> to the "no separately-coded font variations" rule (treated as
> almost a separate  script with compatibility transformations
> back to Latin), there are phonetic description characters that
> are treated as part of the Latin script, look just like base
> Latin characters with various combining markings or decorations,
> but that have ordinary "Lu" or "Ll" properties and no
> decompositions, and these Arabic cases that do not have
> decompositions because they are phonetically different from the
> composing sequence.

Yes there are a number of them; it's not always possible to correctly 
anticipate which one of these cause "surprises".
>
>> For Latin and Cyrillic primarily, and many other scripts, but
>> for some historical reason not for Arabic, Unicode supports
>> the system of "applying marks" to base letters, by encoding
>> the marks directly. To support legacy, common combinations had
>> to be encoded as well. Canonical decomposition is in one sense
>> an assertion of which sequences of base + mark a given
>> combination is the equivalent. (In another sense,
>> decomposition asserts that the ordering of marks in a
>> combining sequence does not matter for some of these marks,
>> but matters for others).
> But then the questions become how far back "legacy" goes,
> because some of these combining sequences and precomposed
> characters seem to have been added fairly recently.  Be that as
> it may...

The interesting thing would be how "recent" additions were made of the 
kind where the identity of XY is well specified by X + Y, so that 
leaving out the decomposition would really result in dually encoding the 
same identity.

>
>> Arabic was excluded from this scheme for (largely) historical
>> reasons; combinations and precomposed forms are explicitly not
>> considered equal or equivalent, and one is not intended to be
>> substituted for another.
> Except, of course, where they are -- see U+0623.   I gather from
> the above that this is now considered a legacy issue.
>
>> So as to not break with the existing
>> system, additional composite forms will be encoded - always
>> without a decomposition.
> This sounds as if we have reached a point at which the rules for
> whether new characters are added, whether they are added in
> precomposed form, and whether those forms decompose, are
> actually different on a per-script (and, given the phonetic
> descriptors, perhaps a per-block or per-character (see below))
> basis.

I think that's a fair conclusion. Each script has always had a bit of 
its own logic (or should I say script-family). Think of Han: no other 
script gives a separate identity to so many rather minute variations of 
shapes, while having a well understood system of ignoring other types of 
perhaps equally minute differences.

> If that is true, it would make the IDNA rule sets
> horribly complex, but it might be possible _if_ there is
> guidance as to whether the next new script to be coded will be
> handled more like Latin or more like Arabic.  The standard seems
> to me to rather clearly say "more like Latin", but maybe that is
> not true.

Most of the "new" scripts are historic, if not "archaic".

Those that are neither, are often relatives of existing scripts of the 
region (Asia) and would be treated accordingly.

However, scripts change. Note what I wrote about Cherokee.

>
>> (As an aside: Arabic is full of other, non-composite code
>> points that will look identical to other code points in some
>> context, but are not supposed to be substituted - yet it's
>> trivial to find instances where they have been).
> Understood and that is complicated by a lot of font and writing
> system variations, especially between Arabic language use of the
> script and uses by languages whose writing systems were more
> strongly influenced by Persian.  But, so far, that set of issues
> has been handled as confusables, not different ways to code
> what, by appearance and name, appear to be the same character.

The problem is that with a better (more capable) support for Arabic in 
OS and applications it would have perhaps been possible to avoid many of 
these dual encodings (like Arabic and Farsi yeh). As Tom Milo points 
out, they really encode preferences that are based on the fact that the 
preferred font styles vary by region (and therefore the letter shapes do).

What's more, the supposed rules of which shapes / letters to apply are 
readily enumerated by educated people, but at least as often violated in 
actual writing (e.g. signage) as the rules for the use of apostrophe in 
English. Which means, that any identifier system that relies on users 
using the correct letter has issues that go beyond encoding decisions.

>
>> Latin, for example, also contains cases where, what looks like
>> a base letter with a mark (stroke, bar or slash) applied to
>> it, is not decomposed canonically. The rationale is that if I
>> apply a "stroke" to a letter form, the placement of the stroke
>> is not predictable. It may overstrike the whole letter, or
>> only a stem, or one side of a bowl. Like the aforementioned
>> case, new stroked, barred or slashed forms will be encoded in
>> the future, and none of these are (or will be) canonically
>> equivalent to sequences including the respective combining
>> marks. (This principle also holds for certain other "attached"
>> marks, like "hook", cf U+1D92, but not cedilla).
> So my "need to have different rules per script" hypothesis above
> in insufficient and the rules really have to be "per combining
> character"?

Yes, attached marks and non-attached marks are treated differently. See 
Danish.

>
>> On the other hand, no new composite forms will be encoded of
>> those that would have been decomposed in the past.
> I don't understand what that means in practice.  It sounds like
> the rule in the standard, but it also appears that exceptions
> can be made at any point simply by saying "that isn't really the
> same abstract character" because it is phonetically,
> semantically, or historically different.

Identity. If XY really isn't X + Y then encoding it as X + Y isn't going 
to work.

The analysis of this is not something you do on the fly, obviously, and 
the UTC is rather reticent to go there, but if you can back them into a 
corner with strong arguments, facts and data, XY will get encoded.
>
>> To come to a short summary of a long story: Unicode is
>> misunderstood if it's combining marks are seen a set of lego
>> bricks for the assemblage of glyphs. Superficially, that's
>> what they indeed appear to be. However, they are always marks
>> with their own identity that happen to be applied, in writing,
>> to certain base letter forms, with the conventional appearance
>> being indistinguishable from a "decoration".
>   
>> Occasionally, because of legacy, occasionally for other
>> reasons, Unicode has encoded identical shapes using multiple
>> code points (homographs). A homograph pair can be understood
>> as something that if both partners were rendered in the same
>> font, they would (practically always) look identical. Not
>> similar, identical.
> Ok.  I can work with that definition but note, as I said at the
> top of this note, that the same term has been used to describe
> sets of code points that merely might appear similar to someone,
> in some font, on some days.

Yeah, we need some other term for the latter.

When we were debating Han unification there ware the term "arms-length 
unification". Based on the idea of squinting at two character shapes and 
saying "close enough". Needless to say, that's was what the critics 
alleged. The reality was very different.

But for confusables, that's precisely what is being done.

>
>> The most common cases occur across scripts - as result of
>> borrowing. Scripts are a bit of an artificial distinction,
>> when operating on the level of shapes (whether in hot metal or
>> in a digital font) there's no need to distinguish whether 'e',
>> 's', 'p', 'x', 'q', 'w' and a number of other shapes are
>> "Latin" or "Cyrillic". They are the same shape. Whether they
>> are used to form English, French or Russian words happens to
>> be determined on another level.
> I understood that and have been trying to distinguish between
> same-script and inter-script differences, similarities, or
> identity all through these discussions.  I've even suggested
> that our IDNA-style identifier problems would be easier if
> Unicode had chosen to treat Latin, Cyrillic, and Greek as a
> single script.  The reasons that wasn't done are clear
> (including the legacy standard issue discussed above) and it is
> clearly too late to undo it (or to treat Arabic and Perso-Arabic
> as separate scripts) even if that were desirable more generally).

If you had robust variant definitions, you could treat LGC as single script.

Insofar, as you would register "pax" or "sex" only once, independent of 
whether it's Latin, Cyrillic or a mix.

As long as people pick the code points to go with the word (pax and sex 
don't make words in any language written in Cyrillic) users wouldn't 
even have troubles typing those labels they recognize.

But, to make it totally seamless, you'd have to have all flavors of 
"pax" being bundled to point to  single IP. And that's not robustly done 
any more. But, if you are willing to bend on the requirement that users 
be able to predict how to type, then it could be done.
>
>> Without the script distinction, these are no longer
>> homographs, because they would occur in the catalog only once.
>>
>> Because we do have script distinction in Unicode, they are
>> homographs, and they are usually handled by limiting script
>> mixing in a identifier - a rough type of "exclusion
>> mechanism".
>> ...
> Fortunately or unfortunately, we learned very early in IDNA
> deployment that a "no mixed-script labels" rule was impractical
> and would not be accepted by the user community due to various
> use cases in which identifiers are formed from multiple
> languages.   We can still advise that it is generally a bad idea
> (and have done so), but we cannot prohibit it.  Equally
> important for the domain name case, most users see an FQDN as a
> single identifier, not a set of isolated labels, but the
> "distributed administrative hierarchy" character of the DNS
> design (and how DNS aliases work) make it impossible   work to
> impose "all of the labels in this domain name must be entirely
> in the same script" rule.
>
>> Because the reasons why these homographs were encoded are
>> still as valid as ever, any new instances that satisfy the
>> same justification, will be encoded as well. In all these
>> cases, the homographs cannot be substituted without (formally)
>> changing the meaning of the text (when interpreted by reading
>> code values, of course, not when looking at marks). Therefore,
>> they cannot have a canonical decomposition.
>>
>> Canonical decomposition, by necessity, thus cannot "solve" the
>> issue of turning Unicode into a perfect encoding for the sole
>> purpose of constructing a robust identifier syntax - like the
>> hypothetical encoding I opened this message with. If there
>> was, at any time, a misunderstanding of that, it can't be
>> helped -- we need to look for solutions elsewhere.
> That is becoming clear (or may already be clear to most people
> here).

OK.
>
>> The fundamental design limitation of IDNA 2008 is that,
>> largely, the rules that it describes pertain to a single label
>> in isolation.
>>
>> You can look at the string of code points in a putative label,
>> and compute whether it is conforming or not.
>>
>> What that kind of system handles poorly is the case where two
>> labels look identical (or are semantically identical with
>> different appearance -- where they look "identical" to the
>> mind, not the eyes, of the user).
>>
>> In these cases, it's not necessarily possible, a-priori, to
>> come to a solid preference of one over the other label (by
>> ruling out certain code points). In fact, both may be equally
>> usable - if one could guarantee that the name space did not
>> contain  a doppelganger.
>>
>> That calls for a different mechanism, what I have called
>> "exclusion mechanism".
>>
>> Having a robust, machine readable specification of which
>> labels are equivalent variants of which other labels, so that
>> from such a variant set, only one of them gets to be an actual
>> identifier. (Presumably the first to be applied for).
>>
>> This will immediately open all the labels that do not form a
>> plausible 'minimal pair' with one of their variants. For
>> example, a word in a language that uses code point X, where
>> the homograph variant Y is not on that locale's keyboard would
>> not be in contention with an entire different word, were Y
>> appears, but in different context, and which is part of a
>> language not using X on their keyboard. Only the occasional
>> collision (like "chat" and "chat" in French and English) would
>> test the formal exclusion mechanism.
>>
>> This less draconian system is not something that is easy to
>> retrofit on the protocol level.
> It is also associated with a requirement that should be made
> explicit.  Unless, for example, one language (or other
> criterion) is to be universally preferred to others in some sort
> of ranked hierarchy, there must be a registry, even if first
> come first served, in which the strings can be cataloged.  That
> works for IANA-maintained protocol parameter registries
> including the DNS root zone and for identifiers associated with
> some other type of registration authority.  It does not work for
> most more distributed environments, nor in situations where
> users pick their own identifiers for their own purposes.  For
> the latter case, sometimes we care about absolute uniqueness of
> the latter, sometimes statistical uniqueness is good enough, and
> sometimes we don't care at all.
>
> In addition, exclusion rules are good at preventing false
> positives but not so helpful for preventing false negatives.  We
> (and the users) often, but not always, care that the user can
> enter the same string twice, even from different systems on
> different days, and get results that are "the same" or will
> compare equal.  If that isn't reliable because different systems
> have different defaults, assumptions, or input methods, then we
> end up in an unhappy situation.

OK - I think we both understand the issue - I elaborated it earlier - above.
>
>> But, already, outside the protocol level, issues of near (and
>> not so near) similarities have to be dealt with. Homographs in
>> particular (and "variants" in general) have the nice property
>> that they are treatable by mechanistic rules, because the
>> "similarities" whether graphical or semantic are "absolute".
>> They can be precomputed and do not require case-by-case
>> analysis.
> Actually, there are enough exceptions that I don't think a
> statement that general is appropriate.

Ah, perhaps we think of "rules" in a different way.

If you remove "arms-length" confusables (accidental or circumstantial 
ones that is) then the subset you are left with includes those that 
(like true homographs) are ones where the coexistence cannot be allowed 
by exception (or appeal).

Therefore, you can set up a Pauli-esque exclusion principle for them 
that does not allow exceptions.

The exceptions that you are thinking of are for when such things might 
enter the repertoire - different issue.
>
>> So, seen from the perspective of the entire eco-system around
>> the registration of labels, the perceived shortcomings of
>> Unicode are not as egregious and as devastating as they would
>> appear if one looks only at the protocol level.
>   
>> There is a whole spectrum of issues, and a whole set of layers
>> in the eco system to potentially deal with them. Just as
>> string similarity is not handled in the protocol, these types
>> of homographs should not have to be.
> Unfortunately, the view that these are just confusable
> characters that ought to be handled by subjective, per-registry
> and highly-distributed, judgments about what is acceptable
> --which I think it where this leads -- may discard a major goal
> of IDNA 2008, which was to be able to deal with the overwhelming
> majority of cases by algorithmic rules (or tables derived from
> them) that could be enforced by lookup-time checking.

What I'm trying to add to the discussion is the possibility for 
"objective" judgements that while not enforceable at look-up time, can 
nevertheless be defined by algorithmic rules. The only reason they are 
not enforceable at look-up time is because all "exclusion" principles 
depend on the order of registration attempts (first one through the 
door, second excluded).

I see this in distinction to merely subjective "arms-length" evaluation 
on a sliding scale with override by appeal.

Because it is possible to describe these exclusions "algorithmically" 
and because we now have a language (XML schema) to define them 
rigorously,  there is the ability to identify them, provide the rules 
for them up-front - with the aim that they become part of the 
established eco system.

>   That
> principle creates a kind of stability and predictability that
> are very important for both security and for a user sense that
> what is allowed (or disallowed) in one part of the DNS (or one
> protocol) will either be allowed elsewhere or the reasons will
> be clear.  By contrast, assuming that every zone administrator
> ("registry" -- remembering that there are hundreds of thousands
> of them, maybe more) will be cautious and conservative and will
> understand the issues and act in the best interests of the users
> and the Internet --or that they will all subscribe to the same
> judgment/ evaluation group and its conclusions-- is unrealistic.

Agreed.

I'm thinking about a list of code points / sequences in the IDNA tables 
where you give them some membership in EXCLUSION SET n, and then give 
them the property of being "valid only when no two labels exist that 
differ solely by two members from the same exclusion set". In other 
words, it wouldn't be a matter of research, investigation or judgement 
(or even understanding the issues).

But I appreciate the challenge in getting that implemented.

Some (all??) of the true homograph code points and sequences tend to be 
specialized. It may be possible to make an IDNA-2008 "restricted" 
profile that just doesn't allow either (works in the Arabic case we are 
discussing, in some other cases, only one may be marginal).

The way some registries have thrown all PVALID code points in, you are 
about overdue for a "restricted" profile anyway - if you could couple 
that with (browser etc.) implementations that can flexibly allow access 
to the unrestricted IDNA2008 only for zones that are known to have 
robust exclusion rules (and other robustnesses in their IDN tables) then 
you may have the hook to get the bad players to clean their act.

Just blue-skying here; I'm sure "can't be done" is the answer to all of 
these. :)

> We tried that and found, even at the second level of the DNS,
> that several important registry operators were quite indignant
> that ICANN would presume to tell them what to do and that
> browser vendors invented different "who or what do you trust"
> rules about individual zones (see Gerv's note about that and
> note that the strategy is useless below the second level unless
> the top-level registry imposes contractual requirements on every
> delegation that apply recursively down the tree and that are
> rigorously enforced -- ideas that have proven unpopular and that
> may not even be plausible).
>
> I hope we can do better, even if it requires special properties
> to identify code points that would have had decompositions under
> a more identifier-oriented set of criteria although the current
> Unicode criteria do not define decompositions for them or even
> an IETF-defined normalization form that has better properties
> for identifiers (again, within a script and taking the script
> boundaries/ categories as given) than NFC is turning out to have.
>
>> Let's recommend handling them in more appropriate ways.
> I hope you have better suggestions than the above.  At this
> point, I don't and I find the idea of going down either of those
> paths fairly daunting.

I think your instincts of cataloging the problem first, and generating 
properties and algorithmic rules next is spot on. I would suggest you 
consider rules that fit outside the current design point of picking a 
single valued normalization or single valued restriction logic, because 
those don't describe the problem domain. That means, that once you have 
your rules, you'll need to find a way to bring them to bear.

A./

PS: a "restricted" profile could nix all the historic only scripts and 
many historic only additions to existing scripts. They really aren't 
needed in a robust identifier system (except for vanity IDs ...).