Comments on issues-07, part 1

Mark Davis mark.davis at icu-project.org
Tue Mar 11 21:51:25 CET 2008


Comments below.

http://tools.ietf.org/html/draft-klensin-idnabis-issues-07

...

>
> 1.5.2. Terminology about Characters and Character Sets
>
> A code point is an integer value associated with a character in a
> coded character set.
>
> Unicode [Unicode50] is a coded character set containing almost


The references to Unicode need to be to Unicode 5.1.


>
> 100,000 characters as of the current version. A single Unicode code
> point is denoted by "U+" followed by four to six hexadecimal digits,
> while a range of Unicode code points is denoted by two four to six
>
> digit hexadecimal numbers separated by "..", with no prefixes.
>
> ASCII means US-ASCII [ASCII], a coded character set containing 128
> characters associated with code points in the range 0000..007F.

...

>
> The prefix and string together must conform to all requirements
> for a label that can be stored in the DNS including conformance to
> the LDH ("host name") rule described in RFC 1034, RFC 1123 and
>
> elsewhere.
>
> o A "U-label" is an IDNA-valid string of Unicode characters,
> expressed in a standard Unicode Encoding Form, normally UTF-8, and

"normally UTF-8"

The encoding form will depend greatly on the environment. For example,
most browsers use UTF-16 internally, as does Java, C#, etc. So this
phrase should either be deleted or qualified, such as "normally UTF-8
in web documents"


> subject to the constraint below. Conversions between valid
>
> U-labels and valid A-labels is performed according to the
> specification in [RFC3492], adding or removing the ACE prefix (see
> Section 1.5.4.3) as needed.
>
> To be valid, U-labels and A-labels must obey an important symmetry
>
> constraint. While that constraint may be tested in any of several
> ways, an A-label must be capable of being produced by conversion from
> a U-label and a U-label must be capable of being produced by
> conversion from an A-label. Among other things, this implies that
>
> both U-labels and A-labels must represent strings in normalized form.
> These strings MUST contain only characters specified elsewhere in
> this document and its companion documents, and only in the contexts

There should be no MUSTs or SHOULDs in this document -- it is
rationale and issues, not specification. All of those should be
explicitly in the specification documents: protocol, bidi, tables.

>
> indicated as appropriate.
>
> Any rules or conventions that apply to DNS labels in general, such as
> rules about lengths of strings, apply to whichever of the U-label or
>
>
>

...

>
> systems are substantially or completely Unicode-compatible (i.e., all
> of the code points in them have an exact and unique mapping to
> Unicode code points). It may be even more difficult when the
> character coding system in local use is based on conceptually
>
> different assumptions than those used by Unicode about, e.g., how
> different presentation or combining forms are handled, such as
> proposals now being developed for Tamil. Those differences may not

This is a bad example, being, well, untrue. All of the proposals for
Tamil encodings being discussed in Tamil Nadu do have a completely
natural mapping to Unicode, with no ambiguities.

I gave an example in earlier comments that was true, and which you
could use; the font encodings used by some Indic publishing sites.

> easily yield unambiguous conversions or interpretations even if each
>
> coding system is internally consistent and adequate to represent the
> local language and script

...

> 5.1.1. PROTOCOL-VALID
>
> Characters identified as "PROTOCOL-VALID" are, in general, permitted
> by IDNA for all uses in IDNs. Their use may be restricted by rules
>
> about the context in which they appear or by other rules that apply
> to the entire label in which they are to be embedded. For example,
> any label that contains a character in this group that has a "right
>
> to left" property must be used in context with the "Bidi" rules.
>
>
>
> Klensin Expires August 9, 2008 [Page 17]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
>
> The term "PROTOCOL-VALID", is used to stress the fact that the
> presence of a character in this category does not imply that a given
> registry need accept registrations containing any of the characters
>
> in the category. Registries are still expected to apply judgment
> about labels they will accept and to maintain rules consistent with
> those judgments (see [IDNA200X-Protocol] and Section 5.3).
>
> Characters that are placed in the "PROTOCOL-VALID" category are never
>
> removed from it unless the code points themselves are removed from
> Unicode (such removal would be inconsistent with the Unicode
> stability principles (see [Unicode50], Appendix F) and hence should
> never occur).

How strong is this promise really? An RFC could obsolete this one and
make this false, just as this RFC will make much of IDNA2003 be false.

>
>
> 5.1.1.1. Contextual Rules
>
> Characters in the PROTOCOL-VALID category may actually be unsuitable
> for general use in IDNs but necessary for the plausible support of
> some scripts. The two most commonly-cited examples are the zero-
>
> width joiner and non-joiner characters (ZWNJ, U+200C, and ZWJ,
> U+200D), but provisions for unambiguous labels may require that other
> characters be restricted to particular contexts.

There has been no compelling case for having any Cf characters except
for ZWNJ and ZWJ.  The Cf characters are far, far more problematic
than most punctuation characters; why does anyone think that they are
needed.

> For example, the
> ASCII hyphen is not permitted to start or end a label, whether that
>
> label contains non-ASCII characters or not.
>
> These characters must not appear in IDNs without additional
> restrictions, typically because they are invisible in most scripts
> but affect format or presentation in a few others or because they are
>
> combining characters that are safe for use only in conjunction with
> particular characters or scripts. In order to permit them to be used
> at all, they are specially identified as "CONTEXTUAL RULE REQUIRED"
>
> and, when adequately understood, associated with a rule. In
> addition, the rule will define whether it is to be applied on lookup
> as well as registration. Only rules associated with characters that
> indicate or prohibit joining are fully tested at lookup time.
>
>
> 5.1.1.2. Rules and Their Application
>
> The actual rules may be present or absent. If present, they may have
> values of "True" (character may be used in any position in any
>
> label), "False" (character may not be used in any label), or may be
> an extended regular expression that specifies the context in which
> the character is permitted.
>
> Examples of descriptions of typical rules include "Must follow a
>
> character from Script XYZ", "MUST occur only if the entire label is
> in Script ABC", "MUST occur only if the previous and subsequent
> characters have the DEF property".
>
>
>
>
>
> Klensin Expires August 9, 2008 [Page 18]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
> Because it is easier to identify these characters than to know that
>
> they are actually needed in IDNs or how to establish exactly the
> right rules for each one, a rule may have a null value in a given
> version of the tables. Characters associated with null rules MUST
> NOT appear in putative labels for either registration or lookup. Of
>
> course, a later version of the tables might contain a non-null rule.
>
> [[anchor24: Definition of regular expression language to be
> supplied]]
>
> 5.1.2. DISALLOWED
>
> Some characters are sufficiently problematic for use in IDNs that
>
> they should be excluded for both registration and lookup (i.e.,
> conforming applications performing name resolution should verify that
> these characters are absent; if they are present, the label strings
> should be rejected rather than converted to A-labels and looked up.
>
>
> Of course, this category would include code points that had been
> removed entirely from Unicode should such characters ever occur.
>
> Characters that are placed in the "DISALLOWED" category are never
>
> removed from it or reclassified. If a character is classified as
> "DISALLOWED" in error and the error is sufficiently problematic, the
> only recourse would be to introduce a new code point into Unicode and
>
> classify it as "PROTOCOL-VALID".
>
> There is provision for exception cases but, in general, characters
> are placed into "DISALLOWED" if they fall into one or more of the
> following groups:
>
>
> o The character is a compatibility equivalent for another character.
> In slightly more precise Unicode terms, application of
> normalization method NFKC to the character yields some other
> character.
>
>
> o The character is an upper-case form or some other form that is
> mapped to another character by Unicode casefolding.
>
> o The character is a symbol or punctuation form or, more generally,
> something that is not a letter or digit.

add: or mark.

>
>
> 5.1.3. UNASSIGNED
>
> For convenience in processing and table-building, code points that do
> not have assigned values in a given version of Unicode are treated as
> belonging to a special UNASSIGNED category. Such code points MUST
>
> NOT appear in labels to be registered or looked up. The category
>
>
>
> Klensin Expires August 9, 2008 [Page 19]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
>
> differs from DISALLOWED in that code points are moved out of it by
> the simple expedient of being assigned in a later version of Unicode
> (at which point, they are classified into one of the other categories
>
> as appropriate.
>
> 5.2. Registration Policy
>
> These recommendations do not address, but registries SHOULD develop

grammar. "address what"?

> and apply addition restrictions to reduce confusion and other
> problems. For example, it is generally believed that labels
>
> containing characters from more than one script are a bad practice
> although may be some important exceptions to that principle. Some
> registries may choose to restrict registrations to characters drawn
> from a very small number of scripts. For many scripts, the use of
>
> variant techniques such as those as described in [RFC3743] and
> [RFC4290] may be helpful in reducing problems that might be perceived
> by users.
>
> 5.3. Layered Restrictions: Tables, Context, Registration, Applications

I like this section. It should also add the recent information that
address spoofing itself is only a small part of the overall spoofing
problem; eg having a web page that mimics a legitimate page is a more
serious problem. It should also add that the user-agents have a
serious role to play.

>
>
> The essence of the character rules in IDNA200X is based on the
> realization that there is no magic bullet for any of the issues
> associated with a multiscript DNS. Instead, the specifications
> define a variety of approaches that, together, constitute multiple
>
> lines of defense against ambiguity in identifiers and loss of
> referential integrity. The actual character tables are the first
> mechanism, protocol rules about how those characters are applied or
> restricted in context are the second, and those two in combination
>
> constitute the limits of what can be done by a protocol alone.
> Registries are expected to restrict what they permit to be
> registered, devising and using rules that are designed to optimize
> the balance between confusion and risk on the one hand and maximum
>
> expressiveness in mnemonics on the other.
>
>
> 6. Issues that Constrain Possible Solutions
>
> 6.1. Display and Network Order
>
> The correct treatment of domain names requires a clear distinction
> between Network Order (the order in which the code points are sent in
>
> protocols) and Display Order (the order in which the code points are
> displayed on a screen or paper). The order of labels in a domain
> name is discussed in [IDNA200X-Bidi]. There are, however, also
> questions about the order in which labels are displayed if left-to-
>
> right and right-to-left labels are adjacent to each other, especially
> if there are also multiple consecutive appearances of one of the
> types. The decision about the display order is ultimately under the
>
>
>
>
> Klensin Expires August 9, 2008 [Page 20]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
> control of user agents --including web browsers, mail clients, and
>
> the like-- which may be highly localized. Even when formats are
> specified by protocols, the full composition of an Internationalized
> Resource Identifier (IRI) [RFC3987] or Internationalized Email
> address contains elements other than the domain name. For example,
>
> IRIs contain protocol identifiers and field delimiter syntax such as
> "http://" or "mailto:" while email addresses contain the "@" to
> separate local parts from domain names. User agents are not required
>
> to use those protocol-based forms directly but often do so. While
> display, parsing, and processing within a label is specified by the
> IDNA protocol and the associated documents, the relationship between
>
> fully-qualified domain names and internationalized labels is
> unchanged from the base DNS specifications. Comments here about such
> full domain names are explanatory or examples of what might be done
> and must not be considered normative.
>
>
> Questions remain about protocol constraints implying that the overall
> direction of these strings will always be left-to-right (or right-to-
> left) for an IRI or email address, or if they even should conform to
>
> such rules. These questions also have several possible answers.
> Should a domain name abc.def, in which both labels are represented in
> scripts that are written right-to-left, be displayed as fed.cba or
> cba.fed? An IRI for clear text web access would, in network order,
>
> begin with "http://" and the characters will appear as
> "http://abc.def" -- but what does this suggest about the display
> order? When entering a URI to many browsers, it may be possible to
>
> provide only the domain name and leave the "http://" to be filled in
> by default, assuming no tail (an approach that does not work for
> other protocols). The natural display order for the typed domain
>
> name on a right-to-left system is fed.cba. Does this change if a
> protocol identifier, tail, and the corresponding delimiters are
> specified?
>
> While logic, precedent, and reality suggest that these are questions
>
> for user interface design, not IETF protocol specifications,
> experience in the 1980s and 1990s with mixing systems in which domain
> name labels were read in network order (left-to-right) and those in
> which those labels were read right-to-left would predict a great deal
>
> of confusion, and heuristics that sometimes fail, if each
> implementation of each application makes its own decisions on these
> issues.
>
> It should be obvious that any revision of IDNA must be more clear
>
> about the distinction between network and display order for complete
> (fully-qualified) domain names, as well as simply for individual
> labels, than the original specification was. It is likely that some
>
> strong suggestions should be made about display order as well.

I didn't understand this paragraph. Is it supposed to apply to
IDNA200x, or to some future version? Either way, it needs rewriting to
be not just a vague wish. I'm guessing this was just a holdover from
previous versions.

>
>
>
>
> Klensin Expires August 9, 2008 [Page 21]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
>
> 6.2. Entry and Display in Applications
>
> Applications can accept domain names using any character set or sets
> desired by the application developer, and can display domain names in
> any charset. That is, the IDNA protocol does not affect the
>
...
>
> In any place where a protocol or document format allows transmission
>
> of the characters in internationalized labels, labels SHOULD be
> transmitted using whatever character encoding and escape mechanism
> the protocol or document format uses at that place.

Is there any choice? If the protocol says that text is UTF-8 and I
stick in Latin-1 bytes, I'm simply breaking that protocol. Why does
the above paragraph have to be included? What is a concrete example of
the problem?

>
> All protocols that use domain name slots already have the capacity
>
> for handling domain names in the ASCII charset. Thus, A-labels can
> inherently be handled by those protocols.
>
>
>
>
>
>
...

The language in the section below is pretty casual for a standard,
talking about "greedy registrars" (examples?). I assume that this will
be removed before this document goes final.


> 7. IDNs and the Robustness Principle
>
> The model of IDNs described in this document can be seen as a
> particular instance of the "Robustness Principle" that has been so
> important to other aspects of Internet protocol design. This
>
> principle is often stated as "Be conservative about what you send and
> liberal in what you accept" (See, e.g., RFC 1123, Section 1.2.2
> [RFC1123]). For IDNs to work well, registries must have or require
>
> sensible policies about what is registered -- conservative policies
> -- and implement and enforce them. Registries, registrars, or other
> actors who do not do so, or who get too liberal, too greedy, or too
>
> weird may deserve punishment that will primarily be meted out in the
> marketplace or by consumer protection rules and legislation. One can
> debate whether or not "punishment by browser vendor" is an effective
>
> marketplace tool, but it falls into the general category of
> approaches being discussed here. In any event, the Protocol Police
> (an important, although mythical, Internet mechanism for enforcing
>
>
>
>
> Klensin Expires August 9, 2008 [Page 26]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
> protocol conformance) are going to be worth about as much here as
>
> they usually are -- i.e., very little -- simply because, unlike the
> marketplace and legal and regulatory mechanisms, they have no
> enforcement power.
>
> Conversely, resolvers can (and SHOULD or maybe MUST) reject labels
>
> that clearly violate global (protocol) rules (no one has ever
> seriously claimed that being liberal in what is accepted requires
> being stupid). However, once one gets past such global rules and
> deals with anything sensitive to script or locale, it is necessary to
>
> assume that garbage has not been placed into the DNS, i.e., one must
> be liberal about what one is willing to look up in the DNS rather
> than guessing about whether it should have been permitted to be
> registered.
>
>
> As mentioned above, if a string doesn't resolve, it makes no
> difference whether it simply wasn't registered or was prohibited by
> some rule.
>
> If resolvers, as a user interface (UI) or other local matter, decide
>
> to warn about some strings that are valid under the global rules but
> that they perceive as dangerous, that is their prerogative and we can
> only hope that the market (and maybe regulators) will reward the good
>
> choices and punish the bad ones. In this context, a resolver that
> decides a string that is valid under the protocol is dangerous and
> refuses to look it up is in violation of the protocols; one that is
> willing to look something up, but warns against it, is exercising a
>
> local choice.
>
>
> 8. Front-end and User Interface Processing

As I've said before, locale-specific preprocessing is a nightmare for
compatibility. While there isn't anything we can do to prevent it, we
should strongly *discourage* it, and not have any language like the
following that seems to *encourage* it.

also "processing" => "preprocessing" for consistency.

...
>
> As discussed elsewhere in this document, the IDNA200X model is to
>
> remove all of these mappings and interpretations, including the
> equivalence of different forms of dots, from the protocol, leaving
> such mappings to local processing. This should not be taken to imply
> that local processing is optional or can be avoided entirely.
>
> Instead, unless the program context is such that it is known that any
> IDNs that appear will be either U-labels or A-labels, some local
> processing of apparent domain name strings will be required, both to
>
> maintain compatibility with IDNA2003 and to prevent user
> astonishment. Such local processing, while not specified in this
> document or the associated ones, will generally take one of two
> forms:
>
>
> o Generic Preprocessing.
> When the context in which the program or system that processes
> domain names operates is global, a reasonable balance must be
> found that is sensitive to the broad range of local needs and
>
> assumptions while, at the same time, not sacrificing the needs of
> one language, script, or user population to those of another.
>
> For this case, the best practice will usually be to apply NFKC and
>
> case-mapping (or, perhaps better yet, Stringprep itself), plus
> dot-mapping where appropriate, to the domain name string prior to
> applying IDNA. That practice will not only yield a reasonable
> compromise of user experience with protocol requirements but will
>
> be almost completely compatible with the various forms permitted
> by IDNA2003.
>
> o Highly Localized Preprocessing.
> Unlike the case above, there will be some situations in which
> software will be highly localized for a particular environment and
>
> carefully adapted to the expectations of users in that
> environment. The many recent discussions about using the Internet
>
>
>
> Klensin Expires August 9, 2008 [Page 28]
>
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
> to preserve and support local cultures suggest that these cases
> may be more common in the future than they have been so far.
>
>
> In these cases, we should avoid trying to tell implementers what
> they should do, if only because they are quite likely (and for
> good reason) to ignore us. We would assume that they would map
>
> characters that the intuitions of their users would suggest be
> mapped. One can imagine switches about whether some sorts of
> mappings occur, warnings before applying them or, in a slightly
> more extreme version of the approach taken in Internet Explorer
>
> version 7 (IE7), utterly refuse to handle "strange" characters at
> all if they appear in U-label form. None of those local decisions
> are a threat to interoperability as long as (i) only U-labels and
>
> A-labels are used in interchange with systems outside the local
> environment, (ii) no character that would be valid in a U-label as
> itself is mapped to something else, (iii) any local mappings are
>
> applied as a preprocessing step (or, for conversions from U-labels
> or A-labels to presentation forms, postprocessing), not as part of
> IDNA processing proper, and (iv) appropriate consideration is
>
> given to labels that might have entered the environment in
> conformance to IDNA2003.
>
>
> 9. Migration and Version Synchronization
>
> 9.1. Design Criteria
>
> As mentioned above and in RFC 4690, two key goals of this work are to
>
> enable applications to be agnostic about whether they are being run
> in environments supporting any Unicode version from 3.2 onward and to
> permit incrementally adding permitted scripts and other character
>
> collections without disruption. The mechanisms that support this are
> outlined above, but this section reviews them in a context that may
> be more helpful to those who need to understand the approach and make
>
> plans for it.
>
> 9.1.1. General IDNA Validity Criteria
>
> The general criteria for a putative label, and the collection of
> characters that make it up, to be considered IDNA-valid are:
>
> o The characters are "letters", numerals, or otherwise used to write
>
> words in some language. Symbols, drawing characters, and various
> notational characters are permanently excluded -- some because
> they are actively dangerous in URI, IRI, or similar contexts and
>
> others because there is no evidence that they are important enough
> to Internet operations or internationalization to justify
> inclusion and the complexities that would come with it (additional
>
>
>
>
> Klensin Expires August 9, 2008 [Page 29]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
> discussion and rationale for the symbol decision appears in
>
> Section 9.5).

The argument here is untenable, and should just be replaced by
something short but tenable, such as that the symbols are not needed
for forming words, and some of them have issues for spoofing.

Why do I say untenable?  Because saying that "symbols aren't allowed
because of ambiguity when reading" is far, far too broad a stroke. It
is like saying that marriage shouldn't be allowed for gays because it
is reserved for procreation. There are simply far too many obvious
counterexamples, like people who are sterile, women past menopause,
etc.

Similarly, there are many, many words and phrases that can't be simply
conveyed as spoken without additional clarifying text. When I say my
email is markdavis at google.com, I always have to say "markdavis (one
word)", and often  "markdavis (one word, with a KAY)", and even
sometimes "markdavis (one word, with a KAY, and VEE I ESS, not VEE I E
ESS)". The situation is compounded in many other languages with even
more homophones, like French or Chinese. It is much easier to
distinguish "black heart symbol" verbally, than it is to distinguish
words and letters of many, many languages.

Moreover, while there are many symbols that one does have to
distinguish verbally because of similar related ones, many are
perfectly clear without any such disambiguation, and we don't want to
say that those *should* be allowed.

>
> If strings are read out loud, rather than seen on paper, there are
> opportunities for considerable confusion between the name of a
> symbol (and a single symbol may have multiple names) and the
>
> symbol itself.
>
> o As a simplified example of this, assume one wanted to use a
> "heart" or "star" symbol in a label. This is problematic because
> the those names are ambiguous in the Unicode system of naming (the
>
> actual Unicode names require far more qualification). A user or
> would-be registrant has no way to know --absent careful study of
> the code tables-- whether it is ambiguous (e.g., where there are
>
> multiple "heart" characters) or not. Conversely, the user seeing
> the hypothetical label doesn't know whether to read it --try to
> transmit it to a colleague by voice-- as "heart", as "love", as
>
> "black heart", or as any of the other examples below.
>
> o The actual situation is even worse than this. There is no
> possible way for a normal, casual, user to tell the difference
> between the hearts of U+2665 and U+2765 and the stars of U+2606
>
> and U+2729 or the without somehow knowing to look for a
> distinction. We have a white heart (U+2661) and few black hearts
> and describing a label containing a heart symbol is hopelessly
> ambiguous. In cities where "Square" is a popular part of a
>
> location name, one might well want to use a square symbol in a
> label as well and there are far more squares of various flavors in
> Unicode than there are hearts or stars.
>
> o Unlike font and style variations in language (and "mathematical")
>
> characters, identification of compatibility encodings and the
> application of NFKC is of no help here. All of these symbols (and
> many other pairs and triples) are treated as valid, independent,
>
> non-reducible, code points.
>
> o Other than in very exceptional cases, e.g., where they are needed
> to write substantially any word of a given language, punctuation
> characters are excluded as well. The fact that a word exists is
>
> not proof that it should be usable in a DNS label and DNS labels
> are not expected to be usable for multiple-word phrases (although
> they are not prohibited if the conventions and orthography of a
>
> particular language cause that to be possible).
>
> o Characters that are unassigned in the version of Unicode being
> used by the registry or application are not permitted, even on
> resolution (lookup). There are at least two reasons for this.
>
> First, unlike the conditions contemplated in IDNA2003 (except for
>
>
>
> Klensin Expires August 9, 2008 [Page 30]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
>
> right-to-left text), we now understand that tests involving the
> context of characters (e.g., some characters being permitted only
> adjacent to other ones of specific types) and integrity tests on
>
> complete labels will be needed. Unassigned code points cannot be
> permitted because one cannot determine the contextual rules that
> particular code points will require before characters are assigned
>
> to them and the properties of those characters fully understood.
> Second, Unicode specifies that an unassigned code point normalizes
> to itself. If the code point is later assigned to a character,
>
> and particularly if the newly-assigned code point has a combining
> class that determines its placement relative to other combining
> characters, it could normalize to some other code point or
> sequence, creating confusion and/or violating other rules listed
>
> here.

The above isn't a good rationale (see other email). I'm not contesting
the disallowing of unassigned, but it needs different grounds.

>
> o Any character that is mapped to another character by Nameprep2003
> or by a current version of NFKC is prohibited as input to IDNA
> (for either registration or resolution). Implementers of user
>
> interfaces to applications are free to make those conversions when
> they consider them suitable for their operating system
> environments, context, or users.
>
> Tables used to identify the characters that are IDNA-valid are
>
> expected to be driven by the principles above. The principles are
> not just an interpretation of the tables.
>
> 9.1.2. Labels in Registration
>
> Anyone entering a label into a DNS zone must properly validate that
>
> label -- i.e., be sure that the criteria for an A-label are met -- in
> order for Unicode version-independence to be possible. In
> particular:
>
> o Any label that contains hyphens as its third and fourth characters
>
> MUST be IDNA-valid. This implies that, (i) if the third and
> fourth characters are hyphens, the first and second ones MUST be
> "xn" until and unless this specification is updated to permit
>
> other prefixes and (ii) labels starting in "xn--" MUST be valid
> A-labels, as discussed in Section 3 above.
>
> o The Unicode tables (i.e., tables of code points, character
> classes, and properties) and IDNA tables (i.e., tables of
>
> contextual rules such as those described above), MUST be
> consistent on the systems performing or validating labels to be
> registered. Note that this does not require that tables reflect
> the latest version of Unicode, only that all tables used on a
>
> given system are consistent with each other.
>
>
>
>
> Klensin Expires August 9, 2008 [Page 31]
>
> Internet-Draft IDNA200X Rationale February 2008
>
>
>
> Systems looking up or resolving DNS labels MUST be able to assume
> that those rules were followed on registration.
>
> 9.1.3. Labels in Resolution (Lookup)
>
> Anyone looking up a label in a DNS zone MUST

Change the format to have the MUST as the first word of each bullet
(you'll see why below).
>
>
> o Maintain a consistent set of tables, as discussed above. As with
> registration, the tables need not reflect the latest version of
> Unicode but they MUST be consistent.
>
> o Validate the characters in labels to be looked up only to the
>
> extent of determining that the U-label does not contain either
> code points prohibited by IDNA (categorized as "DISALLOWED") or
> code points that are unassigned in its version of Unicode.
>
>
> o Validate the label itself for conformance with a small number of
> whole-label rules, notably verifying that there are no leading
> combining marks, that the "bidi" conditions are met if right-to-
>
> left characters appear, that any required contextual rules are
> available and that, if such rules are associated with Joiner
> Controls, they are tested.

> No attempt should be made to validate
> other contextual rules about characters, including mixed-script
>
> label prohibitions, although such rules MAY be used to influence
> presentation decisions in the user interface.

Break the last bullet into two, just before "No attempt should" and change to

o  MUST NOT validate other contextual rules about characters....


>
> By avoiding applying its own interpretation of which labels are valid
> as a means of rejecting lookup attempts, the resolver application
>
> becomes less sensitive to version incompatibilities with the
> particular zone registry associated with the domain name.
>
...


>   02B9; MODIFIER LETTER PRIME; F;;;
>      Permitted only in contexts in which U+0375 is permitted.  U+0374
>      and U+0375 are indicators for numeric use of letters in older
>      Greek writing systems.  U+02B9 is relevant because normalization
>      maps U+0374 into it.;

General note. Please always use the full form U+<code> <name> with all
references to (non-ASCII) Unicode characters. Nobody has these code
memorized!

Eg U+0374 GREEK NUMERAL SIGN

There are quite a number of web utilities that let you get the name.

http://unicode.org/cldr/utility/character.jsp  - enter in the code
0374 to get the name GREEK NUMERAL SIGN

http://rishida.net/scripts/uniview/ - enter 0374 in "custom range" to
get the name

&c.


-- 
Mark


More information about the Idna-update mailing list