DRAFT Status of Work on IDNA2008 + IDNAv2
vint at google.com
Fri Mar 20 14:12:05 CET 2009
DRAFT Status of work on IDNA2008
3/21/2009 0523 PDT
This brief summary is intended to provide some focus for the IDNABIS
scheduled for Monday and Tuesday, March 23 (1740-1940) and March 24
One goal is to try to assess rough consensus about the present
documentation on the
presumption that we are abiding by the ground-rules set forth in the
charter of the WG.
Another is to assess what the implications are for users, registries,
IDNA2008 is adopted as it presently stands. A third goal is to
examine the implications
of the IDNAV2 proposal from Paul Hoffman and contrast with adoption of
I fully recognize that consensus has to be assessed from mailing list
merely from appearances at our face to face meetings.
The material presented below is by no means intended to be more than a
discussion, and is not intended as a penultimate recommendation.
Under the IDNABIS charter, the IDNA2008 design as it now stands makes
specific assumptions or makes specific propositions to achieve a
number of goals:
0. Avoid dependence on any specific version of Unicode through the use
for determining PVALID characters based on Unicode character
1. No change to the deployed DNS server functionality (domain name
labels limited to
ASCII and case-insensitive matching only)
2. Esszet, Final Sigma, ZWJ and ZWNJ, geresh and gershayim are PVALID
some of which are treated through contextual rules (there is
still ongoing discussion
about the implications of these choices)
3. Unassigned Unicode characters will not be looked up
4. No mapping of characters at least within the protocol specification
5. No modification of or dependence on Nameprep (and thus no impact
on other protocols relying on Nameprep or Stringprep.)
6. Clear specification of valid "dot" form in a way that is consistent
7. Symmetry between native-character ("Unicode") and ACE ("Punycode")
forms of a label.
8. Conversion to an inclusion list of PVALID characters (as distinct
IDNA2003 posture that excluded only a few Unicode characters)
9. Improved terminology to make categories and types of labels more
10. Provide explanation for decisions and their motivations
aid implementors, registries, registrants and users in
11. Separately describe registration and lookup procedures to improve
12. Specify tests to be applied at lookup time in an attempt to limit
IDNA at all levels of registration
13. Clarify what is expected of IDNA-aware applications and domain name
"slots" with regard to invalid labels and future extensibility
Chartering and Re-Chartering
(1) A Re-charter is needed if we abandon a significant fraction of the
and methods. IDNAv2, as described by Paul Hoffman requires a re-charter.
(2) A Re-charter is needed if the WG decides to introduce mappings
into the IDNA2008
specifications since the basic assumption in IDNA2008 was that mapping
be part of the specification.
(3) It is possible that re-charter might not be needed if IDNA2008
IDNA2003 operations under a restricted set of conditions and only at
time for purposes of easing the transition to IDNA2008. This would be
up to the
AD and IESG presumably to decide.
Basics for IDNA2003 and IDNA2008
Both of these specifications use the Punycode algorithm to generate what
IDNA2008 would call an A-label (ie. "xn-- <LDH compliant string>") from
labels expressed as a string of characters drawn from a subset of
DNS matching is done in the servers by comparing the query string to the
registered string in a case-independent fashion. For IDNs, these
are done after conversion into the "xn--" prefix form. For IDNs the
matching of the DNS servers applies only to the A-label form and not
Unicode form. This means that the case-insensitive matching behavior of
in traditional ASCII labels is not conferred on IDNs in their Unicode
The case-insensitive comparisons between traditional LDH domain names is
approximated under IDNA2003 by using CaseFold as a mapping guide on the
Unicode strings being looked up. In addition, IDNA2003 also maps the
"compatibility" characters of Unicode into their counterparts. The
precede the registration of new domain names under IDNA2003.
Unicode CaseFold maps to upper case and then map back to lower case.
Prior to Unicode 5.0, Ezsett became "SS" because there was no upper
case, then became "ss" in the lower case mapping. Under Unicode 5.0
CaseFold was unchanged for stability reasons. Consequently
CaseFold (ESSZETT) is "ss" rather than lower case esszett even after
the introduction of upper case ESSZETT in Unicode 5.0.
Under IDNA2003, both DISALLOWED and UNASSIGNED characters
are looked up. If abusive registrations are made using DISALLOWED
or UNASSIGNED characters, these registered domain names may be
be found on lookup by IDNA2003-compliant clients.
Under IDNA2008, UNASSIGNED and DISALLOWED characters are not looked up.
If new characters become defined under a new version of Unicode
an old client will not look them up until it is updated. Abusive
using UNASSIGNED characters will not be looked up.
Script mixing is not banned under IDNA2003. Under IDNA2008, BiDi
bans mixing of European and Extended Arabic-Indic numbers with
Arabic numbers. That is AN and EN characters may not be present in
the same label. Otherwise, mixing is permitted in IDNA2008.
IMPLICATIONS OF ADOPTING IDNA2008 AS CURRENTLY SPECIFIED
1. IDNA2008 is case sensitive for labels with non-LDH characters in
them but is
case-insensitive for LDH characters
for example" buecher "is all ASCII and could be matched with "Buecher"
however "B<u-umlaut>cher" would not be allowed because Tables (see
disallow Latin Capital letters. Some users accustomed to LDH-label
may be surprised that "B<u-umlaut>cher" and "b<u-umlaut>cher" do not
On the other hand, the symmetric relationship between the IDNA2008-
A-Label and U-Label has the benefit one can use exact match for either
U-label form or A-label forms since they are directly and unambiguously
transformable into each other.
2. IDNA2008 does not ban script mixing even within labels.
Attempts to fashion rules along these lines have run into problems
in which characters that may be confused for others are needed
to express strings in particular languages. The International Phonetic
Alphabet (IPA) characters are a case in point. Some are used for
certain (e.g. African) languages but some of these characters
can be confused for others in the Latin alphabet. Other examples
exist in Arabic, Cyrillic, Greek among others.
Even in the absence of intra-label script mixing, inter-script confusion
such as the Russian word for "restaurant" looking like "pectopah" in
Latin characters is quite possible.
Despite the apparent desirability of such a ban at protocol level, there
are simply too many combinations of confusion within-scripts and between
scripts to benefit significantly from a protocol-level ban. On the
registry level constraints that may be more script-aware appear to be
the most effective tool we have.
3. Esszet is permitted and its usage appears to be geographically and
specific. Under IDNA2003, this character is mapped into "ss". To deal
potential conflict with previously mapped registrations in which
Esszet is mapped
to "ss" registries would need to appeal to Rationale 7.2 options, for
to deal with this. Note that not all collisions may be a consequence
of mapping, i.e.,
many occurrences of "ss" in German text are not typographic variations
Esszett and very few occurrences in Latin script, without
consideration of language,
are variations of Esszett either.
4. Final Sigma is permitted and raises similar issues to Esszet with
collisions and the same remedies would apply.
In IDNA2003, these characters were mapped to "nothing". It has become
however that some Indic scripts need them. Persian registries currently
reject registration of labels including ZWJ/ZWNJ although ZWNJ is used
writing Persian languages. Arabic language does not need ZWJ/ZWNJ.
Mapping to "nothing" in INDA2003 has the side-
effect of inhibiting domain name expression in some Indic scripts
Tamil and Devanagari. Permitting either or both as valid characters
a compatibility problem similar to the Esszett one; i.e., one cannot
whether a DNS label, when converted back to native character form, was
intended to be written with ZWJ, ZWNJ or neither.
Elaboration: Suppose that "ab" is a string in one of the scripts in
which we now
propose to permit ZWNJ. All we have in the DNS is the A-label
equivalent of "ab".
We can't tell from looking at it whether the starting string, as seen/
preferred by the
since both map to the same A-label.
Under IDNA2008, if the user enters "ab", she gets one A-label
while, if she enters "aXWJb", she gets a different A-label.
That is exactly the same as the Eszett problem -- you can't tell
from the IDNA2003 A-label what the original intention was and
use of the string under IDNA2008 gets you a different A-label
than it does under IDNA2003.
Joiner characters become invisible if inserted in strings written in
that do not use them. Unicode classifies these characters
as "COMMON" so they also end up passing any plausible tests to prevent
mixing of scripts in a label. Contextual rules are needed to restrict
to strings in scripts where they have some effect. We end up relying on
registries to adopt their use judiciously within those scripts. See
Rationale document for further commentary.
6. Symbols and punctuation are NOT PVALID under IDNA2008 but are valid
under IDNA2003 leading to a variety of potential confusions with
symbols or other symbols used in URIs for example. IDNA2008 rules reduce
confusion potential by making all characters with these Unicode
invalid for use with Domain labels.
It is not clear that such symbols are critically needed for domain
Another reason for banning these characters is that they complicate
references, discussions and databases (such as WHOIS) because it is
not clear how to describe them in common, informal usage.
What is the correct way to refer to "-" ? Is it "hyphen", "minus
minus" or "short middle horizontal bar?" And is "." "period", "dot",
or something else? What about "#" - is it "pound", "hash", "number
"tic-tac-toe"? "Heart" is another example: which one is it?
To be fair, one could refer to the Unicode long name for the character
the "U+" form although this sounds pretty awkward in practical terms.
7. JAMO characters in Korean have been made Protocol Invalid
for reasons similar to (6) above. They introduce a combinatorial
explosion of different
string representations built from JAMO primitive characters. They are
8. Under INDA 2008, when a new version of Unicode is released the
steps can be taken:
a. review of changes that might require new rules in the IDNA2008
Such a conclusion would assuredly require formation of a WG to
facilitate \new RFC
production. This is thought to be extremely unlikely to happen.
b. A review of changes might only require exception rules to preserve
compatibility. It is possible that the required changes might be
to an IANA action possibly in consultation with an expert committee
to generate new tables.
c. Generate new tables for IANA registry (suitable for downloading as
During the transition some will clients have the older tables and
some registries the newer ones. Lookups of Domain Names containing
new PVALID characters by older clients will fail under IDN2008 because
the client will reject UNASSIGNED characters until the clients are
with the new PVALID characters.
"IDNAV2" - The Hoffman proposal
In this proposal, IDNA2003 would be "extended" by adding new characters.
Under IDNA2003 and its variants, mapping based on CaseFold and
mapping of compatibility characters is carried out prior to
All the properties of IDNA2003 apply including the Nameprep profile of
1. To pursue this proposal formally, a charter change would have to be
generated, shown to have community consensus and then approved by
the AD and IESG because it diverges from assumptions in the IDNA2008
2. New Unicode versions require RFC-level apparatus to adopt new
specifications because the tables in IDNAV2 make references to specific
3. As a practical matter, the proposal places IDN for DNS on a
RFC update path for each revision of Unicode that affects domain name
4. A sequence of changes/additions to PVALID characters would
require examination of NamePrep and StringPrep which are
currently defined in terms of Unicode 3.2). Since many other
protocols (including security) rely on Stringprep and possibly on
Nameprep, changes could have significant ripple effects.
5. Multiple characters are allowed as "dots" in domain names
under IDNA2003 and presumably under IDNAV2. This can lead
to a wide range of problems associated with identifying domain
names in running text or other situations in which it is not obvious
that the string is intended to be a domain name. The is a general
problem for all versions of IDNA but is evidently exacerbated by
the variants for "dots" that are permitted under IDNA2003 and IDNAv2.
6. JAMO are PVALID under IDNA2003 leading to combinatoric
registration or bundling effects.
7. There are few if any restrictions on the lookup phase of IDNAv2
(and IDNA2003). The consequences are that lookup will match
domain names injected into DNS by registries that are non-conformant
with registration restrictions intended by the protocol specification.
This condition arises from permitting the looking up of DISALLOWED
or UNASSIGNED characters.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: IETF74-IDNABIS STATUS-v3.rtf
Size: 15523 bytes
Desc: not available
Url : http://www.alvestrand.no/pipermail/idna-update/attachments/20090320/fc07d604/attachment-0001.bin
-------------- next part --------------
1818 Library Street, Suite 400
Reston, VA 20190
vint at google.com
More information about the Idna-update