DRAFT Status of Work on IDNA2008 + IDNAv2
Vint Cerf
vint at google.com
Sun Mar 22 21:31:29 CET 2009
thanks Mark - I re-issued version 5 of the material so some of your
comments have crossed in the mail. I will try to update once more to
take into account your comments or at least annotate as reminders for
discussion.
v
Vint Cerf
Google
1818 Library Street, Suite 400
Reston, VA 20190
202-370-5637
vint at google.com
On Mar 22, 2009, at 3:51 PM, Mark Davis wrote:
> Here are comments on the status. (I tried to update to the later
> doc, but because it was only distributed in pdf, I had to do it
> manually, so I may have missed something.)
>
>
> Mark
>
>
> On Fri, Mar 20, 2009 at 06:12, Vint Cerf <vint at google.com> wrote:
> DRAFT Status of work on IDNA2008
>
> 3/21/2009 0523 PDT
>
>
> Vint Cerf
>
>
> This brief summary is intended to provide some focus for the IDNABIS
> WG meetings
> scheduled for Monday and Tuesday, March 23 (1740-1940) and March 24
> (0900-1130).
>
> One goal is to try to assess rough consensus about the present
> documentation on the
> presumption that we are abiding by the ground-rules set forth in the
> charter of the WG.
> Another is to assess what the implications are for users,
> registries, registrars if
> IDNA2008 is adopted as it presently stands. A third goal is to
> examine the implications
> of the IDNAV2 proposal from Paul Hoffman and contrast with adoption
> of IDNA2008.
>
> I fully recognize that consensus has to be assessed from mailing
> list exchanges, not
> merely from appearances at our face to face meetings.
>
> The material presented below is by no means intended to be more than
> a basis for
> discussion, and is not intended as a penultimate recommendation.
>
> I think it would also be useful to mark where each is different from
> IDNA2003.
>
>
> Background
>
>
>
> Under the IDNABIS charter, the IDNA2008 design as it now stands
> makes several
> specific assumptions or makes specific propositions to achieve a
> number of goals:
>
> 0. Avoid dependence on any specific version of Unicode through the
> use of rules
> for determining PVALID characters based on Unicode character
> properties
>
> add: "as much as possible". Exceptions may be necessary in some
> cases (and are included in the draft tables).
>
>
> 1. No change to the deployed DNS server functionality (domain name
> labels limited to
> ASCII and case-insensitive matching only)
> [no change from IDNA2003]
>
> 2. Esszet, Final Sigma, ZWJ and ZWNJ, geresh and gershayim are
> PVALID characters
> some of which are treated through contextual rules (there is
> still ongoing discussion
> about the implications of these choices)
>
> This is also a current feature of the drafts, but not required by
> the charter. It is unclear whether this is actually consistent with
> the charter or not. "This work is intended to specify an improved
> means to produce and use stable and unambiguous IDN identifiers."
> Effectively, any IDN with the first four characters is ambiguous
> between versions of IDNA in that it will lead to different addresses.
>
>
> 3. Unassigned Unicode characters will not be looked up
>
> Just a comment (no change): IDNA2003 had the slightly different
> goal: unassigned Unicode characters will not be returned from the DNS.
>
>
> 4. No mapping of characters at least within the protocol specification
>
> 5. No modification of or dependence on Nameprep (and thus no impact
> on other protocols relying on Nameprep or Stringprep.)
>
> 6. Clear specification of valid "dot" form in a way that is
> consistent with DNS
> protocol requirements.
>
> IDNA2003 specified the dot form in a way that is consistent with
> DNS; that is, it required no change of the DNS protocol, so this is
> no change. That is, once in the ACE form, dots are dots.
>
>
> 7. Symmetry between native-character ("Unicode") and ACE ("Punycode")
> forms of a label.
>
> This may be a goal, but it is not achieved by the current drafts.
> There is a strong asymmetry between them in that in lookup, an
> implementation need not check that what appears to be an A-Label is
> one, but it must check that a U-Labels is one (mostly). (Comment: I
> believe that this should be a goal: if it is important to check
> those requirements, then it is important to test both A and U
> Labels; if it is not important to test them, then it should not be a
> requirement for either one.)
>
>
> 8. Conversion to an inclusion list of PVALID characters (as distinct
> from the
> IDNA2003 posture that excluded only a few Unicode characters)
>
> 9. Improved terminology to make categories and types of labels more
> clear.
> (Definitions)
>
> 10. Provide explanation for decisions and their motivations
> (Rationale) to
> aid implementors, registries, registrants and users in
> understanding IDNA.
>
> Rationale doesn't really provide explanation for motivations in
> enough detail to be useful. I'd recast this as: "Provide informative
> background material (Rationale) to aid ..."
>
>
> 11. Separately describe registration and lookup procedures to
> improve clarity
>
> The goal is good, but the current drafts don't meet the goal.
> Whether it increases clarity or not is unclear, since by doing so
> makes it difficult to determine what the similarities and
> differences are between the two processes. So drop "to improve
> clarity". (A relatively small recasting of the text to make it
> precisely parallel between them (including numbering), and point out
> precisely where the differences are, would meet this goal.
>
>
> 12. Specify tests to be applied at lookup time in an attempt to
> limit abuse of
> IDNA at all levels of registration
>
> That is not a change from IDNA2003. The tests are different, and are
> expanded, but it is a quantitative difference, not qualitative. For
> example, IDNA2003 did test bidi; we just think the IDNA2008 tests
> are better. And the "in an attempt to limit abuse" is not true; the
> changes in IDNA2008 will have a trifling effect on abuse at the very
> best, and introduce significant opportunities for spoofing because
> of the 4 ambiguous characters. And affecting the "phishing" problem
> is not a requirement of the charter. So this item should be removed.
>
>
> 13. Clarify what is expected of IDNA-aware applications and domain
> name
> "slots" with regard to invalid labels and future extensibility
>
> These are still not nailed down in the current drafts. My
> expectations are that once a domain name is valid, it remain valid
> for all time -- that is, we are doing a one-time massive
> compatibility change, but there will be no more changes that would
> affect compatibility. However, that is not captured in the text,
> despite the charter requirement "This work is intended to specify an
> improved means to produce and use stable and unambiguous IDN
> identifiers."
>
> Another major change is the introduction of a mechanism for changing
> IDNAs on the fly via the context mechanism, with and associated
> process.
>
>
>
>
> Chartering and Re-Chartering
>
> (1) A Re-charter is needed if we abandon a significant fraction of
> the IDNA2008 goals
> and methods. IDNAv2, as described by Paul Hoffman requires a re-
> charter.
>
> (2) A Re-charter is needed if the WG decides to introduce mappings
> into the IDNA2008
> specifications since the basic assumption in IDNA2008 was that
> mapping would not
> be part of the specification.
>
> (3) It is possible that re-charter might not be needed if IDNA2008
> adopts some
> IDNA2003 operations under a restricted set of conditions and only at
> lookup
> time for purposes of easing the transition to IDNA2008. This would
> be up to the
> AD and IESG presumably to decide.
>
> Basics for IDNA2003 and IDNA2008
>
> Both of these specifications use the Punycode algorithm to generate
> what
> IDNA2008 would call an A-label (ie. "xn-- <LDH compliant string>")
> from
>
> Better expressed as an XN label. That terminology can be applied to
> both, while A-Label only makes sense for IDNA2008.
>
> labels expressed as a string of characters drawn from a subset of
> Unicode
> defined characters.
>
> DNS matching is done in the servers by comparing the query string to
> the
> registered string in a case-independent fashion. For IDNs, these
> comparisons
> are done after conversion into the "xn--" prefix form. For IDNs the
> case insensitive
> matching of the DNS servers applies only to the A-label form and not
> to the
> Unicode form. This means that the case-insensitive matching behavior
> of
> in traditional ASCII labels is not conferred on IDNs in their
> Unicode form.
>
> The case-insensitive comparisons between traditional LDH domain
> names is
> approximated under IDNA2003 by using CaseFold as a mapping guide on
> the
> Unicode strings being looked up. In addition, IDNA2003 also maps the
> so-called
> "compatibility" characters of Unicode into their counterparts. The
> same actions
>
> => "compatibility decomposable" characters of Unicode into their
> counterparts
> [Not all compatibility characters are decomposable and vice versa.]
>
>
> precede the registration of new domain names under IDNA2003.
>
> Unicode CaseFold maps to upper case and then map back to lower case.
>
> This is not quite accurate; better would be to say "Unicode CaseFold
> maps characters to lowercase values based on an an equivalence class
> formed by including lowercase, uppercase, and titlecase mappings."
>
>
> Prior to Unicode 5.0, Ezsett became "SS" because there was no upper
> case, then became "ss" in the lower case mapping. Under Unicode 5.0
> CaseFold was unchanged for stability reasons. Consequently
> CaseFold (ESSZETT) is "ss" rather than lower case esszett even after
> the introduction of upper case ESSZETT in Unicode 5.0.
>
> =>
> The uppercase of ezsett in Unicode is "SS", following national
> standards and practices. As of Unicode 5.1, an uppercase version of
> eszett became available. Under the Unicode case folding, both map to
> "ss".
>
>
>
> Under IDNA2003, both DISALLOWED and UNASSIGNED characters
> are looked up. If abusive registrations are made using DISALLOWED
> or UNASSIGNED characters, these registered domain names may be
> be found on lookup by IDNA2003-compliant clients.
>
> This is not correct, as Erik points out.
>
>
>
> Under IDNA2008, UNASSIGNED and DISALLOWED characters are not looked
> up.
> If new characters become defined under a new version of Unicode
> an old client will not look them up until it is updated. Abusive
> registrations
> using UNASSIGNED characters will not be looked up.
>
> Script mixing is not banned under IDNA2003. Under IDNA2008, BiDi
> bans mixing of European and Extended Arabic-Indic numbers with
> Arabic numbers. That is AN and EN characters may not be present in
> the same label. Otherwise, mixing is permitted in IDNA2008.
>
> IMPLICATIONS OF ADOPTING IDNA2008 AS CURRENTLY SPECIFIED
>
>
> 1. IDNA2008 is case sensitive for labels with non-LDH characters in
> them but is
> ... with at least one non-LDH character...
>
> case-insensitive for LDH characters
>
> for example" buecher "is all ASCII and could be matched with
> "Buecher" or "bUecher"
> under IDNA2008
>
> however "B<u-umlaut>cher" would not be allowed because Tables (see
> 4.2.2) would
> disallow Latin Capital letters. Some users accustomed to LDH-label
> behavior
> may be surprised that "B<u-umlaut>cher" and "b<u-umlaut>cher" do not
> match.
>
> On the other hand, the symmetric relationship between the IDNA2008-
> defined
> A-Label and U-Label has the benefit one can use exact match for either
> U-label form or A-label forms since they are directly and
> unambiguously
> transformable into each other.
>
> However, this symmetry will not exist for
> cases where the IDNA2003 A-Label and IDNA2008 A-label for the same
> U-Label differ. [Query: will this be a material problem only for
> actual
> registrations under IDNA2003 that differ in A-label form from
> IDNA2008?]
>
> For registries, this is an advantage (equivalent to disallowing
> mapping), but it is not so clear that it is a "benefit" for lookup.
>
>
>
>
> 2. IDNA2008 does not ban script mixing even within labels.
>
> Attempts to fashion rules along these lines have run into problems
> in which characters that may be confused for others are needed
> to express strings in particular languages. The International Phonetic
> Alphabet (IPA) characters are a case in point. Some are used for
> certain (e.g. African) languages but some of these characters
> can be confused for others in the Latin alphabet. Other examples
> exist in Arabic, Cyrillic, Greek among others.
>
> Even in the absence of intra-label script mixing, inter-script
> confusion
> such as the Russian word for "restaurant" looking like "pectopah" in
> Latin characters is quite possible.
>
> Despite the apparent desirability of such a ban at protocol level,
> there
> are simply too many combinations of confusion within-scripts and
> between
> scripts to benefit significantly from a protocol-level ban. On the
> other hand,
> registry level constraints that may be more script-aware appear to be
> the most effective tool we have.
>
> I think client-level warnings are the most effective constraint.
> After all, if we could always trust the registries, we would need
> *no* constraints on the client side in the protocol.
>
>
>
> 3. Esszet is permitted and its usage appears to be geographically
> and language
> specific. Under IDNA2003, this character is mapped into "ss". To
> deal with the
> potential conflict with previously mapped registrations in which
> Esszet is mapped
> to "ss" registries would need to appeal to Rationale 7.2 options,
> for example,
> to deal with this. Note that not all collisions may be a consequence
> of mapping, i.e.,
> many occurrences of "ss" in German text are not typographic
> variations of
> Esszett and very few occurrences in Latin script, without
> consideration of language,
> are variations of Esszett either.
>
> 4. Final Sigma is permitted and raises similar issues to Esszet with
> regard to
> collisions and the same remedies would apply.
>
> 5. ZWJ/ZWNJ
>
> In IDNA2003, these characters were mapped to "nothing". It has
> become apparent
> however that some Indic scripts need them. Persian registries
> currently
> reject registration of labels including ZWJ/ZWNJ although ZWNJ is
> used in
> writing Persian languages. Arabic language does not need ZWJ/ZWNJ.
> Mapping to "nothing" in INDA2003 has the side-
> effect of inhibiting domain name expression in some Indic scripts
> including
> Tamil and Devanagari. Permitting either or both as valid characters
> creates
> a compatibility problem similar to the Esszett one; i.e., one cannot
> tell
> whether a DNS label, when converted back to native character form, was
> intended to be written with ZWJ, ZWNJ or neither.
>
> Elaboration: Suppose that "ab" is a string in one of the scripts in
> which we now
> propose to permit ZWNJ. All we have in the DNS is the A-label
> equivalent of "ab".
> We can't tell from looking at it whether the starting string, as
> seen/preferred by the
> registrant, was
> ab or
> aZWJb
> since both map to the same A-label.
>
> Under IDNA2008, if the user enters "ab", she gets one A-label
> while, if she enters "aXWJb", she gets a different A-label.
> That is exactly the same as the Eszett problem -- you can't tell
> from the IDNA2003 A-label what the original intention was and
> use of the string under IDNA2008 gets you a different A-label
> than it does under IDNA2003.
>
> Joiner characters become invisible if inserted in strings written in
> scripts
> that do not use them.
> => in strings where they make no visual difference. This included
> scripts
> that do not use them, and many positions in scripts that do use them.
> Unicode classifies these characters
> as "COMMON" so they also end up passing any plausible tests to prevent
> mixing of scripts in a label. Contextual rules are needed to
> restrict their use
> to strings in scripts where they have some effect.
>
> where they could have some effect (they won't always, and even when
> they commonly have an effect, it depends on the font).
>
> We end up relying on
> registries to adopt their use judiciously within those scripts. See
> also the
> Rationale document for further commentary.
>
> 6. Symbols and punctuation are NOT PVALID under IDNA2008 but are valid
> Most symbols...
>
> under IDNA2003 leading to a variety of potential confusions with
> "slash-like"
> symbols or other symbols used in URIs for example. IDNA2008 rules
> reduce
> confusion potential by making all characters with these Unicode
> properties
> invalid for use with Domain labels.
> either most, or add at the end "with certain exceptions"
>
>
> It is not clear that such symbols are critically needed for domain
> names.
>
> Another reason for banning these characters is that they complicate
> references, discussions and databases (such as WHOIS) because it is
> not clear how to describe them in common, informal usage.
> What is the correct way to refer to "-" ? Is it "hyphen", "minus
> sign", "hyphen-
> minus" or "short middle horizontal bar?" And is "." "period", "dot",
> "full stop",
> or something else? What about "#" - is it "pound", "hash", "number
> sign" or
> "tic-tac-toe"? "Heart" is another example: which one is it?
>
> Thus is just not an issue; there are thousands of letters that have
> ambiguous or multiple names. This paragraph just can't be fixed; it
> needs removal.
>
>
>
> To be fair, one could refer to the Unicode long name for the
> character or even
> the "U+" form although this sounds pretty awkward in practical terms.
>
> 7. JAMO characters in Korean have been made Protocol Invalid
> (DISALLOWED)
> for reasons similar to (6) above. They introduce a combinatorial
> explosion of different
> string representations built from JAMO primitive characters. They
> are valid
> under IDNA2003.
>
> This is debatable. The only reasonable rationale we can include is
> that they are only used in historic Korean.
>
>
> 8. Under INDA 2008, when a new version of Unicode is released the
> following
> steps can be taken:
>
> a. review of changes that might require new rules in the IDNA2008
> framework.
> Such a conclusion would assuredly require formation of a WG to
> facilitate \new RFC
> production. This is thought to be extremely unlikely to happen.
>
> b. A review of changes might only require exception rules to preserve
> compatibility. It is possible that the required changes might be
> delegated
> to an IANA action possibly in consultation with an expert committee
> to generate new tables.
>
> The current drafts require new RFCs in order to change the exception
> tables, I believe. It would be better to change that to have the
> exception table governed by the same process as the context tables
> (under stability provisions).
>
>
>
> c. Generate new tables for IANA registry (suitable for downloading
> as needed
>
> During the transition some will clients have the older tables and
> some registries the newer ones. Lookups of Domain Names containing
> new PVALID characters by older clients will fail under IDN2008 because
> the client will reject UNASSIGNED characters until the clients are
> updated
> with the new PVALID characters.
>
> That is not the bad part of the transition. The bad part is that old
> characters may transform from DISALLOWED to PVALID only during the
> transition, then corrected, or transform from PVALID to DISALLOWED
> only during the transition, then corrected. And the correction
> period may be long, depending on when software is updated. That is,
> if a program ships every two years, and is updated during the
> correction, it will be wrong for 2 years.
>
>
> ...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090322/314bfc58/attachment-0001.htm
More information about the Idna-update
mailing list