DRAFT Status of Work on IDNA2008 + IDNAv2

Vint Cerf vint at google.com
Sun Mar 22 21:31:29 CET 2009


thanks Mark - I re-issued version 5 of the material so some of your  
comments have crossed in the mail. I will try to update once more to  
take into account your comments or at least annotate as reminders for  
discussion.

v


Vint Cerf
Google
1818 Library Street, Suite 400
Reston, VA 20190
202-370-5637
vint at google.com




On Mar 22, 2009, at 3:51 PM, Mark Davis wrote:

> Here are comments on the status. (I tried to update to the later  
> doc, but because it was only distributed in pdf, I had to do it  
> manually, so I may have missed something.)
>
>
> Mark
>
>
> On Fri, Mar 20, 2009 at 06:12, Vint Cerf <vint at google.com> wrote:
> DRAFT Status of work on IDNA2008
>
> 3/21/2009 0523 PDT
>
>
> Vint Cerf
>
>
> This brief summary is intended to provide some focus for the IDNABIS  
> WG meetings
> scheduled for Monday and Tuesday, March 23 (1740-1940) and March 24  
> (0900-1130).
>
> One goal is to try to assess rough consensus about the present  
> documentation on the
> presumption that we are abiding by the ground-rules set forth in the  
> charter of the WG.
> Another is to assess what the implications are for users,  
> registries, registrars if
> IDNA2008 is adopted as it presently stands.  A third goal is to  
> examine the implications
> of the IDNAV2 proposal from Paul Hoffman and contrast with adoption  
> of IDNA2008.
>
> I fully recognize that consensus has to be assessed from mailing  
> list exchanges, not
> merely from appearances at our face to face meetings.
>
> The material presented below is by no means intended to be more than  
> a basis for
> discussion, and is not intended as a penultimate recommendation.
>
> I think it would also be useful to mark where each is different from  
> IDNA2003.
>
>
> Background
>
>
>
> Under the IDNABIS charter, the IDNA2008 design as it now stands  
> makes several
> specific assumptions or makes specific propositions to achieve a  
> number of goals:
>
> 0. Avoid dependence on any specific version of Unicode through the  
> use of rules
>    for determining PVALID characters based on Unicode character  
> properties
>
> add: "as much as possible". Exceptions may be necessary in some  
> cases (and are included in the draft tables).
>
>
> 1. No change to the deployed DNS server functionality (domain name  
> labels limited to
>    ASCII and case-insensitive matching only)
> [no change from IDNA2003]
>
> 2. Esszet, Final Sigma, ZWJ and ZWNJ, geresh and gershayim are  
> PVALID characters
>     some of which are treated through contextual rules (there is  
> still ongoing discussion
>     about the implications of these choices)
>
> This is also a current feature of the drafts, but not required by  
> the charter. It is unclear whether this is actually consistent with  
> the charter or not. "This work is intended to specify an improved  
> means to produce and use stable and unambiguous IDN identifiers."  
> Effectively, any IDN with the first four characters is ambiguous  
> between versions of IDNA in that it will lead to different addresses.
>
>
> 3. Unassigned Unicode characters will not be looked up
>
> Just a comment (no change): IDNA2003 had the slightly different  
> goal: unassigned Unicode characters will not be returned from the DNS.
>
>
> 4. No mapping of characters at least within the protocol specification
>
> 5. No modification of or dependence on Nameprep  (and thus no impact
>   on other protocols relying on Nameprep or Stringprep.)
>
> 6. Clear specification of valid "dot" form in a way that is  
> consistent with DNS
>     protocol requirements.
>
> IDNA2003 specified the dot form in a way that is consistent with  
> DNS; that is, it required no change of the DNS protocol, so this is  
> no change. That is, once in the ACE form, dots are dots.
>
>
> 7. Symmetry between native-character ("Unicode") and ACE ("Punycode")
>     forms of a label.
>
> This may be a goal, but it is not achieved by the current drafts.  
> There is a strong asymmetry between them in that in lookup, an  
> implementation need not check that what appears to be an A-Label is  
> one, but it must check that a U-Labels is one (mostly). (Comment: I  
> believe that this should be a goal: if it is important to check  
> those requirements, then it is important to test both A and U  
> Labels; if it is not important to test them, then it should not be a  
> requirement for either one.)
>
>
> 8. Conversion to an inclusion list of PVALID characters (as distinct  
> from the
>    IDNA2003 posture that excluded only a few Unicode characters)
>
> 9. Improved terminology to make categories and types of labels more  
> clear.
>    (Definitions)
>
> 10. Provide explanation for decisions and their motivations  
> (Rationale) to
>     aid implementors, registries, registrants and users in  
> understanding IDNA.
>
> Rationale doesn't really provide explanation for motivations in  
> enough detail to be useful. I'd recast this as: "Provide informative  
> background material (Rationale) to aid ..."
>
>
> 11. Separately describe registration and lookup procedures to  
> improve clarity
>
> The goal is good, but the current drafts don't meet the goal.  
> Whether it increases clarity or not is unclear, since by doing so  
> makes it difficult to determine what the similarities and  
> differences are between the two processes. So drop "to improve  
> clarity". (A relatively small recasting of the text to make it  
> precisely parallel between them (including numbering), and point out  
> precisely where the differences are, would meet this goal.
>
>
> 12. Specify tests to be applied at lookup time in an attempt to  
> limit abuse of
>       IDNA at all levels of registration
>
> That is not a change from IDNA2003. The tests are different, and are  
> expanded, but it is a quantitative difference, not qualitative. For  
> example, IDNA2003 did test bidi; we just think the IDNA2008 tests  
> are better. And the "in an attempt to limit abuse" is not true; the  
> changes in IDNA2008 will have a trifling effect on abuse at the very  
> best, and introduce significant opportunities for spoofing because  
> of the 4 ambiguous characters. And affecting the "phishing" problem  
> is not a requirement of the charter. So this item should be removed.
>
>
> 13. Clarify what is expected of IDNA-aware applications and domain  
> name
>       "slots" with regard to invalid labels and future extensibility
>
> These are still not nailed down in the current drafts. My  
> expectations are that once a domain name is valid, it remain valid  
> for all time -- that is, we are doing a one-time massive  
> compatibility change, but there will be no more changes that would  
> affect compatibility. However, that is not captured in the text,  
> despite the charter requirement "This work is intended to specify an  
> improved means to produce and use stable and unambiguous IDN  
> identifiers."
>
> Another major change is the introduction of a mechanism for changing  
> IDNAs on the fly via the context mechanism, with and associated  
> process.
>
>
>
>
> Chartering and Re-Chartering
>
> (1) A Re-charter is needed if we abandon a significant fraction of  
> the IDNA2008 goals
> and methods. IDNAv2, as described by Paul Hoffman requires a re- 
> charter.
>
> (2) A Re-charter is needed if the WG decides to introduce mappings  
> into the IDNA2008
> specifications since the basic assumption in IDNA2008 was that  
> mapping would not
> be part of the specification.
>
> (3) It is possible that re-charter might not be needed if IDNA2008  
> adopts some
> IDNA2003 operations under a restricted set of conditions and only at  
> lookup
> time for purposes of easing the transition to IDNA2008. This would  
> be up to the
> AD and IESG presumably to decide.
>
> Basics for IDNA2003 and IDNA2008
>
> Both of these specifications use the Punycode algorithm to generate  
> what
> IDNA2008 would call an A-label (ie. "xn-- <LDH compliant string>")  
> from
>
> Better expressed as an XN label. That terminology can be applied to  
> both, while A-Label only makes sense for IDNA2008.
>
> labels expressed as a string of characters drawn from a subset of  
> Unicode
> defined characters.
>
> DNS matching is done in the servers by comparing the query string to  
> the
> registered string in a case-independent fashion.  For IDNs, these  
> comparisons
> are done after conversion into the "xn--" prefix form. For IDNs the  
> case insensitive
> matching of the DNS servers applies only to the A-label form and not  
> to the
> Unicode form. This means that the case-insensitive matching behavior  
> of
> in traditional ASCII labels is not conferred on IDNs in their  
> Unicode form.
>
> The case-insensitive comparisons between traditional LDH domain  
> names is
> approximated under IDNA2003 by using CaseFold as a mapping guide on  
> the
> Unicode strings being looked up. In addition, IDNA2003 also maps the  
> so-called
> "compatibility" characters of Unicode into their counterparts. The  
> same actions
>
> => "compatibility decomposable" characters of Unicode into their  
> counterparts
> [Not all compatibility characters are decomposable and vice versa.]
>
>
> precede the registration of new domain names under IDNA2003.
>
> Unicode CaseFold maps to upper case and then map back to lower case.
>
> This is not quite accurate; better would be to say "Unicode CaseFold  
> maps characters to lowercase values based on an an equivalence class  
> formed by including lowercase, uppercase, and titlecase mappings."
>
>
> Prior to Unicode 5.0, Ezsett became "SS" because there was no upper
> case, then became "ss" in the lower case mapping.  Under Unicode 5.0
> CaseFold was unchanged for  stability reasons. Consequently
> CaseFold (ESSZETT) is "ss" rather than lower case esszett even after
> the introduction of upper case ESSZETT in Unicode 5.0.
>
> =>
> The uppercase of ezsett in Unicode is "SS", following national  
> standards and practices. As of Unicode 5.1, an uppercase version of  
> eszett became available. Under the Unicode case folding, both map to  
> "ss".
>
>
>
> Under IDNA2003, both DISALLOWED and UNASSIGNED characters
> are looked up. If abusive registrations are made using DISALLOWED
> or UNASSIGNED characters, these registered domain names may be
> be found on lookup by IDNA2003-compliant clients.
>
> This is not correct, as Erik points out.
>
>
>
> Under IDNA2008, UNASSIGNED and DISALLOWED characters are not looked  
> up.
> If new characters become defined under a new version of Unicode
> an old client will not look them up until it is updated. Abusive  
> registrations
> using UNASSIGNED characters will not be looked up.
>
> Script mixing is not banned under IDNA2003. Under IDNA2008, BiDi
> bans mixing of European and Extended Arabic-Indic numbers with
> Arabic numbers.  That is AN and EN characters may not be present in
> the same label. Otherwise, mixing is permitted in IDNA2008.
>
> IMPLICATIONS OF ADOPTING IDNA2008 AS CURRENTLY SPECIFIED
>
>
> 1. IDNA2008 is case sensitive for labels with non-LDH characters in  
> them but  is
> ... with at least one non-LDH character...
>
> case-insensitive for LDH characters
>
> for example" buecher "is all ASCII and could be matched with  
> "Buecher" or "bUecher"
> under IDNA2008
>
> however "B<u-umlaut>cher" would not be allowed because Tables (see  
> 4.2.2) would
> disallow Latin Capital letters. Some users accustomed to LDH-label  
> behavior
> may be surprised that "B<u-umlaut>cher" and "b<u-umlaut>cher" do not  
> match.
>
> On the other hand, the symmetric relationship between the IDNA2008- 
> defined
> A-Label and U-Label has the benefit one can use exact match for either
> U-label form or A-label forms since they are directly and  
> unambiguously
> transformable into each other.
>
> However, this symmetry will not exist for
> cases where the IDNA2003 A-Label and IDNA2008 A-label for the same
> U-Label differ. [Query: will this be a material problem only for  
> actual
> registrations under IDNA2003 that differ in A-label form from  
> IDNA2008?]
>
> For registries, this is an advantage (equivalent to disallowing  
> mapping), but it is not so clear that it is a "benefit" for lookup.
>
>
>
>
> 2. IDNA2008 does not ban script mixing even within labels.
>
> Attempts to fashion rules along these lines have run into problems
> in which characters that may be confused for others are needed
> to express strings in particular languages. The International Phonetic
> Alphabet (IPA) characters are a case in point. Some are used for
> certain (e.g. African) languages but some of these characters
> can be confused for others in the Latin alphabet. Other examples
> exist in Arabic, Cyrillic, Greek among others.
>
> Even in the absence of intra-label script mixing, inter-script  
> confusion
> such as the Russian word for "restaurant" looking like  "pectopah" in
> Latin characters is quite possible.
>
> Despite the apparent desirability of such a ban at protocol level,  
> there
> are simply too many combinations of confusion within-scripts and  
> between
> scripts to benefit significantly from a protocol-level ban. On the  
> other hand,
> registry level constraints that may be more script-aware appear to be
> the most effective tool we have.
>
> I think client-level warnings are the most effective constraint.  
> After all, if we could always trust the registries, we would need  
> *no* constraints on the client side in the protocol.
>
>
>
> 3. Esszet is permitted and its usage appears to be geographically  
> and language
> specific. Under IDNA2003, this character is mapped into "ss". To  
> deal with the
> potential conflict with previously mapped registrations in which  
> Esszet is mapped
> to "ss" registries would need to appeal to Rationale 7.2 options,  
> for example,
> to deal with this. Note that not all collisions may be a consequence  
> of mapping, i.e.,
> many occurrences of "ss" in German text are not typographic  
> variations of
> Esszett and very few occurrences in Latin script, without  
> consideration of language,
> are variations of Esszett either.
>
> 4. Final Sigma is permitted and raises similar issues to Esszet with  
> regard to
> collisions and the same remedies would apply.
>
> 5. ZWJ/ZWNJ
>
> In IDNA2003, these characters were mapped to "nothing". It has  
> become apparent
> however that some Indic scripts need them. Persian registries  
> currently
> reject registration of labels including ZWJ/ZWNJ although ZWNJ is  
> used in
> writing Persian languages. Arabic language does not need ZWJ/ZWNJ.
> Mapping to "nothing" in INDA2003 has the side-
> effect of inhibiting domain name expression in some Indic scripts  
> including
> Tamil and Devanagari. Permitting either or both as valid characters  
> creates
> a compatibility problem similar to the Esszett one; i.e., one cannot  
> tell
> whether a DNS label, when converted back to native character form, was
> intended to be written with ZWJ, ZWNJ or neither.
>
> Elaboration: Suppose that "ab" is a string in one of the scripts in  
> which we now
> propose to permit ZWNJ.  All we have in the DNS is the A-label  
> equivalent of "ab".
> We can't tell from looking at it whether the starting string, as  
> seen/preferred by the
> registrant, was
>  ab    or
>  aZWJb
> since both map to the same A-label.
>
> Under IDNA2008, if the user enters "ab", she gets one A-label
> while, if she enters "aXWJb", she gets a different A-label.
> That is exactly the same as the Eszett problem -- you can't tell
> from the IDNA2003 A-label what the original intention was and
> use of the string under IDNA2008 gets you a different A-label
> than it does under IDNA2003.
>
> Joiner characters become invisible if inserted in strings written in  
> scripts
> that do not use them.
> => in strings where they make no visual difference. This included  
> scripts
> that do not use them, and many positions in scripts that do use them.
> Unicode classifies these characters
> as "COMMON" so they also end up passing any plausible tests to prevent
> mixing of scripts in a label. Contextual rules are needed to  
> restrict their use
> to strings in scripts where they have some effect.
>
> where they could have some effect (they won't always, and even when  
> they commonly have an effect, it depends on the font).
>
> We end up relying on
> registries to adopt their use judiciously within those scripts. See  
> also the
> Rationale document for further commentary.
>
> 6. Symbols and punctuation are NOT PVALID under IDNA2008 but are valid
> Most symbols...
>
> under IDNA2003 leading to a variety of potential confusions with  
> "slash-like"
> symbols or other symbols used in URIs for example. IDNA2008 rules  
> reduce
> confusion potential by making all characters with these Unicode  
> properties
> invalid for use with Domain labels.
> either most, or add at the end "with certain exceptions"
>
>
> It is not clear that such symbols are critically needed for domain  
> names.
>
> Another reason for banning these characters is that they complicate
> references, discussions and databases (such as WHOIS) because it is
> not clear how to describe them in common, informal usage.
> What is the correct way to refer to "-" ? Is it "hyphen", "minus  
> sign", "hyphen-
> minus" or "short middle horizontal bar?" And is "." "period", "dot",  
> "full stop",
> or something else? What about "#" - is it "pound", "hash", "number  
> sign" or
> "tic-tac-toe"? "Heart" is another example: which one is it?
>
> Thus is just not an issue; there are thousands of letters that have  
> ambiguous or multiple names. This paragraph just can't be fixed; it  
> needs removal.
>
>
>
> To be fair, one could refer to the Unicode long name for the  
> character or even
> the "U+" form although this sounds pretty awkward in practical terms.
>
> 7. JAMO characters in Korean have been made Protocol Invalid  
> (DISALLOWED)
> for reasons similar to (6) above. They introduce a combinatorial  
> explosion of different
> string representations built from JAMO primitive characters. They  
> are valid
> under IDNA2003.
>
> This is debatable. The only reasonable rationale we can include is  
> that they are only used in historic Korean.
>
>
> 8. Under INDA 2008, when a new version of Unicode is released the  
> following
> steps can be taken:
>
> a. review of changes that might require new rules in the IDNA2008  
> framework.
> Such a conclusion would assuredly require formation of a  WG to  
> facilitate \new RFC
> production. This is thought to be extremely unlikely to happen.
>
> b. A review of changes might only require exception rules to preserve
> compatibility. It is possible that the required changes might be  
> delegated
> to an IANA action possibly in consultation with an expert committee
> to generate new tables.
>
> The current drafts require new RFCs in order to change the exception  
> tables, I believe. It would be better to change that to have the  
> exception table governed by the same process as the context tables  
> (under stability provisions).
>
>
>
> c. Generate new tables for IANA registry (suitable for downloading  
> as needed
>
> During the transition some will clients have the older tables and
> some registries the newer ones. Lookups of Domain Names containing
> new PVALID characters by older clients will fail under IDN2008 because
> the client will reject UNASSIGNED characters until the clients are  
> updated
> with the new PVALID characters.
>
> That is not the bad part of the transition. The bad part is that old  
> characters may transform from DISALLOWED to PVALID only during the  
> transition, then corrected, or transform from PVALID to DISALLOWED  
> only during the transition, then corrected. And the correction  
> period may be long, depending on when software is updated. That is,  
> if a program ships every two years, and is updated during the  
> correction, it will be wrong for 2 years.
>
>
> ...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090322/314bfc58/attachment-0001.htm 


More information about the Idna-update mailing list