DRAFT Status of Work on IDNA2008 + IDNAv2

Vint Cerf vint at google.com
Wed Mar 25 12:44:40 CET 2009


Thank you, YungJin Suh. This recommendation was accepted in IDNA2008  
discussions. That suggests to me that for registration purposes, your  
recommendation would still apply. the new exploration of a mapping  
function, prior to looking up a domain name, might end up mapping out  
any Jamo characters appearing in a query.

the WG needs now to define more precisely what characters are mapped  
and into what other characters (or into "nothing").

There is also a question of when to apply such mappings prior to a  
query. One suggestion that is contained in the draft protocol document  
of IDNA2008 would perform an IDN2008-style lookup and if that failed,  
would then map the query under IDNA2003-like rules and do the lookup  
again.  I use the term "IDNA2003-like" above only because of the  
possibility that the WG will conclude that the IDNA2008 mapping  
function is similar to but possibly excludes some of the characters  
mapped under IDNA2003 rules.

Vint

Vint Cerf
Google
1818 Library Street, Suite 400
Reston, VA 20190
202-370-5637
vint at google.com




On Mar 25, 2009, at 5:59 AM, YungJin Suh wrote:

> Dear all WG members,
>
> About the JAMO characters in Korean, We still strongly recommend to  
> disallow these characters.
>
> We want to allow ONLY Hangul Syllables(U+AC00 ~ U+D7A3) in revised  
> IDNA.
>
> I attached the letter from Korean goverment. (I think some of you  
> may had already read it .)
>
> I hope this document helps you to understand our situation.
>
> With regard to the local mapping draft,
>
> About 'dots' as label separators, actually Korean doesn't have  
> mapping problems. But Chinese and Japanese do.
> So I hope this problem will be solved in some way or other.
>
> And about 'Compatibility characters', we defined Korean IDN in the  
> draft as following:
>       The term "Korean IDN" stands for "IDN consists from CJK  
> scripts marked with 'Y' in 'K' column, which is Hangul Syllables(U 
> +AC00 ~ U+D7A3),  and LDH".   Permitted characters in Korean IDN are  
> listed in [IANA-IDN-Language-ko-KR].
> Eventhough in IDNA2003 allowed JAMO in Korean, we defined like this.
> Because if we restrict the range of allowed characters to only  
> Hangul Syllables(U+AC00 ~ U+D7A3), nomalization is not an issue for  
> Korean IDN anymore.
>
> Again, allowing only  Hangul Syllables in IDNA is recommended.
>
> Thank you.
>
> With regards,
>
> YungJin Suh
> =======================
> YungJin Suh
> Head of DNS section, KRNIC, NIDA
> yjsuh at nida.kr
> +82-2-2186-4562(O)
> +82-10-4820-8291(M)
> ========================
>
>
> From: idna-update-bounces at alvestrand.no [mailto:idna-update-bounces at alvestrand.no 
> ] On Behalf Of Vint Cerf
> Sent: Monday, March 23, 2009 5:31 AM
> To: Mark Davis
> Cc: idna-update at alvestrand.no
> Subject: Re: DRAFT Status of Work on IDNA2008 + IDNAv2
>
> thanks Mark - I re-issued version 5 of the material so some of your  
> comments have crossed in the mail. I will try to update once more to  
> take into account your comments or at least annotate as reminders  
> for discussion.
>
> v
>
>
> Vint Cerf
> Google
> 1818 Library Street, Suite 400
> Reston, VA 20190
> 202-370-5637
> vint at google.com
>
>
>
>
> On Mar 22, 2009, at 3:51 PM, Mark Davis wrote:
>
>> Here are comments on the status. (I tried to update to the later  
>> doc, but because it was only distributed in pdf, I had to do it  
>> manually, so I may have missed something.)
>>
>>
>> Mark
>>
>>
>> On Fri, Mar 20, 2009 at 06:12, Vint Cerf <vint at google.com> wrote:
>> DRAFT Status of work on IDNA2008
>>
>> 3/21/2009 0523 PDT
>>
>>
>> Vint Cerf
>>
>>
>> This brief summary is intended to provide some focus for the  
>> IDNABIS WG meetings
>> scheduled for Monday and Tuesday, March 23 (1740-1940) and March 24  
>> (0900-1130).
>>
>> One goal is to try to assess rough consensus about the present  
>> documentation on the
>> presumption that we are abiding by the ground-rules set forth in  
>> the charter of the WG.
>> Another is to assess what the implications are for users,  
>> registries, registrars if
>> IDNA2008 is adopted as it presently stands.  A third goal is to  
>> examine the implications
>> of the IDNAV2 proposal from Paul Hoffman and contrast with adoption  
>> of IDNA2008.
>>
>> I fully recognize that consensus has to be assessed from mailing  
>> list exchanges, not
>> merely from appearances at our face to face meetings.
>>
>> The material presented below is by no means intended to be more  
>> than a basis for
>> discussion, and is not intended as a penultimate recommendation.
>>
>> I think it would also be useful to mark where each is different  
>> from IDNA2003.
>>
>>
>> Background
>>
>>
>>
>> Under the IDNABIS charter, the IDNA2008 design as it now stands  
>> makes several
>> specific assumptions or makes specific propositions to achieve a  
>> number of goals:
>>
>> 0. Avoid dependence on any specific version of Unicode through the  
>> use of rules
>>    for determining PVALID characters based on Unicode character  
>> properties
>>
>> add: "as much as possible". Exceptions may be necessary in some  
>> cases (and are included in the draft tables).
>>
>>
>> 1. No change to the deployed DNS server functionality (domain name  
>> labels limited to
>>    ASCII and case-insensitive matching only)
>> [no change from IDNA2003]
>>
>> 2. Esszet, Final Sigma, ZWJ and ZWNJ, geresh and gershayim are  
>> PVALID characters
>>     some of which are treated through contextual rules (there is  
>> still ongoing discussion
>>     about the implications of these choices)
>>
>> This is also a current feature of the drafts, but not required by  
>> the charter. It is unclear whether this is actually consistent with  
>> the charter or not. "This work is intended to specify an improved  
>> means to produce and use stable and unambiguous IDN identifiers."  
>> Effectively, any IDN with the first four characters is ambiguous  
>> between versions of IDNA in that it will lead to different addresses.
>>
>>
>> 3. Unassigned Unicode characters will not be looked up
>>
>> Just a comment (no change): IDNA2003 had the slightly different  
>> goal: unassigned Unicode characters will not be returned from the  
>> DNS.
>>
>>
>> 4. No mapping of characters at least within the protocol  
>> specification
>>
>> 5. No modification of or dependence on Nameprep  (and thus no impact
>>   on other protocols relying on Nameprep or Stringprep.)
>>
>> 6. Clear specification of valid "dot" form in a way that is  
>> consistent with DNS
>>     protocol requirements.
>>
>> IDNA2003 specified the dot form in a way that is consistent with  
>> DNS; that is, it required no change of the DNS protocol, so this is  
>> no change. That is, once in the ACE form, dots are dots.
>>
>>
>> 7. Symmetry between native-character ("Unicode") and ACE ("Punycode")
>>     forms of a label.
>>
>> This may be a goal, but it is not achieved by the current drafts.  
>> There is a strong asymmetry between them in that in lookup, an  
>> implementation need not check that what appears to be an A-Label is  
>> one, but it must check that a U-Labels is one (mostly). (Comment: I  
>> believe that this should be a goal: if it is important to check  
>> those requirements, then it is important to test both A and U  
>> Labels; if it is not important to test them, then it should not be  
>> a requirement for either one.)
>>
>>
>> 8. Conversion to an inclusion list of PVALID characters (as  
>> distinct from the
>>    IDNA2003 posture that excluded only a few Unicode characters)
>>
>> 9. Improved terminology to make categories and types of labels more  
>> clear.
>>    (Definitions)
>>
>> 10. Provide explanation for decisions and their motivations  
>> (Rationale) to
>>     aid implementors, registries, registrants and users in  
>> understanding IDNA.
>>
>> Rationale doesn't really provide explanation for motivations in  
>> enough detail to be useful. I'd recast this as: "Provide  
>> informative background material (Rationale) to aid ..."
>>
>>
>> 11. Separately describe registration and lookup procedures to  
>> improve clarity
>>
>> The goal is good, but the current drafts don't meet the goal.  
>> Whether it increases clarity or not is unclear, since by doing so  
>> makes it difficult to determine what the similarities and  
>> differences are between the two processes. So drop "to improve  
>> clarity". (A relatively small recasting of the text to make it  
>> precisely parallel between them (including numbering), and point  
>> out precisely where the differences are, would meet this goal.
>>
>>
>> 12. Specify tests to be applied at lookup time in an attempt to  
>> limit abuse of
>>       IDNA at all levels of registration
>>
>> That is not a change from IDNA2003. The tests are different, and  
>> are expanded, but it is a quantitative difference, not qualitative.  
>> For example, IDNA2003 did test bidi; we just think the IDNA2008  
>> tests are better. And the "in an attempt to limit abuse" is not  
>> true; the changes in IDNA2008 will have a trifling effect on abuse  
>> at the very best, and introduce significant opportunities for  
>> spoofing because of the 4 ambiguous characters. And affecting the  
>> "phishing" problem is not a requirement of the charter. So this  
>> item should be removed.
>>
>>
>> 13. Clarify what is expected of IDNA-aware applications and domain  
>> name
>>       "slots" with regard to invalid labels and future extensibility
>>
>> These are still not nailed down in the current drafts. My  
>> expectations are that once a domain name is valid, it remain valid  
>> for all time -- that is, we are doing a one-time massive  
>> compatibility change, but there will be no more changes that would  
>> affect compatibility. However, that is not captured in the text,  
>> despite the charter requirement "This work is intended to specify  
>> an improved means to produce and use stable and unambiguous IDN  
>> identifiers."
>>
>> Another major change is the introduction of a mechanism for  
>> changing IDNAs on the fly via the context mechanism, with and  
>> associated process.
>>
>>
>>
>>
>> Chartering and Re-Chartering
>>
>> (1) A Re-charter is needed if we abandon a significant fraction of  
>> the IDNA2008 goals
>> and methods. IDNAv2, as described by Paul Hoffman requires a re- 
>> charter.
>>
>> (2) A Re-charter is needed if the WG decides to introduce mappings  
>> into the IDNA2008
>> specifications since the basic assumption in IDNA2008 was that  
>> mapping would not
>> be part of the specification.
>>
>> (3) It is possible that re-charter might not be needed if IDNA2008  
>> adopts some
>> IDNA2003 operations under a restricted set of conditions and only  
>> at lookup
>> time for purposes of easing the transition to IDNA2008. This would  
>> be up to the
>> AD and IESG presumably to decide.
>>
>> Basics for IDNA2003 and IDNA2008
>>
>> Both of these specifications use the Punycode algorithm to generate  
>> what
>> IDNA2008 would call an A-label (ie. "xn-- <LDH compliant string>")  
>> from
>>
>> Better expressed as an XN label. That terminology can be applied to  
>> both, while A-Label only makes sense for IDNA2008.
>>
>> labels expressed as a string of characters drawn from a subset of  
>> Unicode
>> defined characters.
>>
>> DNS matching is done in the servers by comparing the query string  
>> to the
>> registered string in a case-independent fashion.  For IDNs, these  
>> comparisons
>> are done after conversion into the "xn--" prefix form. For IDNs the  
>> case insensitive
>> matching of the DNS servers applies only to the A-label form and  
>> not to the
>> Unicode form. This means that the case-insensitive matching  
>> behavior of
>> in traditional ASCII labels is not conferred on IDNs in their  
>> Unicode form.
>>
>> The case-insensitive comparisons between traditional LDH domain  
>> names is
>> approximated under IDNA2003 by using CaseFold as a mapping guide on  
>> the
>> Unicode strings being looked up. In addition, IDNA2003 also maps  
>> the so-called
>> "compatibility" characters of Unicode into their counterparts. The  
>> same actions
>>
>> => "compatibility decomposable" characters of Unicode into their  
>> counterparts
>> [Not all compatibility characters are decomposable and vice versa.]
>>
>>
>> precede the registration of new domain names under IDNA2003.
>>
>> Unicode CaseFold maps to upper case and then map back to lower case.
>>
>> This is not quite accurate; better would be to say "Unicode  
>> CaseFold maps characters to lowercase values based on an an  
>> equivalence class formed by including lowercase, uppercase, and  
>> titlecase mappings."
>>
>>
>> Prior to Unicode 5.0, Ezsett became "SS" because there was no upper
>> case, then became "ss" in the lower case mapping.  Under Unicode 5.0
>> CaseFold was unchanged for  stability reasons. Consequently
>> CaseFold (ESSZETT) is "ss" rather than lower case esszett even after
>> the introduction of upper case ESSZETT in Unicode 5.0.
>>
>> =>
>> The uppercase of ezsett in Unicode is "SS", following national  
>> standards and practices. As of Unicode 5.1, an uppercase version of  
>> eszett became available. Under the Unicode case folding, both map  
>> to "ss".
>>
>>
>>
>> Under IDNA2003, both DISALLOWED and UNASSIGNED characters
>> are looked up. If abusive registrations are made using DISALLOWED
>> or UNASSIGNED characters, these registered domain names may be
>> be found on lookup by IDNA2003-compliant clients.
>>
>> This is not correct, as Erik points out.
>>
>>
>>
>> Under IDNA2008, UNASSIGNED and DISALLOWED characters are not looked  
>> up.
>> If new characters become defined under a new version of Unicode
>> an old client will not look them up until it is updated. Abusive  
>> registrations
>> using UNASSIGNED characters will not be looked up.
>>
>> Script mixing is not banned under IDNA2003. Under IDNA2008, BiDi
>> bans mixing of European and Extended Arabic-Indic numbers with
>> Arabic numbers.  That is AN and EN characters may not be present in
>> the same label. Otherwise, mixing is permitted in IDNA2008.
>>
>> IMPLICATIONS OF ADOPTING IDNA2008 AS CURRENTLY SPECIFIED
>>
>>
>> 1. IDNA2008 is case sensitive for labels with non-LDH characters in  
>> them but  is
>> ... with at least one non-LDH character...
>>
>> case-insensitive for LDH characters
>>
>> for example" buecher "is all ASCII and could be matched with  
>> "Buecher" or "bUecher"
>> under IDNA2008
>>
>> however "B<u-umlaut>cher" would not be allowed because Tables (see  
>> 4.2.2) would
>> disallow Latin Capital letters. Some users accustomed to LDH-label  
>> behavior
>> may be surprised that "B<u-umlaut>cher" and "b<u-umlaut>cher" do  
>> not match.
>>
>> On the other hand, the symmetric relationship between the IDNA2008- 
>> defined
>> A-Label and U-Label has the benefit one can use exact match for  
>> either
>> U-label form or A-label forms since they are directly and  
>> unambiguously
>> transformable into each other.
>>
>> However, this symmetry will not exist for
>> cases where the IDNA2003 A-Label and IDNA2008 A-label for the same
>> U-Label differ. [Query: will this be a material problem only for  
>> actual
>> registrations under IDNA2003 that differ in A-label form from  
>> IDNA2008?]
>>
>> For registries, this is an advantage (equivalent to disallowing  
>> mapping), but it is not so clear that it is a "benefit" for lookup.
>>
>>
>>
>>
>> 2. IDNA2008 does not ban script mixing even within labels.
>>
>> Attempts to fashion rules along these lines have run into problems
>> in which characters that may be confused for others are needed
>> to express strings in particular languages. The International  
>> Phonetic
>> Alphabet (IPA) characters are a case in point. Some are used for
>> certain (e.g. African) languages but some of these characters
>> can be confused for others in the Latin alphabet. Other examples
>> exist in Arabic, Cyrillic, Greek among others.
>>
>> Even in the absence of intra-label script mixing, inter-script  
>> confusion
>> such as the Russian word for "restaurant" looking like  "pectopah" in
>> Latin characters is quite possible.
>>
>> Despite the apparent desirability of such a ban at protocol level,  
>> there
>> are simply too many combinations of confusion within-scripts and  
>> between
>> scripts to benefit significantly from a protocol-level ban. On the  
>> other hand,
>> registry level constraints that may be more script-aware appear to be
>> the most effective tool we have.
>>
>> I think client-level warnings are the most effective constraint.  
>> After all, if we could always trust the registries, we would need  
>> *no* constraints on the client side in the protocol.
>>
>>
>>
>> 3. Esszet is permitted and its usage appears to be geographically  
>> and language
>> specific. Under IDNA2003, this character is mapped into "ss". To  
>> deal with the
>> potential conflict with previously mapped registrations in which  
>> Esszet is mapped
>> to "ss" registries would need to appeal to Rationale 7.2 options,  
>> for example,
>> to deal with this. Note that not all collisions may be a  
>> consequence of mapping, i.e.,
>> many occurrences of "ss" in German text are not typographic  
>> variations of
>> Esszett and very few occurrences in Latin script, without  
>> consideration of language,
>> are variations of Esszett either.
>>
>> 4. Final Sigma is permitted and raises similar issues to Esszet  
>> with regard to
>> collisions and the same remedies would apply.
>>
>> 5. ZWJ/ZWNJ
>>
>> In IDNA2003, these characters were mapped to "nothing". It has  
>> become apparent
>> however that some Indic scripts need them. Persian registries  
>> currently
>> reject registration of labels including ZWJ/ZWNJ although ZWNJ is  
>> used in
>> writing Persian languages. Arabic language does not need ZWJ/ZWNJ.
>> Mapping to "nothing" in INDA2003 has the side-
>> effect of inhibiting domain name expression in some Indic scripts  
>> including
>> Tamil and Devanagari. Permitting either or both as valid characters  
>> creates
>> a compatibility problem similar to the Esszett one; i.e., one  
>> cannot tell
>> whether a DNS label, when converted back to native character form,  
>> was
>> intended to be written with ZWJ, ZWNJ or neither.
>>
>> Elaboration: Suppose that "ab" is a string in one of the scripts in  
>> which we now
>> propose to permit ZWNJ.  All we have in the DNS is the A-label  
>> equivalent of "ab".
>> We can't tell from looking at it whether the starting string, as  
>> seen/preferred by the
>> registrant, was
>>  ab    or
>>  aZWJb
>> since both map to the same A-label.
>>
>> Under IDNA2008, if the user enters "ab", she gets one A-label
>> while, if she enters "aXWJb", she gets a different A-label.
>> That is exactly the same as the Eszett problem -- you can't tell
>> from the IDNA2003 A-label what the original intention was and
>> use of the string under IDNA2008 gets you a different A-label
>> than it does under IDNA2003.
>>
>> Joiner characters become invisible if inserted in strings written  
>> in scripts
>> that do not use them.
>> => in strings where they make no visual difference. This included  
>> scripts
>> that do not use them, and many positions in scripts that do use them.
>> Unicode classifies these characters
>> as "COMMON" so they also end up passing any plausible tests to  
>> prevent
>> mixing of scripts in a label. Contextual rules are needed to  
>> restrict their use
>> to strings in scripts where they have some effect.
>>
>> where they could have some effect (they won't always, and even when  
>> they commonly have an effect, it depends on the font).
>>
>> We end up relying on
>> registries to adopt their use judiciously within those scripts. See  
>> also the
>> Rationale document for further commentary.
>>
>> 6. Symbols and punctuation are NOT PVALID under IDNA2008 but are  
>> valid
>> Most symbols...
>>
>> under IDNA2003 leading to a variety of potential confusions with  
>> "slash-like"
>> symbols or other symbols used in URIs for example. IDNA2008 rules  
>> reduce
>> confusion potential by making all characters with these Unicode  
>> properties
>> invalid for use with Domain labels.
>> either most, or add at the end "with certain exceptions"
>>
>>
>> It is not clear that such symbols are critically needed for domain  
>> names.
>>
>> Another reason for banning these characters is that they complicate
>> references, discussions and databases (such as WHOIS) because it is
>> not clear how to describe them in common, informal usage.
>> What is the correct way to refer to "-" ? Is it "hyphen", "minus  
>> sign", "hyphen-
>> minus" or "short middle horizontal bar?" And is "." "period",  
>> "dot", "full stop",
>> or something else? What about "#" - is it "pound", "hash", "number  
>> sign" or
>> "tic-tac-toe"? "Heart" is another example: which one is it?
>>
>> Thus is just not an issue; there are thousands of letters that have  
>> ambiguous or multiple names. This paragraph just can't be fixed; it  
>> needs removal.
>>
>>
>>
>> To be fair, one could refer to the Unicode long name for the  
>> character or even
>> the "U+" form although this sounds pretty awkward in practical terms.
>>
>> 7. JAMO characters in Korean have been made Protocol Invalid  
>> (DISALLOWED)
>> for reasons similar to (6) above. They introduce a combinatorial  
>> explosion of different
>> string representations built from JAMO primitive characters. They  
>> are valid
>> under IDNA2003.
>>
>> This is debatable. The only reasonable rationale we can include is  
>> that they are only used in historic Korean.
>>
>>
>> 8. Under INDA 2008, when a new version of Unicode is released the  
>> following
>> steps can be taken:
>>
>> a. review of changes that might require new rules in the IDNA2008  
>> framework.
>> Such a conclusion would assuredly require formation of a  WG to  
>> facilitate \new RFC
>> production. This is thought to be extremely unlikely to happen.
>>
>> b. A review of changes might only require exception rules to preserve
>> compatibility. It is possible that the required changes might be  
>> delegated
>> to an IANA action possibly in consultation with an expert committee
>> to generate new tables.
>>
>> The current drafts require new RFCs in order to change the  
>> exception tables, I believe. It would be better to change that to  
>> have the exception table governed by the same process as the  
>> context tables (under stability provisions).
>>
>>
>>
>> c. Generate new tables for IANA registry (suitable for downloading  
>> as needed
>>
>> During the transition some will clients have the older tables and
>> some registries the newer ones. Lookups of Domain Names containing
>> new PVALID characters by older clients will fail under IDN2008  
>> because
>> the client will reject UNASSIGNED characters until the clients are  
>> updated
>> with the new PVALID characters.
>>
>> That is not the bad part of the transition. The bad part is that  
>> old characters may transform from DISALLOWED to PVALID only during  
>> the transition, then corrected, or transform from PVALID to  
>> DISALLOWED only during the transition, then corrected. And the  
>> correction period may be long,   depending on when software is  
>> updated. That is, if a program ships every two years, and is  
>> updated during the correction, it will be wrong for 2 years.
>>
>>
>> ...
>
> <Comments for Korean IDN.pdf>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090325/58da1231/attachment-0001.htm 


More information about the Idna-update mailing list