[mostly OT] Re: Re-sending TXT form of Proposed IDNA2008 Transition Idea

Steve Crocker steve at shinkuro.com
Wed Dec 16 12:51:06 CET 2009

Thanks.  See comments in line below.


On Dec 16, 2009, at 5:20 AM, Martin J. Dürst wrote:

> On 2009/12/15 5:20, Steve Crocker wrote:
>> I would be interested in understanding this combinatorial explosion  
>> more
>> clearly. I was focused just on the sharp-s situation, and the  
>> expansion
>> there is very slight, I believe.
> In practice, yes. There are German words with three consecutive  
> 's'es, and only the first two can be combined into a 'ß'. I.e. the  
> combination ßs is possible in German (when two words are connected,  
> which is very frequent), but sß is not. The number of double 's' or  
> triple 's' in reasonable German words is also limited, a long word  
> with two or three of them can easily be made, but more than four of  
> them and it gets weirder and weirder.
> In theory, no. It's easy to write down a recursion for the number of  
> combinations of 's' and 'ß' one can create that correspond to a  
> certain number N of 's'es. Simply divide N into two parts of length  
> A and B (i.e. N = A+B), and assume that either there is an 'ß'  
> straddling the substrings of length A and B, or there is no such  
> 'ß'. The number of combinations C(N) then is C(A)*C(B) (for the  
> later alternative) plus C(A-1) * C(B-1) (for the former  
> alternative). I have attached a little Ruby program (sz.rb) and its  
> output (count_sz.txt).

This works out to be the Fibonacci series, i.e. C(N+2) = C(N+1) + C(N).

C(1) = 1
C(2) = 2
C(3) = 3
C(4) = 5
C(5) = 8
C(57) = 591,286,729,879

> For N = 57, we get 591,286,729,879 combinations. This is the actual  
> upper length where the simple recursion still works. With anything  
> above that, you get into problems if you convert just one "ss" pair  
> into a 'ß'. You end up with 56 or more 's', 4 characters for the  
> "xn--" prefix, and 4 characters (a hyphen and three characters  
> payload) for encoding the 'ß' and its position. Please note that  
> when you convert more than one "ss" pair to a 'ß', your string gets  
> shorter and shorter, because the overall number of characters is  
> getting smaller, and punycode is very efficient (essentially only  
> needs one character) to express "one more of these".

When I suggested that registries and registrars proactively register  
the variants to smooth the transition, I had in mind the existing  
corpus of names.  I can see that if registries and registrars  
instituted this plan, it would be possible for someone to register a  
deliberately perverse string that includes a large number of s's.  As  
you point out, the cases that arise naturally only have 'ß' mapped for  
the initial pair of ss in a sequence of s's, and there are not more  
than three sets of s's in any string.  If the proactive rule were  
restricted to these combinations, there would be no more than seven  
additional variants.  Moreover, to prevent gaming, the proactive rule  
could be applied to existing strings but perhaps not to newly  
registered strings.

Pat Kane and John Klensin point out that proactive registration of  
variants of other characters would likely result in many, many more  
variants, so the strategy I am suggesting is limited to special cases  
like 'ß' for ss but not a broader set.

> Of course, the situation for other bundling cases is different and  
> has to be analyzed separately.
> Regards,   Martin.
>> Steve
>> On Dec 14, 2009, at 3:18 PM, Kane, Pat wrote:
>>> Steve,
>>> There could be billions of variants for a single registration. We  
>>> used
>>> to have at least one IDN in .com that would had 16M variants. We  
>>> keep
>>> a separate variant table as opposed to registering the variants
>>> themselves as domains. Some Chinese characters have as many as eight
>>> variants and the way that punycode compresses for repeating  
>>> characters
>>> you could end up with more than 20 Chinese characters represented by
>>> the entire ASCII encoded string.
>>> If you repeated one of the characters with eight variants for seven
>>> positions in a string, you would generate over 2 million variants.
>>> Pat
>>> From: idna-update-bounces at alvestrand.no
>>> [mailto:idna-update-bounces at alvestrand.no] On Behalf Of Steve  
>>> Crocker
>>> Sent: Monday, December 14, 2009 2:37 PM
>>> To: Vint Cerf
>>> Cc: Steve Crocker; idna-update at alvestrand.no
>>> Subject: Re: Re-sending TXT form of Proposed IDNA2008 Transition  
>>> Idea
>>> Vint, et al,
>>> This seems reasonable to me. I would offer two refinements.
>>> First, each registry, in cooperation with its registrars, could use
>>> the sunrise period to register all of the variants that are
>>> automatically mapped together under IDNA2003 but will become  
>>> separate
>>> under IDNA2008. The variants would all point to the same  
>>> address(es),
>>> so the result should be the same for anyone looking up a name under
>>> either the IDNA2003 or IDNA2008 rules. When the sunrise period is
>>> over, the variants could become unregistered or could be transferred
>>> to others. The existing registrant would have first say, of course.
>>> I'm implicitly suggesting a business strategy for the registries and
>>> registrars, and that may or may not appeal to them. For ICANN
>>> accredited registries and registrars, there might need to be some
>>> coordination with ICANN too, particularly if the variant  
>>> registrations
>>> are provided at no charge during the sunrise period. I haven't given
>>> this extensive thought, and I haven't talked to others in ICANN,  
>>> so I
>>> can't speak authoritatively, but it seems to me a plausible strategy
>>> for smoothing the transition. In essence, this is the mirror of the
>>> strategy you're proposing in the sense that all the variants are
>>> registered and then the undesired ones trickle away.
>>> One might ask if the number of variants will be unwieldy, and one  
>>> can
>>> point to examples like Mississippi or hisssss... as stressful cases.
>>> My intuition is that even if a few cases explode, the overall impact
>>> will be small. I'll go on record here and suggest the impact will be
>>> less than 10% for any existing domain.
>>> Second, you've proposed the timing of the transitions would be up to
>>> each registry. That's a good suggestion in terms of providing  
>>> maximum
>>> flexibility, but it seems to me that some of the timing is  
>>> governed by
>>> the browsers. I would expect there will be a date when IDNA2008 is
>>> phased in and a separate, later date when IDNA2003 is declared dead.
>>> In between these dates, I would expect the registries will have to
>>> phase over.
>>> These details aside, I am very glad there is attention to a  
>>> transition
>>> plan. That's something that has been a difficult area for both IPv6
>>> and DNSSEC, and I think IDNAbis will be much better off with this
>>> attention to transition.
>>> Thanks,
>>> Steve
>>> On Dec 14, 2009, at 2:12 PM, Vint Cerf wrote:
>>> It is recommended to use a fixed width font to display this message
>>> Introduction of Eszett (sharp-S) and Final Sigma
>>> See http://typefoundry.blogspot.com/2008/01/esszett-or.html for an
>>> interesting perspective on 'Sharp-S'
>>> Introduction
>>> The IDNABIS working group has spent two years evolving documents
>>> describing the use of Unicode in Internet domain name labels. We  
>>> have
>>> ended the IETF Last Call with a lengthy discussion on the manner in
>>> which the Unicode characters Latin Small Letter Sharp-S (U+00DF) and
>>> Greek Small Letter Final Sigma (U+03C2) are to be introduced into  
>>> use.
>>> The so-called Zero-Width Joiner and Zero-Width Non-Joiner (ZWJ and
>>> ZWNJ respectively) have been included as CONTEXT-Joiner (or  
>>> in the IDNA2008 documentation and the general consensus is that  
>>> these
>>> two may be registered at the discretion of registries. IDNA2008
>>> specifically permits their use, in context.
>>> The primary debates surrounding Sharp-S and Final Sigma relate to  
>>> the
>>> method of their introduction into use as PVALID characters under
>>> IDNA2008. This note represents an attempt to synthesize a
>>> philosophical basis for achieving the goal of making these two
>>> characters usable in domain name labels.
>>> It is useful to recall that the Domain Name System is a hierarchical
>>> system of registries. The root zone is the place where top level
>>> domain labels are registered. The Top Level domain name registries
>>> (e.g. .com, .coop, .ca, .uk) are 'pointed to' using 'delegation
>>> records' in the root zone file. Each 'dot' in a domain name is a  
>>> point
>>> where 'delegation' (in DNS-speak, a zone cut) for further  
>>> registration
>>> handling MAY be implemented.
>>> So, for example, suppose that it is desired to create a Second Level
>>> label, 'foo' under the Top Level Label 'com'. Typically, the party
>>> wishing to register domain names with the suffix 'foo.com' would
>>> request to register 'foo' as a second level label under 'com' and a
>>> delegation record would be created pointing to the name server that
>>> will respond to all domain names with the suffix 'foo.com'.
>>> At any point, a registration may either be an address record for,
>>> e.g., abc.foo.com, or a set of delegation records pointing to the
>>> servers Third Level label 'abc'.
>>> The notion of delegation is important to keep in mind when  
>>> considering
>>> how to introduce new PVALID characters into labels since each  
>>> label in
>>> a multi-label domain name can be managed by a different entity (ie
>>> through delegated authority). A decision by a higher level authority
>>> to treat two different labels as equivalent is a non-trivial  
>>> exercise
>>> in delegation mechanics. This fact is often lost in discussions  
>>> about
>>> domain names as if there were flat identifiers. They are not. They
>>> really represent delegated hierarchies and their creation is often
>>> achieved through a series of assignments of delegated authority.
>>> 1. It is desirable that they can be introduced as soon as any
>>> registry in the hierarchy wishes to do so without having to
>>> coordinate with other registries.
>>> 2. It is desirable that IDNA2003 compliant and IDNA2008 compliant
>>> entities (programs, applications‚ etc.) co-exist without introducing
>>> ambiguous resolution of domain names (ie. The same domain name
>>> resolves to different IP addresses under IDNA2003 and IDNA2008
>>> interpretation)
>>> 3. In the proposal that follows, a relaxation of the constraint
>>> in (2) is that it is acceptable that IDNA2008 interpretation leads
>>> to NXDOMAIN even if IDNA2003 leads to a valid IP address (or
>>> vice-versa). Under this provision, the introduction of a new
>>> PVALID character does not lead to distinct IP addresses (and
>>> therefore hazardous ambiguity) even if it produces (temporary?)
>>> non-resolution for some cases.
>>> It should be recognized that the millions of registries/zones in the
>>> DNS are largely independent entities. We can produce a "suggested
>>> good practice", but registries will make local determinations as to
>>> what to do based on local considerations. To discourage a particular
>>> practice, it seems best to explain what bad consequences will result
>>> from following it but as a practical matter leave the decisions up  
>>> to
>>> the registry. In many ways we have already adopted this position in
>>> IDNA2008 by leaving a great many decisions about which characters to
>>> permit for registration (even if they are PVALID in protocol) for
>>> reasons of local significance or practice.
>>> There are many side-effects associated with introducing as PVALID
>>> characters that were formerly mapped under IDNA2003. An unknown  
>>> number
>>> of URLs (or other domain-name-referencing constructs) may become
>>> unreachable upon adoption of IDNA2008, if the unmapped versions of  
>>> the
>>> associated domain names have not been constructively registered and
>>> made to resolve to the same IP address as the mapped version.
>>> Under IDNA2003, any reference to a domain name label containing
>>> Sharp-S is converted to a label containing 'ss' in place of Sharp-S,
>>> whereever Sharp-S appears. This revised label is then used either  
>>> for
>>> registration or look up in the Domain Name System.
>>> Under IDNA2008, Sharp-S is treated as PVALID and not converted to
>>> 'ss'.
>>> Many of the suggested transition tactics have attempted a kind of
>>> "perfection" in which there is either a deadline by which everything
>>> works under IDNA2008 or new mechanisms to somehow distinguish  
>>> between
>>> IDNA2003 and IDNA2008 or urge strenuous efforts to make everything
>>> backward compatible with IDNA2003 mappings - especially for the two
>>> problem characters Sharp-S and Final Sigma. I am ignoring everything
>>> else but these in this contribution since my sense is that this
>>> working group may go along with anything that "solves" the problem
>>> with them. Joiners I think we can assume have been accepted in the
>>> CONTEXTJ form.
>>> I would like to try out on you an idea that isn't "perfect" but that
>>> avoids the worst hazard, I think.
>>> My definition of worst hazard is that different entities (browsers,
>>> applications) do resolution and get conflicting results.
>>> An example of this would be a case where under IDNA2003, a domain  
>>> name
>>> containing Sharp-S would be vectored to a domain name and associated
>>> IP address that referenced a domain name registered with "ss" in  
>>> lieu
>>> of Sharp-S and under IDNA2008 would be vectored to an IP address
>>> associated with a Sharp-S registration that leads to a different IP
>>> address and a distinct registrant. I would distinguish this from the
>>> case where the same registered domain name is associated with two or
>>> more IP addresses on purpose (e.g. two A records that the registrant
>>> considers equivalent).
>>> IDNA2003 Case
>>> registered looked up
>>> domain name domain name IP address Registrant
>>> masse.com maße.com mapped Mr. Foo
>>> to masse.com
>>> IDNA2008 Case
>>> registered looked up
>>> domain name domain name IP address Registrant
>>> maße.com maße.com Mr. Bar
>>> The hazard is that under IDNA2003, a look up for maße.com gets the
>>> address of masse.com while under IDNA2008, the look up  
>>> for
>>> maße.com gets the address of maße.com
>>> What we would like is to prevent this unexpected ambiguity.
>>> I would like to introduce a failsafe practice that prevents this
>>> particular ambiguity but allows for an NXDOMAIN result that may  
>>> not be
>>> considered hazardous even it is annoying.
>>> Let us imagine that the .com registry wishes to introduce IDNA2008
>>> capability into its second level domain registrations (that's all it
>>> controls).
>>> We assume that it has been registering under IDNA2003 rules in the
>>> past, so that any label containing "ß" will have been mapped to "ss"
>>> prior to registration. There is a collection of registrants in the
>>> equivalence class "registered a label containing 'ss'". Let us call
>>> the set of such registrants R.
>>> The .com registry introduces a sunrise period in which all members  
>>> of
>>> R are advised that they may register domains equivalent to the ones
>>> they did register but with the mapped "ss" form changed to the
>>> unmapped "ß" form. I am pretty sure there cannot be collisions here
>>> because all the final registrations have to have been mapped to  
>>> "ss" -
>>> so if there were going to be a collision it would already have been
>>> detected at the time of original IDNA2003-compliant registration:
>>> "sorry, someone else has already registered the 'ss' form you would
>>> have gotten, can't register that."
>>> After time T (determined by the registry, not by IETF or ICANN  
>>> fiat),
>>> the .com registry then advises that it will accept registration of
>>> SLDs containing "ß". However, it abides by the following rules at
>>> (Failsafe Rule 1): If registration of an SLD containing "ß" would
>>> collide under IDNA2003 mapping rules with an existing registered
>>> domain name, the registration is allowed if the holder of the
>>> requested domain is the same (*) as the holder of the
>>> already-registered domain, otherwise the registration is not  
>>> allowed.
>>> (Failsafe Rule 2): If registration of an SLD containing "ss" would
>>> collide under IDNA2003 mapping rules with an existing registered
>>> domain name containing "ß" it is allowed if the holder of the
>>> requested domain is the same (*) as the holder of the already
>>> registered domain, otherwise the registration not allowed. Note that
>>> Failsafe rule 2 only applies once a registry is operating under
>>> IDNA2008 rules.
>>> (*) Which registrants are "the same" is to be defined by the
>>> registry, and match the definitions the registry applies.
>>> As a slightly less safe alternative, but at the option of the  
>>> registry
>>> (perhaps after even more time has gone by), "not allowed" in the  
>>> above
>>> two rules could be replaced by notification of the existing domain
>>> holder with an offer to again let that registrant preemptively
>>> register the name, thereby blocking its registration by someone  
>>> else.
>>> If that offer were not accepted, the new registration would be
>>> permitted, of course still subject to whatever dispute resolution
>>> policies are in effect for .com or other relevant zone.
>>> This latter suggestion opens the door for achieving independence of
>>> formerly-mapped pairs of now PVALID characters.
>>> There are some nuances to the scenarios offered above. With possible
>>> exceptions for some "bundling" practices, most registrations will be
>>> sequential (ie. not "at the same time"). One typically registers one
>>> domain name and then registers others. Because of this, we will
>>> usually end up in a situation where at the time of the second (or  
>>> Nth)
>>> registration someone has to check, for example, whether the  
>>> requested
>>> holder of the next domain name registered is the same holder as the
>>> holder of earlier but colliding registered domain names.
>>> There may be different registrars involved in sequential
>>> registrations. There may be different contact representatives for
>>> respective registrations. There might be transfers being made in
>>> between related registrations.
>>> Because of this, the important things are the failsafe rules, and  
>>> that
>>> they (in an ICANN context) are formulated by the registries so that
>>> details like "same" actually have some specific meaning in the
>>> specific registry context.
>>> If we go back to the example given above and assume that Mr. Foo has
>>> registered masse.com before Mr. Bar has entered the picture, Mr. Foo
>>> will get to register maße.com during the sunrise period. Mr. Bar  
>>> will
>>> not be allowed to register either maße.com or masse.com because both
>>> of these collide with previously registered domain names.
>>> Let us now suppose that after the sunrise period, the registry is
>>> operating under IDNA2008 rules. Let us suppose that someone, Mr.  
>>> Baz,
>>> has registered "strasse.com" prior to the adoption of the IDNA2008
>>> rules. Let us also assume that he did not bother to register
>>> "straße.com" during the sunrise period (if he had, he would  
>>> presumably
>>> have that registration too).
>>> Now let Mr. Frotz try to register "straße.com" - under Failsafe Rule
>>> 1, he would be denied this registration. Mr. Baz still has the
>>> possibility of registering it.
>>> If someone looks up "straße.com" under IDNA2003-compliant rules, he
>>> will get "strasse.com" unambiguously.
>>> If someone looks up "straße.com" under IDNA2008-compliant rules, he
>>> will get NXDOMAIN. This is a kind of brokenness but perhaps this is
>>> tolerable if it does not steer the party to the "wrong" site - and  
>>> it
>>> potentially allows Mr. Baz to recover from his earlier choice not to
>>> register the "ß" version of his SLD earlier.
>>> Now let us suppose that "strasse.com" has NOT been registered at  
>>> all,
>>> the sunrise happens, and we are now operating under IDNA2008 rules.
>>> Mr. Frotz registers "straße.com". Since there is no collision with a
>>> previously registered "strasse.com" there is no problem. Let us
>>> suppose that Mr. Frotz does not bother to register "strasse.com".
>>> If someone looks up "straße.com" under IDNA2003-compliant rules, he
>>> will get NXDOMAIN because "strasse.com" does not exist.
>>> If someone looks up "straße.com" under IDNA2008-compliant rules, he
>>> will get the corresponding IP address.
>>> If someone looks up "strasse.com" under IDNA2008-compliant rules, he
>>> will get NXDOMAIN because it has not been registered.
>>> Because the registry is operating under IDNA2008-rules, "ß" and "ss"
>>> are considered distinct and the party using IDNA2003-rules to look  
>>> up
>>> a domain name registered under IDNA2008 rules is getting a "correct"
>>> response in some sense (in this case, NXDOMAIN). At least the lookup
>>> does not lead to the "wrong IP address".
>>> If Mr. Frotz registers both "strasse.com" and "straße.com" (assuming
>>> neither of these violates Failsafe Rules (1) and (2) at registration
>>> time), his registrations will work for both IDNA2003-compliant and
>>> IDNA2008-compliant lookups. Whether queries using the two strings
>>> will produce the same results or not will still be up to him and not
>>> the registry: there is no practical way to avoid that.
>>> Let us suppose, again, that Mr. Frotz successfully registers
>>> "straße.com" under IDNA2008 rules but does not bother to register
>>> "strasse.com"
>>> Now let us suppose that Mr. FUBAR tries to register "strasse.com"
>>> subsequent to Mr. Frotz's registration of "straße.com". When he  
>>> tries
>>> to do this, he would be blocked from that registration under  
>>> Failsafe
>>> Rule (2). Or, under the more permissive variation, Mr. Frotz would
>>> have an additional opportunity to block Mr. FUBAR's registration by
>>> registering "strasse.com" himself.
>>> I believe that adoption of Failsafe Rules (1) and (2) would permit
>>> each registry (in the general sense - all levels) to introduce
>>> IDNA2008 rules whenever they wish, and to provide for sunrise time
>>> periods of their choosing. The failures that occur (NXDOMAIN) are  
>>> not
>>> harmful in the same way that "wrong IP address" would be harmful and
>>> perhaps this form of "failure" would be an acceptable price to pay  
>>> for
>>> some period of time when IDNA2003-compliant and IDNA2008-compliant
>>> systems were in concurrent operation.
>>> I hope this isn't completely nuts.
>>> vint
>>> from John Klensin:
>>> The suggested process could be used to create a five-stage process:
>>> (1) No registrations that actually involve Sharp-S (the status quo)
>>> (2) Sunrise -- priority registrations for Sharp-S those who already
>>> have labels containing "ss".
>>> (3) No possibly-conflicting registrations, using Failsafe Rules 1
>>> and 2 as written; starting time to be determined by registry
>>> (4) Possibly-conflicting registrations permitted only after the
>>> original registrant gets notification and an additional
>>> opportunity to register the name herself; starting date again
>>> determined by the registry
>>> (5) Sharp-S is just another character with no special treatment;
>>> starting date again determined by the registry.
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>> ------------------------------------------------------------------------
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
> -- 
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp
> <sz.rb><count_sz.txt>_______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

More information about the Idna-update mailing list