The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

Debbie Garside debbie at ictmarketing.co.uk
Thu Dec 3 14:29:01 CET 2009


I think adding a TRANSITIONAL status may go someway towards alleviating this problem.  Although, as Georg pointed out, this would mean that the WG would need to reconvene.  However, if you add a TRANSITION DATE field everyone knows where they are at (and when) and no need to reconvene.  You could also add a “Transitional relationship” field which would include ss for ß and add text to the document stating that registries should bundle transitional characters until the TRANSITION DATE when ß et al would become PVALID.  

 

Mark wrote:

 

>> That will cause currently valid URLs to fail, but that is far better than having them have ambiguous targets. This way we get to the long-term goal of having these characters be PVALID, without having the disruption during the interim.

 

I don’t like the idea of currently valid URLs failing.  This would be addressed (I think) by bundling until 2016?

 

Best regards

 

Debbie

 

   _____  

From: idna-update-bounces at alvestrand.no [mailto:idna-update-bounces at alvestrand.no] On Behalf Of Mark Davis ?
Sent: 01 December 2009 17:49
To: Alexander Mayrhofer
Cc: Shawn Steele; Patrik Fältström; Harald Alvestrand; idna-update at alvestrand.no; lisa Dusseault; "Martin J. Dürst"; Vint Cerf
Subject: Re: The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

 

I don't think that anyone at this point would really stand in the way of these characters being PVALID, if it weren't for compatibility problems. To that end, I think the key issue is the transition strategy: how to deal with the 5 or so years where the browser implementations are transitioning to IDNA2008. If we had an adequate strategy, I don't think anyone would really stand in the way of having the 4 problem characters be valid.

These 4 characters are unlike symbols in two ways: (a) with symbols you don't go to two different places with two different browsers, and (b) symbols are far less frequent than these characters. So even though the prohibition on symbols was based on no particular evidence, the prohibition doesn't cause a severe compatibility issue.

When reading some of the transition proposals, one approach occurred to me. What if we have a new status for the 4 characters: TRANSITIONAL?

We set it up in this way; in IDNA2008, TRANSITIONAL characters are invalid for registration and lookup, AND cannot be mapped. After a period of some years, once the percentage of IDNA2003 browsers and emailers have dropped to a small proportion, the stated plan is to issue a new version of IDNA that changes them to PVALID.

That will cause currently valid URLs to fail, but that is far better than having them have ambiguous targets. This way we get to the long-term goal of having these characters be PVALID, without having the disruption during the interim.

===

As far as Harald's back-of-the-envelope calculations go, they present a very inaccurate picture of the scale. Here are some more exact figures for that data.

1.	819,600,672    = sample size of documents
2.	5,000    = links with eszed in the sample
3.	1,000,000,000,000    = total documents in index (2008)
4.	1,220    = scaling factor (= total docs / sample size)
5.	6,100,532    = estimated total links with eszed (= scaling * sample eszed links)

Even this has to be taken with a certain grain of salt, since (a) it is assuming that the sample is representative (although we have reasonable confidence in that), and (b) it doesn't weight the "importance" of the links (in terms of the number of times they are followed), and (c) this data was collected back in Nov 2008, so we've had another year of growth since then.

Mark



On Tue, Dec 1, 2009 at 01:59, Alexander Mayrhofer <HYPERLINK "mailto:alexander.mayrhofer at nic.at" \nalexander.mayrhofer at nic.at> wrote:


(I've spent quite some time on re-thinking the issue last night. It's a bit longish, and the promised proposal is at the end).

I think i didn't make it clear enough in my previous messages that i'm not an opponent of the character Latin Sharp S itself. I'm opposing against changes that have a high risk of introducing interopability, particularly in the long run.

My *only* major concern is that the introduction of the Latin Sharp S is exactly such a case, but a particularly nasty one. I understand that the majority of WG participants think that "ß" should be PVALID (i'm carefully avoiding the word "concensus" here, because it's obviously up to the WG chair to declare that).

If i look at the issue in an isolated way, not considering any compatibility/interopability issues, then it makes perfectly sense to declare "ß" PVALID, because (this is sort of convincing myself here ;) :

- There seems to be little existing deployment of ß-labels out there, at least on the web - the client side is a different issue, there's nearly 100% deployment. We can also err guesstimate that "ß" has only about 1% of the deployment of other german "umlauts", according to Erik's numbers (As Eric pointed out, those numbers have no indication of confidence, though). We don't know how many people type "ß" into their browser address bar, though, which is at least "unsatisfying" from an engineering perspective.

- The character is undoubtly part of German grammar, at least in two of the three countries where German is an official language - i don't know about the minorities in other countries. The upper case variant as well as the Unicode casing and folding is.. well, extravagant - but the lowercase "ß" is definitely part of the grammar.

- Georg's argument that this would be "the last chance" to introduce "ß", got me thinking. If the "Exceptions" would be implemented as an IANA registry, it would be much easier to add (and probably remove) characters. But given that changes to the Exceptions now require an update to the base specification, we should probably take this opportunity, rather than waiting for IDNA2015.

So, as i said multiple times, the problem is changing the semantics of a part of the namespace, definitely from the user's perspective - one could argue whether or not that means the "protocol semantics" change, since the mapping step ist part of the protocol of IDNA2003.

Regarding interopability, i'm not so much concerned about the transition period between IDNA2003 and IDNAbis. This will be painful, but it will be (hopefully temporary).

What i am more concerned is that the legacy of the "ß-ss" mapping would introduce incompatibility for an indefinite period of time, *after* all clients have switched over to IDNAbis. This could happen because some vendors would implement mappings to be fully IDNA2003 backwards compatible, and others would implements the informative idnabis-mappings only.



More information about the Idna-update mailing list