The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

Shawn Steele Shawn.Steele at microsoft.com
Wed Dec 2 02:56:42 CET 2009


The list complained too many recipients?  Trying again with fewer of you ☺

From: Shawn Steele
Sent: ,  01,  2009 17:55
To: 'Tina Dam'; Mark Davis ☕
Cc: Erik van der Poel; Patrik Fältström; Harald Alvestrand; idna-update at alvestrand.no; lisa Dusseault; Alexander Mayrhofer; Martin J. Dürst; Vint Cerf
Subject: RE: The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

I interpreted Mark’s statement (perhaps incorrectly) as meaning that though the % of URLs with final sigma is smaller, so is the amount of Greek content.  And if you do the math, that final sigma appears as important to Greek as Eszett is to German, despite the smaller percentage of links.  In other words the problem in Greek and German are comparable.  (I could be completely wrong, but that’s what I read ☺)

-Shawn

From: Tina Dam [mailto:tina.dam at icann.org]
Sent: ,  01,  2009 17:51
To: Mark Davis ☕
Cc: Erik van der Poel; Shawn Steele; Patrik Fältström; Harald Alvestrand; idna-update at alvestrand.no; lisa Dusseault; Alexander Mayrhofer; Martin J. Dürst; Vint Cerf
Subject: RE: The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

I disagreed with your assessment: “that the relative proportion per language is important.” I think all languages should be able to be represented no matter the size of population (I know you did not say that) or the amount of content– after all that is what IDNs are all about – getting the Internet addresses out to the regions where usability of the Internet otherwise is difficult.
The explanation I added, was “in addition” to yours. But to  comment on your note below - one reason why Chinese for example has a larger uptake is the way they implemented it (with no need inside China to type the .cn). So I really don’t think it is fair to compare Greek with German and so forth. If you want to make the comparison then I think we need to look broader than just the amount of content. Other than that I think content goes hand in hand with population size, Internet usage in the population, as well as whether the addresses are available matching the content language or script if you like. Hopefully getting the IDNs out the door and in a more useful manner will make the other areas grow as well.
Tina



From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis ?
Sent: Tuesday, December 01, 2009 5:35 PM
To: Tina Dam
Cc: Erik van der Poel; Shawn Steele; Patrik Fältström; Harald Alvestrand; idna-update at alvestrand.no; lisa Dusseault; Alexander Mayrhofer; Martin J. Dürst; Vint Cerf
Subject: Re: The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

I'm not exactly sure what you were disagreeing with. I wrote: "One addition: there is over 35 times as much German content as Greek, so that explains part of the difference in final-sigma vs eszed proportion. (The relative proportion per language is important.)"

So which of these three are you disagreeing with?

  *   the relative proportion of content, or
  *   that it explains part of the difference, or
  *   that the relative proportion per language is important.
As for your point, it is a valid one: that the uptake of IDNs has varied by script. However, your figures and conclusion are not accurate; by our measurements at Google the uptake in Korea, Taiwan, and Hong Kong are each higher than Germany, and Japan is near Germany. And all of those are non-Latin. Uptake in Greece is about half that of Germany, but still respectable. So while I agree that part of the difference is the uptake, I still believe that -- as I said -- part (actually a large part) of the difference is due to the 35 : 1 ratio of web content.

Mark
2009/12/1 Tina Dam <tina.dam at icann.org<mailto:tina.dam at icann.org>>
Hi Mark, I respectfully disagree with your assessment. Also based on the correspondence I have received over the last several years, explains another reasons that Greek and other non-Latin based addresses are not used as much. It is simply quite inconvenient to have IDNs at the second level in non-Latin based scripts, then to switch script when typing the top level portion of the address.

For that reason IDNs at the second level in certain scripts has not been introduced or if introduced then not considered useful.

This is anticipated to change as we get the IDN TLDs launched.

Tina

From: idna-update-bounces at alvestrand.no<mailto:idna-update-bounces at alvestrand.no> [mailto:idna-update-bounces at alvestrand.no<mailto:idna-update-bounces at alvestrand.no>] On Behalf Of Mark Davis ?
Sent: Tuesday, December 01, 2009 3:23 PM
To: Erik van der Poel
Cc: Shawn Steele; Patrik Fältström; Harald Alvestrand; idna-update at alvestrand.no<mailto:idna-update at alvestrand.no>; lisa Dusseault; Alexander Mayrhofer; Martin J. Dürst; Vint Cerf
Subject: Re: The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

One addition: there is over 35 times as much German content as Greek, so that explains part of the difference in final-sigma vs eszed proportion. (The relative proportion per language is important.)

Mark
2009/12/1 Erik van der Poel <erikv at google.com<mailto:erikv at google.com>>
I ran the program again today, and Eszett is being used a bit more now
than it was last year.

2009-11-28
1,253,099,703 documents
88,712,912,831 links
8,981 Eszett in host name in link 0.00001%
furz-großerfurz.de<http://furz-grosserfurz.de>
www.bußgeldexperten.de<http://www.bussgeldexperten.de>
www.metzgerei-gaßner.de<http://www.metzgerei-gassner.de>

2008-11-19
819,600,672 documents
49,904,513,188 links
2,739 Eszett in host name in link 0.0000055%
www.rtc-großefehn.de<http://www.rtc-grossefehn.de>
www.mein-fußballclub.de<http://www.mein-fussballclub.de>
www.dermaßanzug.com<http://www.dermassanzug.com>

2006-11-27
889,759,121 documents
1,973 Eszett in host name in document URL 0.00022%
www.uni-gießen.de<http://www.uni-giessen.de>
www.uni-gießen.de<http://www.uni-giessen.de>
www.uni-gießen.de<http://www.uni-giessen.de>

All 3 of the samples were "high value" documents in our index. The
2006 sample looked for Eszett in the host name in the URL of the
document (rather than links inside the document). It is no longer
possible to find Eszett in the URLs of our documents because they are
now all mapped to "ss". So the 2006 sample cannot really be compared
with the others because the URL of a document always contains a host
name, while a link inside a document might be a relative URL (without
a host name).

The Final Sigma has not grown as much:

2009-11-28
305 final sigma in host name in link
0.00000034%
www.γυναικολόγος.gr<http://www.xn--mxadbxfgktc4bn4g.gr>
www.γυναικολόγος.gr<http://www.xn--mxadbxfgktc4bn4g.gr>
www.γυναικολόγος.gr<http://www.xn--mxadbxfgktc4bn4g.gr>

2008-11-19
138 final sigma in host name in link
0.00000028%
www.ταβερνες.gr<http://www.xn--mxacja3bxaqb.gr>
www.ελληναΐς.gr<http://www.xn--owa9dlitap4c.gr>
www.γυναικολόγος.gr<http://www.xn--mxadbxfgktc4bn4g.gr>

Erik

On Tue, Dec 1, 2009 at 11:49 AM, Mark Davis ☕ <mark at macchiato.com<mailto:mark at macchiato.com>> wrote:
> It is approximately 60, as you computed. The trillion figure was in a public
> posting from July 2008, which is why we can quote it.
>
> Mark
>
>
> 2009/12/1 Harald Alvestrand <harald at alvestrand.no<mailto:harald at alvestrand.no>>
>>
>> Mark Davis ☕ wrote:
>>>
>>> As far as Harald's back-of-the-envelope calculations go, they present a
>>> very inaccurate picture of the scale. Here are some more exact figures for
>>> that data.
>>>
>>>   1. 819,600,672    = sample size of documents
>>>   2. 5,000    = links with eszed in the sample
>>>   3. 1,000,000,000,000    = total documents in index (2008)
>>>   4. 1,220    = scaling factor (= total docs / sample size)
>>>   5. 6,100,532    = estimated total links with eszed (= scaling *
>>>      sample eszed links)
>>>
>>> Even this has to be taken with a certain grain of salt, since (a) it is
>>> assuming that the sample is representative (although we have reasonable
>>> confidence in that), and (b) it doesn't weight the "importance" of the links
>>> (in terms of the number of times they are followed), and (c) this data was
>>> collected back in Nov 2008, so we've had another year of growth since then.
>>
>> I obviously need a bigger envelope :-) - I didn't think we had one
>> trillion documents in the 2008 index.
>>
>> One missing number: how many links per document?
>>
>> Obviously #eszed links / #documents can't be the basis of the 0.00001%
>> figure that Erik quoted, because 5000/819600672 = 0.00061005%, not 0.00001%,
>> which is a factor of 60 larger, but if we estimate 60 links per document,
>> the 0.00001% fits nicely as the percentage of links that contain eszed.
>>
>>              Harald
>>
>>
>>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no<mailto:Idna-update at alvestrand.no>
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20091202/3da09443/attachment-0001.htm 


More information about the Idna-update mailing list