The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

Mark Davis ☕ mark at macchiato.com
Wed Dec 2 02:58:20 CET 2009


Tina, that's all I was saying; I wasn't saying anything about relative
importance of languages per se, I was talking about the importance **in
explaining the different in percentages of IDNs found in the Google
analysis*.*

Mark


On Tue, Dec 1, 2009 at 17:54, Shawn Steele <Shawn.Steele at microsoft.com>wrote:

> I interpreted Mark’s statement (perhaps incorrectly) as meaning that though
> the % of URLs with final sigma is smaller, so is the amount of Greek
> content.  And if you do the math, that final sigma appears as important to
> Greek as Eszett is to German, despite the smaller percentage of links.  In
> other words the problem in Greek and German are comparable.  (I could be
> completely wrong, but that’s what I read J)
>
>
>
> -Shawn
>
>
>
> *From:* Tina Dam [mailto:tina.dam at icann.org]
> *Sent:* ,  01,  2009 17:51
> *To:* Mark Davis ☕
>
> *Cc:* Erik van der Poel; Shawn Steele; Patrik Fältström; Harald
> Alvestrand; idna-update at alvestrand.no; lisa Dusseault; Alexander
> Mayrhofer; Martin J. Dürst; Vint Cerf
> *Subject:* RE: The real issue: interopability, and a proposal (Was:
> Consensus Call on Latin Sharp S and Greek Final Sigma)
>
>
>
> I disagreed with your assessment: “that the relative proportion per
> language is important.” I think all languages should be able to be
> represented no matter the size of population (I know you did not say that)
> or the amount of content– after all that is what IDNs are all about –
> getting the Internet addresses out to the regions where usability of the
> Internet otherwise is difficult.
>
> The explanation I added, was “in addition” to yours. But to  comment on
> your note below - one reason why Chinese for example has a larger uptake is
> the way they implemented it (with no need inside China to type the .cn). So
> I really don’t think it is fair to compare Greek with German and so forth.
> If you want to make the comparison then I think we need to look broader than
> just the amount of content. Other than that I think content goes hand in
> hand with population size, Internet usage in the population, as well as
> whether the addresses are available matching the content language or script
> if you like. Hopefully getting the IDNs out the door and in a more useful
> manner will make the other areas grow as well.
>
> Tina
>
>
>
>
>
>
>
> *From:* mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] *On
> Behalf Of *Mark Davis ?
> *Sent:* Tuesday, December 01, 2009 5:35 PM
> *To:* Tina Dam
> *Cc:* Erik van der Poel; Shawn Steele; Patrik Fältström; Harald
> Alvestrand; idna-update at alvestrand.no; lisa Dusseault; Alexander
> Mayrhofer; Martin J. Dürst; Vint Cerf
> *Subject:* Re: The real issue: interopability, and a proposal (Was:
> Consensus Call on Latin Sharp S and Greek Final Sigma)
>
>
>
> I'm not exactly sure what you were disagreeing with. I wrote: "One
> addition: there is over 35 times as much German content as Greek, so that
> explains part of the difference in final-sigma vs eszed proportion. (The
> relative proportion per language is important.)"
>
> So which of these three are you disagreeing with?
>
>    - the relative proportion of content, or
>    - that it explains part of the difference, or
>    - that the relative proportion per language is important.
>
> As for your point, it is a valid one: that the uptake of IDNs has varied by
> script. However, your figures and conclusion are not accurate; by our
> measurements at Google the uptake in Korea, Taiwan, and Hong Kong are each
> higher than Germany, and Japan is near Germany. And all of those are
> non-Latin. Uptake in Greece is about half that of Germany, but still
> respectable. So while I agree that part of the difference is the uptake, I
> still believe that -- as I said -- part (actually a large part) of the
> difference is due to the 35 : 1 ratio of web content.
>
> Mark
>
> 2009/12/1 Tina Dam <tina.dam at icann.org>
>
> Hi Mark, I respectfully disagree with your assessment. Also based on the
> correspondence I have received over the last several years, explains another
> reasons that Greek and other non-Latin based addresses are not used as much.
> It is simply quite inconvenient to have IDNs at the second level in
> non-Latin based scripts, then to switch script when typing the top level
> portion of the address.
>
>
>
> For that reason IDNs at the second level in certain scripts has not been
> introduced or if introduced then not considered useful.
>
>
>
> This is anticipated to change as we get the IDN TLDs launched.
>
>
>
> Tina
>
>
>
> *From:* idna-update-bounces at alvestrand.no [mailto:
> idna-update-bounces at alvestrand.no] *On Behalf Of *Mark Davis ?
> *Sent:* Tuesday, December 01, 2009 3:23 PM
> *To:* Erik van der Poel
> *Cc:* Shawn Steele; Patrik Fältström; Harald Alvestrand;
> idna-update at alvestrand.no; lisa Dusseault; Alexander Mayrhofer; Martin J.
> Dürst; Vint Cerf
> *Subject:* Re: The real issue: interopability, and a proposal (Was:
> Consensus Call on Latin Sharp S and Greek Final Sigma)
>
>
>
> One addition: there is over 35 times as much German content as Greek, so
> that explains part of the difference in final-sigma vs eszed proportion.
> (The relative proportion per language is important.)
>
> Mark
>
> 2009/12/1 Erik van der Poel <erikv at google.com>
>
> I ran the program again today, and Eszett is being used a bit more now
> than it was last year.
>
> 2009-11-28
> 1,253,099,703 documents
> 88,712,912,831 links
> 8,981 Eszett in host name in link 0.00001%
> furz-großerfurz.de <http://furz-grosserfurz.de>
> www.bußgeldexperten.de <http://www.bussgeldexperten.de>
> www.metzgerei-gaßner.de <http://www.metzgerei-gassner.de>
>
> 2008-11-19
> 819,600,672 documents
> 49,904,513,188 links
> 2,739 Eszett in host name in link 0.0000055%
> www.rtc-großefehn.de <http://www.rtc-grossefehn.de>
> www.mein-fußballclub.de <http://www.mein-fussballclub.de>
> www.dermaßanzug.com <http://www.dermassanzug.com>
>
> 2006-11-27
> 889,759,121 documents
> 1,973 Eszett in host name in document URL 0.00022%
> www.uni-gießen.de <http://www.uni-giessen.de>
> www.uni-gießen.de <http://www.uni-giessen.de>
> www.uni-gießen.de <http://www.uni-giessen.de>
>
> All 3 of the samples were "high value" documents in our index. The
> 2006 sample looked for Eszett in the host name in the URL of the
> document (rather than links inside the document). It is no longer
> possible to find Eszett in the URLs of our documents because they are
> now all mapped to "ss". So the 2006 sample cannot really be compared
> with the others because the URL of a document always contains a host
> name, while a link inside a document might be a relative URL (without
> a host name).
>
> The Final Sigma has not grown as much:
>
> 2009-11-28
> 305 final sigma in host name in link
> 0.00000034%
> www.γυναικολόγος.gr <http://www.xn--mxadbxfgktc4bn4g.gr>
> www.γυναικολόγος.gr <http://www.xn--mxadbxfgktc4bn4g.gr>
> www.γυναικολόγος.gr <http://www.xn--mxadbxfgktc4bn4g.gr>
>
> 2008-11-19
> 138 final sigma in host name in link
> 0.00000028%
> www.ταβερνες.gr <http://www.xn--mxacja3bxaqb.gr>
> www.ελληναΐς.gr <http://www.xn--owa9dlitap4c.gr>
> www.γυναικολόγος.gr <http://www.xn--mxadbxfgktc4bn4g.gr>
>
> Erik
>
>
> On Tue, Dec 1, 2009 at 11:49 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
> > It is approximately 60, as you computed. The trillion figure was in a
> public
> > posting from July 2008, which is why we can quote it.
> >
> > Mark
> >
> >
> > 2009/12/1 Harald Alvestrand <harald at alvestrand.no>
> >>
> >> Mark Davis ☕ wrote:
> >>>
> >>> As far as Harald's back-of-the-envelope calculations go, they present a
> >>> very inaccurate picture of the scale. Here are some more exact figures
> for
> >>> that data.
> >>>
> >>>   1. 819,600,672    = sample size of documents
> >>>   2. 5,000    = links with eszed in the sample
> >>>   3. 1,000,000,000,000    = total documents in index (2008)
> >>>   4. 1,220    = scaling factor (= total docs / sample size)
> >>>   5. 6,100,532    = estimated total links with eszed (= scaling *
> >>>      sample eszed links)
> >>>
> >>> Even this has to be taken with a certain grain of salt, since (a) it is
> >>> assuming that the sample is representative (although we have reasonable
> >>> confidence in that), and (b) it doesn't weight the "importance" of the
> links
> >>> (in terms of the number of times they are followed), and (c) this data
> was
> >>> collected back in Nov 2008, so we've had another year of growth since
> then.
> >>
> >> I obviously need a bigger envelope :-) - I didn't think we had one
> >> trillion documents in the 2008 index.
> >>
> >> One missing number: how many links per document?
> >>
> >> Obviously #eszed links / #documents can't be the basis of the 0.00001%
> >> figure that Erik quoted, because 5000/819600672 = 0.00061005%, not
> 0.00001%,
> >> which is a factor of 60 larger, but if we estimate 60 links per
> document,
> >> the 0.00001% fits nicely as the percentage of links that contain eszed.
> >>
> >>              Harald
> >>
> >>
> >>
> >
> >
>
> > _______________________________________________
> > Idna-update mailing list
> > Idna-update at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/idna-update
> >
> >
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20091201/81894554/attachment-0001.htm 


More information about the Idna-update mailing list