The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

Mark Davis ☕ mark at macchiato.com
Wed Dec 2 00:23:27 CET 2009


One addition: there is over 35 times as much German content as Greek, so
that explains part of the difference in final-sigma vs eszed proportion.
(The relative proportion per language is important.)

Mark


2009/12/1 Erik van der Poel <erikv at google.com>

> I ran the program again today, and Eszett is being used a bit more now
> than it was last year.
>
> 2009-11-28
> 1,253,099,703 documents
> 88,712,912,831 links
> 8,981 Eszett in host name in link 0.00001%
> furz-großerfurz.de <http://furz-grosserfurz.de>
> www.bußgeldexperten.de <http://www.bussgeldexperten.de>
> www.metzgerei-gaßner.de <http://www.metzgerei-gassner.de>
>
> 2008-11-19
> 819,600,672 documents
> 49,904,513,188 links
> 2,739 Eszett in host name in link 0.0000055%
> www.rtc-großefehn.de <http://www.rtc-grossefehn.de>
> www.mein-fußballclub.de <http://www.mein-fussballclub.de>
> www.dermaßanzug.com <http://www.dermassanzug.com>
>
> 2006-11-27
> 889,759,121 documents
> 1,973 Eszett in host name in document URL 0.00022%
> www.uni-gießen.de <http://www.uni-giessen.de>
> www.uni-gießen.de <http://www.uni-giessen.de>
> www.uni-gießen.de <http://www.uni-giessen.de>
>
> All 3 of the samples were "high value" documents in our index. The
> 2006 sample looked for Eszett in the host name in the URL of the
> document (rather than links inside the document). It is no longer
> possible to find Eszett in the URLs of our documents because they are
> now all mapped to "ss". So the 2006 sample cannot really be compared
> with the others because the URL of a document always contains a host
> name, while a link inside a document might be a relative URL (without
> a host name).
>
> The Final Sigma has not grown as much:
>
> 2009-11-28
> 305 final sigma in host name in link
> 0.00000034%
> www.γυναικολόγος.gr <http://www.xn--mxadbxfgktc4bn4g.gr>
> www.γυναικολόγος.gr <http://www.xn--mxadbxfgktc4bn4g.gr>
> www.γυναικολόγος.gr <http://www.xn--mxadbxfgktc4bn4g.gr>
>
> 2008-11-19
> 138 final sigma in host name in link
> 0.00000028%
> www.ταβερνες.gr <http://www.xn--mxacja3bxaqb.gr>
> www.ελληναΐς.gr <http://www.xn--owa9dlitap4c.gr>
> www.γυναικολόγος.gr <http://www.xn--mxadbxfgktc4bn4g.gr>
>
> Erik
>
> On Tue, Dec 1, 2009 at 11:49 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
> > It is approximately 60, as you computed. The trillion figure was in a
> public
> > posting from July 2008, which is why we can quote it.
> >
> > Mark
> >
> >
> > 2009/12/1 Harald Alvestrand <harald at alvestrand.no>
> >>
> >> Mark Davis ☕ wrote:
> >>>
> >>> As far as Harald's back-of-the-envelope calculations go, they present a
> >>> very inaccurate picture of the scale. Here are some more exact figures
> for
> >>> that data.
> >>>
> >>>   1. 819,600,672    = sample size of documents
> >>>   2. 5,000    = links with eszed in the sample
> >>>   3. 1,000,000,000,000    = total documents in index (2008)
> >>>   4. 1,220    = scaling factor (= total docs / sample size)
> >>>   5. 6,100,532    = estimated total links with eszed (= scaling *
> >>>      sample eszed links)
> >>>
> >>> Even this has to be taken with a certain grain of salt, since (a) it is
> >>> assuming that the sample is representative (although we have reasonable
> >>> confidence in that), and (b) it doesn't weight the "importance" of the
> links
> >>> (in terms of the number of times they are followed), and (c) this data
> was
> >>> collected back in Nov 2008, so we've had another year of growth since
> then.
> >>
> >> I obviously need a bigger envelope :-) - I didn't think we had one
> >> trillion documents in the 2008 index.
> >>
> >> One missing number: how many links per document?
> >>
> >> Obviously #eszed links / #documents can't be the basis of the 0.00001%
> >> figure that Erik quoted, because 5000/819600672 = 0.00061005%, not
> 0.00001%,
> >> which is a factor of 60 larger, but if we estimate 60 links per
> document,
> >> the 0.00001% fits nicely as the percentage of links that contain eszed.
> >>
> >>              Harald
> >>
> >>
> >>
> >
> >
> > _______________________________________________
> > Idna-update mailing list
> > Idna-update at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/idna-update
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20091201/9a23fac0/attachment.htm 


More information about the Idna-update mailing list