One addition: there is over 35 times as much German content as Greek, so that explains part of the difference in final-sigma vs eszed proportion. (The relative proportion per language is important.)<br><br clear="all">Mark<br>
<br><br><div class="gmail_quote">2009/12/1 Erik van der Poel <span dir="ltr"><<a href="mailto:erikv@google.com">erikv@google.com</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
I ran the program again today, and Eszett is being used a bit more now<br>
than it was last year.<br>
<br>
2009-11-28<br>
1,253,099,703 documents<br>
88,712,912,831 links<br>
8,981 Eszett in host name in link 0.00001%<br>
<a href="http://furz-grosserfurz.de" target="_blank">furz-großerfurz.de</a><br>
<a href="http://www.bussgeldexperten.de" target="_blank">www.bußgeldexperten.de</a><br>
<a href="http://www.metzgerei-gassner.de" target="_blank">www.metzgerei-gaßner.de</a><br>
<br>
2008-11-19<br>
819,600,672 documents<br>
49,904,513,188 links<br>
2,739 Eszett in host name in link 0.0000055%<br>
<a href="http://www.rtc-grossefehn.de" target="_blank">www.rtc-großefehn.de</a><br>
<a href="http://www.mein-fussballclub.de" target="_blank">www.mein-fußballclub.de</a><br>
<a href="http://www.dermassanzug.com" target="_blank">www.dermaßanzug.com</a><br>
<br>
2006-11-27<br>
889,759,121 documents<br>
1,973 Eszett in host name in document URL 0.00022%<br>
<a href="http://www.uni-giessen.de" target="_blank">www.uni-gießen.de</a><br>
<a href="http://www.uni-giessen.de" target="_blank">www.uni-gießen.de</a><br>
<a href="http://www.uni-giessen.de" target="_blank">www.uni-gießen.de</a><br>
<br>
All 3 of the samples were "high value" documents in our index. The<br>
2006 sample looked for Eszett in the host name in the URL of the<br>
document (rather than links inside the document). It is no longer<br>
possible to find Eszett in the URLs of our documents because they are<br>
now all mapped to "ss". So the 2006 sample cannot really be compared<br>
with the others because the URL of a document always contains a host<br>
name, while a link inside a document might be a relative URL (without<br>
a host name).<br>
<br>
The Final Sigma has not grown as much:<br>
<br>
2009-11-28<br>
305 final sigma in host name in link<br>
0.00000034%<br>
<a href="http://www.xn--mxadbxfgktc4bn4g.gr" target="_blank">www.γυναικολόγος.gr</a><br>
<a href="http://www.xn--mxadbxfgktc4bn4g.gr" target="_blank">www.γυναικολόγος.gr</a><br>
<a href="http://www.xn--mxadbxfgktc4bn4g.gr" target="_blank">www.γυναικολόγος.gr</a><br>
<br>
2008-11-19<br>
138 final sigma in host name in link<br>
0.00000028%<br>
<a href="http://www.xn--mxacja3bxaqb.gr" target="_blank">www.ταβερνες.gr</a><br>
<a href="http://www.xn--owa9dlitap4c.gr" target="_blank">www.ελληναΐς.gr</a><br>
<a href="http://www.xn--mxadbxfgktc4bn4g.gr" target="_blank">www.γυναικολόγος.gr</a><br>
<font color="#888888"><br>
Erik<br>
</font><div><div></div><div class="h5"><br>
On Tue, Dec 1, 2009 at 11:49 AM, Mark Davis ☕ <<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>> wrote:<br>
> It is approximately 60, as you computed. The trillion figure was in a public<br>
> posting from July 2008, which is why we can quote it.<br>
><br>
> Mark<br>
><br>
><br>
> 2009/12/1 Harald Alvestrand <<a href="mailto:harald@alvestrand.no">harald@alvestrand.no</a>><br>
>><br>
>> Mark Davis ☕ wrote:<br>
>>><br>
>>> As far as Harald's back-of-the-envelope calculations go, they present a<br>
>>> very inaccurate picture of the scale. Here are some more exact figures for<br>
>>> that data.<br>
>>><br>
>>> 1. 819,600,672 = sample size of documents<br>
>>> 2. 5,000 = links with eszed in the sample<br>
>>> 3. 1,000,000,000,000 = total documents in index (2008)<br>
>>> 4. 1,220 = scaling factor (= total docs / sample size)<br>
>>> 5. 6,100,532 = estimated total links with eszed (= scaling *<br>
>>> sample eszed links)<br>
>>><br>
>>> Even this has to be taken with a certain grain of salt, since (a) it is<br>
>>> assuming that the sample is representative (although we have reasonable<br>
>>> confidence in that), and (b) it doesn't weight the "importance" of the links<br>
>>> (in terms of the number of times they are followed), and (c) this data was<br>
>>> collected back in Nov 2008, so we've had another year of growth since then.<br>
>><br>
>> I obviously need a bigger envelope :-) - I didn't think we had one<br>
>> trillion documents in the 2008 index.<br>
>><br>
>> One missing number: how many links per document?<br>
>><br>
>> Obviously #eszed links / #documents can't be the basis of the 0.00001%<br>
>> figure that Erik quoted, because 5000/819600672 = 0.00061005%, not 0.00001%,<br>
>> which is a factor of 60 larger, but if we estimate 60 links per document,<br>
>> the 0.00001% fits nicely as the percentage of links that contain eszed.<br>
>><br>
>> Harald<br>
>><br>
>><br>
>><br>
><br>
><br>
</div></div><div><div></div><div class="h5">> _______________________________________________<br>
> Idna-update mailing list<br>
> <a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>
> <a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>
><br>
><br>
</div></div></blockquote></div><br>