One addition: there is over 35 times as much German content as Greek, so that explains part of the difference in final-sigma vs eszed proportion. (The relative proportion per language is important.)<br><br clear="all">Mark<br>


<br><br><div class="gmail_quote">2009/12/1 Erik van der Poel <span dir="ltr">&lt;<a href="mailto:erikv@google.com">erikv@google.com</a>&gt;</span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

I ran the program again today, and Eszett is being used a bit more now<br>

than it was last year.<br>

<br>

2009-11-28<br>

1,253,099,703 documents<br>

88,712,912,831 links<br>

8,981 Eszett in host name in link 0.00001%<br>

<a href="http://furz-grosserfurz.de" target="_blank">furz-großerfurz.de</a><br>

<a href="http://www.bussgeldexperten.de" target="_blank">www.bußgeldexperten.de</a><br>

<a href="http://www.metzgerei-gassner.de" target="_blank">www.metzgerei-gaßner.de</a><br>

<br>

2008-11-19<br>

819,600,672 documents<br>

49,904,513,188 links<br>

2,739 Eszett in host name in link 0.0000055%<br>

<a href="http://www.rtc-grossefehn.de" target="_blank">www.rtc-großefehn.de</a><br>

<a href="http://www.mein-fussballclub.de" target="_blank">www.mein-fußballclub.de</a><br>

<a href="http://www.dermassanzug.com" target="_blank">www.dermaßanzug.com</a><br>

<br>

2006-11-27<br>

889,759,121 documents<br>

1,973 Eszett in host name in document URL 0.00022%<br>

<a href="http://www.uni-giessen.de" target="_blank">www.uni-gießen.de</a><br>

<a href="http://www.uni-giessen.de" target="_blank">www.uni-gießen.de</a><br>

<a href="http://www.uni-giessen.de" target="_blank">www.uni-gießen.de</a><br>

<br>

All 3 of the samples were &quot;high value&quot; documents in our index. The<br>

2006 sample looked for Eszett in the host name in the URL of the<br>

document (rather than links inside the document). It is no longer<br>

possible to find Eszett in the URLs of our documents because they are<br>

now all mapped to &quot;ss&quot;. So the 2006 sample cannot really be compared<br>

with the others because the URL of a document always contains a host<br>

name, while a link inside a document might be a relative URL (without<br>

a host name).<br>

<br>

The Final Sigma has not grown as much:<br>

<br>

2009-11-28<br>

305 final sigma in host name in link<br>

0.00000034%<br>

<a href="http://www.xn--mxadbxfgktc4bn4g.gr" target="_blank">www.γυναικολόγος.gr</a><br>

<a href="http://www.xn--mxadbxfgktc4bn4g.gr" target="_blank">www.γυναικολόγος.gr</a><br>

<a href="http://www.xn--mxadbxfgktc4bn4g.gr" target="_blank">www.γυναικολόγος.gr</a><br>

<br>

2008-11-19<br>

138 final sigma in host name in link<br>

0.00000028%<br>

<a href="http://www.xn--mxacja3bxaqb.gr" target="_blank">www.ταβερνες.gr</a><br>

<a href="http://www.xn--owa9dlitap4c.gr" target="_blank">www.ελληναΐς.gr</a><br>

<a href="http://www.xn--mxadbxfgktc4bn4g.gr" target="_blank">www.γυναικολόγος.gr</a><br>

<font color="#888888"><br>

Erik<br>

</font><div><div></div><div class="h5"><br>

On Tue, Dec 1, 2009 at 11:49 AM, Mark Davis ☕ &lt;<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>&gt; wrote:<br>

&gt; It is approximately 60, as you computed. The trillion figure was in a public<br>

&gt; posting from July 2008, which is why we can quote it.<br>

&gt;<br>

&gt; Mark<br>

&gt;<br>

&gt;<br>

&gt; 2009/12/1 Harald Alvestrand &lt;<a href="mailto:harald@alvestrand.no">harald@alvestrand.no</a>&gt;<br>

&gt;&gt;<br>

&gt;&gt; Mark Davis ☕ wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; As far as Harald&#39;s back-of-the-envelope calculations go, they present a<br>

&gt;&gt;&gt; very inaccurate picture of the scale. Here are some more exact figures for<br>

&gt;&gt;&gt; that data.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;   1. 819,600,672    = sample size of documents<br>

&gt;&gt;&gt;   2. 5,000    = links with eszed in the sample<br>

&gt;&gt;&gt;   3. 1,000,000,000,000    = total documents in index (2008)<br>

&gt;&gt;&gt;   4. 1,220    = scaling factor (= total docs / sample size)<br>

&gt;&gt;&gt;   5. 6,100,532    = estimated total links with eszed (= scaling *<br>

&gt;&gt;&gt;      sample eszed links)<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Even this has to be taken with a certain grain of salt, since (a) it is<br>

&gt;&gt;&gt; assuming that the sample is representative (although we have reasonable<br>

&gt;&gt;&gt; confidence in that), and (b) it doesn&#39;t weight the &quot;importance&quot; of the links<br>

&gt;&gt;&gt; (in terms of the number of times they are followed), and (c) this data was<br>

&gt;&gt;&gt; collected back in Nov 2008, so we&#39;ve had another year of growth since then.<br>

&gt;&gt;<br>

&gt;&gt; I obviously need a bigger envelope :-) - I didn&#39;t think we had one<br>

&gt;&gt; trillion documents in the 2008 index.<br>

&gt;&gt;<br>

&gt;&gt; One missing number: how many links per document?<br>

&gt;&gt;<br>

&gt;&gt; Obviously #eszed links / #documents can&#39;t be the basis of the 0.00001%<br>

&gt;&gt; figure that Erik quoted, because 5000/819600672 = 0.00061005%, not 0.00001%,<br>

&gt;&gt; which is a factor of 60 larger, but if we estimate 60 links per document,<br>

&gt;&gt; the 0.00001% fits nicely as the percentage of links that contain eszed.<br>

&gt;&gt;<br>

&gt;&gt;              Harald<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;<br>

&gt;<br>

</div></div><div><div></div><div class="h5">&gt; _______________________________________________<br>

&gt; Idna-update mailing list<br>

&gt; <a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>

&gt; <a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

&gt;<br>

&gt;<br>

</div></div></blockquote></div><br>