AW: The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

Georg Ochsner g.ochsner at revolistic.com
Wed Dec 2 05:23:28 CET 2009


Hello Erik,

would it maybe be possible for you to query the usage of each German letter within German documents in Germany? Especially the following characters would be very interesting to compare: ß, ä, ö, ü - as well as e and q for comparison reasons, if not all 30. I think this would give a good idea of their actual usage.

Best regards
Georg


> -----Ursprüngliche Nachricht-----
> Von: idna-update-bounces at alvestrand.no [mailto:idna-update-
> bounces at alvestrand.no] Im Auftrag von Erik van der Poel
> Gesendet: Dienstag, 1. Dezember 2009 23:45
> An: Mark Davis ☕
> Cc: Shawn Steele; Patrik Fältström; Harald Alvestrand; idna-
> update at alvestrand.no; lisa Dusseault; Alexander Mayrhofer; Martin J.
> Dürst; Vint Cerf
> Betreff: Re: The real issue: interopability, and a proposal (Was:
> Consensus Call on Latin Sharp S and Greek Final Sigma)
> 
> I ran the program again today, and Eszett is being used a bit more now
> than it was last year.
> 
> 2009-11-28
> 1,253,099,703 documents
> 88,712,912,831 links
> 8,981 Eszett in host name in link 0.00001%
> furz-großerfurz.de
> www.bußgeldexperten.de
> www.metzgerei-gaßner.de
> 
> 2008-11-19
> 819,600,672 documents
> 49,904,513,188 links
> 2,739 Eszett in host name in link 0.0000055%
> www.rtc-großefehn.de
> www.mein-fußballclub.de
> www.dermaßanzug.com
> 
> 2006-11-27
> 889,759,121 documents
> 1,973 Eszett in host name in document URL 0.00022%
> www.uni-gießen.de
> www.uni-gießen.de
> www.uni-gießen.de
> 
> All 3 of the samples were "high value" documents in our index. The
> 2006 sample looked for Eszett in the host name in the URL of the
> document (rather than links inside the document). It is no longer
> possible to find Eszett in the URLs of our documents because they are
> now all mapped to "ss". So the 2006 sample cannot really be compared
> with the others because the URL of a document always contains a host
> name, while a link inside a document might be a relative URL (without
> a host name).
> 
> The Final Sigma has not grown as much:
> 
> 2009-11-28
> 305 final sigma in host name in link
> 0.00000034%
> www.γυναικολόγος.gr
> www.γυναικολόγος.gr
> www.γυναικολόγος.gr
> 
> 2008-11-19
> 138 final sigma in host name in link
> 0.00000028%
> www.ταβερνες.gr
> www.ελληναΐς.gr
> www.γυναικολόγος.gr
> 
> Erik
> 
> On Tue, Dec 1, 2009 at 11:49 AM, Mark Davis ☕ <mark at macchiato.com>
> wrote:
> > It is approximately 60, as you computed. The trillion figure was in a
> public
> > posting from July 2008, which is why we can quote it.
> >
> > Mark
> >
> >
> > 2009/12/1 Harald Alvestrand <harald at alvestrand.no>
> >>
> >> Mark Davis ☕ wrote:
> >>>
> >>> As far as Harald's back-of-the-envelope calculations go, they
> present a
> >>> very inaccurate picture of the scale. Here are some more exact
> figures for
> >>> that data.
> >>>
> >>>   1. 819,600,672    = sample size of documents
> >>>   2. 5,000    = links with eszed in the sample
> >>>   3. 1,000,000,000,000    = total documents in index (2008)
> >>>   4. 1,220    = scaling factor (= total docs / sample size)
> >>>   5. 6,100,532    = estimated total links with eszed (= scaling *
> >>>      sample eszed links)
> >>>
> >>> Even this has to be taken with a certain grain of salt, since (a) it
> is
> >>> assuming that the sample is representative (although we have
> reasonable
> >>> confidence in that), and (b) it doesn't weight the "importance" of
> the links
> >>> (in terms of the number of times they are followed), and (c) this
> data was
> >>> collected back in Nov 2008, so we've had another year of growth
> since then.
> >>
> >> I obviously need a bigger envelope :-) - I didn't think we had one
> >> trillion documents in the 2008 index.
> >>
> >> One missing number: how many links per document?
> >>
> >> Obviously #eszed links / #documents can't be the basis of the
> 0.00001%
> >> figure that Erik quoted, because 5000/819600672 = 0.00061005%, not
> 0.00001%,
> >> which is a factor of 60 larger, but if we estimate 60 links per
> document,
> >> the 0.00001% fits nicely as the percentage of links that contain
> eszed.
> >>
> >>              Harald
> >>
> >>
> >>
> >
> >
> > _______________________________________________
> > Idna-update mailing list
> > Idna-update at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/idna-update
> >
> >
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update



More information about the Idna-update mailing list