The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)
Erik van der Poel
erikv at google.com
Wed Dec 2 08:02:08 CET 2009
Resending to fewer recipients, with correction:
These percentages are for links that contain host names.
2009/12/1 Erik van der Poel <erikv at google.com>:
> Hello Georg,
>
> I'd rather not spend time writing and running a program that only
> covers German in Germany since this working group is supposed to be
> addressing all languages and countries. The following numbers are from
> today's run (across all languages and all countries):
>
> 000041 A 0.05257%
> 000042 B 0.05273%
> 000043 C 0.04667%
> 000044 D 0.03968%
> 000045 E 0.03433%
> 000046 F 0.03479%
> 000047 G 0.03765%
> 000048 H 0.02878%
> 000049 I 0.02559%
> 00004A J 0.01087%
> 00004B K 0.01261%
> 00004C L 0.03129%
> 00004D M 0.05843%
> 00004E N 0.03023%
> 00004F O 0.03110%
> 000050 P 0.04785%
> 000051 Q 0.00352%
> 000052 R 0.03524%
> 000053 S 0.06189%
> 000054 T 0.04769%
> 000055 U 0.01398%
> 000056 V 0.01801%
> 000057 W 0.02288%
> 000058 X 0.01048%
> 000059 Y 0.00646%
> 00005A Z 0.00631%
> 000061 a 65.13017%
> 000062 b 34.37698%
> 000063 c 75.88070%
> 000064 d 33.11157%
> 000065 e 71.04350%
> 000066 f 19.63188%
> 000067 g 35.18146%
> 000068 h 21.27433%
> 000069 i 56.23800%
> 00006A j 7.64336%
> 00006B k 19.40692%
> 00006C l 46.67894%
> 00006D m 72.34482%
> 00006E n 49.77034%
> 00006F o 87.07448%
> 000070 p 32.55542%
> 000071 q 2.40478%
> 000072 r 55.26164%
> 000073 s 51.89721%
> 000074 t 52.76240%
> 000075 u 31.00335%
> 000076 v 15.58301%
> 000077 w 61.28691%
> 000078 x 5.37177%
> 000079 y 18.39022%
> 00007A z 7.33118%
> 0000C4 Ä 0.00000%
> 0000D6 Ö 0.00000%
> 0000DC Ü 0.00000%
> 0000DF ß 0.00002%
> 0000E4 ä 0.00108%
> 0000F6 ö 0.00092%
> 0000FC ü 0.00163%
>
> Erik
>
> 2009/12/1 Georg Ochsner <g.ochsner at revolistic.com>:
>> Hello Erik,
>>
>> would it maybe be possible for you to query the usage of each German letter within German documents in Germany? Especially the following characters would be very interesting to compare: ß, ä, ö, ü - as well as e and q for comparison reasons, if not all 30. I think this would give a good idea of their actual usage.
>>
>> Best regards
>> Georg
>>
>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: idna-update-bounces at alvestrand.no [mailto:idna-update-
>>> bounces at alvestrand.no] Im Auftrag von Erik van der Poel
>>> Gesendet: Dienstag, 1. Dezember 2009 23:45
>>> An: Mark Davis ☕
>>> Cc: Shawn Steele; Patrik Fältström; Harald Alvestrand; idna-
>>> update at alvestrand.no; lisa Dusseault; Alexander Mayrhofer; Martin J.
>>> Dürst; Vint Cerf
>>> Betreff: Re: The real issue: interopability, and a proposal (Was:
>>> Consensus Call on Latin Sharp S and Greek Final Sigma)
>>>
>>> I ran the program again today, and Eszett is being used a bit more now
>>> than it was last year.
>>>
>>> 2009-11-28
>>> 1,253,099,703 documents
>>> 88,712,912,831 links
>>> 8,981 Eszett in host name in link 0.00001%
>>> furz-großerfurz.de
>>> www.bußgeldexperten.de
>>> www.metzgerei-gaßner.de
>>>
>>> 2008-11-19
>>> 819,600,672 documents
>>> 49,904,513,188 links
>>> 2,739 Eszett in host name in link 0.0000055%
>>> www.rtc-großefehn.de
>>> www.mein-fußballclub.de
>>> www.dermaßanzug.com
>>>
>>> 2006-11-27
>>> 889,759,121 documents
>>> 1,973 Eszett in host name in document URL 0.00022%
>>> www.uni-gießen.de
>>> www.uni-gießen.de
>>> www.uni-gießen.de
>>>
>>> All 3 of the samples were "high value" documents in our index. The
>>> 2006 sample looked for Eszett in the host name in the URL of the
>>> document (rather than links inside the document). It is no longer
>>> possible to find Eszett in the URLs of our documents because they are
>>> now all mapped to "ss". So the 2006 sample cannot really be compared
>>> with the others because the URL of a document always contains a host
>>> name, while a link inside a document might be a relative URL (without
>>> a host name).
>>>
>>> The Final Sigma has not grown as much:
>>>
>>> 2009-11-28
>>> 305 final sigma in host name in link
>>> 0.00000034%
>>> www.γυναικολόγος.gr
>>> www.γυναικολόγος.gr
>>> www.γυναικολόγος.gr
>>>
>>> 2008-11-19
>>> 138 final sigma in host name in link
>>> 0.00000028%
>>> www.ταβερνες.gr
>>> www.ελληναΐς.gr
>>> www.γυναικολόγος.gr
>>>
>>> Erik
>>>
>>> On Tue, Dec 1, 2009 at 11:49 AM, Mark Davis ☕ <mark at macchiato.com>
>>> wrote:
>>> > It is approximately 60, as you computed. The trillion figure was in a
>>> public
>>> > posting from July 2008, which is why we can quote it.
>>> >
>>> > Mark
>>> >
>>> >
>>> > 2009/12/1 Harald Alvestrand <harald at alvestrand.no>
>>> >>
>>> >> Mark Davis ☕ wrote:
>>> >>>
>>> >>> As far as Harald's back-of-the-envelope calculations go, they
>>> present a
>>> >>> very inaccurate picture of the scale. Here are some more exact
>>> figures for
>>> >>> that data.
>>> >>>
>>> >>> 1. 819,600,672 = sample size of documents
>>> >>> 2. 5,000 = links with eszed in the sample
>>> >>> 3. 1,000,000,000,000 = total documents in index (2008)
>>> >>> 4. 1,220 = scaling factor (= total docs / sample size)
>>> >>> 5. 6,100,532 = estimated total links with eszed (= scaling *
>>> >>> sample eszed links)
>>> >>>
>>> >>> Even this has to be taken with a certain grain of salt, since (a) it
>>> is
>>> >>> assuming that the sample is representative (although we have
>>> reasonable
>>> >>> confidence in that), and (b) it doesn't weight the "importance" of
>>> the links
>>> >>> (in terms of the number of times they are followed), and (c) this
>>> data was
>>> >>> collected back in Nov 2008, so we've had another year of growth
>>> since then.
>>> >>
>>> >> I obviously need a bigger envelope :-) - I didn't think we had one
>>> >> trillion documents in the 2008 index.
>>> >>
>>> >> One missing number: how many links per document?
>>> >>
>>> >> Obviously #eszed links / #documents can't be the basis of the
>>> 0.00001%
>>> >> figure that Erik quoted, because 5000/819600672 = 0.00061005%, not
>>> 0.00001%,
>>> >> which is a factor of 60 larger, but if we estimate 60 links per
>>> document,
>>> >> the 0.00001% fits nicely as the percentage of links that contain
>>> eszed.
>>> >>
>>> >> Harald
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > Idna-update mailing list
>>> > Idna-update at alvestrand.no
>>> > http://www.alvestrand.no/mailman/listinfo/idna-update
>>> >
>>> >
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>>
>
More information about the Idna-update
mailing list