The real issue: interopability, and a proposal (Was: Consensus Call on Latin Sharp S and Greek Final Sigma)

Erik van der Poel erikv at google.com
Wed Dec 2 08:02:08 CET 2009


Resending to fewer recipients, with correction:

These percentages are for links that contain host names.

2009/12/1 Erik van der Poel <erikv at google.com>:
> Hello Georg,
>
> I'd rather not spend time writing and running a program that only
> covers German in Germany since this working group is supposed to be
> addressing all languages and countries. The following numbers are from
> today's run (across all languages and all countries):
>
> 000041  A       0.05257%
> 000042  B       0.05273%
> 000043  C       0.04667%
> 000044  D       0.03968%
> 000045  E       0.03433%
> 000046  F       0.03479%
> 000047  G       0.03765%
> 000048  H       0.02878%
> 000049  I       0.02559%
> 00004A  J       0.01087%
> 00004B  K       0.01261%
> 00004C  L       0.03129%
> 00004D  M       0.05843%
> 00004E  N       0.03023%
> 00004F  O       0.03110%
> 000050  P       0.04785%
> 000051  Q       0.00352%
> 000052  R       0.03524%
> 000053  S       0.06189%
> 000054  T       0.04769%
> 000055  U       0.01398%
> 000056  V       0.01801%
> 000057  W       0.02288%
> 000058  X       0.01048%
> 000059  Y       0.00646%
> 00005A  Z       0.00631%
> 000061  a       65.13017%
> 000062  b       34.37698%
> 000063  c       75.88070%
> 000064  d       33.11157%
> 000065  e       71.04350%
> 000066  f       19.63188%
> 000067  g       35.18146%
> 000068  h       21.27433%
> 000069  i       56.23800%
> 00006A  j       7.64336%
> 00006B  k       19.40692%
> 00006C  l       46.67894%
> 00006D  m       72.34482%
> 00006E  n       49.77034%
> 00006F  o       87.07448%
> 000070  p       32.55542%
> 000071  q       2.40478%
> 000072  r       55.26164%
> 000073  s       51.89721%
> 000074  t       52.76240%
> 000075  u       31.00335%
> 000076  v       15.58301%
> 000077  w       61.28691%
> 000078  x       5.37177%
> 000079  y       18.39022%
> 00007A  z       7.33118%
> 0000C4  Ä       0.00000%
> 0000D6  Ö       0.00000%
> 0000DC  Ü       0.00000%
> 0000DF  ß       0.00002%
> 0000E4  ä       0.00108%
> 0000F6  ö       0.00092%
> 0000FC  ü       0.00163%
>
> Erik
>
> 2009/12/1 Georg Ochsner <g.ochsner at revolistic.com>:
>> Hello Erik,
>>
>> would it maybe be possible for you to query the usage of each German letter within German documents in Germany? Especially the following characters would be very interesting to compare: ß, ä, ö, ü - as well as e and q for comparison reasons, if not all 30. I think this would give a good idea of their actual usage.
>>
>> Best regards
>> Georg
>>
>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: idna-update-bounces at alvestrand.no [mailto:idna-update-
>>> bounces at alvestrand.no] Im Auftrag von Erik van der Poel
>>> Gesendet: Dienstag, 1. Dezember 2009 23:45
>>> An: Mark Davis ☕
>>> Cc: Shawn Steele; Patrik Fältström; Harald Alvestrand; idna-
>>> update at alvestrand.no; lisa Dusseault; Alexander Mayrhofer; Martin J.
>>> Dürst; Vint Cerf
>>> Betreff: Re: The real issue: interopability, and a proposal (Was:
>>> Consensus Call on Latin Sharp S and Greek Final Sigma)
>>>
>>> I ran the program again today, and Eszett is being used a bit more now
>>> than it was last year.
>>>
>>> 2009-11-28
>>> 1,253,099,703 documents
>>> 88,712,912,831 links
>>> 8,981 Eszett in host name in link 0.00001%
>>> furz-großerfurz.de
>>> www.bußgeldexperten.de
>>> www.metzgerei-gaßner.de
>>>
>>> 2008-11-19
>>> 819,600,672 documents
>>> 49,904,513,188 links
>>> 2,739 Eszett in host name in link 0.0000055%
>>> www.rtc-großefehn.de
>>> www.mein-fußballclub.de
>>> www.dermaßanzug.com
>>>
>>> 2006-11-27
>>> 889,759,121 documents
>>> 1,973 Eszett in host name in document URL 0.00022%
>>> www.uni-gießen.de
>>> www.uni-gießen.de
>>> www.uni-gießen.de
>>>
>>> All 3 of the samples were "high value" documents in our index. The
>>> 2006 sample looked for Eszett in the host name in the URL of the
>>> document (rather than links inside the document). It is no longer
>>> possible to find Eszett in the URLs of our documents because they are
>>> now all mapped to "ss". So the 2006 sample cannot really be compared
>>> with the others because the URL of a document always contains a host
>>> name, while a link inside a document might be a relative URL (without
>>> a host name).
>>>
>>> The Final Sigma has not grown as much:
>>>
>>> 2009-11-28
>>> 305 final sigma in host name in link
>>> 0.00000034%
>>> www.γυναικολόγος.gr
>>> www.γυναικολόγος.gr
>>> www.γυναικολόγος.gr
>>>
>>> 2008-11-19
>>> 138 final sigma in host name in link
>>> 0.00000028%
>>> www.ταβερνες.gr
>>> www.ελληναΐς.gr
>>> www.γυναικολόγος.gr
>>>
>>> Erik
>>>
>>> On Tue, Dec 1, 2009 at 11:49 AM, Mark Davis ☕ <mark at macchiato.com>
>>> wrote:
>>> > It is approximately 60, as you computed. The trillion figure was in a
>>> public
>>> > posting from July 2008, which is why we can quote it.
>>> >
>>> > Mark
>>> >
>>> >
>>> > 2009/12/1 Harald Alvestrand <harald at alvestrand.no>
>>> >>
>>> >> Mark Davis ☕ wrote:
>>> >>>
>>> >>> As far as Harald's back-of-the-envelope calculations go, they
>>> present a
>>> >>> very inaccurate picture of the scale. Here are some more exact
>>> figures for
>>> >>> that data.
>>> >>>
>>> >>>   1. 819,600,672    = sample size of documents
>>> >>>   2. 5,000    = links with eszed in the sample
>>> >>>   3. 1,000,000,000,000    = total documents in index (2008)
>>> >>>   4. 1,220    = scaling factor (= total docs / sample size)
>>> >>>   5. 6,100,532    = estimated total links with eszed (= scaling *
>>> >>>      sample eszed links)
>>> >>>
>>> >>> Even this has to be taken with a certain grain of salt, since (a) it
>>> is
>>> >>> assuming that the sample is representative (although we have
>>> reasonable
>>> >>> confidence in that), and (b) it doesn't weight the "importance" of
>>> the links
>>> >>> (in terms of the number of times they are followed), and (c) this
>>> data was
>>> >>> collected back in Nov 2008, so we've had another year of growth
>>> since then.
>>> >>
>>> >> I obviously need a bigger envelope :-) - I didn't think we had one
>>> >> trillion documents in the 2008 index.
>>> >>
>>> >> One missing number: how many links per document?
>>> >>
>>> >> Obviously #eszed links / #documents can't be the basis of the
>>> 0.00001%
>>> >> figure that Erik quoted, because 5000/819600672 = 0.00061005%, not
>>> 0.00001%,
>>> >> which is a factor of 60 larger, but if we estimate 60 links per
>>> document,
>>> >> the 0.00001% fits nicely as the percentage of links that contain
>>> eszed.
>>> >>
>>> >>              Harald
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > Idna-update mailing list
>>> > Idna-update at alvestrand.no
>>> > http://www.alvestrand.no/mailman/listinfo/idna-update
>>> >
>>> >
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>>
>


More information about the Idna-update mailing list