Mapping Accents (was Final Sigma (was: RE: Esszett, Final Sigma, ZWJ and ZWNJ))

Mark Davis mark at macchiato.com
Wed Feb 25 18:45:01 CET 2009


Re the deaccenting of uppercase letters.

On the registrar side it can be handled with bundling (annoying as that is),
but on the client side it is quite tricky. The problem is that deaccenting
of uppercase letters loses information, so when mapping back to lowercase,
there is no indication of what the accents would have been.

So take the example of ΧΡΗΣΗΣ.gr. There is no algorithmic way to know
whether that would have come from χρήσης.gr <http://xn--jxas2ajbt.gr> or
χρησής.gr <http://xn--jxar3ajbt.gr> or even
χρησης.gr<http://xn--sxaa2ajbt.gr>without having language-specific
knowledge in the client implementation. The
only even possibly feasible way to handle that, as far as I can tell, would
be to special-case Greek characters by mapping so as to always remove their
accents. In that case, ΧΡΗΣΗΣ.gr, ΧΡΉΣΗΣ.gr,
χρήσης.gr<http://xn--jxas2ajbt.gr>,
χρησής.gr <http://xn--jxar3ajbt.gr> all map to
χρησησ.gr<http://xn--sxaa2ajbt.gr>.
Unlike the final sigma, however, there is no way for the client side to map
from the unaccented version back to an accented version (except in the
special case of a single vowel), so Greeks would always see
χρησης.gr<http://xn--sxaa2ajbt.gr>-- I suspect you'd find that
sub-optimal.

Mark


On Wed, Feb 25, 2009 at 03:02, Vaggelis Segredakis <segred at ics.forth.gr>wrote:

>  Dear Mark and Tina,
>
>
>
> The original IDNA2003 mapping has made life easier for us on the final
> sigma -> sigma issue but the example Mark presented brings forth another
> very big problem we have faced with that version: In Greek you never put a
> hyphenation mark in a word consisting only by capital letters. The correct
> uppercase for χρήσης.gr (xn--jxas2ajbt.gr) is ΧΡΗΣΗΣ.gr (xn--sxaa2ajbt.gr)
> and not ΧΡΉΣΗΣ.gr which was accepted by IDNA2003 as the only equivalent.
>
>
>
> We started there and then to use bundling options to bundle DNS tags to
> make them work as our language is normally used where it should have been
> the other way round. IDNA tags should be able to represent languages as they
> are used. It happens in Latin character languages.
>
>
>
> I would welcome a solution that takes this second issue into account as
> well and further simplifies life for Greek users who get a poor experience
> of the IDNs. We had already a meeting with our Telecommunications regulator,
> our Government and the .CY registry and we tried to raise a common position
> on this new solution of the final sigma representation as a separate
> character. The results of this meeting are pending but from my understanding
> a more global solution on these issues that haunt the Greek IDNs would be
> more welcome than patches on a problematic protocol.
>
>
>
> My belief is that if a broader solution would be welcomed by this working
> group, our LIC would be interested to participate in a broad public
> discussion for a consensus in how we wish our IDNs to operate. The question
> is if this WG is ready to bend some rules and change some former decisions
> because it looks that xn— might be a thing of the past soon.
>
>
>
> Vaggelis
>
>
>  ------------------------------
>
> *From:* mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] *On
> Behalf Of *Mark Davis
> *Sent:* Wednesday, February 25, 2009 1:18 AM
> *To:* Tina Dam
> *Cc:* Vaggelis Segredakis; idna-update at alvestrand.no; Vint Cerf; Sotiris
> Panaretou; Panagiotis Papaspiliopoulos; Euripides Zervanos
> *Subject:* Re: Final Sigma (was: RE: Esszett, Final Sigma, ZWJ and ZWNJ)
>
>
>
> The original IDNA2003 mapping was chosen for a purpose: it allows
> χρήσης.gr <http://xn--jxas2ajbt.gr> and ΧΡΉΣΗΣ.gr to both go to the same
> page, without requiring bundling. (Note the two different kinds of lowercase
> sigmas.)
>
> I still think a better approach would be to retain the mapping for
> compatibility, but specify that when converting back from punycode, trailing
> sigmas be transformed into final sigmas. For example, in the address bar you
> could type ΧΡΉΣΗΣ.gr, and when you went to the page you'd see χρήσης.gr<http://xn--jxas2ajbt.gr>in the address bar.
>
> The only downside I can see is that it would encourage Greek domain names
> to use interior hyphens where necessary to get the sigma right. So you would
> want to register
>
> ευρείας-χρήσης.gr <http://xn----tlbbisas8eesdbp8a.gr>
>   instead of
> ευρείασχρήσης.gr <http://xn--jxas2ajbt.gr>
>
> But that's not a big downside compared with the alternatives.
>
> Mark
>
>  On Tue, Feb 24, 2009 at 14:34, Tina Dam <tina.dam at icann.org> wrote:
>
> Vaggelis,
>
> I totally understand the frustration and concern that you are expressing. I
> am wondering though if it is not better to get this corrected now, so that
> the Greek script/language is functioning correctly in the Internet/with
> domain names, than it is to have this half solution that really makes things
> worse the larger the volume of domain names that are registered? That is
> both under .GR, but also other TLDs that might introduce the Greek
> characters (.CY is the most natural existing TLD that comes to mind in
> addition to .GR, but off course also gTLDs, and even more importantly as we
> move to the IDN TLDs).
>
>
>
> As far as I see things this is not a matter of mapping or no mappings, but
> in the case about the final sigma it is the matter of a wrong decision being
> made in 2003, making
>
>
>
> U+03A3 GREEK CAPITAL LETTER SIGMA - always map into:
>
>
>
> U+03C3 GREEK SMALL LETTER SIGMA - when in fact (as you and your colleagues
> are well aware of and as you express below) it often should be mapped into:
>
>
>
> U+03C2 GREEK SMALL LETTER FINAL SIGMA
>
>
>
> In other words, the mapping of the Capital Sigma is not a one-to-one nor a
> global solution like for example the mapping of Capital “A” to lower-case
> “a” is, and hence this sigma-mapping should never have been introduced in
> the protocol in the first place.
>
>
>
> About solutions….I am wondering if you are going to be at the Mexico
> meeting this following week and if so, perhaps we can find a good time to
> chat further about it? (That would be with my IDN hat on and ICANN hat of,
> since ICANN off course has nothing to do with your policies).
>
>
>
> Tina
>
>
>
>
>
>
>
> *From:* idna-update-bounces at alvestrand.no [mailto:
> idna-update-bounces at alvestrand.no] *On Behalf Of *Vaggelis Segredakis
> *Sent:* Tuesday, February 24, 2009 2:41 AM
> *To:* idna-update at alvestrand.no; 'Vint Cerf'
> *Cc:* 'Euripides Zervanos'; 'Panagiotis Papaspiliopoulos'; 'Sotiris
> Panaretou'
> *Subject:* Re: Esszett, Final Sigma, ZWJ and ZWNJ
>
>
>
> Dear Vint,
>
>
>
> I would love to say that we as the .gr Registry are enthusiastic about the
> proposed solution (PVALID Final Sigma) but in reality we are quite
> skeptical. I can clearly see the advantages of the use of a distinct final
> sigma. The reality however is that the change is significant and the
> registry will have to take measures to reduce the impact.
>
>
>
> It will be necessary for us (and I believe anyone who uses Esszett as well)
> to “map” the two versions of the domain names ourselves to overcome the fact
> that browsers and software do not change overnight and IDNA2003 and IDNA2008
> are incompatible.
>
>
>
> In Greek, a word that finishes with a final sigma in small characters when
> typed in capital letters gets a normal capital sigma in the place of that
> final sigma. Although you have prohibited Capital letters in IDNA2008 any
> browser programmer will try to translate letter by letter a URL typed in
> capital. Most possibly then he will translate a capital Sigma to sigma and
> not final sigma, regardless of its position in the word. Why would a
> programmer try to learn Greek grammar?
>
>
>
> For each final sigma in a domain name, the registrant will have to register
> a variant with a lower sigma in that position as well and each variant that
> occurs if you put more than one final sigma in a domain name. For 2 final
> sigmas you will have 4 variants. If you add to this the tonos punctuation
> point issue (in capital letters it is not used and this gives us two
> variants for each domain name), you end up with sixteen variants for a
> single domain name with two final sigmas (two words)!
>
>
>
> We already do bundling of the domain names. We will probably do it in the
> future, especially if this proposed solution moves forward. If you have any
> other alternatives though that could shed some new light on these issues,
> this might be a good time to start discussing them. Even if this means a
> best practice document or IDNAv2_2009, anything should be open to
> discussion.
>
>
>
> Best Regards,
>
>
>
> Vaggelis Segredakis
>
> Administrator of the .GR Top Level Domain
>
> Institute of Computer Science
>
> Foundation for Research and Technology - Hellas
>
> Tel. +30-281-0391450
>
> Fax +30-281-0391451
>
> Email segred at ics.forth.gr
>
>
>
>
>
>
>
>
>
>
>
> Message: 3
>
> Date: Mon, 23 Feb 2009 20:14:04 -0500
>
> From: Vint Cerf <vint at google.com>
>
> Subject: Re: Esszett, Final Sigma, ZWJ and ZWNJ
>
> To: Mark Davis <mark at macchiato.com>
>
> Cc: Paul Hoffman <phoffman at imc.org>, Andrew Sullivan
>
>             <ajs at shinkuro.com>,    idna-update at alvestrand.no, John C
> Klensin
>
>             <klensin at jck.com>
>
> Message-ID: <2C4BC1C5-3B45-46FA-AA6D-9A60D3C72B35 at google.com>
>
> Content-Type: text/plain; charset="utf-8"
>
>
>
> Mark,
>
>
>
> thanks - I think what left me in an ambiguous state was the term "bits on
> the wire".  In your example, under the IDNA2003 mapping process, the final
> sigma is mapped into ordinary sigma and THEN the resulting string is looked
> up (after conversion to xn-- format using the punycode algorithm). The two
> forms become identical prior to lookup.
>
> Under the proposed IDNA2008 rules, the two strings remain distinct in both
> the U-label and A-label format and thus look "different" on the wire and
> unless other measures are taken (bundling, restricted registration, etc) it
> is possible for the two domains to yield distinct results on lookup.
>
>
>
> Paul - is that the picture you wanted to paint?
>
>
>
> sorry to be slow to see which bits you were comparing.
>
>
>
> v
>
>
>
>
>
> Vint Cerf
>
> Google
>
> 1818 Library Street, Suite 400
>
> Reston, VA 20190
>
> 202-370-5637
>
> vint at google.com
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090225/4b2b83c2/attachment-0001.htm 


More information about the Idna-update mailing list