support of metadata

jean-michel bernier de portzamparc jmabdp at gmail.com
Mon Sep 14 12:18:52 CEST 2009


Dear Colleagues,
I will respond to Martin and John one shot.

At 04:39 14/09/2009, John C Klensin wrote:
--On Monday, September 14, 2009 02:11 +0200 jean-michel bernier
de portzamparc <jmabdp at gmail.com> wrote:

> Dear colleagues,
> among the points we introduced during the WG/LC that have not
> been addressed yeat is the end to end support of script
> oriented metadata (one example being the French majuscules).
> Metadata can be supported either:
>
> - explicitely through a specific new code from "unassigned" -
> since Language Tag and Private Use control are disallowed

As is any use of an UNASSIGNED code.  The use of such codes is a
protocol violation; conforming implementations will not look up
labels containing them.

Correct. This is why there would be no violation of IDNA by an extended
punnyplus algorithm.

> - implicitely through an unlike sequence of PVALID codes (ex:
> FE73-0061 ... 007A)

Since there is no prohibition on such strings, nothing prevents
you from using them and interpreting them in a special way,
assuming that FE73 is not problematic from a Bidi standpoint
(while it is identified as a "Arabic" character, the code point
does not appear in Arabic-Shaping.txt, which drives Bidi).
However, most applications globally will interpret them as valid
labels, many or most applications will warn against them as
mixing scripts, and attempts to use specific characters as
metadata indicators will not work satisfactorily except in your
particular applications.

This is why I propose that code. It increases the robustness of the
security. It should only go through seemlessly with punyplus aware
applications.

 > If I call "punnyplus" the extended algorithm that will provide
> this support: - in the first case there is no risk of
> confusion since it will only work if both ends are punnyplus
> enabled.

In the first case, anyone supporting "punnyplus" will be in
violation of IDNA.

They will only support IDNA and more.

> - in the second case there is no risk of confusion either but
> the sending end is to be punyplus enabled.

And the receiving system has to know to apply "punyplus" rules
rather than IDNA rules.

No. It will receive an A-label. I suppose that every host will soon accept U
and A labels as aliases, no matter if their hosting service or ISP or
operating system supports punycode or not. The matter is only that ecole.fra
and Ecole.fra may target different hosts and be filtered out in case of
error.

> If the WG documents remain unchanged in terms of French
> majuscules support, the support of the two will be offered as
> a response to the "+" entry. Ex. http://+Etat.fr.

That would be an interoperability problem.  I would be quite
surprised if FRNIC went along and more surprised if ICANN
permitted any gTLD to do this.

We are not at this stage interested in AFNIC nor with ICANN. We are
interested in a standard better use of IDNA and in Project.FRA and Multilinc
test beds. We are confident that by the time the two test beds have been
carried and published their reporst, punyplus is adopted as an Internet
standard by the IESG, majuscule metadata is supported off the shelves by ISO
10646 and Unicode. However, before introducing an NWIP on the matter, it is
better to investigate and test different solutions and get them validated.

At 09:42 14/09/2009, Martin J. Dürst wrote:
Hello John, Jean-Michel, others,

On 2009/09/14 11:39, John C Klensin wrote:
>
> --On Monday, September 14, 2009 02:11 +0200 jean-michel bernier
> de portzamparc<jmabdp at gmail.com>  wrote:
>
>> Dear colleagues,
>> among the points we introduced during the WG/LC that have not
>> been addressed yeat is the end to end support of script
>> oriented metadata (one example being the French majuscules).
>> Metadata can be supported either:

>> - implicitely through an unlike sequence of PVALID codes (ex:
>> FE73-0061 ... 007A)
>
> Since there is no prohibition on such strings, nothing prevents
> you from using them and interpreting them in a special way,
> assuming that FE73 is not problematic from a Bidi standpoint
> (while it is identified as a "Arabic" character, the code point
> does not appear in Arabic-Shaping.txt, which drives Bidi).

Where in Bidi does it say so? The Bidi document refers to Bidi
properties, and these are defined in UnicodeData.txt
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt). There, U+FE73
is AL (Arabic Letter), which means that the above won't work exactly as
proposed. Of course, there are ample other characters in Unicode which
may be suited for misuse for the above mentioned purpose.

In selecting that code, I just looked at what might cause the most of
possible errors if it was not replaced by punyplus. I am not a Unicode
expert and every advice is welcome.

> However, most applications globally will interpret them as valid
> labels, many or most applications will warn against them as
> mixing scripts, and attempts to use specific characters as
> metadata indicators will not work satisfactorily except in your
> particular applications.

Yes indeed.

This is exactly what I want to obtain. It will result in an error. This is
the usual protection that was adopted against IDNA misuses?

>> If the WG documents remain unchanged in terms of French
>> majuscules support, the support of the two will be offered as
>> a response to the "+" entry. Ex. http://+Etat.fr.

While I'm writing this mail, some comments on majuscules that I have
been thinking about for quite a while.

On careful reading, the French article
http://fr.wikipedia.org/wiki/Majuscule and the English counterpart at
http://en.wikipedia.org/wiki/Majuscule aren't too different at all. Not
only French, but a wide range (if not all) European languages know a
difference between 'majuscules' and 'capitales', and good orthography
and typography is impossible without these concepts, even if they may be
less explicitly distinguished in other languages than in French.

Correct.

The reason why this distinction hasn't made it into character encoding
is in part historical (less computers than typewriters), but a big part
of it, in my opinion, has to be attributed to the fact that a large
majority of the population everywhere around the world thinks primarily
visually. I.e. most people everywhere around the world want an upper
case letter when they want an upper case letter and a lower case letter
when they want a lower case letter, and on first approximation, they
don't care whether something is a 'majuscule' or a 'capitale' because
they both look the same. Trying to teach everybody to always be aware of
the difference and press the right shift key would simply be impossible.
That's not only the case for this specific difference, but is also a
widely reported phenomenon on other levels, such as document appearance
vs. document structure (think nicely structured, valid (X)HTML) vs. "it
has to look the same on every browser").

This may be difficult to understand for people who think mainly
logically rather than visually. I suggest they take a Myers-Briggs test
and compare their result with the percentages for each type.

Unicode and punycode strive to respect the entered cases. IDNA does not for
a reason which is now gone (correction of the misunderstanding between I
would call Unicode real case folding and DNS virtual case folding). As long
as Unicode/punycode permitted to be transparent to what users enter, nobody
cared about the reasons why they were entered. When an IDNA internal reason
came and conflict with the user reasons a solution has to be found.

The new pre-punycode lower casing, is a change in the punycode algorithm,
which when corellated with the DNS "virtual" lower casing permits a robust
support of everything, except upper casing. This has therefore to be
corrected : the Unicode codepoint that isentered MUST be received on the
other end, otherwise IDNA is in breach of the basic Internet end to end
concept.

There are two ways of considering this. As a punycode change (what the
Charter prohibits) or as a pre/post punycode change. If punycode must
receive lowercased entries (an IDNA addition), IDNA must cater for the
required output uppercasing when needed.

What "+" indicates in http://+Etat.fra is not that E must be an upper case
but that Etat (State) is a case sensitive entry. http://+état.fra (status)
will result in a lowercase output.

Now, it is true that Unicode imperfectly supports majuscules. It supports
their orthotypography not their grammar, i.e. the reason why they are
entered as uppercases (i.e. that they are majuscules, i.e. a metalanguage
information). Introducing in ISO 10646 a way to support that kind of
metalangue information is only a plus, because some typographic style
rendering might use it. For example, accentuated majuscules should be
rendered in best mechanical printing as accentuated upper cases, but should
be rendered as non- accentuated uppercases in manual scripting. These are
"should" and a fount may propose more precision that Unicode could not
support if required.

Portzamparc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090914/d2dc4dc8/attachment.htm 


More information about the Idna-update mailing list