Combining mark vs combining character?

jean-michel bernier de portzamparc jmabdp at gmail.com
Sat Jan 8 10:30:10 CET 2011


Jefsey went to hospital urgencies. He as limited internet access. He may be
hampered for a few weeks. He says that we should stop using "Unicode
strings" when we talk of "U-labels". This is extermely confusing, because it
mixes end to end and fringe to fringe terminology.

Users may enter the Unicode strings/or whatever they want, It belongs to the
application or to their UI or to the ML-DNS to transform their entry into
U-labels. Also, ML-DNS is not limited to the Internet naming system and we
will meet many applications with different naming rules and hence conversion
preparations to U-labels. Many will be based upon ISO 10646 or entirely
different standards/technologies. We will therefore adapt from a user point
of view, as exemplified in RFC 5895. However, we will strive to respect the
IAB positions to come on IDNA as far as the inside Internet is considered
(end to end).

This is why we want to stick to the "U-label" for "User-label" terms as what
is received by the Internet from the user, and "Unicode/ISO 10646 string" or
"User-entry" for what the user enters. This is something which at this stage
does not create problem 'if RFC 5892 has no bug" as previously discussed.

Please remember that ML-DNS is fringe to fringe and will accept semiotic
entries: working examples today: kinect entries or audio entries for fingers
snap or audionames. The way this works is asymetric, based upon IDNA2008.
Audio entry is converted in pseudo-U-label which is used by the local fringe
as per IDNA2008: conversion to A-label, resolution, conversion to the
pseudo-U-label on the other end for usual processing on the other fringe.

Our IUse emerging community working position is expressed in the
introductory note at http://incsa.org. Comments welcome.

I hope my rendition of Jefsey's input is understandable.

Portzamparc



2011/1/5 Kenneth Whistler <kenw at sybase.com>

> Simon asked:
>
> > I need a clarification regarding this paragraph in section 4.2.3.2 of
> > RFC 5891:
> >
> >    The Unicode string MUST NOT begin with a combining mark or combining
> >    character (see The Unicode Standard, Section 2.11 [Unicode] for an
> >    exact definition).
>
> Mark Davis suggested that this would better read:
>
>     The Unicode string MUST NOT begin with a character having a General
>     Category property value of Mark (M).
>
> and I concur that that would be more precise.
>
> And to add to Mark Davis' clarification and respond further
> to one of Simon's questions:
>
> > There is one section 3.6 on "Combination" that gives the precice
> > definition of a "Combining character":
> >
> >    Combining character: A character with the General Category of
> >    Combining Mark (M).
>
> > 3) What is the precice definition of a "combining mark"?
>
> In the Unicode Standard, "combining character" is the term
> of art. That is the general term which is used throughout
> the standard for referring to characters which combine.
> Hence the normative definition D52 for "Combining character"
> which Simon quotes from Section 3.6 of Unicode 5.0.
>
> "Mark" is a property value alias which refers to any of the
> three possible General Category values for a combining character
> in the standard. This is defined by the following entries in
> the UCD data file, PropertyValueAliases.txt:
>
> gc ; M         ; Mark                             # Mc | Me | Mn
> gc ; Mc        ; Spacing_Mark
> gc ; Me        ; Enclosing_Mark
> gc ; Mn        ; Nonspacing_Mark
>
> "Combining mark" is an unofficial synonym for "combining character".
> It occurs occasionally in Unicode-related documents, including
> the text of the standard itself, because Unicode implementers
> often talk about "spacing marks" and "nonspacing marks" and
> "enclosing marks" and then treat the union of all those as
> "combining marks" by force of habit in talking about "marks".
>
> My advice for external standards referring to the Unicode
> Standard would be to stick to "combining character", which is
> and will remain the term with the normative definition in
> the Unicode Standard. And the best point of reference is
> to Section 3.6, "Combination", which is where this term (and
> related terms) have their normative definitions in the standard.
>
> --Ken
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20110108/bda7574c/attachment.html>


More information about the Idna-update mailing list