emoji (was Re: I-D Action: draft-klensin-idna-rfc5891bis-00.txt)

John C Klensin klensin at jck.com
Sat Mar 18 15:42:17 CET 2017

--On Friday, March 17, 2017 03:14 -0700 Asmus Freytag
<asmusf at ix.netcom.com> wrote:

> On 3/16/2017 1:15 PM, John C Klensin wrote:
>> Just to respond to this one issue...
>> --On Monday, March 13, 2017 06:57 +0000 Shawn Steele
>> <Shawn.Steele at microsoft.com> wrote:
>>> ...
>>> I did not ignore the Unicode categories of Emoji.  I
>>> indicated that despite their classification, Unicode (not
>>> me) has been including new emoji characters in their updated
>>> tables.  The 2003 emoji did not surprise me, but I was
>>> surprised that they were extending it.
>> Shawn,
>> First, while I'm not sure how much difference it makes in
>> general, the smiling faces, etc., of earlier years were
>> "emoticons" (or just typographic symbols) and not emoji, which
>> are a newer invention and addition to Unicode.  While there
>> might have been other issues (and probably were), had the
>> Unicode Consortium been convinced that emoji were a new script
>> and kind of letter, they could easily have defined such a
>> script and, I beiieve, even a new "Letter" property (or
>> assigned them to "Lo") without any damage to stability rules
>> or other important principles.   Worst case, they could have
>> coded new forms of the emoticons into the emoji script, done
>> something appropriate with NFKC if they thought that was
>> necessary, and moved on.
>> They didn't.  Which brings me to...
> Actually, .... no.
> The emoji are clearly not "letters". The word emoji means
> "picture
> character"; they share some similarities in classification
> with logographs
> (signs for words), but the picture is in the foreground, not
> the word.

Of course.  But that is the point that several of us have tried
to make -- IDNA is designed around analogies to letters and
digits, just as the DNS preferred syntax is, and the rules for
allowable hostnames were before it.  While the logograph or
pictograph arguments might be interesting, from what I
understand of Unicode, I personally think that the
classification as Symbols is exactly correct, as is their
decision to exclude emoji from identifiers.  In that regard, the
only thing that confuses me is prohibiting emoji from
identifiers but recommending them for domain names.

However, if they are symbols and not letters, then a suggestion
that they be allowed in domain names is not an IDN issue (IDNA
or otherwise).   It is a new sort of beast that requires either
some rethinking of our basic (and 40+ year old) assumptions of
how we name network objects on the Internet or some approach
that separates branding and other resource names from names of
network objects.  If people think they (and that sort of change)
belong in the DNS -- I don't, as should probably be obvious from
my comments about two-level solutions--  suggest that the right
place to take up such a proposal is in DNSOP.

> The use of emoji in text is somewhat different. They can be
> used entirely on their own, but it remains common to use them
> with text, or alternating with text statements. The
> surrounding text can be of any script, making the emoji less
> something that embodies its own script, but more akin to
> characters that have the "Common" script (are to be used with
> any script).

Right - complete agreement.   It may be worth adding that
non-punctuation symbols of various flavors appear in running
text all the time, with various currency symbols being the most
obvious examples.   Still doesn't make them appropriate for
identifiers in spite of the observation that some people "want"
to do that.

> Despite the precursors (smile) they represent something that's
> effectively novel, with novel and evolving usage conventions.

>  I'm not surprised to see that they manage, unlike many other
> additions to Unicode, to create problems in extending existing
> classifications and usage rules.

And that takes us back to part of Patrik's point.  Even if they
were otherwise a good idea (which I obviously think they are
not), trying to figure out the right rules for their use in
identifiers, situations where unambiguous matching rules are
essential, is just premature at this stage.  Even if none of the
other issues existed, the presence of special [combining]
modifiers suggests that, sooner or later, some operations
roughly equivalent to normalization will almost certainly be
needed.   That is definitely something Unicode will need to
address (or not), rather than something the IETF should try to
take on.

> While I argue here that Unicode had little choice in the way
> GC and script
> values were assigned, I would equally argue that those GC and
> script values
> do not do a good job of capturing the essence of these beasts
> in the best way.

Again, no disagreement from me although, if they had perfect
foresight (something I don't have and don't expect of them), it
might have made sense to consider assigning a different GC code
or, more likely, subcode of "Symbol".  As you say, they are
something of a new creature.

> I'm fully cognizant of many of the issues that these
> represent, but I think that I like the more nuanced reply by
> Andrew better than any argument that is based on the results
> of implementing such a blunt instrument as the GC.

Agree there too. I was merely pointing out they they didn't make
a special GC category and that IDNA2008 is very dependent on
those categories.

More for others than for Asmus...

As I've said many times, if I had been in charge of Unicode
decisions, there are a number of them I would have made
differently.  But I have never believed that my preferences were
universally "more correct" or even "better" than the ones that
were made.  The difficulty, which was understood even before
Unicode first appeared, is that a universal character set has to
serve many functions and than, inevitably, optimizing for some
functions (or even trying to strike a balance) will make some
things easier and others more difficult, perhaps even requiring
kludges to work even passably well.  

In particular, I come partially out of backgrounds in
programming languages and classification, for which identifiers
and precision are really important.  That leads me to
instinctively try to optimize for those things, even while
realizing that the overwhelming uses for a coded character set
-- the ones that affect the most users most of the time-- are
for running text in which inconsistencies are normal and we all
just take things in stride.  It also leads me to believe that
infallibility and perfect foresight are rare among humans and
that creating stability rules that make it essentially
impossible to undo decisions that turn out to be wrong,
especially those based on incorrect assumptions, is a bad
idea... but I can recognize the attractiveness of such rules in
many situations.  

IMO, we will make progress only if we can all listen more and
try to understand each others perspectives better than if our
discussions about about "you are wrong; no _you_ are wrong" or
"we need to do X" with an implicit "ho matter what other damage
it causes".    


More information about the Idna-update mailing list