emoji (was Re: I-D Action: draft-klensin-idna-rfc5891bis-00.txt)

John C Klensin klensin at jck.com
Sun Mar 12 21:58:51 CET 2017

I agree with Andrew, but let me give an additional reason or
two.  First, it is really important to understand that
restrictions on the characters that can be used in domain names
are not only not a new idea with IDNA, they predate the DNS
itself.  The so-called Letter-Digit-Hyphen rule, which excluded
all symbols and special characters other than hyphen, goes back
to the earliest days of the ARPANET host table.  Symbols were
prohibited for three reasons.  Those were, IIR, (in declining
order of importance), because they (and different of them) were
often used as delimiters or special indicators in the wide range
of operating systems of the day, there were ambiguities about
what they were called and hence how to tell someone what a host
name was over the phone or equivalent, and they didn't appear in
ordinary words.   Hyphen (or "minus sign", or "dash" --
"hyphen-minus" didn't come along as a term until later and I
suspect many ordinary users would still give one a strange look
if it were used in conversation) was allowed because it _did_
appear in ordinary words and because some sort of intra-name
break was needed.  "Confusion" was considered in only one
context, which was the relationship between "-" and "_" with a
clear conclusion that only one of the two should be allowed
because of distinctions in handwriting (if nothing else).  And
hyphen was chosen from the two because of that ordinary word
issue and because there was a type style ambiguity about how the
character coded as "_" was rendered, with a history of using a
back arrow and considering the two equivalent.

There were also some rules that we would describe as "contextual
restrictions" today: no leading or trailing hyphens (delimiter
problems), no doubled hyphens (too hard to distinguish in
handwritten text and some type styles), and no leading digits
(potential for confusion with addresses (IP addresses and their
predecessors).  The leading digit restriction was removed in RFC
1123 and the special rules about hyphens never got written down,
which was a good thing because, if it had been, we wouldn't have
been able to pull the "xn--" trick and other ACE variations.

Those "host name" rules were incorporated into the DNS as the
"preferred syntax", not only because they were used for names of
hosts but because the restriction had been incorporated into
many application protocols, notably SMTP and its predecessors.   

While not documented, again IIR, until RFC 1123 and then 1591,
and with some pieces not being documented at all because Jon
knew what he would allow, there were special rules for the root
-- no hyphens, no digits (again, concern about IP addresses),
and a 2-3-4/5 rule about lengths.  ICANN discarded the latter
because "new TLD" applicants around 2000-2002 really wanted
longer strings (perhaps the first instance of "can't say 'no'");
if it had remained in place, we wouldn't be having issues about
"special names" and who gets to allocate them today.

So this story is very old news.  Perhaps the decisions were
wrong, but it is a little late to attack them now (see
forthcoming note). As Andrew points out, the guiding principle
in IDNA2008 was to extend the principles underlying the LDH rule
to IDNs, not to impose new restrictions out of arrogance or
anything else.  And, again, none of that was about "confusion".

Now, to accomplish that "letters and digits" principle for
phonetic scripts and make sure that CJK worked out, the IDNA
rules used the available General Categories.  The emoji, when
they came along, got the same General Category that the earlier
emoticons did -- "So".  Not an IETF decision or some special way
to discriminate against them, but Unicode assignment of that
category, which results in their exclusion from IDNA
(independent of any issues about compatibility of Unicode 7, 8,
or 9 with IDNA).  GeneralCategory =So implies DISALLOWED.
There are some other issues with emoji, issues that relate to
non-standardization of representations; the implications of
assorted combining forms which, by the way, don't normalize;
questions about what do about screen readers and voice input so
as to have things be even mostly unambiguous in a rapidly
changing environment; and the other issues Andrew mentions.
Those issues might have been sufficient to argue for banning
emoji from IDNs even if, e.g., they has been assigned some
"Letter"-like General Category on the theory that they represent
the words of a new language, but the "So" decision --again, a
Unicode decision, not an IETF one-- preempted those discussions.


--On Sunday, March 12, 2017 9:36 AM -0400 Andrew Sullivan
<ajs at anvilwalrusden.com> wrote:

> On Sun, Mar 12, 2017 at 03:23:41AM +0000, Shawn Steele wrote:
>> And then when we get to emoji, well it's pretty hard to mix
>> up a cat with a human with the work "cat", so, whether they
>> seem "serious" or not, it seems to me to be pretty harsh to
>> just get rid of the whole set when people clearly want to use
>> them.  Sure, they're silly, but sometimes people aren't in
>> businesses that deal entirely in life-and-death matters.
> The reason that IDNA2008 doesn't permit emoji is not because
> the WG thought that they were more or less serious.  The
> reason that emoji are DISALLOWED is at bottom roughly the same
> reason that the conjoining Hangul Jamo are DISALLOWED.  The
> approach that IDNA2008 took was that the DNS did not need to
> permit any label anyone might want, but instead needed to
> permit effective mnemonics that people could use reliably.  In
> effect, the goal was to "internationalize LDH".  LDH does not
> permit lots of labels that would be useful to people.  It
> doesn't permit apostrophes or spaces, for instance -- both
> things which have turned out to be important for DNS-SD and
> which also cause trouble for LDH-restricted zones.  Hangul
> Jamo was like that: the precomposed characters were better for
> these purposes.
> Emojis are poor choices for mnemonics because they tend to
> ambiguity. There is in fact a writing system for ideographic
> scripts such as Han, and the writing system is well-developed
> even if there are locale-dependent variations in display.  But
> emoji ideographs are nowise so systematized yet, and the
> semantic value of a given ideograph is often in flux even
> within a given user population. They're a useful and fun tool
> for lots of purposes, but they're not especially good for
> Internet-scale identifiers.  Reliable interoperability often
> requires that particular features cannot be relied upon, and
> so far emojis appear to fall into that category for stable,
> Internet-scale identifiers.  That's what IDNA is designed to
> support.
> Best regards,
> A

More information about the Idna-update mailing list