emoji (was Re: I-D Action: draft-klensin-idna-rfc5891bis-00.txt)
Shawn.Steele at microsoft.com
Sun Mar 12 23:53:03 CET 2017
I'm not certain how half-century old design restrictions necessarily apply to modern Unicode in IDN. I totally get how certain ASCII characters already had special meanings and would be problematic in host names, and how adding ASCII symbols even now would cause some trouble.
Of course, none of Unicode was allowed either, because it wasn't invented yet. A concern for backwards compatibility led to the hack for converting Unicode into Punycode to tunnel into the ASCII space that was available for the legacy systems. That mitigates some of the problems for some of the obsolete components still in service.
There was no technical reason that emoji didn't work in IDNA2003. Removing them in IDNA2008 because of their Unicode categories did not help the users interested in those characters. Indeed (much to my surprise) Unicode has been able to extend the set of characters in this set when publishing the newer IDN tables, despite their categorization.
The fact that pretty much all of the browsers enable stuff that IDNA2008 explicitly forbids would tend to indicate that some decisions were indeed wrong. And apparently they need to be corrected.
To be clear, the space I "own," Microsoft Windows, Edge & IE IDN behavior, was late to adopt the newer emoji code points - those post IDNA2008, that our competitors adopted. When alerted to that (by customer bug reports), we did treat it as a bug and I had no qualms updating to snap to the newer tables, but I was completely taken by surprise and did not drive the effort to extend support to new emoji.
I would much rather see the IETF's IDN effort work to support the behavior that users apparently want, rather than attempting to push restrictions that the industry at large has already chosen to ignore.
And, as you noted, the permissibility of things like Emoji is completely orthogonal to your other point, which is that registrars should be taking steps to prevent registration of characters that can be disruptive or abused.
From: Idna-update [mailto:idna-update-bounces at alvestrand.no] On Behalf Of John C Klensin
Sent: Sunday, March 12, 2017 1:59 PM
To: Andrew Sullivan <ajs at anvilwalrusden.com>
Cc: idna-update at alvestrand.no
Subject: Re: emoji (was Re: I-D Action: draft-klensin-idna-rfc5891bis-00.txt)
I agree with Andrew, but let me give an additional reason or two. First, it is really important to understand that restrictions on the characters that can be used in domain names are not only not a new idea with IDNA, they predate the DNS itself. The so-called Letter-Digit-Hyphen rule, which excluded all symbols and special characters other than hyphen, goes back to the earliest days of the ARPANET host table. Symbols were prohibited for three reasons. Those were, IIR, (in declining order of importance), because they (and different of them) were often used as delimiters or special indicators in the wide range of operating systems of the day, there were ambiguities about what they were called and hence how to tell someone what a host name was over the phone or equivalent, and they didn't appear in
ordinary words. Hyphen (or "minus sign", or "dash" --
"hyphen-minus" didn't come along as a term until later and I suspect many ordinary users would still give one a strange look if it were used in conversation) was allowed because it _did_ appear in ordinary words and because some sort of intra-name break was needed. "Confusion" was considered in only one context, which was the relationship between "-" and "_" with a clear conclusion that only one of the two should be allowed because of distinctions in handwriting (if nothing else). And hyphen was chosen from the two because of that ordinary word issue and because there was a type style ambiguity about how the character coded as "_" was rendered, with a history of using a back arrow and considering the two equivalent.
There were also some rules that we would describe as "contextual restrictions" today: no leading or trailing hyphens (delimiter problems), no doubled hyphens (too hard to distinguish in handwritten text and some type styles), and no leading digits (potential for confusion with addresses (IP addresses and their predecessors). The leading digit restriction was removed in RFC
1123 and the special rules about hyphens never got written down, which was a good thing because, if it had been, we wouldn't have been able to pull the "xn--" trick and other ACE variations.
Those "host name" rules were incorporated into the DNS as the "preferred syntax", not only because they were used for names of hosts but because the restriction had been incorporated into
many application protocols, notably SMTP and its predecessors.
While not documented, again IIR, until RFC 1123 and then 1591, and with some pieces not being documented at all because Jon knew what he would allow, there were special rules for the root
-- no hyphens, no digits (again, concern about IP addresses), and a 2-3-4/5 rule about lengths. ICANN discarded the latter because "new TLD" applicants around 2000-2002 really wanted longer strings (perhaps the first instance of "can't say 'no'"); if it had remained in place, we wouldn't be having issues about "special names" and who gets to allocate them today.
So this story is very old news. Perhaps the decisions were wrong, but it is a little late to attack them now (see forthcoming note). As Andrew points out, the guiding principle in IDNA2008 was to extend the principles underlying the LDH rule to IDNs, not to impose new restrictions out of arrogance or anything else. And, again, none of that was about "confusion".
Now, to accomplish that "letters and digits" principle for phonetic scripts and make sure that CJK worked out, the IDNA rules used the available General Categories. The emoji, when they came along, got the same General Category that the earlier emoticons did -- "So". Not an IETF decision or some special way to discriminate against them, but Unicode assignment of that category, which results in their exclusion from IDNA (independent of any issues about compatibility of Unicode 7, 8, or 9 with IDNA). GeneralCategory =So implies DISALLOWED.
There are some other issues with emoji, issues that relate to non-standardization of representations; the implications of assorted combining forms which, by the way, don't normalize; questions about what do about screen readers and voice input so as to have things be even mostly unambiguous in a rapidly changing environment; and the other issues Andrew mentions.
Those issues might have been sufficient to argue for banning emoji from IDNs even if, e.g., they has been assigned some "Letter"-like General Category on the theory that they represent the words of a new language, but the "So" decision --again, a Unicode decision, not an IETF one-- preempted those discussions.
--On Sunday, March 12, 2017 9:36 AM -0400 Andrew Sullivan <ajs at anvilwalrusden.com> wrote:
> On Sun, Mar 12, 2017 at 03:23:41AM +0000, Shawn Steele wrote:
>> And then when we get to emoji, well it's pretty hard to mix up a cat
>> with a human with the work "cat", so, whether they seem "serious" or
>> not, it seems to me to be pretty harsh to just get rid of the whole
>> set when people clearly want to use them. Sure, they're silly, but
>> sometimes people aren't in businesses that deal entirely in
>> life-and-death matters.
> The reason that IDNA2008 doesn't permit emoji is not because the WG
> thought that they were more or less serious. The reason that emoji
> are DISALLOWED is at bottom roughly the same reason that the
> conjoining Hangul Jamo are DISALLOWED. The approach that IDNA2008
> took was that the DNS did not need to permit any label anyone might
> want, but instead needed to permit effective mnemonics that people
> could use reliably. In effect, the goal was to "internationalize
> LDH". LDH does not permit lots of labels that would be useful to
> people. It doesn't permit apostrophes or spaces, for instance -- both
> things which have turned out to be important for DNS-SD and which also
> cause trouble for LDH-restricted zones. Hangul Jamo was like that:
> the precomposed characters were better for these purposes.
> Emojis are poor choices for mnemonics because they tend to ambiguity.
> There is in fact a writing system for ideographic scripts such as Han,
> and the writing system is well-developed even if there are
> locale-dependent variations in display. But emoji ideographs are
> nowise so systematized yet, and the semantic value of a given
> ideograph is often in flux even within a given user population.
> They're a useful and fun tool for lots of purposes, but they're not
> especially good for Internet-scale identifiers. Reliable
> interoperability often requires that particular features cannot be
> relied upon, and so far emojis appear to fall into that category for
> stable, Internet-scale identifiers. That's what IDNA is designed to
> Best regards,
Idna-update mailing list
Idna-update at alvestrand.no
More information about the Idna-update