emoji (was Re: I-D Action: draft-klensin-idna-rfc5891bis-00.txt)
Shawn.Steele at microsoft.com
Thu Mar 16 22:05:50 CET 2017
A few of us were thinking maybe identifiers almost need (might be too strong a word) two layers... One to provide a fairly strict version of providing "identifiers" in a way that tries to reduce confusion, and another layer that provides "friendly labels" that help people get to those identifiers in a way that makes them feel good.
Most (AFAICT) of the more interesting uses of IDN already resolve to a name that would fit in that "stricter identifier" bucket, so, in practice, we kinda already have two layers. The "marketing thought it would be good if this linked to us" which goes to the "this is the label that doesn't scare the IT department."
Regardless on where you draw the line of what characters are appropriate, this is already happening somewhat naturally, especially when it's "easy". Eg: an umlauted domain resolving to a pure-ascii variant spelling. Of course that's tougher for other languages, but the same idea seems to happen a lot.
From: John C Klensin [mailto:klensin at jck.com]
Sent: Thursday, March 16, 2017 1:16 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>; Patrik Fältström <paf at frobbit.se>
Cc: idna-update at alvestrand.no; Andrew Sullivan <ajs at anvilwalrusden.com>
Subject: RE: emoji (was Re: I-D Action: draft-klensin-idna-rfc5891bis-00.txt)
Just to respond to this one issue...
--On Monday, March 13, 2017 06:57 +0000 Shawn Steele <Shawn.Steele at microsoft.com> wrote:
> I did not ignore the Unicode categories of Emoji. I indicated that
>despite their classification, Unicode (not me) has been including new
>emoji characters in their updated tables. The
> 2003 emoji did not surprise me, but I was surprised that they were
First, while I'm not sure how much difference it makes in general, the smiling faces, etc., of earlier years were "emoticons" (or just typographic symbols) and not emoji, which are a newer invention and addition to Unicode. While there might have been other issues (and probably were), had the Unicode Consortium been convinced that emoji were a new script and kind of letter, they could easily have defined such a script and, I beiieve, even a new "Letter" property (or assigned them to "Lo") without any damage to stability rules or other
important principles. Worst case, they could have coded new
forms of the emoticons into the emoji script, done something appropriate with NFKC if they thought that was necessary, and moved on.
They didn't. Which brings me to...
Second, while UTR#46 allows emoji in Unicode Domain Names,
UAX#31 (Identifier and Pattern Syntax) does not allow them in Unicode-recommended identifiers. That creates an interesting situation. Certainly we looked at UAX#31 in designing IDNA2008.
While the results, in terms of what was and was not considered acceptable for an identifier, were not identical -- the DNS has some special needs and constraints which are the reason the rules of IDNA2008 are not identical the the PEECIS recommendations about more general-purpose identifiers either -- major areas of difference (i.e., beyond some special considerations and edge cases) between IDNA2008 and UAX#31 should be surprising to all concerned. As far as I know, there are, those cases and a difference in style aside, no significant
differences between UAX#31 and IDNA2008. (The difference in
style is that UAX#31 defines equivalence rules for, e.g., case and normalization while IDNA2008, in part in order to assure that labels could be converted from U-label to A-label form and back without loss of information. avoids that by imposing restrictions on its inputs.
However, introduction of emoji in UTR#46, if used instead of IDNA2008, creates the interesting situation in which strings that are valid for use in domain names labels are not valid as
Unicode-recommended identifiers. I trust you can see why I
think that is a problem even if we need to agree to disagree about everything else.
More information about the Idna-update