IAB Statement on Identifiers and Unicode 7.0.0

Wed Jan 28 20:08:11 CET 2015

> The problem _is_ about whether two ways to code the same abstract character, within the same script, can be reliably compared equal with the existing technology and, i

For some of this at least that's the problem, they aren't the "same abstract character" unless you apply some thinking that is different than what Unicode defines (eg: they sure look the same to me).

> ...for pure identifier ones (think "IETF identifier", but the Historical Note at the end

That's where I get lost.  What does the "IETF" think is a "pure identifier"?  IMO IDN is seen as an easy way to get a "pure identifier," but it isn't really appropriate to that task.

IDN works OK for web sites because it allows a pretty broad set of words/letters/whatever to be used as mnemonics to find something.  Even IDNA2003 with smiley faces and all that was fine for that purpose, who cares if a website calls itself ☃.  Sure, there are funny ways to spell things and stuff and L3-G0.blogspot.com and L3-GO.blogspot.com go different places, but it doesn't really matter.

However I think that IDN is a terrible thing for a "pure identifier".  There are too many "confusables".  If you want an accurate "identifier", then I'm not sure how you rely on the registrars coming up with the "right" rules to ensure confusion doesn't happen.  An Identifier needs something that I need to be able to parse by a machine and put in a canonical un-confusable form.  IMO identifiers that do things like this would need to ensure that all 4 Turkish I's mapped to i.  It's not gonna round-trip to a "pretty" form though.

Do identifiers have to be "pretty"?
Do they have to be reasonably unambiguous on the back of a napkin, or is it OK if they're only unambiguous on a machine?
Do they have to be unambiguous when printed?  In any font?

Sorry, I haven't been following any of the discussions, so I really don't know what's expected of these identifiers.

-Shawn