Consensus Call Tranche 8 Results

Wed Nov 5 02:51:12 CET 2008

It is still premature to add eszett and final sigma until we have some
accompanying text that addresses the security exploit.
The two possibilities I could think of are:

   1. Change the prefix for xn--
   2. Recommend or require that if the name contains either eszett or final
   sigma, that any DNS lookup by client software has to be done twice: once
   with the original, and once with a second string that has these remapped to
   "ss" and sigma respectively, and is then NFC'ed.

Mark

On Tue, Oct 21, 2008 at 11:38 PM, Mark Davis <mark at macchiato.com> wrote:

> About eszett, I think we are still waiting on proposed text on the security
> issue. If the security and interoperability issues can be addressed, I'm
> happy with allowing eszett, but we haven't seen any text yet, and thus
> cannot make a firm assessment.
> Mark
>
>
> On Sun, Oct 19, 2008 at 5:02 PM, Vint Cerf <vint at google.com> wrote:
>
>>  Consensus Call Tranche 8 (Character Adjustments)
>>
>> Polling results: 7 YES 8 NO (see below)
>>
>> It was hard to score the polling because many members wanted to split
>> their responses (e.g. YES for 8a, 8b and NO for 8c)
>>
>> In rough terms, the polls were equal on the YES and NO sides (occasionally
>> counting some votes YES AND NO). I am sure that one could end up with
>> different tallies depending on how one interprets the comments but I think
>> the basic point is that there was not a consensus on YES for all three of
>> the proposals or NO against all three.
>>
>> Trying to summarize, I think I detect a possible willingness to accept or
>> interest in the following:
>>
>> 1. Allow esszet (can be excluded by registry)
>> 2. map final sigma into lower case sigma per IDNA2003 (if mapping can be
>> done as an exception rule in IDNA2008???)
>> 3. allow JAMO but rely on registries to exclude if desired. The Korean
>> experts are still consulting on this.
>>
>> A key question for all three of these cases is whether there are clear and
>> likely cases of ambiguity in which, absent mandatory mapping or protocol
>> exclusion, two users might enter what they THINK are equivalent U-labels
>> and end up with DIFFERENT punycode and thus different destinations.
>>
>> I tried to capture the back and forth below by extracting comments from
>> the emails on this polling cycle and organizing them as comments about 8a
>> and 8b in one group and 8c in another. You'll have to be the judge whether
>> this effort helps or confuses the discussion even more.
>>
>>
>> Vint
>>
>>
>> (8) Specific character adjustments for IDNA2003 -> IDNA2008
>> differences.
>>
>> (8.a) Make Eszett Protocol-Valid per list discussion.
>>
>> (8.b) Make Greek final sigma Protocol-Valid per list
>> discussion.
>>
>> (8.c) Disallow conjoining Hangul jamo per recommendation from
>> KRNIC and others, permitting only precomposed syllables.
>>
>>
>> COMMENTS:
>>
>> GENERAL COMMENTS:
>>
>> What concerns me is the current discussion.
>>
>> People are arguing strongly to listen to a ccTLD registry that uses
>> Eszett regarding approving it.
>>
>> People are arguing strongly to NOT listen to a ccTLD registry that
>> uses Jamo regarding disallowing it.
>>
>> Sure, there is a difference between approving and disallowing, but,
>> why should we (as in the wg) listen more to one registry than another?
>>
>> Are we trusting one more than the other?
>>
>> Do we listen more to more active people on this mailing list than
>> parties not as active?
>>
>> --------------
>>
>> The only reason I have for responding the way I do is that I
>> understand these responses to be the most consistent with the initial
>> principles from which we start: use the Unicode properties, do
>> everything as much as possible by tables, and introduce as few
>> exceptions as are possible and practical.  The constraint on "possible
>> and practical" is "internationalize LDH" rather than other goals (such
>> as "write novels in DNS labels" or "make DNS safe for everyone, given
>> that there are a lot of visially-confusable characters" or even
>> "ensure that zone operators can't permit bad things").
>>
>> As far as I am able to tell -- but I'm not an expert in these matters,
>> and I don't really have the time to become one -- the inclusion and
>> exclusion in the respective cases are the most consistent with those
>> principles.  If we want to adopt other principles, that's ok with me
>> too.  It might change my opinion on these cases.  I don't have any
>> opinion what the outcome should be overall; I only have an opinion
>> given the overall design principles we're trying to follow.  (This is
>> the same reason I thought that, even though it's very unlikely anyone
>> will want domain names in, say, Phoenician, the design principle
>> didn't really permit us to exclude the archaic characters.)
>> -----------------------
>> Of necessity, any specific language expert(s) are going to be a (small)
>> minority of an IETF WG, which is where "rough consensus" can fail to
>> produce an outcome which incorporates the contributions of the specific
>> language expert(s).
>>
>> It happened in 2002/3, different chair(s), but its not just the chair(s)
>> who hum. I'm still concerned that the Arabic Script meeting at ICANN
>> Paris yielded information from the Jawi user community which was
>> dismissed out of hand here. I don't think distance from Minneapolis or
>> Dublin really makes a poorer technical argument than proximity.
>>
>> It is something of an inherent defect that our "consensus" can mean
>> meeting-centricism, where "meeting" takes on values somewhere along the
>> Goonhilly-ISI arc. Our problem space is slightly larger.
>>
>> Ironically, in 200X, X << 8, it was Koreans who wanted Cherokee banned
>> (similarity to ASCII). Fortunately the more popular position did not
>> then prevail.
>>
>> ------------
>>
>>
>> ===================SPECIFIC PROPOSALS==========
>>
>> (8.a) Make Eszett Protocol-Valid per list discussion.
>>
>> (8.b) Make Greek final sigma Protocol-Valid per list
>> discussion.
>> --------------------------------
>>  Note that if we look at the proposals eszett and the one
>> from korea, the eszett is an exception, while the korean proposal uses
>> the Unicode properties.
>> ---------------
>> While the desire for ß and ς characters is understandable, there are
>> problems with compatibility. Until they are upgraded, which will require
>> some period of time, implementations will be supporting IDNA2003 and not
>> IDNA2008. And for compatibility, for the foreseeable future, even
>> implementations that support IDNA2008 will need to also support IDNA2003.
>>
>> In most cases the differences between these are tractable, for companies
>> like my own. URL X may be valid in IDNA2003 and not IDNA2008 or vice versa,
>> but it never goes to two different locations. These two characters would
>> break that. URL X could go to two *different* locations, depending which
>> standard is being supported.
>>
>> If I send someone *große.com* <http://grosse.com/> in an email, then
>> depending on what tools the user uses to read that email, it could end up at
>> *grosse.com* <http://grosse.com/> (a legitimate site) or *große.com*<http://grosse.com/>(a spoof site). (Or, of course,
>> *große.com* <http://grosse.com/> could be the legitimate site and *
>> grosse.com* <http://grosse.com/> the spoof site.) This represents a
>> significant security problem.
>>
>> Sigma is fundamentally a presentation issue: it should be displayed as ς
>> if it is final. An alternative approach would be to add a SHOULD that it be
>> so displayed.
>>
>> Eszett is slightly trickier. Yet its use in German orthography is not
>> fundamentally required, as evidenced by the fact that it is not used in High
>> German within in Switzerland, with no apparent ill effects on the population
>> (see, for example, *http://www.nzz.ch/* <http://www.nzz.ch/>). And the
>> recommended usage of ss vs ß changed substantially in the latest, not-wholly
>> successful, German spelling reforms. As a percentage of words in use,
>> especially when weighted by usage, the number that are distinguished by ss
>> vs ß are vanishingly small.
>>
>> As stated in rationale-03:
>>    They [DNS 'names']are typically derived from, or rooted in, some
>>    language because most people think in language-based ways.  But,
>>    because they are mnemonics, they need not obey the orthographic
>>    conventions of any language: it is not a requirement that it be
>>    possible for them to be "words".
>>
>>    This distinction is important because the reasonable goal of an IDN
>>    effort is not to be able to write the great Klingon (or language of
>>    one's choice) novel in DNS labels but to be able to form a usefully
>>    broad range of mnemonics in ways that are as natural as possible in a
>>    very broad range of scripts.
>>
>> Thus while recognizing the legitimate desire of people to use ß and
>> ς characters, the cost in terms of compatibility and security does not
>> appear to be worth the gain. It is thus too early for consensus on these.
>>
>> Instead, those wanting to make this change should propose some mechanisms
>> for avoiding the security problems -- only if those can be overcome in a
>> reasonable fashion could we incorporate this change, allowing ß and ς.
>> ----------
>> YES for 8.a and 8.b. Despite the transition issues
>> mentioned by Mark, the long discussion on this list has
>> shown that these are the right things to do in the long term.
>> While I'm not aware of any concrete examples of similar
>> cases, I think it would be worthwhile to check with other
>> potentially affected script/language communities.
>> What, for example, about the few final letters in Hebrew?
>>
>> -------------------
>>
>> Or the many initial and final letters in Arabic?  The answer in
>> both cases is that these are individual characters and are
>> PROTOCOL-VALID.
>>
>> [Note by another WG member:
>> I have to apologize for picking the Hebrew finals example.
>> I was on a train, guessing. The answer is that the Hebrew
>> finals are PROTOCOL-VALID. But that's not the case for
>> Arabic. In Hebrew, there are just a few final variants,
>> and they got encoded as first-class letters, and because
>> Hebrew doesn't have case, they didn't get excluded by
>> special case folding the way the Greek final sigma has.
>>
>> However, Arabic has a lot of initial/final/medial/isolated
>> glyph variants, and therefore these are context-dependent
>> and created by rendering engines, not encoded as such.
>> There are encodings of these variants in the compatibility
>> area, but they should be excluded (DISALLOW) by the fact
>> that there are compatibility mappings from them to the
>> base letters.]
>>
>>
>> What I believe got us into difficulty with
>> Eszett and Final Sigma wasn't the positioning issue or an
>> alternate shaping one but the intersection between them and the
>> case-folding rules.  Since, at least as of Unicode 3.2, neither
>> of them had upper-case forms and IDNA2003 violated the Unicode
>> Standard's advice against using case-folding to actually map
>> characters (rather than using it only in comparison but
>> retaining the original forms), the only result consistent with
>> the general IDNA2003 model was Eszett -> "ss" and Final Sigma ->
>> Medial Lower Case Sigma.
>>
>> Since neither Hebrew nor Arabic (nor any of the other scripts
>> that have position-sensitive characters) have case, they cannot
>> get into the same problem.
>>
>> Since we don't do case mapping in IDNA2008, the case folding
>> issue does not apply, regardless of what one thinks of that
>> operation and its applicability.  Without it, the only issue is
>> whether it is worth banning the characters to preserve part of
>> the IDNA2003 behavior (or making a major exception and
>> preserving the IDNA2003 mapping behavior) for the long term even
>> though it is clear that, were the decision being made for the
>> first time with the IDNA2008 rules, we would not even be asking
>> the question.
>>
>> [Note by another WG member:
>> Yes indeed. But eszett and final sigma are not the only ones
>> affected by casing. The data that deals with cases where casing
>> isn't one-to-one is http://unicode.org/Public/UNIDATA/SpecialCasing.txt.
>>
>> That includes a lot of data that may be irrelevant for us,
>> but I think it would be worthwhile to carefully examine it
>> so that we can fix everything that we need to fix.
>> The first character that comes to my mind is the lower
>> dotless I, used for Turkish and Turcic languages.]
>>
>>
>> --------------
>> If eszett and final-sigma are permitted, there must be discussions on
>> backwards compatibility and security consequences.  This brings back the
>> discussion that it may be lower over-all cost to change the xn-- prefix
>> for IDNA2008.
>>
>> I'm not yet ready to decide on 8.a and 8.b until we have discussed and
>> reviewed the backwards compatibility and security issues.  Pending such
>> text to review, I'd say NO because we do not know what the consequences
>> of making this change is today.
>> ----------------
>>  On these
>> two, I have no opinion; I don't feel sufficiently qualified to say
>> whether these individual characters should be altered.  My
>> understanding is that, because they are consistent with the tables
>> approach that we are taking, the only reason to exclude them would be
>> historical.  Since the unhappiness with some of those historical
>> decisions is part of the justification for the current work, it seems
>> to me that these ought to be allowed (although I wonder whether 8.b
>> ought to have a context rule).
>>
>> ------------
>> Could you explain why you would require a context rule for Final
>> Sigma without requiring one for Eszett?  Certainly it would be
>> easier to specify a rule for the former ("Script=Greek") while
>> the latter would presumably either require either "Script=Latin"
>> (which wouldn't do much good) or an enumerated list of
>> characters.  One can't require that the character actually
>> appear in the last position in a label without preventing people
>> from constructing labels by cramming words together... any
>> prohibition along _those_ lines should certainly be a registry
>> decision, IMO.
>>
>> For the record (and context when that discussion re-emerges on
>> the list), at least some of the Greek IDN community would prefer
>> that we preserve the IDNA2003 mapping / case-folding behavior
>> for final sigma even if that is the only required mapping in
>> IDNA2008.
>> -----------------
>>
>> These should only be made PVALID *if* sufficient information
>> is added to the protocol document about the particular
>> transition and security issues involved in making them
>> so. It is far safer to leave them DISALLOWED -- and in
>> the particular case of the final sigma, to make recommendations
>> for *display* of domain names.
>> -------------------
>> FWIW I volunteer for providing text to the protocol doc with respect to
>> 8.a
>>
>> Having this text available now would influence my position on this.  I
>> think it is very complicated text to write.  To fully evaluate all
>> alternatives, I believe it also will need to compare the costs of making
>> the change against changing the xn-- prefix.
>>
>> Wrt permanence, if we change the prefix, at least all wire-encoded IDN
>> URLs would remain permanent between IDNA2003 and IDNA2008.  I'm also
>> concerned with stable URIs.  Possibly IDNA could take the position that
>> it will never make backwards incompatible changes without changing the
>> prefix: that means wire-encoded IDN URLs are permanence-safe.
>> --------------
>>
>> Thanks for the offer.  Text eagerly awaited -- I've got some
>> notes, but I'm sure what you have will be complementary and
>> better.
>>
>> More generally, those two characters have been extensively
>> discussed, both on and off-list.  In the case of Eszett (8.a)
>> the German orthographic situation is clear and the top-level
>> registries who are likely to be most affected understand the
>> transition issues (either way) and are willing to deal with
>> them.  In the Sigma case, the current registry preference is to
>> preserve the IDNA2003 mapping as part of the protocol (see
>> forthcoming note).
>>
>> [Note by another WG member:
>> Actually, [this] would be a 180 degree turn from the much earlier
>> and broader decision to eliminate all mappings from the protocol
>> and to make transformations between U-labels and A-labels fully
>> reversible without loss of information.  It is, however,
>> consistent with the position Vaggelis suggests and requests in
>> the note he posted today; I was just trying to identify that
>> preference in my note, not advocate for it.]
>>
>> -------------
>> Characterizing the eszett as an exception is correct on one level,
>> but in my view, it's only an exception because we took the wrong
>> rules for IDNA2003. And these rules are even more wrong for IDNA 2008.
>>
>> [Note by another WG member:
>> With "exception", I mean "exception" as defined in the tables document
>> of IDNA200x.]
>>
>>
>> What IDNA 2003 needed was some kind of case mapping. Unicode provided
>> two levels of case mapping: a) the simple one-to-one case mappings,
>> and b) special-casing for cases such as eszett (on top of a).
>>
>> At the time of IDNA 2003, the mood was: 1) We have to take some
>> existing tables, we can't construct our own or we'll never finish.
>> 2) Take special-casing, because that's what you would do for search,
>> and domain name lookup is essentially search.
>>
>> The problem with this is that 2) isn't exactly true. In search,
>> you get back original documents, so there are no misspelling
>> issues. For IDNs, you get back whatever you put in after case
>> folding, and so you end up with misspellings.
>>
>> So in my view, we should look at what we get when we remove
>> special-casing from our rules.
>>
>> -----------
>> I could live with 8.a and 8.b
>>
>> -------------
>> YES to 8.a and 8.b
>>
>> ------------
>>
>> I would like the final sigma to continue working as today so that
>> registrants can use small caps domain names as they usually do in the
>> Greek
>> language, typing the final sigma at the end of the word.
>>
>> Please accept an example for clarification reasons for the members of our
>> list:
>>
>> It would be best if "κύπρος" and "κύπροσ" were represented with different
>> punycode translation since they would be correctly represented in the
>> address bar.
>>
>> However, although in IDNA2008 the upper case characters are invalid I am
>> sure that they will be accepted in the browser and translated to small
>> case
>> characters. In this translation case, there is no upper case character
>> equivalent to the final sigma. Both final sigma and medial sigma have the
>> same uppercase (Σ).
>>
>> This brings us to the case where if you have registered "κύπρος", you will
>> have no way to write this domain in upper case, other than misspell it to
>> "ΚΥΠΡΟς" while somebody else could have registered "κύπροσ"
>> (xn--vxakcel0d)
>> - "ΚΥΠΡΟΣ" in uppercase and on purpose phish for your clients who
>> rightfully
>> think that "ΚΥΠΡΟΣ" is the correct uppercase equivalent for "κύπρος".
>>
>> If in IDNA2008 you make final sigma and medial sigma different characters
>> but you accept both, in the Greek registry we will try to make a DNAME of
>> the two domain names and protect our registrants. I do not expect this to
>> be
>> the case with the gTLDs or anyone else allowing registrations in Greek
>> characters.
>>
>> At present the protocol as proposed excludes the final sigma from the
>> table
>> of characters that are valid for registration. The certain thing for me,
>> however, is that the use of the final sigma in an address bar is mandatory
>> for the representation of the Greek language and it should somehow be in
>> the
>> protocol.
>>
>> Since we have two possible solutions, I could discuss on the pros and cons
>> of any of them. My preference is with the one where the protocol
>> proactively
>> prohibits phishing and allows for the correct translation from Upper case
>> to
>> Lower case for a good user experience of the IDNs. Thus I propose to
>> maintain the IDNA2003 solution, the character mapping, in IDNA2008.
>>
>>
>>
>> ================================
>>
>> (8.c) Disallow conjoining Hangul jamo per recommendation from
>> KRNIC and others, permitting only precomposed syllables.
>>
>> ===============================================
>>
>> From our Korean colleagues:
>>
>> Dear Dr. Cerf and other WG members,
>>
>> First of all, I would like to thank WG members for their comments on this
>> matter.
>>
>> I have been discussing this issue again with my government since our last
>> IETF meeting.
>>
>> Among several government bodies (Ministry of Knowledge Economy, Korea
>> Communications and Commissions, etc) and government agencies (Korean Agency
>> for Technology and Standards and the National Institute of The Korean
>> Language, Korean Standards Association), there was a lively discussion on
>> the feedbacks from IDNAbis IETF WG.
>>
>> The position of the Korean government is the same as before since we made
>> a decision very carefully to prevent a potential harm for IDN users.
>>
>> I will try to provide more clear explanation on this Hangul Jamo issue
>> next week. Please understand that the government process is slow.
>>
>> Thank you.
>>
>> Regards,
>> Jaeyoun Kim
>> National Internet Development Agency of Korea (NIDA)
>>
>> =================
>>
>> A YES vote would represent a significant security problem, and slow the
>> development of IDNA2008 significantly. There are two distinct issues wrapped
>> up in this tranche.
>>
>>
>> As for the conjoining Hangul characters, these are used in representing
>> non-modern Hangul characters. The committee has had a long-standing
>> consensus for *not* going character by character through each script to
>> determine which are the modern-use characters and which are not. We do not
>> need to reopen this issue.
>>
>> If this change is made, then that would force us to rethink that policy,
>> potentially bogging us down in protracted analyses of the different scripts
>> to exclude non-modern use characters, such as
>> U+01BF <http://unicode.org/cldr/utility/character.jsp?a=01BF> ( ƿ ) LATIN
>> LETTER WYNN
>> U+16B9 <http://unicode.org/cldr/utility/character.jsp?a=16B9> ( ᚹ ) RUNIC
>> LETTER WUNJO WYNN W
>> and many, many others.
>>
>> ----------------
>>
>> NO for 8.c, for the reasons explained by Mark.
>> KRNIC is free (or better, strongly recommended) to exclude
>> conjoining Hangul from what they allow to register,
>> but that should not influence our discussion too much.
>>
>> [Just as a hopefully far-fetched example, assume that
>> one day in North Korea, a few Hangul syllables containing some
>> historic Jamos gains crucial importance.]
>>
>> ----------------
>>
>> This appears to open the character-by-character decision making that
>> we already ruled out.  As Mark Davis argues, if we accept this
>> restriction then we probably need to re-open the discussions about
>> obsolete scripts, &c.  It sounds to me very like a registry policy.
>> The argument that some people will get that registry policy wrong has
>> already been floated, and we rejected it.  Indeed, if we don't reject
>> that premise, then all of the local mapping approach that we've taken
>> should be tossed out, and we should go back to strict mapping in the
>> protocol.
>> -------------
>>
>> Let me try to explain the other point of view, to the extent to
>> which I understand the issues as they have been explained to me
>> by the group associated with the Korean registry (if I have it
>> wrong, I hope they will step in directly).  I am going to try to
>> write this so as to not be inflammatory.  If I fail, I want to
>> stress that being inflammatory is not my intent and ask
>> forgiveness in advance.
>>
>> Unicode classifies characters in various ways using a collection
>> of categories and properties.  Those categories and properties
>> (or at least the vast majority of them) were designed long
>> before the IETF started thinking about IDNs; they were certainly
>> not optimized for IDNA requirements.  Given that, we should be
>> grateful and pleasantly surprised that the properties work as
>> well as they do for our purposes.  On the other hand, we should
>> not be surprised when, for some group of characters, they do
>> not... and that has nothing to do with character by character
>> decisions, at least as I understand that term.
>>
>> Before addressing the Hangul question, let me invent an example
>> that is counterfactual, i.e., barring something unforeseen, we
>> are unlikely to ever have to deal with it directly.   There is a
>> proposal pending for ISO/IEC JTC1/SC2/WG2 to add a number of
>> annotation marks for Arabic.  These marks are, according to the
>> proposal (with confirmation from independent experts) used
>> strictly for pedagogical purposes.   Obviously, if one were
>> going to transmit the instructional texts electronically in
>> other than page image form, they have to have code points.  They
>> are identified in the proposal with General Category "Sk"
>> (modifier symbols).  With that classification, the rules in
>> "Tables" would automatically place them in DISALLOWED.  But
>> suppose the proposal had identified them as modifier letters
>> instead (I'm told there is a case to be made for that, even
>> though the relevant Unicode folks have --wisely from our point
>> of view but perhaps not others-- decided otherwise).  Then we
>> would need to exclude them (the whole group, not
>> character-by-character) as a backward-compatibility issue
>> because otherwise, to quote a colleague, we would have a huge
>> mess on our hands, with all sorts of equivalences failing.
>> Again, this is _not_ an issue, but it may help in thinking about
>> the Hangul problem.
>>
>> ...
>>
>> [Message clipped]
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081104/df55c168/attachment-0001.htm