Consensus Call Tranche 8 Results

Wed Oct 22 09:38:31 CEST 2008

About eszett, I think we are still waiting on proposed text on the security
issue. If the security and interoperability issues can be addressed, I'm
happy with allowing eszett, but we haven't seen any text yet, and thus
cannot make a firm assessment.
Mark

On Sun, Oct 19, 2008 at 5:02 PM, Vint Cerf <vint at google.com> wrote:

> Consensus Call Tranche 8 (Character Adjustments)
>
> Polling results: 7 YES 8 NO (see below)
>
> It was hard to score the polling because many members wanted to split their
> responses (e.g. YES for 8a, 8b and NO for 8c)
>
> In rough terms, the polls were equal on the YES and NO sides (occasionally
> counting some votes YES AND NO). I am sure that one could end up with
> different tallies depending on how one interprets the comments but I think
> the basic point is that there was not a consensus on YES for all three of
> the proposals or NO against all three.
>
> Trying to summarize, I think I detect a possible willingness to accept or
> interest in the following:
>
> 1. Allow esszet (can be excluded by registry)
> 2. map final sigma into lower case sigma per IDNA2003 (if mapping can be
> done as an exception rule in IDNA2008???)
> 3. allow JAMO but rely on registries to exclude if desired. The Korean
> experts are still consulting on this.
>
> A key question for all three of these cases is whether there are clear and
> likely cases of ambiguity in which, absent mandatory mapping or protocol
> exclusion, two users might enter what they THINK are equivalent U-labels
> and end up with DIFFERENT punycode and thus different destinations.
>
> I tried to capture the back and forth below by extracting comments from the
> emails on this polling cycle and organizing them as comments about 8a and 8b
> in one group and 8c in another. You'll have to be the judge whether this
> effort helps or confuses the discussion even more.
>
>
> Vint
>
>
> (8) Specific character adjustments for IDNA2003 -> IDNA2008
> differences.
>
> (8.a) Make Eszett Protocol-Valid per list discussion.
>
> (8.b) Make Greek final sigma Protocol-Valid per list
> discussion.
>
> (8.c) Disallow conjoining Hangul jamo per recommendation from
> KRNIC and others, permitting only precomposed syllables.
>
>
> COMMENTS:
>
> GENERAL COMMENTS:
>
> What concerns me is the current discussion.
>
> People are arguing strongly to listen to a ccTLD registry that uses
> Eszett regarding approving it.
>
> People are arguing strongly to NOT listen to a ccTLD registry that
> uses Jamo regarding disallowing it.
>
> Sure, there is a difference between approving and disallowing, but,
> why should we (as in the wg) listen more to one registry than another?
>
> Are we trusting one more than the other?
>
> Do we listen more to more active people on this mailing list than
> parties not as active?
>
> --------------
>
> The only reason I have for responding the way I do is that I
> understand these responses to be the most consistent with the initial
> principles from which we start: use the Unicode properties, do
> everything as much as possible by tables, and introduce as few
> exceptions as are possible and practical.  The constraint on "possible
> and practical" is "internationalize LDH" rather than other goals (such
> as "write novels in DNS labels" or "make DNS safe for everyone, given
> that there are a lot of visially-confusable characters" or even
> "ensure that zone operators can't permit bad things").
>
> As far as I am able to tell -- but I'm not an expert in these matters,
> and I don't really have the time to become one -- the inclusion and
> exclusion in the respective cases are the most consistent with those
> principles.  If we want to adopt other principles, that's ok with me
> too.  It might change my opinion on these cases.  I don't have any
> opinion what the outcome should be overall; I only have an opinion
> given the overall design principles we're trying to follow.  (This is
> the same reason I thought that, even though it's very unlikely anyone
> will want domain names in, say, Phoenician, the design principle
> didn't really permit us to exclude the archaic characters.)
> -----------------------
> Of necessity, any specific language expert(s) are going to be a (small)
> minority of an IETF WG, which is where "rough consensus" can fail to
> produce an outcome which incorporates the contributions of the specific
> language expert(s).
>
> It happened in 2002/3, different chair(s), but its not just the chair(s)
> who hum. I'm still concerned that the Arabic Script meeting at ICANN
> Paris yielded information from the Jawi user community which was
> dismissed out of hand here. I don't think distance from Minneapolis or
> Dublin really makes a poorer technical argument than proximity.
>
> It is something of an inherent defect that our "consensus" can mean
> meeting-centricism, where "meeting" takes on values somewhere along the
> Goonhilly-ISI arc. Our problem space is slightly larger.
>
> Ironically, in 200X, X << 8, it was Koreans who wanted Cherokee banned
> (similarity to ASCII). Fortunately the more popular position did not
> then prevail.
>
> ------------
>
>
> ===================SPECIFIC PROPOSALS==========
>
> (8.a) Make Eszett Protocol-Valid per list discussion.
>
> (8.b) Make Greek final sigma Protocol-Valid per list
> discussion.
> --------------------------------
>  Note that if we look at the proposals eszett and the one
> from korea, the eszett is an exception, while the korean proposal uses
> the Unicode properties.
> ---------------
> While the desire for ß and ς characters is understandable, there are
> problems with compatibility. Until they are upgraded, which will require
> some period of time, implementations will be supporting IDNA2003 and not
> IDNA2008. And for compatibility, for the foreseeable future, even
> implementations that support IDNA2008 will need to also support IDNA2003.
>
> In most cases the differences between these are tractable, for companies
> like my own. URL X may be valid in IDNA2003 and not IDNA2008 or vice versa,
> but it never goes to two different locations. These two characters would
> break that. URL X could go to two *different* locations, depending which
> standard is being supported.
>
> If I send someone *große.com* <http://grosse.com/> in an email, then
> depending on what tools the user uses to read that email, it could end up at
> *grosse.com* <http://grosse.com/> (a legitimate site) or *große.com*<http://grosse.com/>(a spoof site). (Or, of course,
> *große.com* <http://grosse.com/> could be the legitimate site and *
> grosse.com* <http://grosse.com/> the spoof site.) This represents a
> significant security problem.
>
> Sigma is fundamentally a presentation issue: it should be displayed as ς
> if it is final. An alternative approach would be to add a SHOULD that it be
> so displayed.
>
> Eszett is slightly trickier. Yet its use in German orthography is not
> fundamentally required, as evidenced by the fact that it is not used in High
> German within in Switzerland, with no apparent ill effects on the population
> (see, for example, *http://www.nzz.ch/* <http://www.nzz.ch/>). And the
> recommended usage of ss vs ß changed substantially in the latest, not-wholly
> successful, German spelling reforms. As a percentage of words in use,
> especially when weighted by usage, the number that are distinguished by ss
> vs ß are vanishingly small.
>
> As stated in rationale-03:
>    They [DNS 'names']are typically derived from, or rooted in, some
>    language because most people think in language-based ways.  But,
>    because they are mnemonics, they need not obey the orthographic
>    conventions of any language: it is not a requirement that it be
>    possible for them to be "words".
>
>    This distinction is important because the reasonable goal of an IDN
>    effort is not to be able to write the great Klingon (or language of
>    one's choice) novel in DNS labels but to be able to form a usefully
>    broad range of mnemonics in ways that are as natural as possible in a
>    very broad range of scripts.
>
> Thus while recognizing the legitimate desire of people to use ß and
> ς characters, the cost in terms of compatibility and security does not
> appear to be worth the gain. It is thus too early for consensus on these.
>
> Instead, those wanting to make this change should propose some mechanisms
> for avoiding the security problems -- only if those can be overcome in a
> reasonable fashion could we incorporate this change, allowing ß and ς.
> ----------
> YES for 8.a and 8.b. Despite the transition issues
> mentioned by Mark, the long discussion on this list has
> shown that these are the right things to do in the long term.
> While I'm not aware of any concrete examples of similar
> cases, I think it would be worthwhile to check with other
> potentially affected script/language communities.
> What, for example, about the few final letters in Hebrew?
>
> -------------------
>
> Or the many initial and final letters in Arabic?  The answer in
> both cases is that these are individual characters and are
> PROTOCOL-VALID.
>
> [Note by another WG member:
> I have to apologize for picking the Hebrew finals example.
> I was on a train, guessing. The answer is that the Hebrew
> finals are PROTOCOL-VALID. But that's not the case for
> Arabic. In Hebrew, there are just a few final variants,
> and they got encoded as first-class letters, and because
> Hebrew doesn't have case, they didn't get excluded by
> special case folding the way the Greek final sigma has.
>
> However, Arabic has a lot of initial/final/medial/isolated
> glyph variants, and therefore these are context-dependent
> and created by rendering engines, not encoded as such.
> There are encodings of these variants in the compatibility
> area, but they should be excluded (DISALLOW) by the fact
> that there are compatibility mappings from them to the
> base letters.]
>
>
> What I believe got us into difficulty with
> Eszett and Final Sigma wasn't the positioning issue or an
> alternate shaping one but the intersection between them and the
> case-folding rules.  Since, at least as of Unicode 3.2, neither
> of them had upper-case forms and IDNA2003 violated the Unicode
> Standard's advice against using case-folding to actually map
> characters (rather than using it only in comparison but
> retaining the original forms), the only result consistent with
> the general IDNA2003 model was Eszett -> "ss" and Final Sigma ->
> Medial Lower Case Sigma.
>
> Since neither Hebrew nor Arabic (nor any of the other scripts
> that have position-sensitive characters) have case, they cannot
> get into the same problem.
>
> Since we don't do case mapping in IDNA2008, the case folding
> issue does not apply, regardless of what one thinks of that
> operation and its applicability.  Without it, the only issue is
> whether it is worth banning the characters to preserve part of
> the IDNA2003 behavior (or making a major exception and
> preserving the IDNA2003 mapping behavior) for the long term even
> though it is clear that, were the decision being made for the
> first time with the IDNA2008 rules, we would not even be asking
> the question.
>
> [Note by another WG member:
> Yes indeed. But eszett and final sigma are not the only ones
> affected by casing. The data that deals with cases where casing
> isn't one-to-one is http://unicode.org/Public/UNIDATA/SpecialCasing.txt.
>
> That includes a lot of data that may be irrelevant for us,
> but I think it would be worthwhile to carefully examine it
> so that we can fix everything that we need to fix.
> The first character that comes to my mind is the lower
> dotless I, used for Turkish and Turcic languages.]
>
>
> --------------
> If eszett and final-sigma are permitted, there must be discussions on
> backwards compatibility and security consequences.  This brings back the
> discussion that it may be lower over-all cost to change the xn-- prefix
> for IDNA2008.
>
> I'm not yet ready to decide on 8.a and 8.b until we have discussed and
> reviewed the backwards compatibility and security issues.  Pending such
> text to review, I'd say NO because we do not know what the consequences
> of making this change is today.
> ----------------
>  On these
> two, I have no opinion; I don't feel sufficiently qualified to say
> whether these individual characters should be altered.  My
> understanding is that, because they are consistent with the tables
> approach that we are taking, the only reason to exclude them would be
> historical.  Since the unhappiness with some of those historical
> decisions is part of the justification for the current work, it seems
> to me that these ought to be allowed (although I wonder whether 8.b
> ought to have a context rule).
>
> ------------
> Could you explain why you would require a context rule for Final
> Sigma without requiring one for Eszett?  Certainly it would be
> easier to specify a rule for the former ("Script=Greek") while
> the latter would presumably either require either "Script=Latin"
> (which wouldn't do much good) or an enumerated list of
> characters.  One can't require that the character actually
> appear in the last position in a label without preventing people
> from constructing labels by cramming words together... any
> prohibition along _those_ lines should certainly be a registry
> decision, IMO.
>
> For the record (and context when that discussion re-emerges on
> the list), at least some of the Greek IDN community would prefer
> that we preserve the IDNA2003 mapping / case-folding behavior
> for final sigma even if that is the only required mapping in
> IDNA2008.
> -----------------
>
> These should only be made PVALID *if* sufficient information
> is added to the protocol document about the particular
> transition and security issues involved in making them
> so. It is far safer to leave them DISALLOWED -- and in
> the particular case of the final sigma, to make recommendations
> for *display* of domain names.
> -------------------
> FWIW I volunteer for providing text to the protocol doc with respect to
> 8.a
>
> Having this text available now would influence my position on this.  I
> think it is very complicated text to write.  To fully evaluate all
> alternatives, I believe it also will need to compare the costs of making
> the change against changing the xn-- prefix.
>
> Wrt permanence, if we change the prefix, at least all wire-encoded IDN
> URLs would remain permanent between IDNA2003 and IDNA2008.  I'm also
> concerned with stable URIs.  Possibly IDNA could take the position that
> it will never make backwards incompatible changes without changing the
> prefix: that means wire-encoded IDN URLs are permanence-safe.
> --------------
>
> Thanks for the offer.  Text eagerly awaited -- I've got some
> notes, but I'm sure what you have will be complementary and
> better.
>
> More generally, those two characters have been extensively
> discussed, both on and off-list.  In the case of Eszett (8.a)
> the German orthographic situation is clear and the top-level
> registries who are likely to be most affected understand the
> transition issues (either way) and are willing to deal with
> them.  In the Sigma case, the current registry preference is to
> preserve the IDNA2003 mapping as part of the protocol (see
> forthcoming note).
>
> [Note by another WG member:
> Actually, [this] would be a 180 degree turn from the much earlier
> and broader decision to eliminate all mappings from the protocol
> and to make transformations between U-labels and A-labels fully
> reversible without loss of information.  It is, however,
> consistent with the position Vaggelis suggests and requests in
> the note he posted today; I was just trying to identify that
> preference in my note, not advocate for it.]
>
> -------------
> Characterizing the eszett as an exception is correct on one level,
> but in my view, it's only an exception because we took the wrong
> rules for IDNA2003. And these rules are even more wrong for IDNA 2008.
>
> [Note by another WG member:
> With "exception", I mean "exception" as defined in the tables document
> of IDNA200x.]
>
>
> What IDNA 2003 needed was some kind of case mapping. Unicode provided
> two levels of case mapping: a) the simple one-to-one case mappings,
> and b) special-casing for cases such as eszett (on top of a).
>
> At the time of IDNA 2003, the mood was: 1) We have to take some
> existing tables, we can't construct our own or we'll never finish.
> 2) Take special-casing, because that's what you would do for search,
> and domain name lookup is essentially search.
>
> The problem with this is that 2) isn't exactly true. In search,
> you get back original documents, so there are no misspelling
> issues. For IDNs, you get back whatever you put in after case
> folding, and so you end up with misspellings.
>
> So in my view, we should look at what we get when we remove
> special-casing from our rules.
>
> -----------
> I could live with 8.a and 8.b
>
> -------------
> YES to 8.a and 8.b
>
> ------------
>
> I would like the final sigma to continue working as today so that
> registrants can use small caps domain names as they usually do in the Greek
> language, typing the final sigma at the end of the word.
>
> Please accept an example for clarification reasons for the members of our
> list:
>
> It would be best if "κύπρος" and "κύπροσ" were represented with different
> punycode translation since they would be correctly represented in the
> address bar.
>
> However, although in IDNA2008 the upper case characters are invalid I am
> sure that they will be accepted in the browser and translated to small case
> characters. In this translation case, there is no upper case character
> equivalent to the final sigma. Both final sigma and medial sigma have the
> same uppercase (Σ).
>
> This brings us to the case where if you have registered "κύπρος", you will
> have no way to write this domain in upper case, other than misspell it to
> "ΚΥΠΡΟς" while somebody else could have registered "κύπροσ" (xn--vxakcel0d)
> - "ΚΥΠΡΟΣ" in uppercase and on purpose phish for your clients who
> rightfully
> think that "ΚΥΠΡΟΣ" is the correct uppercase equivalent for "κύπρος".
>
> If in IDNA2008 you make final sigma and medial sigma different characters
> but you accept both, in the Greek registry we will try to make a DNAME of
> the two domain names and protect our registrants. I do not expect this to
> be
> the case with the gTLDs or anyone else allowing registrations in Greek
> characters.
>
> At present the protocol as proposed excludes the final sigma from the table
> of characters that are valid for registration. The certain thing for me,
> however, is that the use of the final sigma in an address bar is mandatory
> for the representation of the Greek language and it should somehow be in
> the
> protocol.
>
> Since we have two possible solutions, I could discuss on the pros and cons
> of any of them. My preference is with the one where the protocol
> proactively
> prohibits phishing and allows for the correct translation from Upper case
> to
> Lower case for a good user experience of the IDNs. Thus I propose to
> maintain the IDNA2003 solution, the character mapping, in IDNA2008.
>
>
>
> ================================
>
> (8.c) Disallow conjoining Hangul jamo per recommendation from
> KRNIC and others, permitting only precomposed syllables.
>
> ===============================================
>
> From our Korean colleagues:
>
> Dear Dr. Cerf and other WG members,
>
> First of all, I would like to thank WG members for their comments on this
> matter.
>
> I have been discussing this issue again with my government since our last
> IETF meeting.
>
> Among several government bodies (Ministry of Knowledge Economy, Korea
> Communications and Commissions, etc) and government agencies (Korean Agency
> for Technology and Standards and the National Institute of The Korean
> Language, Korean Standards Association), there was a lively discussion on
> the feedbacks from IDNAbis IETF WG.
>
> The position of the Korean government is the same as before since we made a
> decision very carefully to prevent a potential harm for IDN users.
>
> I will try to provide more clear explanation on this Hangul Jamo issue next
> week. Please understand that the government process is slow.
>
> Thank you.
>
> Regards,
> Jaeyoun Kim
> National Internet Development Agency of Korea (NIDA)
>
> =================
>
> A YES vote would represent a significant security problem, and slow the
> development of IDNA2008 significantly. There are two distinct issues wrapped
> up in this tranche.
>
>
> As for the conjoining Hangul characters, these are used in representing
> non-modern Hangul characters. The committee has had a long-standing
> consensus for *not* going character by character through each script to
> determine which are the modern-use characters and which are not. We do not
> need to reopen this issue.
>
> If this change is made, then that would force us to rethink that policy,
> potentially bogging us down in protracted analyses of the different scripts
> to exclude non-modern use characters, such as
> U+01BF <http://unicode.org/cldr/utility/character.jsp?a=01BF> ( ƿ ) LATIN
> LETTER WYNN
> U+16B9 <http://unicode.org/cldr/utility/character.jsp?a=16B9> ( ᚹ ) RUNIC
> LETTER WUNJO WYNN W
> and many, many others.
>
> ----------------
>
> NO for 8.c, for the reasons explained by Mark.
> KRNIC is free (or better, strongly recommended) to exclude
> conjoining Hangul from what they allow to register,
> but that should not influence our discussion too much.
>
> [Just as a hopefully far-fetched example, assume that
> one day in North Korea, a few Hangul syllables containing some
> historic Jamos gains crucial importance.]
>
> ----------------
>
> This appears to open the character-by-character decision making that
> we already ruled out.  As Mark Davis argues, if we accept this
> restriction then we probably need to re-open the discussions about
> obsolete scripts, &c.  It sounds to me very like a registry policy.
> The argument that some people will get that registry policy wrong has
> already been floated, and we rejected it.  Indeed, if we don't reject
> that premise, then all of the local mapping approach that we've taken
> should be tossed out, and we should go back to strict mapping in the
> protocol.
> -------------
>
> Let me try to explain the other point of view, to the extent to
> which I understand the issues as they have been explained to me
> by the group associated with the Korean registry (if I have it
> wrong, I hope they will step in directly).  I am going to try to
> write this so as to not be inflammatory.  If I fail, I want to
> stress that being inflammatory is not my intent and ask
> forgiveness in advance.
>
> Unicode classifies characters in various ways using a collection
> of categories and properties.  Those categories and properties
> (or at least the vast majority of them) were designed long
> before the IETF started thinking about IDNs; they were certainly
> not optimized for IDNA requirements.  Given that, we should be
> grateful and pleasantly surprised that the properties work as
> well as they do for our purposes.  On the other hand, we should
> not be surprised when, for some group of characters, they do
> not... and that has nothing to do with character by character
> decisions, at least as I understand that term.
>
> Before addressing the Hangul question, let me invent an example
> that is counterfactual, i.e., barring something unforeseen, we
> are unlikely to ever have to deal with it directly.   There is a
> proposal pending for ISO/IEC JTC1/SC2/WG2 to add a number of
> annotation marks for Arabic.  These marks are, according to the
> proposal (with confirmation from independent experts) used
> strictly for pedagogical purposes.   Obviously, if one were
> going to transmit the instructional texts electronically in
> other than page image form, they have to have code points.  They
> are identified in the proposal with General Category "Sk"
> (modifier symbols).  With that classification, the rules in
> "Tables" would automatically place them in DISALLOWED.  But
> suppose the proposal had identified them as modifier letters
> instead (I'm told there is a case to be made for that, even
> though the relevant Unicode folks have --wisely from our point
> of view but perhaps not others-- decided otherwise).  Then we
> would need to exclude them (the whole group, not
> character-by-character) as a backward-compatibility issue
> because otherwise, to quote a colleague, we would have a huge
> mess on our hands, with all sorts of equivalences failing.
> Again, this is _not_ an issue, but it may help in thinking about
> the Hangul problem.
>
> ...
>
> [Message clipped]
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081022/0408d821/attachment-0001.htm