Consensus Call Tranche 8 Results

Vint Cerf vint at google.com
Sun Oct 19 17:02:15 CEST 2008


Consensus Call Tranche 8 (Character Adjustments)

Polling results: 7 YES 8 NO (see below)

It was hard to score the polling because many members wanted to split  
their responses (e.g. YES for 8a, 8b and NO for 8c)

In rough terms, the polls were equal on the YES and NO sides  
(occasionally counting some votes YES AND NO). I am sure that one  
could end up with different tallies depending on how one interprets  
the comments but I think the basic point is that there was not a  
consensus on YES for all three of the proposals or NO against all three.

Trying to summarize, I think I detect a possible willingness to  
accept or interest in the following:

1. Allow esszet (can be excluded by registry)
2. map final sigma into lower case sigma per IDNA2003 (if mapping can  
be done as an exception rule in IDNA2008???)
3. allow JAMO but rely on registries to exclude if desired. The  
Korean experts are still consulting on this.

A key question for all three of these cases is whether there are  
clear and likely cases of ambiguity in which, absent mandatory  
mapping or protocol exclusion, two users might enter what they THINK  
are equivalent U-labels  and end up with DIFFERENT punycode and thus  
different destinations.

I tried to capture the back and forth below by extracting comments  
from the emails on this polling cycle and organizing them as comments  
about 8a and 8b in one group and 8c in another. You'll have to be the  
judge whether this effort helps or confuses the discussion even more.


Vint


(8) Specific character adjustments for IDNA2003 -> IDNA2008
differences.

(8.a) Make Eszett Protocol-Valid per list discussion.

(8.b) Make Greek final sigma Protocol-Valid per list
discussion.

(8.c) Disallow conjoining Hangul jamo per recommendation from
KRNIC and others, permitting only precomposed syllables.


COMMENTS:

GENERAL COMMENTS:

What concerns me is the current discussion.

People are arguing strongly to listen to a ccTLD registry that uses
Eszett regarding approving it.

People are arguing strongly to NOT listen to a ccTLD registry that
uses Jamo regarding disallowing it.

Sure, there is a difference between approving and disallowing, but,
why should we (as in the wg) listen more to one registry than another?

Are we trusting one more than the other?

Do we listen more to more active people on this mailing list than
parties not as active?

--------------

The only reason I have for responding the way I do is that I
understand these responses to be the most consistent with the initial
principles from which we start: use the Unicode properties, do
everything as much as possible by tables, and introduce as few
exceptions as are possible and practical.  The constraint on "possible
and practical" is "internationalize LDH" rather than other goals (such
as "write novels in DNS labels" or "make DNS safe for everyone, given
that there are a lot of visially-confusable characters" or even
"ensure that zone operators can't permit bad things").

As far as I am able to tell -- but I'm not an expert in these matters,
and I don't really have the time to become one -- the inclusion and
exclusion in the respective cases are the most consistent with those
principles.  If we want to adopt other principles, that's ok with me
too.  It might change my opinion on these cases.  I don't have any
opinion what the outcome should be overall; I only have an opinion
given the overall design principles we're trying to follow.  (This is
the same reason I thought that, even though it's very unlikely anyone
will want domain names in, say, Phoenician, the design principle
didn't really permit us to exclude the archaic characters.)
-----------------------
Of necessity, any specific language expert(s) are going to be a (small)
minority of an IETF WG, which is where "rough consensus" can fail to
produce an outcome which incorporates the contributions of the specific
language expert(s).

It happened in 2002/3, different chair(s), but its not just the chair(s)
who hum. I'm still concerned that the Arabic Script meeting at ICANN
Paris yielded information from the Jawi user community which was
dismissed out of hand here. I don't think distance from Minneapolis or
Dublin really makes a poorer technical argument than proximity.

It is something of an inherent defect that our "consensus" can mean
meeting-centricism, where "meeting" takes on values somewhere along the
Goonhilly-ISI arc. Our problem space is slightly larger.

Ironically, in 200X, X << 8, it was Koreans who wanted Cherokee banned
(similarity to ASCII). Fortunately the more popular position did not
then prevail.

------------

===================SPECIFIC PROPOSALS==========

(8.a) Make Eszett Protocol-Valid per list discussion.

(8.b) Make Greek final sigma Protocol-Valid per list
discussion.
--------------------------------
  Note that if we look at the proposals eszett and the one
from korea, the eszett is an exception, while the korean proposal uses
the Unicode properties.
---------------
While the desire for ß and ς characters is understandable, there are  
problems with compatibility. Until they are upgraded, which will  
require some period of time, implementations will be supporting  
IDNA2003 and not IDNA2008. And for compatibility, for the foreseeable  
future, even implementations that support IDNA2008 will need to also  
support IDNA2003.

In most cases the differences between these are tractable, for  
companies like my own. URL X may be valid in IDNA2003 and not  
IDNA2008 or vice versa, but it never goes to two different locations.  
These two characters would break that. URL X could go to two  
*different* locations, depending which standard is being supported.

If I send someone große.com in an email, then depending on what tools  
the user uses to read that email, it could end up at grosse.com (a  
legitimate site) or große.com (a spoof site). (Or, of course,  
große.com could be the legitimate site and grosse.com the spoof  
site.) This represents a significant security problem.

Sigma is fundamentally a presentation issue: it should be displayed  
as ς if it is final. An alternative approach would be to add a SHOULD  
that it be so displayed.

Eszett is slightly trickier. Yet its use in German orthography is not  
fundamentally required, as evidenced by the fact that it is not used  
in High German within in Switzerland, with no apparent ill effects on  
the population (see, for example, http://www.nzz.ch/). And the  
recommended usage of ss vs ß changed substantially in the latest, not- 
wholly successful, German spelling reforms. As a percentage of words  
in use, especially when weighted by usage, the number that are  
distinguished by ss vs ß are vanishingly small.

As stated in rationale-03:
    They [DNS 'names']are typically derived from, or rooted in, some
    language because most people think in language-based ways.  But,
    because they are mnemonics, they need not obey the orthographic
    conventions of any language: it is not a requirement that it be
    possible for them to be "words".

    This distinction is important because the reasonable goal of an IDN
    effort is not to be able to write the great Klingon (or language of
    one's choice) novel in DNS labels but to be able to form a usefully
    broad range of mnemonics in ways that are as natural as possible  
in a
    very broad range of scripts.

Thus while recognizing the legitimate desire of people to use ß and  
ς characters, the cost in terms of compatibility and security does  
not appear to be worth the gain. It is thus too early for consensus  
on these.

Instead, those wanting to make this change should propose some  
mechanisms for avoiding the security problems -- only if those can be  
overcome in a reasonable fashion could we incorporate this change,  
allowing ß and ς.
----------
YES for 8.a and 8.b. Despite the transition issues
mentioned by Mark, the long discussion on this list has
shown that these are the right things to do in the long term.
While I'm not aware of any concrete examples of similar
cases, I think it would be worthwhile to check with other
potentially affected script/language communities.
What, for example, about the few final letters in Hebrew?

-------------------

Or the many initial and final letters in Arabic?  The answer in
both cases is that these are individual characters and are
PROTOCOL-VALID.

[Note by another WG member:
I have to apologize for picking the Hebrew finals example.
I was on a train, guessing. The answer is that the Hebrew
finals are PROTOCOL-VALID. But that's not the case for
Arabic. In Hebrew, there are just a few final variants,
and they got encoded as first-class letters, and because
Hebrew doesn't have case, they didn't get excluded by
special case folding the way the Greek final sigma has.

However, Arabic has a lot of initial/final/medial/isolated
glyph variants, and therefore these are context-dependent
and created by rendering engines, not encoded as such.
There are encodings of these variants in the compatibility
area, but they should be excluded (DISALLOW) by the fact
that there are compatibility mappings from them to the
base letters.]


What I believe got us into difficulty with
Eszett and Final Sigma wasn't the positioning issue or an
alternate shaping one but the intersection between them and the
case-folding rules.  Since, at least as of Unicode 3.2, neither
of them had upper-case forms and IDNA2003 violated the Unicode
Standard's advice against using case-folding to actually map
characters (rather than using it only in comparison but
retaining the original forms), the only result consistent with
the general IDNA2003 model was Eszett -> "ss" and Final Sigma ->
Medial Lower Case Sigma.

Since neither Hebrew nor Arabic (nor any of the other scripts
that have position-sensitive characters) have case, they cannot
get into the same problem.

Since we don't do case mapping in IDNA2008, the case folding
issue does not apply, regardless of what one thinks of that
operation and its applicability.  Without it, the only issue is
whether it is worth banning the characters to preserve part of
the IDNA2003 behavior (or making a major exception and
preserving the IDNA2003 mapping behavior) for the long term even
though it is clear that, were the decision being made for the
first time with the IDNA2008 rules, we would not even be asking
the question.

[Note by another WG member:
Yes indeed. But eszett and final sigma are not the only ones
affected by casing. The data that deals with cases where casing
isn't one-to-one is http://unicode.org/Public/UNIDATA/SpecialCasing.txt.

That includes a lot of data that may be irrelevant for us,
but I think it would be worthwhile to carefully examine it
so that we can fix everything that we need to fix.
The first character that comes to my mind is the lower
dotless I, used for Turkish and Turcic languages.]


--------------
If eszett and final-sigma are permitted, there must be discussions on
backwards compatibility and security consequences.  This brings back the
discussion that it may be lower over-all cost to change the xn-- prefix
for IDNA2008.

I'm not yet ready to decide on 8.a and 8.b until we have discussed and
reviewed the backwards compatibility and security issues.  Pending such
text to review, I'd say NO because we do not know what the consequences
of making this change is today.
----------------
  On these
two, I have no opinion; I don't feel sufficiently qualified to say
whether these individual characters should be altered.  My
understanding is that, because they are consistent with the tables
approach that we are taking, the only reason to exclude them would be
historical.  Since the unhappiness with some of those historical
decisions is part of the justification for the current work, it seems
to me that these ought to be allowed (although I wonder whether 8.b
ought to have a context rule).

------------
Could you explain why you would require a context rule for Final
Sigma without requiring one for Eszett?  Certainly it would be
easier to specify a rule for the former ("Script=Greek") while
the latter would presumably either require either "Script=Latin"
(which wouldn't do much good) or an enumerated list of
characters.  One can't require that the character actually
appear in the last position in a label without preventing people
from constructing labels by cramming words together... any
prohibition along _those_ lines should certainly be a registry
decision, IMO.

For the record (and context when that discussion re-emerges on
the list), at least some of the Greek IDN community would prefer
that we preserve the IDNA2003 mapping / case-folding behavior
for final sigma even if that is the only required mapping in
IDNA2008.
-----------------

These should only be made PVALID *if* sufficient information
is added to the protocol document about the particular
transition and security issues involved in making them
so. It is far safer to leave them DISALLOWED -- and in
the particular case of the final sigma, to make recommendations
for *display* of domain names.
-------------------
FWIW I volunteer for providing text to the protocol doc with respect to
8.a

Having this text available now would influence my position on this.  I
think it is very complicated text to write.  To fully evaluate all
alternatives, I believe it also will need to compare the costs of making
the change against changing the xn-- prefix.

Wrt permanence, if we change the prefix, at least all wire-encoded IDN
URLs would remain permanent between IDNA2003 and IDNA2008.  I'm also
concerned with stable URIs.  Possibly IDNA could take the position that
it will never make backwards incompatible changes without changing the
prefix: that means wire-encoded IDN URLs are permanence-safe.
--------------

Thanks for the offer.  Text eagerly awaited -- I've got some
notes, but I'm sure what you have will be complementary and
better.

More generally, those two characters have been extensively
discussed, both on and off-list.  In the case of Eszett (8.a)
the German orthographic situation is clear and the top-level
registries who are likely to be most affected understand the
transition issues (either way) and are willing to deal with
them.  In the Sigma case, the current registry preference is to
preserve the IDNA2003 mapping as part of the protocol (see
forthcoming note).

[Note by another WG member:
Actually, [this] would be a 180 degree turn from the much earlier
and broader decision to eliminate all mappings from the protocol
and to make transformations between U-labels and A-labels fully
reversible without loss of information.  It is, however,
consistent with the position Vaggelis suggests and requests in
the note he posted today; I was just trying to identify that
preference in my note, not advocate for it.]

-------------
Characterizing the eszett as an exception is correct on one level,
but in my view, it's only an exception because we took the wrong
rules for IDNA2003. And these rules are even more wrong for IDNA 2008.

[Note by another WG member:
With "exception", I mean "exception" as defined in the tables document
of IDNA200x.]


What IDNA 2003 needed was some kind of case mapping. Unicode provided
two levels of case mapping: a) the simple one-to-one case mappings,
and b) special-casing for cases such as eszett (on top of a).

At the time of IDNA 2003, the mood was: 1) We have to take some
existing tables, we can't construct our own or we'll never finish.
2) Take special-casing, because that's what you would do for search,
and domain name lookup is essentially search.

The problem with this is that 2) isn't exactly true. In search,
you get back original documents, so there are no misspelling
issues. For IDNs, you get back whatever you put in after case
folding, and so you end up with misspellings.

So in my view, we should look at what we get when we remove
special-casing from our rules.

-----------
I could live with 8.a and 8.b

-------------
YES to 8.a and 8.b

------------

I would like the final sigma to continue working as today so that
registrants can use small caps domain names as they usually do in the  
Greek
language, typing the final sigma at the end of the word.

Please accept an example for clarification reasons for the members of  
our
list:

It would be best if "κύπρος" and "κύπροσ" were represented  
with different
punycode translation since they would be correctly represented in the
address bar.

However, although in IDNA2008 the upper case characters are invalid I am
sure that they will be accepted in the browser and translated to  
small case
characters. In this translation case, there is no upper case character
equivalent to the final sigma. Both final sigma and medial sigma have  
the
same uppercase (Σ).

This brings us to the case where if you have registered  
"κύπρος", you will
have no way to write this domain in upper case, other than misspell  
it to
"ΚΥΠΡΟς" while somebody else could have registered  
"κύπροσ" (xn--vxakcel0d)
- "ΚΥΠΡΟΣ" in uppercase and on purpose phish for your clients  
who rightfully
think that "ΚΥΠΡΟΣ" is the correct uppercase equivalent for  
"κύπρος".

If in IDNA2008 you make final sigma and medial sigma different  
characters
but you accept both, in the Greek registry we will try to make a  
DNAME of
the two domain names and protect our registrants. I do not expect  
this to be
the case with the gTLDs or anyone else allowing registrations in Greek
characters.

At present the protocol as proposed excludes the final sigma from the  
table
of characters that are valid for registration. The certain thing for me,
however, is that the use of the final sigma in an address bar is  
mandatory
for the representation of the Greek language and it should somehow be  
in the
protocol.

Since we have two possible solutions, I could discuss on the pros and  
cons
of any of them. My preference is with the one where the protocol  
proactively
prohibits phishing and allows for the correct translation from Upper  
case to
Lower case for a good user experience of the IDNs. Thus I propose to
maintain the IDNA2003 solution, the character mapping, in IDNA2008.



================================

(8.c) Disallow conjoining Hangul jamo per recommendation from
KRNIC and others, permitting only precomposed syllables.

===============================================

 From our Korean colleagues:

Dear Dr. Cerf and other WG members,

First of all, I would like to thank WG members for their comments on  
this matter.

I have been discussing this issue again with my government since our  
last IETF meeting.

Among several government bodies (Ministry of Knowledge Economy, Korea  
Communications and Commissions, etc) and government agencies (Korean  
Agency for Technology and Standards and the National Institute of The  
Korean Language, Korean Standards Association), there was a lively  
discussion on the feedbacks from IDNAbis IETF WG.

The position of the Korean government is the same as before since we  
made a decision very carefully to prevent a potential harm for IDN  
users.

I will try to provide more clear explanation on this Hangul Jamo  
issue next week. Please understand that the government process is slow.

Thank you.

Regards,
Jaeyoun Kim
National Internet Development Agency of Korea (NIDA)

=================

A YES vote would represent a significant security problem, and slow  
the development of IDNA2008 significantly. There are two distinct  
issues wrapped up in this tranche.


As for the conjoining Hangul characters, these are used in  
representing non-modern Hangul characters. The committee has had a  
long-standing consensus for *not* going character by character  
through each script to determine which are the modern-use characters  
and which are not. We do not need to reopen this issue.

If this change is made, then that would force us to rethink that  
policy, potentially bogging us down in protracted analyses of the  
different scripts to exclude non-modern use characters, such as
U+01BF ( ƿ ) LATIN LETTER WYNN
U+16B9 ( ᚹ ) RUNIC LETTER WUNJO WYNN W
and many, many others.

----------------

NO for 8.c, for the reasons explained by Mark.
KRNIC is free (or better, strongly recommended) to exclude
conjoining Hangul from what they allow to register,
but that should not influence our discussion too much.

[Just as a hopefully far-fetched example, assume that
one day in North Korea, a few Hangul syllables containing some
historic Jamos gains crucial importance.]

----------------

This appears to open the character-by-character decision making that
we already ruled out.  As Mark Davis argues, if we accept this
restriction then we probably need to re-open the discussions about
obsolete scripts, &c.  It sounds to me very like a registry policy.
The argument that some people will get that registry policy wrong has
already been floated, and we rejected it.  Indeed, if we don't reject
that premise, then all of the local mapping approach that we've taken
should be tossed out, and we should go back to strict mapping in the
protocol.
-------------

Let me try to explain the other point of view, to the extent to
which I understand the issues as they have been explained to me
by the group associated with the Korean registry (if I have it
wrong, I hope they will step in directly).  I am going to try to
write this so as to not be inflammatory.  If I fail, I want to
stress that being inflammatory is not my intent and ask
forgiveness in advance.

Unicode classifies characters in various ways using a collection
of categories and properties.  Those categories and properties
(or at least the vast majority of them) were designed long
before the IETF started thinking about IDNs; they were certainly
not optimized for IDNA requirements.  Given that, we should be
grateful and pleasantly surprised that the properties work as
well as they do for our purposes.  On the other hand, we should
not be surprised when, for some group of characters, they do
not... and that has nothing to do with character by character
decisions, at least as I understand that term.

Before addressing the Hangul question, let me invent an example
that is counterfactual, i.e., barring something unforeseen, we
are unlikely to ever have to deal with it directly.   There is a
proposal pending for ISO/IEC JTC1/SC2/WG2 to add a number of
annotation marks for Arabic.  These marks are, according to the
proposal (with confirmation from independent experts) used
strictly for pedagogical purposes.   Obviously, if one were
going to transmit the instructional texts electronically in
other than page image form, they have to have code points.  They
are identified in the proposal with General Category "Sk"
(modifier symbols).  With that classification, the rules in
"Tables" would automatically place them in DISALLOWED.  But
suppose the proposal had identified them as modifier letters
instead (I'm told there is a case to be made for that, even
though the relevant Unicode folks have --wisely from our point
of view but perhaps not others-- decided otherwise).  Then we
would need to exclude them (the whole group, not
character-by-character) as a backward-compatibility issue
because otherwise, to quote a colleague, we would have a huge
mess on our hands, with all sorts of equivalences failing.
Again, this is _not_ an issue, but it may help in thinking about
the Hangul problem.

For Hangul, the individual Jamo (again, a clearly-identified
group of characters, not a character-by-character decision) are
used to construct conventional (and precomposed) characters
("Hangul syllables").  To the extent to which there is an
analogy in Latin-based script, they would be combining
characters that combine without a base character.  For
Latin-based scripts, we don't need to worry about conflicts
between precomposed characters and composing (base+combining
character) forms of the same characters because the NFC
requirement deals with the problem.   For Korean, there is no
equivalent because NFC doesn't produce the relevant precomposed
forms.   And, because it doesn't, our problem is not one of
confusing similarity (a registry problem) but one of having
comparisons work correctly (a much deeper issue which we have
generally dealt with in the protocol, in the analogous case by
the requirement for NFC.  If Unicode had assigned properties
that treated the Syllables differently from the Jamo, we would
simply build a rule using those categories and we would not be
having a discussion about, e.g., "character by character
decisions".  But there is apparently no such property --both the
Jamo and the Syllables are in General Category "Lo" and the rest
of the properties appear to match as well.

I think the situation --and the comparison failures that would
result if we don't deal with it-- makes a strong case for our
disallowing either the Jamo or the Syllables.  The ccTLD
registry and local experts strongly prefer that we disallow the
Jamo, even though it means that some archaic Syllables and
fanciful forms are disallowed as a consequence.   I think we
just defer to them.

Just my opinion, of course.

The argument that some people will get
that registry policy wrong has already been floated, and we
rejected it.  Indeed, if we don't reject that premise, then
all of the local mapping approach that we've taken should be
tossed out, and we should go back to strict mapping in the
protocol.

Again, the issue here is one of comparison failures, not of
confusability or other registry policy questions.
------- ----
The combining Jamo *do* form composed characters under NFC. Here is  
an example:

U+1100 ( ᄀ ) HANGUL CHOSEONG KIYEOK
U+1161 ( ᅡ ) HANGUL JUNGSEONG A
U+11A8 ( ᆨ ) HANGUL JONGSEONG KIYEOK
=>
U+AC01 ( 각 ) HANGUL SYLLABLE GAG

That is, each of the Hangul precomposed syllables decomposes into one  
or two combining jamo under NFD, and under NFC that sequence of  
combining jamo composes back into that syllable. The comparisons *do*  
work correctly, since IDNA labels have to be in NFC.

For non-modern use characters, the NFC form may not combine all of  
the characters, simply because there may not be a corresponding  
precomposed form to combine them into. That is not a problem. It is  
similar to cases with accents; the NFC form composes as much as it  
can, but where it can't compose it leaves the code points separate.

The key point is that the result is still unique and does not cause a  
problem for comparison.
----------
Korean is a very hierarchically designed script,
and it depends which level you look at.

On the level of Jamos (the level involved in the current consensus  
call),
everything is as described by Mark, and as we are used for other
kinds of characters (except much more regular, and so done by
formulae rather than tables).

On the lower level, usually called "featural", John would be
correct that there is no defined normalization mechanism.
As an example, take the Jamo
     U+1101 HANGUL CHOSEONG SSANGKIYEOK
If you look at it at http://www.unicode.org/charts/PDF/U1100.pdf,
you might agree that it looks like a sequence of
     U+1100 HANGUL CHOSEONG KIYEOK
     U+1100 HANGUL CHOSEONG KIYEOK
However, there is no decomposition along these lines in Unicode.
---------
I didn't see that happening when  I ran a few tests of my own,
but certainly have to defer to your example and experience.   My
instinct is still to defer to the national experts and the
registry,

but, if the [pre]composed characters are consistently
formed by NFC, I agree that consistency with decisions made
elsewhere would disallow the problematic comparison cases and
dictate that we leave this to registry restrictions.

[NOTE from another WG member:
They are. (That is the 11,172 Wanseong Hangul Syllable Blocks
encoded at U+AC00..U+D7A3 behave that way.)]


I am a bit concerned about the hypothetical case that Martin
raised and my reaction, at least if I correctly understand
Unicode's stability rules.    If a few syllables that are now
considered archaic (or, if such cases exists, ones that have
never been used) abruptly become, to use Martin's term, of
crucial importance, would the syllable forms  be allocated code
points?   If so, am I correct in assuming that stability rules
would require that NFC would actually decompose the newly-added
syllables (presumably composing the individual Jamo to the new
syllables would result in an incompatible change to
normalization)?

[remark from another WG member:
Counterfactual. But yes, if it *were* the case (which it isn't),
then addition of a new precomposed Hangul syllable would then
require addressing normalization stability. The exact details
of how that would be done are unclear, because any new
precomposed Hangul syllable would, by definition, be outside
the context of the Hangul Syllable Composition and Hangul
Syllable Decomposition algorithms (TUS 5.0, pp. 122 - 124)
which *define* the normalization relationship between
conjoining jamos and the 11,172 precomposed Hangul syllables.]

[remark from another WG member:
You can call this an incompatible change to normalization, but it's
actually designed the other way round: Any data that already exists
(and is decomposed, because there was no precomposed form) is already
normalized according to these rules. In this very important(*) sense,
the change to normalization is backwards-compatible. This then
creates a strong argument for not encoding anything new
in the first place.

(*) While in the IETF, there is a high awareness of how difficult it
is to change an installed base of software, data isn't usually much
discussed, but it should be obvious that changing an existing base
of data is way tougher. That's why the stability rules for normalization
are they way they are.]



That isn't an attractive answer because it
makes the behavior dependent on when a particular character code
point is added to Unicode.

[NOTE from another WG member:
Also counterfactual, because such characters will not be added
to the Unicode Standard. Nobody in the UTC *or* in Korea
(South or North) is asking for them.

In fact, if you read the new Korean standard, KS X 1026-1:2007,
"Part 1, Hangul processing guide for information interchange",
that standard *mandates* that for Old Hangul syllable blocks
a sequence of three Jamos be used:

"5.2 A representation format of Modern Hangul syllable blocks

"For representing Modern Hangul syllable blocks, we must use code
positions of 11,172 Hangul syllables U+AC00 ~ U+D7A3. ...

"5.3 A representation format of Old Hangul syllable blocks

"For representing Old Hangul syllable blocks, we must use
code positions of Johab Hangul letters in Hangul Jamo U+1100 ~
U+11FF, Hangul Jamo Extended-A U+A960 ~ U+A97F, and Hangul Jamo
Extended-B U+D7B0 ~ U+D7FF, ..."

That isn't something that the UTC wrote in the Unicode Standard --
it is what the Korean Agency for Technology & Standards wrote
in a *Korean* standard.]

[Note by another WG member:
This may be getting somewhat OT, but it should be noted that the
above recommendation for Old Hangul is in conflict with NFC. From
http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3422.pdf:

2) A Wanseong syllable block cannot be recomposed with Johab Hangul  
letter(s) to represent
another Hangul syllable block.
- an example. ?? (U+AC00 U+11EB, incorrect) ? (U+1101 U+1161 U+11EB,  
correct)

NFC would result in U+AC00 U+11EB, not U+1101 U+1161 U+11EB.]



  The other alternatives are certainly
worse for general applications of Unicode.  However, I note that
prohibiting the Jamo in IDNA would prevent the problem, at the
cost of requiring anyone who wants to use a syllable that is not
now assigned a code point in a domain name to persuade UTC and
SC2 to add that  code point.

[Note by another WG member:
That would be exactly NOT what we wanted.]

[Note by another WG member:
Which will never happen.

See above. Prohibiting the Jamos in IDNA would prevent the
usage of Old Hangul syllable blocks in domain names, period.
And frankly, I consider that well within the purview of
registry policies in Korea, if that is what they want to do.]


Unless the national experts and registry can make a much
stronger case than I can make on their behalf (that ought to be
easy for them, but they have not yet been heard from), I think
the NFC relationships still shift the balance toward making this
a registry restriction.  However, I don't think the answer is
quite as obvious and one-sided as your note seems to imply.

[Note by another WG member:
Noted. But I disagree and consider this one obvious. I will
return to Andrew Sullivan's point. If you think it is within
the competence and purview of this particular working group
to decide that the *protocol* should prohibit a certain
subset of historical Old Hangul syllables from representation
in domain names, then we may as well reopen the discussion
about the appropriateness of the protocol letting Sumero-Akkadian
cuneiform, Linear B syllables, or other historic scripts in
domain names.]

[Note by another WG member:
I am not arguing for the competence and purview of the WG to
start making up selective prohibitions.  I am suggesting that,
when there are cognizant registries (especially at the top
level) and governments for which a living language is primary,
we should be very careful about ignoring their advice (although
we should also be sure we and they understand the advice and its
implications in the same way).    As I believe you pointed out
in a prior discussion, we are unlikely to encounter such
registries or governments where Sumero-Akkadian cuneiform,
Linear B syllables, etc., are concerned, so the situations are
somewhat different.]

-------------
I think we have held quite a bit of discussion.

We understand that the Korean registry doesn't need Jamos, and
doesn't plan to allow them. We understand that they can do that
by excluding them at the registration level.

We have heard that there are supposed to be some normalization
problems between Jamos and Syllable blocks, but we know that
this is not the case. If this information is referring to
something different and specific, then it should be clarified
by the Korean side or whatever intermediaries there are.

We have a policy of allowing characters that are used historically,
to not close doors unnecessarily. By that policy, we allow
Korean Jamos so that they can be used if the need arises.

If the above isn't a good summary of the discussion, and/or
if there is more material, please add your points or send
pointers.

-----------------
These should not be disallowed by the table. The issues
are as Mark explained them, and the Korean NIC can simply
disallow them for their registrations if that is their
preference.
-------------------
I have attached the Korean proposal, which in short is:

1. Add Hangul Jamo to blocks to disallow (i.e. "2.1.4 IgnorableBlocks  
(D)")
2. Add two codepoints (that is "Inherited", but not DISALLOWED by  
other means) to DISALLOWED

I.e. I must correct myself when I said that the proposal is only  
using Unicode properties. I can not (but I am tired...) see how to  
catch the two Bangjeom codepoints U+302E and U+302F without using  
exceptions.

People interested in this discussion should also re-read the messages  
from Ken where he explain his view is that this is something that  
should be expressed by a registry policy.

Message-Id: <200807281913.m6SJDpL01810 at birdie.sybase.com>
Date: Mon, 28 Jul 2008 12:13:51 -0700 (PDT)

[Note by another WG member:
I had a look at the document again. For those points of the
proposal where it disagrees with what we currently have,
the words "not needed" are used. Nothing that even comes
close to words such as "harmful", "confusing", or the like
appears for points 1 and 2. The word "confusing appears for
point 3, Hangul Compatibility Jamo, which we already disallow.

Of course writing and reading such documents is always frought
with difficulties, but I don't think that the hypothesis that
the authors understand the difference between "we don't need
them" and "these are dangerous" is far-fetched.]

[Note by another WG member:

This because I think having an IETF wg make a decision about consensus
that is _against_ a proposal from a formal organisation like NIDA that
say they have been running a consensus driven process in Korea with
participants from Korean Agency for Technology and Standards (National
Body of ISO and IEC), the National Institute of The Korean Language,
etc, is serious.

If we had a similar situation in Sweden where IETF ruled against what
similar consensus driven process in Sweden about Swedish...well, I
would start asking serious questions on how consensus in IETF was
reached.

So, I am as editor of the tables document neutral in the issue. I just
envision that for 8.c, we will get questions given what the consensus
seems to be at the moment.]


----------------
but 8.c is not acceptable for reasons
better expressed by Ken. Jamo inclusion/exclusion policy is a matter of
registry policy. And I totally understand that at this point Korean
registration constituencies would want to exclude Jamos, but that is a
different matter.

----------------
reluctant NO to 8.c


NOTE NEW BUSINESS ADDRESS AND PHONE
Vint Cerf
Google
1818 Library Street, Suite 400
Reston, VA 20190
202-370-5637
vint at google.com




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081019/46a9fc42/attachment-0001.htm 


More information about the Idna-update mailing list