IDNA 2008 security

Mon Dec 22 14:36:30 CET 2008

On 2 dec 2008, at 01.21, Dick Sites wrote:

> First, I very much support the inclusion-based approach. It should be
> clearly stated on page 4 that Unassigned codes are Disallowed until
> explicitly assigned and included.

I do not want to include text from the other documents on how to treat  
the different categories.

> It would be helpful to state explicitly on page 4 or 5 that upper-case
> letters are disallowed. For emphasis and to avoid confusion of the
> casual reader, Lu could be removed from LetterDigits on page 5.

This is not a correct statement. I.e. it is not the case that upper- 
case letters are disallowed. We disallow characters that have a  
mapping to lower case. See "2.2. Unstable (B)".

That said, today (as of Unicode 5.1) I can not find any codepoints  
that are PVALID that are Lu.

> On page 9, rule 11 should be restated as The value is DISALLOWED. This
> avoids the need for a rule 12 that specifies unconditional Disallowed,
> and it makes this section exactly match the recasting in Appendix A.1

This will be rewritten to -05.txt.

> Appendix A is unclear in several places. Before(cp) is not properly
> defined for cp the first character of a label. After(cp) is not
> defined at all; when added, it should address the case that cp is the
> last character of a label.

I hope I have managed to fix this.

> The current definition of FirstChar appears flawed. Does cp .eq.
> FirstChar return true if cp is the third character but is the
> identical code point as the first character? Same comment for
> LastChar.

The text talk about "codepoint", and not "codepoint value", which for  
me is clear. If you want other text, please send a suggestion. So for  
me, if cp is the third character, it is different from than what  
FirstChar returns.

> The meaning of Lookup: true or false escapes me.

Whether it should be tested at Lookup time or not.

Now added some text that explains this.

> The expression in A.2 with constant U+002D should be redone in the
> style of A.11 with the variable cp.

Fixed.

> A.3 is unclear on what is intended with Script(cp) .eq. Arabic when cp
> is U+200C. On its face, the first term appears to be always false,
> since Script(cp) for cp = U+200C is Inherited. I suspect there is a
> missing For all Characters or somesuch. The Script(before(cp)) part is
> undefined if cp is the first character, but no term states the
> constraint that cp must not be the first character.

I removed the Script(cp) test. It does not make any sense.

> It would be helpful to combine identical rules A.6 and A.7 and to
> combine A.9 with A.10 and A.11 with A.12.

At this point in time, I do not want to do so as the codepoints the  
rules are valid for are not in sequence. And for A.6 and A.7, they are  
sort of different rules, although the spec end up being the same.

> There is nothing in this draft that addresses known real-world
> phishing exploits and disallows them. That seems like a truly
> unfortunate oversight. Specifically, "paypal" spelled with one or more
> Cyrillic lookalike-a characters is allowed. Yet all the mechanism is
> in place to require U+0430, etc. only to be used in a Cyrillic script
> label.

See the other documents.

> Even better would be an inclusion-based  approach that only allows
> change of script at a hyphen. Legitimate domain owners could then
> prevent an entire class of phishing by not using hyphen in their
> actual labels, while domain owners who want foo-бар or Фу-bar  
> can do
> so. Hyphen would be enough of a clue for some users in that something
> unusual might be going on, and would allow only
>  p-а-yp-а-l
> for use of two Cyrillic letters intermixed with Latin letters. This
> enforced simple rule could perhaps replace several of the current
> more-specialized context rules.
>
> The oversight suggests that this draft is just a collection of
> rules and not a serious effort to improve security on the web.

    Patrik