Katakana Middle Dot again (Was: tables-06b.txt: A.5, A.6, A.9)

Sun Jul 26 13:13:16 CEST 2009

Following is a proposed rule I wrote a few months ago in
<http://www.alvestrand.no/pipermail/idna-update/2009-April/004355.html>.

  Rule Set:
      False;
      For All Characters:
        If Script(cp) .eq. ( Han | Hiragana | Katakana ) Then True;
        If cp .in. U+3005..U+3007 Then True;
      End For;

This is quite similar to Mark's rule in
<http://www.alvestrand.no/pipermail/idna-update/2009-July/005000.html>.

-- 
Yoshiro YONEYA <yone at jprs.co.jp>

On Sat, 25 Jul 2009 14:54:21 +0200 Patrik Fältström <patrik at frobbit.se> wrote:

> On 25 jul 2009, at 14.34, Wil Tan wrote:
> 
> > I accidentally left out the U+3005..U+3007 that Yoneya-san proposed.
> > Therefore, #3 should be:
> >
> >  3. That the label contains only
> > (Han|Hiragana|Katakana|LDH|U+3005..U+3007) + katakana middle dot.
> >
> > It's important to note that having these constraints would rule out:
> 
> What you say is that you want the following rules:
> 
>    True;
>    if .not. Script(BeforeChar(cp)) .in.  (Han|Hiragana|Katakana) then  
> False;
>    For each cp:
>      if .not. (Script(cp) .in. (Han|Hiragana|Katakana) .or.
>          cp in {U+002D,U+0030..U+0039,U+0061..U+007A,U+3005..U+3007})  
> then False;
> 
>     Patrik
> 
> > a. planting the katakana middle dot in an all-ASCII label (a
> > legitimate use-case, but one that Yoneya-san was willing to live
> > without)
> >
> > b. katakana middle dot used to concatenate two strings, the first of
> > which is all-ASCII (arguably more common than above, so disallowing
> > this may well be unnecessarily restrictive, but is meant to mitigate
> > phishing concerns)
> >
> > c. katakana middle dot at the beginning of a label (I don't know how
> > common is this.)
> >
> > I don't pretend to know this well, so it'd be great if Yoneya-san and
> > others who are familiar with this could weigh in.
> >
> >
> >>> However, it makes the rule considerably more complex and
> >>> because of this I was thinking more of leaving this to the
> >>> application, which may have more contextual information (such
> >>> as user's locale, the TLD, etc.) to take appropriate steps to
> >>> protect the user.
> >>
> >> It is a question, again, of how to draw the line.  In some
> >> sense, the way that IDNA works makes "in the IDNA protocol" and
> >> "in the application" different versions of the same thing -- all
> >> of IDNA occurs "in the application" although API design, etc.,
> >> may change perceptions of that.   The argument about  how
> >> normative mapping should be has its mirror image here.
> >> Personally, I can live with "general prohibition on
> >> registration; recommend that lookup-side applications be
> >> extra-careful with this stuff".  That is more or less what
> >> CONTEXTO is about and would be consistent with what I think you
> >> are suggesting above (the text in Protocol may need tuning;
> >> suggested text welcome).     Somewhat more would also work for
> >> me if we can fairly clearly justify it.
> >>
> >> If applications start drawing lines differently on what should
> >> have been registered, we get the most inconsistent behavior
> >> possible.  That is why Protocol contains language requiring that
> >> anything that is Lookup-Valid must be looked up, even if one
> >> decides to warn the user first.
> >>
> >
> > Thanks, this is for me a useful answer to the meta question of "where
> > to draw the line". Still, it is a difficult balancing act to juggle
> > between having rules that are simple enough to implement and yet tight
> > enough to prevent confusion with a view to allow legitimate usage.
> >
> > In order to frame my head to make sense of the recent contextual rules
> > discussions, I'm trying to picture the "guidelines" for deciding how
> > to craft the rules. Is the following reasonable?
> >
> > 1. Overview: describe the expected contexts in which the subject
> > character is allowed to appear.
> > 2. Rule set: if the contexts are simple and narrow enough to capture,
> > they can be expressed in the pseudocode. For other more complex ones,
> > a simple check may suffice leaving the registry to place additional
> > constraints on the subject character. It may not have to capture
> > everything in the overview.
> >
> > This seems to be consistent with recommendations given elsewhere by
> > Mark and others.
> >
> > Applying those "guidelines" to the katakana middle dot, the overview  
> > would be:
> >
> >  This character is used in Japanese orthography to concatenate strings
> >  containing characters in the Hiragana, Katakana and Han scripts,
> >  or in any of the following sets: [a-z0-9\-], U+3005..U+3009.
> >
> > and the rule set would probably just be "True", because the rule won't
> > be simple, and certainly will not be narrow so trying to capture it in
> > pseudocode would be pointless. However, having the rule is better than
> > not having any at all because it does flag the character to the
> > registry that it needs policy around it, as well as tell the
> > application to be extra careful about it.
> >
> >> Be careful, however, about any assumptions of actions based  on
> >> TLDs.   The presence of DNAME and the lack of any "give me back
> >> the canonical/primary tree" function in the DNS makes that one
> >> very fragile even if there were no other issues.
> >>
> >
> > Point taken.
> >
> > Thanks.
> >
> > =wil
> >
>