Document: draft-ietf-speechsc-reqts-05
Reviewer: Spencer Dawkins
Date: March 13, 2004

"This draft is on the right track but has open issues, described in
the review."

Specifically, it looks like it's mostly been reviewed by people who
are familiar with the topic area. Some of my comments are more than
nits, but most are requests to provide more detail and justification
for a more general audience. 3.5 and 3.6 seem most likely to be
problematic.

I'm also fascinated by the Acknowledgements section, but that's
another kettle of fish entirely.

It's a well-written and mostly well-explained requirements draft, that
still needs a little work.

Spencer

---------------------------------------


  Requirements for Distributed Control of ASR, SI/SV and TTS Resources
                      draft-ietf-speechsc-reqts-05

...

Abstract

   This document outlines the needs and requirements for a protocol to
   control distributed speech processing of audio streams.  By speech
   processing, this document specifically means automatic speech
   recognition (ASR), speaker recognition - which includes both
speaker
   identification (SI) and speaker verification (SV) - and
   text-to-speech (TTS).  Other IETF protocols, such as SIP and RTSP,
   address rendezvous and control for generalized media streams.
   However, speech processing presents additional requirements that
none
   of the extant IETF protocols address.

Spencer: OK, I didn't see this assertion explained in any greater
detail
 in either the Introduction or in the body of the specification.
 It seems pretty important to SPEECHSC's direction - probably worth
 at least a couple of sentences somewhere to make things explicit.

1. Introduction

   There are multiple IETF protocols for establishment and termination
   of media sessions (SIP [5]), low-level media control (MGCP [6] and
   MEGACO [7]), and media record and playback (RTSP [8]). This
document
   focuses on requirements for one or more protocols to support the
   control of network elements that perform Automated Speech
Recognition
   (ASR), speaker identification or verification (SI/SV), and
rendering
   text into audio, also known as Text-to-Speech (TTS). Many
multimedia
   applications can benefit from having automatic speech recognition
   (ASR) and text-to-speech (TTS) processing available as a
distributed,
   network resource.  This requirements document limits its focus to
the
   distributed control of ASR, SI/SV and TTS servers.

   There are a broad range of systems which can benefit from a unified
   approach to control of TTS, ASR, and SI/SV. These include
   environments such as VoIP gateways to the PSTN, IP Telephones,
media
   servers, and wireless mobile devices who obtain speech services via
   servers on the network.

   To date, there are a number of proprietary ASR and TTS API's, as
well
   as two IETF drafts that address this problem [12], [13].  However,

Spencer: huh? These two references are to RFCs on SLP and SRV - is
this
 a blown pointer, or am I just confused?

   there are serious deficiencies to the existing drafts.  In
   particular, they mix the semantics of existing protocols yet are
   close enough to other protocols as to be confusing to the
   implementer.

Spencer: I would probably want one or two specifics here, but can't
 tell for sure because I don't think the draft references are correct.

   This document sets forth requirements for protocols to support
   distributed speech processing of audio streams. For simplicity, and
   to remove confusion with existing protocol proposals, this document
   presents the requirements as being for a "framework" that addresses
   the distributed control of speech resources It refers to such a

Spencer: missing period after "resources" (no extra charge for
proofing)

   framework as "SPEECHSC", for Speech Services Control.

   Discussion of this and related documents is on the speechsc mailing
   list.  To subscribe, send the message "subscribe speechsc" to
   speechsc-request@ietf.org.  The public archive is at http://
   www.ietf.org/mail-archive/workinggroups/speechsc/current/
   maillist.html

Spencer: "working-groups" is hypenated in the actual URL (per WG home
page)

2. SPEECHSC Framework

   Figure 1 below shows the SPEECHSC framework for speech processing.

                          +-------------+
                          | Application |
                          |   Server    |\
                          +-------------+ \ SPEECHSC
            SIP, VoiceXML,  /              \
             etc.          /                \
           +------------+ /                  \    +-------------+
           |   Media    |/       SPEECHSC     \---| ASR, SI/SV  |
           | Processing |-------------------------| and/or TTS  |
       RTP |   Entity   |           RTP           |    Server   |
      =====|            |=========================|             |
           +------------+                         +-------------+

                 Figure 1: Figure 1: SPEECHSC Framework

   The "Media Processing Entity" is a network element that processes
   media. It may be either a pure media handler, or also have an
   associated SIP user agent, VoiceXML browser or other control
entity.
   The "ASR, SI/SV and/or TTS Server" is a network element which
   performs the back-end speech processing. It may generate an RTP
   stream as output based on text input (TTS) or return recognition
   results in response to an RTP stream as input (ASR, SI/SV).  The
   "Application Server" is a network element that instructs the Media
   Processing Entity on what transformations to make to the media
   stream. Those instructions may be established via a session
protocol
   such as SIP, or provided via a client/server exchange such as
   VoiceXML.  The framework allows either the Media Processing Entity
or
   the Application Server to control the ASR or TTS Server using
   SPEECHSC as a control protocol, which accounts for the speechsc

Spencer: not sure why speechsc isn't capitalized in this sentence?

   protocol appearing twice in the diagram.

   Physical embodiments of the entities can reside in one physical
   instance per entity, or some combination of entities.  For example,
a
   VoiceXML [10] Gateway may combine the ASR and TTS functions on the
   same platform as the Media Processing Entity. Note that VoiceXML
   Gateways themselves are outside the scope of this protocol.
Likewise,
   one can combine the Application Server and Media Processing Entity,
   as would be the case in an interactive voice response (IVR)
platform.

   One can also decompose the Media Processing Entity into an entity
   that controls media endpoints and entities that process media
   directly.  Such would be the case with a decomposed gateway using
   MGCP or megaco. However, this decomposition is again orthogonal to

Spencer: "Megaco" (capitalized)

   the scope of SPEECHSC. The following subsections provide a number
of
   example use cases the SPEECHSC, one each for TTS, ASR and SI/SV.
They
   are intended to be illustrative only, and not to imply any
   restriction on the scope of the framework or to limit the
   decompostion or configuration to that shown in the example.

...


2.2 Automatic speech recognition example

   This example illustrates a VXML-enabled media processing entity and
   associated application server using the SPEECHSC framework to
supply
   an ASR-based user interface through an Interactive Voice Response
   (IVR) system. The example scenario is shown below in figure 3. The
   VXML-client corresponds to the "media processing entity", while the
   IVR application server corresponds to the "application server"  of
   the SPEECHSC framework of figure 1.

                                      +------------+
                                      |    IVR     |
                                     _|Application |
                               VXML_/ +------------+
                +-----------+  __/
                |           |_/       +------------+
    PSTN Trunk  |   VoIP    | SPEECHSC|            |
   =============| Gateway   |---------| SPEECHSC   |
                |(VXML voice|         |   ASR      |
                | browser)  |=========|  Server    |
                +-----------+   RTP   +------------+

        Figure 3: Figure 3: Automatic speech recognition example

   In this example, users call into the service in order to obtain
stock
   quotes.  The VoIP gateway answers their PSTN call.  An IVR

Spencer: "their calls"

   application feeds VXML scripts to the gateway to drive the user
   interaction.  The VXML interpreter on the gateway directs the
user's
   media stream to the SPEECHSC ASR server and uses SPEECHSC to
control
   the ASR server.

   When, for example, the user speaks the name of a stock in response
to
   an IVR prompt, the SPEECHSC ASR server attempts recognition of the
   name, and returns the results to the VXML gateway.  The VXML
gateway,
   following standard VXML mechanisms, informs the IVR Application of
   the recognized result.  The IVR Application can then do the
   appropriate information lookup.  The answer, of course, can be sent
   back to the user using text-to-speech.  This example does not show
   this scenario, but it would work analogously to the scenario shown
in
   section Section 2.1.

2.3 Speaker Identification example

   This example illustrates using speaker identification to allow
   voice-actuated login to an IP phone. The example scenario is shown
   below in figure 4. In the figure, the IP Phone acts as both the
   "Media Processing Entity" and the Application Server" of the
SPEECHSC
   framework in figure 1.

   +-----------+         +---------+
   |           |   RTP   |         |
   |   IP      |=========| SPEECHSC|
   |  Phone    |         |   TTS   |
   |           |_________|  Server |
   |           | SPEECHSC|         |
   +-----------+         +---------+


           Figure 4: Figure 4: Speaker identification example

   In this example, a user speaks into a SIP phone in order to get
   "logged in" to that phone to make and receive phone calls using his
   identity and preferences. The IP phone uses the SPEECHSC framework
to
   set up an RTP stream between the phone and the SPEECHSC SI/SV
server
   and to request verification. The SV server verifies the user's
   identity and returns the result, including the necessary login
   credentials, to the phone via SPEECHSC. The IP Phone may either use
   the identity directly to identify the user in outgoing calls, to
   fetch the user's preferences from a configuration server, request
   authorization from a AAA server, in any combination. Since this
   example uses SPEECHSC to perform a security-related function, be
sure
   to note the associated material in Section 9

Spencer: missing period here.

3. General Requirements

3.1 Reuse Existing Protocols

   To the extent feasible, the SPEECHSC framework SHOULD use existing
   protocols.

Spencer: is it possible to provide a list of candidate protocols
 to be considered?

...

3.4 Efficiency

   The SPEECHSC framework SHOULD employ protocol elements known to
   result in efficient operation. Techniques to be considered include:

   o  Re-use of transport connections across sessions

   o  Piggybacking of responses on requests in the reverse direction

   o  Caching of state across requests

Spencer: I'm stepping out on a limb here, but I'd like to see any
 clarification of "efficient" that can be provided. Even a
 statement like "efficient at the scale of a media gateway"
 would help me.


3.5 Invocation of services

   The SPEECHSC framework MUST be compliant with the IAB OPES [3]
   framework. The applicability of the SPEECHSC protocol will
therefore
   be specified as occurring between clients and servers at least one
of
   which is operating directly on behalf of the user requesting the
   service.

Spencer: I would love to see any explanation here. Why OPES and not
 some other framework (working at a different protocol layer,
 for instance)? Is this a checkoff item, or are there things
 SPEECHSC expects to use OPES for? Are there specific things
 about OPES that are particularly important?

3.6 Location and Load Balancing

   To the extent feasible, the SPEECHSC framework SHOULD exploit
   existing schemes for supporting service location and load
balancing,
   such as the Service Location Protocol [12] or DNS SRV records [13].
   Where such facilities are not deemed adequate, the SPEECHSC
framework
   MAY define additional load balancing techniques.

Spencer: OK, I'm not the AD, but I'm kinda having a sense of "feature
 creep" here - there's no indication that these capabilities
 would be unique for SPEECHSC, so I'm not sure why SPEECHSC would
 have better luck defining them than other WGs, and I'm not
 sure why TSV is the right place to define better capabilities,
 either.

3.9 Users with disabilities

   The SPEECHSC framework must have sufficient capabilities to address
   the critical needs of people with disabilities. In particular, the
   set of requirements set forth in RFC3351 [4] MUST be taken into
   account by the framework. It is also important that implementers of
   SPEECHSC clients and servers be cognizant that some interaction
   modalities of SPEECHSC may be inconvenient, or simply inappropriate
   for disabled users. Hearing-impaired individuals may find TTS of
   limited utility. Spech-impaired users may be unable to make use of
   ASR or SI/SV capabilities. Therefore, systems employing SPEECHSC
MUST
   provide alternative interaction modes or avoid the use of speech
   processing entirely.

Spencer: It might be worth mentioning that these alternative
interaction
 mores are likely lower-bandwidth and more appropriate for
 users at the end of some slow ("wireless") connections as well...

...

4.2.2 SSML

Spencer: "Speech Synthesis Markup Language (SSML)" (had not been
expanded
 previously). And while we're on the subject, there's no
 justification given for this requirement...

   The SPEECHSC framework MUST support SSML[3] <speak> basics, and
   SHOULD support other SSML tags. The framework assumes all TTS
servers
   are capable of reading SSML formatted text. Internationalization of
   TTS in the SPEECHSC framework, including multi-lingual output
within
   a single utterance, is accomplished via SSML xml:lang tags.

4.2.3 Text in Control Channel

   The Speechsc framework assumes all TTS servers accept text over the

Spencer: it's a nit, but capitalization isn't consistent with
elsewhere

   SPEECHSC connection for reading over the RTP connection. The
   framework assumes the server can accept text either "by value"
   (embedded in the protocol), or "by reference" (e.g. by
de-referencing
   a URI embedded in the protocol).

...


4.5 Playback Controls

   The Speechsc framework MUST support "VCR controls" for controlling
   the playout of streaming media output from SPEECHSC processing, and
   MUST allow for servers with varying capabilities to accommodate
such
   controls. The protocol SHOULD allow clients to state what controls
   they wish to use, and for servers to report which ones they honor.
   These capabilities include:

Spencer: good discussion here - is there a canonical list that can
 be referred to, or is this the SPEECHSC take on a canonical list?

   o  The ability to jump in time to the location of a specific
marker.

   o  The ability to jump in time, forwards or backwards, by a
specified
      amount of time.  Valid time units MUST include seconds, words,
      paragraphs, sentences, and markers.

   o  The ability to increase and decrease playout speed.

   o  The ability to fast-forward and fast-rewind the audio, where
      snippets of audio are played as the server moves forwards or
      backwards in time.

   o  The ability to pause and resume playout.

   o  The ability to increase and decrease playout volume.

   These controls SHOULD be made easily available to users through the
   client user interface and through per-user customization
capabilities
   of the client. This is particularly important for hearing-impaired
   users, who will likely desire settings and control regimes
different
   from those that would be acceptable for non-impaired users.

4.6 Session Parameters

   The SPEECHSC framework MUST support the specification of session
   parameters, such as language, prosody and voicing.

Spencer: is there a reference you can give for "session parameters"?
 What's in, what's not in, etc.

...


5.2 XML

Spencer: Is this a requirement for general support of XML, or a
 specific requirement for VoiceXML? I am confused here...

   The Speechsc framework assumes that all ASR servers support the
   VoiceXML speech recognition grammar specification (SRGS) for speech
   recognition [2].

5.3.3 Grammar Sharing

   The SPEECHSC framework SHOULD exploit sharing grammars across
   sessions for servers which are capable of doing so. This supports
   applications with large grammars for which it is unrealistic to
   dynamically load.  An example is a city-country grammar for a
weather
   service.

Spencer: is there an associated security requirement here? Most of the
 security considerations I saw were about isolating sessions...

...


7. Duplexing and Parallel Operation Requirements

   One very important requirement for an interactive speech-driven
   system is that user perception of the quality of the interaction
   depends strongly on the ability of the user to interrupt a prompt
or
   rendered TTS with speech.  Interrupting, or barging, the speech

Spencer: I think I understand what the actual requirement is, but it
was
 a struggle. Could you say "X is required because one..."?

   output requires more than energy detection from the user's
direction.
   Many advanced systems halt the media towards the user by employing
   the ASR engine to decide if an utterance is likely to be real
speech,
   as opposed to a cough, for example.

...


8. Additional Considerations (non-normative)

   The framework assumes that SDP will be used to describe media
   sessions and streams. The framework further assumes RTP carriage of
   media, however since SDP can be used to describe other media
   transport schemes (e.g. ATM) these could be used if they provide
the
   necessary elements (e.g. explicit timestamps).

   The working group will not be defining distributed speech
recognition
   methods (DSR), as exemplified by the ETSI Aurora project.  The
   working group will not be recreating functionality available in
other
   protocols, such as SIP or SDP.

   TTS looks very much like playing back a file.  Extending RTSP looks
   promising for when one requires VCR controls or markers in the text

Spencer: Nit - "promising when"

   to be spoken.  When one does not require VCR controls, SIP in a
   framework such as Network Announcements [11] works directly without
   modification.

   ASR has an entirely different set of characteristics.  For barge-in
   support, ASR requires real-time return of intermediate results.
   Barring the discovery of a good reuse model for an existing
protocol,
   this will most likely become the focus of SPEECHSC.

...

10. Acknowledgements

   Eric Burger wrote the original draft of these requirements and has
   continued to contribute actively throughout their development. He
is
   a co-author in all but formal authorship, and is instead
acknowledged
   here as it is preferable that working group co-chairs have
   non-conflicting roles with respect to the progression of documents.

Spencer: Okay, I'm not sure how "wrote the original draft" maps to
"non-
 conflicting roles" just because you take the WG chair's name off a
 draft. Maybe this is an internal WG issue, but it seems especially
 weird to me if you think about someone appealing a WG consensus
 call to the WG chair on text that the WG chair wrote in stealth mode?
 At the very least, this seems like a question for MPOWR...

_______________________________________________
Gen-ART mailing list
Gen-ART@alvestrand.no
http://eikenes.alvestrand.no/mailman/listinfo/gen-art

 
ers

   The SPEECHSC framework MUST support the specification of session
   parameters, such as language, prosody and voicing.

Spencer: is there a reference you can give for "session parameters"?
 What's in, what's not in, etc.

...


5.2 XML

Spencer: Is this a requirement for general support of XML, or a
 specific requirement for VoiceXML? I am confused here...

   The Speechsc framework assumes that all ASR servers support the
   VoiceXML speech recognition grammar specification (SRGS) for speech
   recognition [2].

5.3.3 Grammar Sharing

   The SPEECHSC framework SHOULD exploit sharing grammars across
   sessions for servers which are capable of doing so. This supports
   applications with large grammars for which it is unrealistic to
   dynamically load.  An example is a city-country grammar for a
weather
   service.

Spencer: is there an associated security requirement here? Most of the
 security considerations I saw were about isolating sessions...

...


7. Duplexing and Parallel Operation Requirements

   One very important requirement for an interactive speech-driven
   system is that user perception of the quality of the interaction
   depends strongly on the ability of the user to interrupt a prompt
or
   rendered TTS with speech.  Interrupting, or barging, the speech

Spencer: I think I understand what the actual requirement is, but it
was
 a struggle. Could you say "X is required because one..."?

   output requires more than energy detection from the user's
direction.
   Many advanced systems halt the media towards the user by employing
   the ASR engine to decide if an utterance is likely to be real
speech,
   as opposed to a cough, for example.

...


8. Additional Considerations (non-normative)

   The framework assumes that SDP will be used to describe media
   sessions and streams. The framework further assumes RTP carriage of
   media, however since SDP can be used to describe other media
   transport schemes (e.g. ATM) these could be used if they provide
the
   necessary elements (e.g. explicit timestamps).

   The working group will not be defining distributed speech
recognition
   methods (DSR), as exemplified by the ETSI Aurora project.  The
   working group will not be recreating functionality available in
other
   protocols, such as SIP or SDP.

   TTS looks very much like playing back a file.  Extending RTSP looks
   promising for when one requires VCR controls or markers in the text

Spencer: Nit - "promising when"

   to be spoken.  When one does not require VCR controls, SIP in a
   framework such as Network Announcements [11] works directly without
   modification.

   ASR has an entirely different set of characteristics.  For barge-in
   support, ASR requires real-time return of intermediate results.
   Barring the discovery of a good reuse model for an existing
protocol,
   this will most likely become the focus of SPEECHSC.

...

10. Acknowledgements

   Eric Burger wrote the original draft of these requirements and has
   continued to contribute actively throughout their development. He
is
   a co-author in all but formal authorship, and is instead
acknowledged
   here as it is preferable that working group co-chairs have
   non-conflicting roles with respect to the progression of documents.

Spencer: Okay, I'm not sure how "wrote the original draft" maps to
"non-
 conflicting roles" just because you take the WG chair's name off a
 draft. Maybe this is an internal WG issue, but it seems especially
 weird to me if you think about someone appealing a WG consensus
 call to the WG chair on text that the WG chair wrote in stealth mode?
 At the very least, this seems like a question for MPOWR...

_______________________________________________
Gen-ART mailing list
Gen-ART@alvestrand.no
http://eikenes.alvestrand.no/mailman/listinfo/gen-art