[RTW] Some useful thoughts about requirements for audio processing in browsers
Harald Alvestrand
harald at alvestrand.no
Mon Jan 17 11:04:54 CET 2011
Sharing some info from other W3C groups that are working on handling of
audio in browsers.
Harald
-------- Original Message --------
Subject: Feedback to the DAP group on the topic of audio/media capture
needed for HTML+Speech
Resent-Date: Sat, 15 Jan 2011 05:47:45 +0000
Resent-From: public-device-apis at w3.org
Date: Sat, 15 Jan 2011 04:54:56 +0000
From: Michael Bodell <mbodell at microsoft.com>
To: public-device-apis at w3.org <public-device-apis at w3.org>
CC: public-xg-htmlspeech at w3.org <public-xg-htmlspeech at w3.org>
On today's Hypertext Coordination Group Teleconference the issue of
"Audio on the Web" was discussed (see minutes:
http://www.w3.org/2011/01/14-hcg-minutes.html) and I was given the
action item of contacting the DAP group to provide feedback about audio
capture. We in the HTML Speech XG
(http://www.w3.org/2005/Incubator/htmlspeech/) have been discussing use
cases, requirements, and some proposals around speech enabled html pages
and the need for the audio to be captured and recognized in real time
(I.e., in a streaming fashion, not in a file upload fashion). We
recognize that there are interesting security and privacy concerns with
supporting this necessary functionality.
The HTML Speech XG has currently finished with requirements gathering,
and is in the process of prioritizing these requirements. Our
requirements document is at
http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html.
There are a large number (almost half) of our requirements that might be
of particular note to the audio capture process. I've tried to pull out
and organize the requirements most relevant to the DAP audio capture:
·Requirements about to where the audio is streamed:
oFPR12. Speech services that can be specified by web apps must include
network speech services
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr12]
oFPR32. Speech services that can be specified by web apps must include
local speech services.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr32]
·Requirements about the audio streams and the fact that it needs to be
streamed:
oFPR33. There should be at least one mandatory-to-support codec that
isn't encumbered with IP issues and has sufficient fidelity & low
bandwidth requirements.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr33]
oFPR25. Implementations should be allowed to start processing captured
audio before the capture completes.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr25]
oFPR26. The API to do recognition should not introduce unneeded latency.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr26]
oFPR56. Web applications must be able to request NL interpretation based
only on text input (no audio sent).
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr56]
·Requirements about what must be possible while streaming (I.e., getting
midstream events in a timely fashion without cutting off the stream;
being able to decide to cut off the stream mid request; being able to
reuse the stream):
oFPR40. Web applications must be able to use barge-in (interrupting
audio and TTS output when the user starts speaking).
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr40]
oFPR21. The web app should be notified that capture starts.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr21]
oFPR22. The web app should be notified that speech is considered to have
started for the purposes of recognition.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr22]
oFPR23. The web app should be notified that speech is considered to have
ended for the purposes of recognition.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr23]
oFPR24. The web app should be notified when recognition results are
available.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr24]
oFPR57. Web applications must be able to request recognition based on
previously sent audio.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr57]
oFPR59. While capture is happening, there must be a way for the web
application to abort the capture and recognition process.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr59]
·Requirements around the UI/API/Usability of speech/audio capture:
oFPR42. It should be possible for user agents to allow hands-free speech
input.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr42]
oFPR54. Web apps should be able to customize all aspects of the user
interface for speech recognition, except where such customizations
conflict with security and privacy requirements in this document, or
where they cause other security or privacy problems.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr54]
oFPR13. It should be easy to assign recognition results to a single
input field.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr13]
oFPR14. It should not be required to fill an input field every time
there is a recognition result.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr14]
oFPR15. It should be possible to use recognition results to multiple
input fields.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr15]
·Requirements around privacy and security concerns:
oFPR16. User consent should be informed consent.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr16]
oFPR20. The spec should not unnecessarily restrict the UA's choice in
privacy policy.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr20]
oFPR55. Web application must be able to encrypt communications to remote
speech service.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr55]
oFPR1. Web applications must not capture audio without the user's
consent.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr1]
oFPR17. While capture is happening, there must be an obvious way for the
user to abort the capture and recognition process.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr17]
oFPR18. It must be possible for the user to revoke consent.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr18]
oFPR37. Web application should be given captured audio access only after
explicit consent from the user.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr37]
oFPR49. End users need a clear indication whenever microphone is
listening to the user.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr49]
We would be happy to discuss the details and context behind any of these
requirements, and we'd also appreciate any feedback on our use cases and
requirements. I'm sure many of these are requirements the DAP group is
already considering, but the speech use cases may well add some
additional requirements that may not have yet been considered as part of
the capture work.
The HTML Speech XG is also in the process of collecting proposals for
our Speech API which we are planning to finish by the end of February.
In our discussions to date, we have reviewed and discussed some of the
DAP capture API as well as some of the work that has gone on around the
<device> tag proposals (We reviewed and discussed at least
http://www.w3.org/TR/html-media-capture/ and
http://www.w3.org/TR/media-capture-api/ and Robin provided the following
links to more in progress work in the htcg call
http://dev.w3.org/2009/dap/camera/ and
http://dev.w3.org/2009/dap/camera/Overview-API.html). In general I'd
characterize our discussions as we would be extremely happy if we could
reuse the DAP work, and would be happy to work with you on having
proposals that meet this need. To date in our review the large issue
has been the streaming issue where the capture API is nearly useless to
us if it doesn't support streaming. But happily from today's htcg call
it sounds like DAP is actively working on streaming so we strongly
support that work direction, think it is extremely important, and will
be interesting to see any and all work in that direction.
I'm not sure what the most productive next steps for us to take (email
discussion back and forth, some HTML Speech XG members come to a DAP
audio capture conference call, some DAP members come to a Speech XG
teleconference, or something else). In general, the HTML Speech XG
tries to do most of our work over the public email alias and we also
have a schedule-as-needed Thursday teleconference time for 90 minutes
starting at noon New York time.
Thanks, and look forward to working on this important topic with you!
Michael Bodell (Microsoft)
Co-chair HTML Speech XG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/rtc-web/attachments/20110117/c79f04fe/attachment-0001.html>
More information about the RTC-Web
mailing list