[RTW] Some useful thoughts about requirements for audio processing in browsers

Mon Jan 17 11:04:54 CET 2011

Sharing some info from other W3C groups that are working on handling of 
audio in browsers.

                     Harald

-------- Original Message --------
Subject: 	Feedback to the DAP group on the topic of audio/media capture 
needed for HTML+Speech
Resent-Date: 	Sat, 15 Jan 2011 05:47:45 +0000
Resent-From: 	public-device-apis at w3.org
Date: 	Sat, 15 Jan 2011 04:54:56 +0000
From: 	Michael Bodell <mbodell at microsoft.com>
To: 	public-device-apis at w3.org <public-device-apis at w3.org>
CC: 	public-xg-htmlspeech at w3.org <public-xg-htmlspeech at w3.org>

On today's Hypertext Coordination Group Teleconference the issue of 
"Audio on the Web" was discussed (see minutes: 
http://www.w3.org/2011/01/14-hcg-minutes.html) and I was given the 
action item of contacting the DAP group to provide feedback about audio 
capture.  We in the HTML Speech XG 
(http://www.w3.org/2005/Incubator/htmlspeech/) have been discussing use 
cases, requirements, and some proposals around speech enabled html pages 
and the need for the audio to be captured and recognized in real time 
(I.e., in a streaming fashion, not in a file upload fashion).  We 
recognize that there are interesting security and privacy concerns with 
supporting this necessary functionality.

The HTML Speech XG has currently finished with requirements gathering, 
and is in the process of prioritizing these requirements.  Our 
requirements document is at 
http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html.  
There are a large number (almost half) of our requirements that might be 
of particular note to the audio capture process.  I've tried to pull out 
and organize the requirements most relevant to the DAP audio capture:

·Requirements about to where the audio is streamed:

oFPR12. Speech services that can be specified by web apps must include 
network speech services 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr12]

oFPR32. Speech services that can be specified by web apps must include 
local speech services. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr32]

·Requirements about the audio streams and the fact that it needs to be 
streamed:

oFPR33. There should be at least one mandatory-to-support codec that 
isn't encumbered with IP issues and has sufficient fidelity & low 
bandwidth requirements. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr33]

oFPR25. Implementations should be allowed to start processing captured 
audio before the capture completes. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr25]

oFPR26. The API to do recognition should not introduce unneeded latency. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr26]

oFPR56. Web applications must be able to request NL interpretation based 
only on text input (no audio sent). 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr56]

·Requirements about what must be possible while streaming (I.e., getting 
midstream events in a timely fashion without cutting off the stream; 
being able to decide to cut off the stream mid request; being able to 
reuse the stream):

oFPR40. Web applications must be able to use barge-in (interrupting 
audio and TTS output when the user starts speaking). 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr40]

oFPR21. The web app should be notified that capture starts. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr21]

oFPR22. The web app should be notified that speech is considered to have 
started for the purposes of recognition. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr22]

oFPR23. The web app should be notified that speech is considered to have 
ended for the purposes of recognition. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr23]

oFPR24. The web app should be notified when recognition results are 
available. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr24]

oFPR57. Web applications must be able to request recognition based on 
previously sent audio. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr57]

oFPR59. While capture is happening, there must be a way for the web 
application to abort the capture and recognition process. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr59]

·Requirements around the UI/API/Usability of speech/audio capture:

oFPR42. It should be possible for user agents to allow hands-free speech 
input. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr42]

oFPR54. Web apps should be able to customize all aspects of the user 
interface for speech recognition, except where such customizations 
conflict with security and privacy requirements in this document, or 
where they cause other security or privacy problems. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr54]

oFPR13. It should be easy to assign recognition results to a single 
input field. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr13]

oFPR14. It should not be required to fill an input field every time 
there is a recognition result. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr14]

oFPR15. It should be possible to use recognition results to multiple 
input fields. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr15]

·Requirements around privacy and security concerns:

oFPR16. User consent should be informed consent. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr16]

oFPR20. The spec should not unnecessarily restrict the UA's choice in 
privacy policy. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr20]

oFPR55. Web application must be able to encrypt communications to remote 
speech service. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr55]

oFPR1. Web applications must not capture audio without the user's 
consent. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr1]

oFPR17. While capture is happening, there must be an obvious way for the 
user to abort the capture and recognition process. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr17]

oFPR18. It must be possible for the user to revoke consent. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr18]

oFPR37. Web application should be given captured audio access only after 
explicit consent from the user. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr37]

oFPR49. End users need a clear indication whenever microphone is 
listening to the user. 
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr49]

We would be happy to discuss the details and context behind any of these 
requirements, and we'd also appreciate any feedback on our use cases and 
requirements.  I'm sure many of these are requirements the DAP group is 
already considering, but the speech use cases may well add some 
additional requirements that may not have yet been considered as part of 
the capture work.

The HTML Speech XG is also in the process of collecting proposals for 
our Speech API which we are planning to finish by the end of February. 
  In our discussions to date, we have reviewed and discussed some of the 
DAP capture API as well as some of the work that has gone on around the 
<device> tag proposals (We reviewed and discussed at least 
http://www.w3.org/TR/html-media-capture/ and 
http://www.w3.org/TR/media-capture-api/ and Robin provided the following 
links to more in progress work in the htcg call 
http://dev.w3.org/2009/dap/camera/ and 
http://dev.w3.org/2009/dap/camera/Overview-API.html).  In general I'd 
characterize our discussions as we would be extremely happy if we could 
reuse the DAP work, and would be happy to work with you on having 
proposals that meet this need.  To date in our review the large issue 
has been the streaming issue where the capture API is nearly useless to 
us if it doesn't support streaming.  But happily from today's htcg call 
it sounds like DAP is actively working on streaming so we strongly 
support that work direction, think it is extremely important, and will 
be interesting to see any and all work in that direction.

I'm not sure what the most productive next steps for us to take (email 
discussion back and forth, some HTML Speech XG members come to a DAP 
audio capture conference call, some DAP members come to a Speech XG 
teleconference, or something else).  In general, the HTML Speech XG 
tries to do most of our work over the public email alias and we also 
have a schedule-as-needed Thursday teleconference time for 90 minutes 
starting at noon New York time.

Thanks, and look forward to working on this important topic with you!

Michael Bodell (Microsoft)

Co-chair HTML Speech XG

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/rtc-web/attachments/20110117/c79f04fe/attachment-0001.html>