IBM Wants You to Talk to Your Devices
Page 1 of 1
Voice recognition technology is no longer science fiction. It's been a reality for decades, though it remains immature. Now voice technologies, spurred by standards like VoiceXML, are increasingly finding their way into the market with applications in telephony, PDAs and automobiles.
With that in mind, IBM Corp.
Dubbed the "Super Human Speech Recognition Initiative," IBM's push aims to create new technology that supports what IBM Voice Systems
Director Nigel Beck calls "conversational computing."
Beck said that in the past 50 years, "it was always you learning how to use the machine." With conversational computing, Beck said
the goal is "making the machine learn how to interact with the end user."
The Super Human Speech Recognition Initiative's ultimate goal is to create technology that performs better than humans for any
transcription task, without the need for customization. It seeks accurate transcription of everything from voice mail to meetings
and customer service calls -- with full audio (and possibly) video searching capabilities. Along the way, the company plans a number
of milestones that it expects will have wide-ranging applications in everything from data mining in call centers to interpersonal
communication to biometrics.
While IBM's decision to devote such resources to an idea that seems "out there," may seem surprising at first, Big Blue thinks of it
more as an effort to capture a lucrative market opportunity. Market research firm, the Kelsey Group, has projected worldwide spending
on voice recognition will reach $41 billion by 2005.
Voice drivers
Current solutions
Also, on March 26th, the company added WebSphere Translation Server 2.0 to its server-based offerings. Translation Server now
supports both Chinese-to-English and Japanese-to-English translations, meaning it now supports 16 language pairs. It delivers both
on-the-fly translation of static and dynamically-generated Web pages and translation using a servlet or JSP. Additionally,
Translation Server can be integrated with Lotus Domino and Sametime servers via Lotus Translation Components and Lotus Translation
Services for Sametime. This allows users to engage in multilingual e-mail and chat. For instance, two co-workers -- one an
English-speaker and the other a Japanese-speaker -- could have their chat client dynamically translate messages. IBM is looking to
take this farther with a prototype for PDAs that would allow a user to dictate into the PDA, which would translate the speech and
read it back in another language.
On the dictation end, IBM offers the consumer-grade ViaVoice software, available for Windows, Macintosh and Linux platforms. The
software has been on the market since 1996. It also offers WebSphere Voice Server for Transcription, which enables large vocabulary
dictation over a network from a variety of devices, including PCs, digital recorders and telephones. Big Blue's goal in this area is
to perfect the dictation technology to the point where it could be used in courtrooms, for medical transcription, to transcribe call
center calls, even in radio and broadcast journalism. This in turn would make multimedia data mining a possibility -- allowing rapid
searches of audio visual data.
Finally, on the embedded solutions front, IBM offers Embedded ViaVoice, which focuses on voice-enabling mobile and pervasive
computing devices, from smart phones and PDAs to car dashboards.
Hurdles
"The state of the speech world is roughly where the state of the Web world was six years ago," Beck said.
To take voice technology to the next step, a number of large obstacles must be overcome, according to electrical engineer David
Nahamoo, manager of Human Language Technologies at IBM Research, responsible for setting IBM's worldwide speech recognition software
research strategy and leader of the Super Human Speech Recognition Initiative.
Nahamoo said noise, which can prevent a machine from interpreting speech, may be the most pressing problem. Another problem is
grammar and punctuation (which poses more trouble for transcription technologies). "There's no good model for spontaneous
conversation," Nahamoo said. Modeling accents is another problem.
The biggest obstacle of all, though, may be the need for human users to mold their speech interactions with computers in such a way
that the computers can understand them. "Today's technology for speech recognition asks the user to be cooperative," Nahamoo said.
IBM's milestones in the Super Human Speech Recognition Initiative are designed to overcome these hurdles on the way toward the grail
of automated transcription that outstrips the performance of human efforts. Currently, IBM Research data suggests that today's
speech recognition technology, depending on the task, is anywhere from a factor of three to a factor of 10 worse than human
performance.
Combating noise is one of the first milestones. To do it, Big Blue is expanding into the sphere of audio visual speech recognition,
which uses computer vision technology to "read lips." Nahamoo said that by using a camera to pick up the movements of the lips and
jaw, the technology should improve speech recognition technology by about 10db. As an added bonus, audio visual speech recognition
should allow the software to determine when a user is actively attempting to utilize it.
"With a camera, you can make the environment attentive," Nahamoo said, explaining that this would allow the technology to know when
it should respond to the user's speech and when it shouldn't.
Multimodal access
To that end, working with partners Motorola and Opera, IBM has submitted a specification for Multimodal Access to the World Wide Web
Consortium (W3C). The specification, XHTML+VoiceXML, would allow users to access data on devices through multiple modes of
interaction.
"Multimodal is the mixing of voice and data," Beck said. "People operate in multiple modes at once."
The technology allows users to use multiple input and output methods simultaneously, including stylus, touch, screen, keypad,
keyboard and voice. For instance, the technology would allow a user to request stock information from a wireless PDA by voice, and
receive the information as a chart.
The company has also put together a number of prototypes to display its ideas.
One, Meeting Miner, is an agent used during meetings to passively capture and analyze meeting discussion. It also has the capability
of becoming an active participant in the meeting when it finds information it determines to be pertinent to the discussion. Meeting
Miner uses the audio streams from one or more microphones to capture the speech during the meeting and converts it to a text
transcript.
Another prototype has been dubbed ePIM, and is intended to give users access to their Personal Information Management tools and data
through unified, anytime/anywhere access. The ePIM prototype provides voice mail for the Lotus Notes inbox; notification of e-mail,
voice mail and calendaring to a cell phone or pager; voice-enabled natural language interface to Notes messages and calendar through
a phone call; and WAP/HDML access to Notes inbox, calendar and address book.
Finally, to show off voice technology's utility is security applications, IBM has created a speech biometrics prototype which uses
voice print match together with knowledge-based verification via a conversational interface to determine the identity of a user or
to authenticate a claimed identity.
, a pioneer in voice recognition, is pouring the resources of IBM research -- the
world's largest information technology research organization, with more than 3,000 scientists and engineers at eight labs in six
countries -- into an eight-year project to revolutionize voice technologies. The company has assigned about 100 speech researchers to the task.
IBM said it sees a number of key forces that will drive that growth:
Currently, IBM has a number of offerings, based on VoiceXML and Java, which tap these areas. The solution it crafted for T. Rowe
Price utilizes WebSphere Voice Server with Natural Language Understanding (NLU). Voice Server allows companies to voice-enable their
Web sites and intranets, databases and business applications.
But while these things are already happening, the technology must improve before it can gain wide-spread acceptance.
Largely, however, Beck said IBM's first steps in the Super Human Speech Recognition Initiative will be establishing standards that
will ensure that devices utilizing voice are interoperable and will run in heterogeneous environments.