IBM Wants You to Talk to Your Devices

Voice recognition technology is no longer science fiction. It’s been a reality for decades, though it remains immature. Now voice technologies, spurred by standards like VoiceXML, are increasingly finding their way into the market with applications in telephony, PDAs and automobiles.

With that in mind, IBM Corp. , a pioneer in voice recognition, is pouring the resources of IBM research — the
world’s largest information technology research organization, with more than 3,000 scientists and engineers at eight labs in six
countries — into an eight-year project to revolutionize voice technologies. The company has assigned about 100 speech researchers to the task.

Dubbed the “Super Human Speech Recognition Initiative,” IBM’s push aims to create new technology that supports what IBM Voice Systems
Director Nigel Beck calls “conversational computing.”

Beck said that in the past 50 years, “it was always you learning how to use the machine.” With conversational computing, Beck said
the goal is “making the machine learn how to interact with the end user.”

The Super Human Speech Recognition Initiative’s ultimate goal is to create technology that performs better than humans for any
transcription task, without the need for customization. It seeks accurate transcription of everything from voice mail to meetings
and customer service calls — with full audio (and possibly) video searching capabilities. Along the way, the company plans a number
of milestones that it expects will have wide-ranging applications in everything from data mining in call centers to interpersonal
communication to biometrics.

While IBM’s decision to devote such resources to an idea that seems “out there,” may seem surprising at first, Big Blue thinks of it
more as an effort to capture a lucrative market opportunity. Market research firm, the Kelsey Group, has projected worldwide spending
on voice recognition will reach $41 billion by 2005.

Voice drivers
IBM said it sees a number of key forces that will drive that growth:

  • Voice can be used to improve services from customer call centers while reducing costs. By utilizing voice recognition, companies
    can automate customer service over the phone, without subjecting customers to hold times or older systems that require people to
    respond to rigidly structured menus. Such automation can dramatically reduce expenses; a typical customer service call costs $5 to
    $10 to support while an automated voice recognition system can lower that to 10 cents to 30 cents per call. Such systems are already
    coming online with current technology. Silicon Valley-based start-up TuVox Inc. has automated the after-hours technical support lines for
    both Handspring and Activision, while IBM itself has created a system for investment management firm T. Rowe Price which allows
    customers to access and manage their accounts through natural conversations.
  • The use of Telematics, or the combination of computers and wireless telecommunications with motor vehicles. Telematics can provide
    customized services like driving directions, emergency roadside assistance, personalized news, sports and weather information, and
    access to e-mail and other productivity tools. The Kelsey Group predicts U.S. and European spending on telematics alone will exceed
    $6.4 billion by 2006. IBM is providing speech software to automotive supplier Johnson Controls, which has created a voice-enabled
    mobile communications system for the Chrysler Group. The system consists of a receiver module behind the dashboard, an embedded
    microphone in the rearview mirror, and the driver’s own mobile phone, which synchronizes with the receiver module via Bluetooth
    technology in the car’s audio system. When a call is placed, audio is suspended and the call comes through the car’s speakers. IBM’s
    software allows drivers to use spoken commands in English, French or Spanish to place calls or access the system’s audio address
  • Businesses which want to voice-enable the Internet and their IT establishments to provide information to consumers through “voice
    portals” or allow employees to access corporate databases through spoken commands over the phone.
  • The ability to squeeze speech recognition into smaller and smaller devices like phones, PDAs and other mobile devices.

Current solutions
Currently, IBM has a number of offerings, based on VoiceXML and Java, which tap these areas. The solution it crafted for T. Rowe
Price utilizes WebSphere Voice Server with Natural Language Understanding (NLU). Voice Server allows companies to voice-enable their
Web sites and intranets, databases and business applications.

Also, on March 26th, the company added WebSphere Translation Server 2.0 to its server-based offerings. Translation Server now
supports both Chinese-to-English and Japanese-to-English translations, meaning it now supports 16 language pairs. It delivers both
on-the-fly translation of static and dynamically-generated Web pages and translation using a servlet or JSP. Additionally,
Translation Server can be integrated with Lotus Domino and Sametime servers via Lotus Translation Components and Lotus Translation
Services for Sametime. This allows users to engage in multilingual e-mail and chat. For instance, two co-workers — one an
English-speaker and the other a Japanese-speaker — could have their chat client dynamically translate messages. IBM is looking to
take this farther with a prototype for PDAs that would allow a user to dictate into the PDA, which would translate the speech and
read it back in another language.

On the dictation end, IBM offers the consumer-grade ViaVoice software, available for Windows, Macintosh and Linux platforms. The
software has been on the market since 1996. It also offers WebSphere Voice Server for Transcription, which enables large vocabulary
dictation over a network from a variety of devices, including PCs, digital recorders and telephones. Big Blue’s goal in this area is
to perfect the dictation technology to the point where it could be used in courtrooms, for medical transcription, to transcribe call
center calls, even in radio and broadcast journalism. This in turn would make multimedia data mining a possibility — allowing rapid
searches of audio visual data.

Finally, on the embedded solutions front, IBM offers Embedded ViaVoice, which focuses on voice-enabling mobile and pervasive
computing devices, from smart phones and PDAs to car dashboards.

But while these things are already happening, the technology must improve before it can gain wide-spread acceptance.

“The state of the speech world is roughly where the state of the Web world was six years ago,” Beck said.

To take voice technology to the next step, a number of large obstacles must be overcome, according to electrical engineer David
Nahamoo, manager of Human Language Technologies at IBM Research, responsible for setting IBM’s worldwide speech recognition software
research strategy and leader of the Super Human Speech Recognition Initiative.

Nahamoo said noise, which can prevent a machine from interpreting speech, may be the most pressing problem. Another problem is
grammar and punctuation (which poses more trouble for transcription technologies). “There’s no good model for spontaneous
conversation,” Nahamoo said. Modeling accents is another problem.

The biggest obstacle of all, though, may be the need for human users to mold their speech interactions with computers in such a way
that the computers can understand them. “Today’s technology for speech recognition asks the user to be cooperative,” Nahamoo said.

IBM’s milestones in the Super Human Speech Recognition Initiative are designed to overcome these hurdles on the way toward the grail
of automated transcription that outstrips the performance of human efforts. Currently, IBM Research data suggests that today’s
speech recognition technology, depending on the task, is anywhere from a factor of three to a factor of 10 worse than human

Combating noise is one of the first milestones. To do it, Big Blue is expanding into the sphere of audio visual speech recognition,
which uses computer vision technology to “read lips.” Nahamoo said that by using a camera to pick up the movements of the lips and
jaw, the technology should improve speech recognition technology by about 10db. As an added bonus, audio visual speech recognition
should allow the software to determine when a user is actively attempting to utilize it.

“With a camera, you can make the environment attentive,” Nahamoo said, explaining that this would allow the technology to know when
it should respond to the user’s speech and when it shouldn’t.

Multimodal access
Largely, however, Beck said IBM’s first steps in the Super Human Speech Recognition Initiative will be establishing standards that
will ensure that devices utilizing voice are interoperable and will run in heterogeneous environments.

To that end, working with partners Motorola and Opera, IBM has submitted a specification for Multimodal Access to the World Wide Web
Consortium (W3C). The specification, XHTML+VoiceXML, would allow users to access data on devices through multiple modes of

“Multimodal is the mixing of voice and data,” Beck said. “People operate in multiple modes at once.”

The technology allows users to use multiple input and output methods simultaneously, including stylus, touch, screen, keypad,
keyboard and voice. For instance, the technology would allow a user to request stock information from a wireless PDA by voice, and
receive the information as a chart.

The company has also put together a number of prototypes to display its ideas.

One, Meeting Miner, is an agent used during meetings to passively capture and analyze meeting discussion. It also has the capability
of becoming an active participant in the meeting when it finds information it determines to be pertinent to the discussion. Meeting
Miner uses the audio streams from one or more microphones to capture the speech during the meeting and converts it to a text

Another prototype has been dubbed ePIM, and is intended to give users access to their Personal Information Management tools and data
through unified, anytime/anywhere access. The ePIM prototype provides voice mail for the Lotus Notes inbox; notification of e-mail,
voice mail and calendaring to a cell phone or pager; voice-enabled natural language interface to Notes messages and calendar through
a phone call; and WAP/HDML access to Notes inbox, calendar and address book.

Finally, to show off voice technology’s utility is security applications, IBM has created a speech biometrics prototype which uses
voice print match together with knowledge-based verification via a conversational interface to determine the identity of a user or
to authenticate a claimed identity.

News Around the Web