Synthesized speech: More human by the day

Voice synthesis has come a long way since the "Hal 9000"

When film critic Roger Ebert, who lost his voice following complications from thyroid cancer in 2006, unveiled a prototype of his newly synthesized voice on "The Oprah Winfrey Show" a year ago, and again this March at the TED conference in Long Beach, reports described it as "miraculous," "experimental" and "amazing."

But ever since the robot HAL 9000 spoke its first words in the movie "2001: A Space Odyssey" four decades ago, most people have known computers can speak. And when Stephen Hawking started using a voice synthesizer in the 1980s, we saw that humans could use computers to voice their thoughts. So it's not so much the technology behind Ebert's new voice, dubbed "Roger Jr.," that amazes us -- it's the finesse with which it's being used. Although it is a robot, it sounds human.

"In the past 30 or 40 years, people have been trying to make computers speak so that they are understandable," said Alan Black, associate professor at Carnegie Mellon University's Language Technologies Institute. "Over the past 20 years, researchers have refined technology so a synthesized voice is colored by things like age, gender or tone of voice. Within the past fifteen years, we've wanted to create the voice of a particular person."

The latter is what the Scottish speech synthesis research company CereProc has been working on for Ebert. Prior to this, they had already experimented extensively synthesizing George W. Bush's voice.

"Voice clones are a natural progression of our technology" of speech synthesis, Matthew Aylett, CereProc's chief technical officer, told TechNewsDaily.

Also, voice banking has been around for more than a decade, said Carnegie Mellon's Black, and this, too, enables synthesis of a particular -- rather than generic -- voice.

"Someone who's about to lose their voice for medical reasons can record their speech in advance," said Carnegie Mellon's Black.

For this, patients receive a transcript of about 10,000 prompts to read aloud. The prompts need to contain all speech sounds necessary for the English language today -- including rarer sounds like "oy," a soft "j" or sounds that appear in words of foreign origin (nasal vowels, for example). Also, each sound needs to appear in multiple language environments -- for example, the t-sound in cat, stop, button, etc. -- and in both function and content words. The software records and labels these sounds based on the original transcript and combines them into new words as needed.

Ebert, however, didn't have the luxury of foresight. His voice disappeared unexpectedly. For CereProc, this meant using found -- rather than targeted -- data, and sifting through enough material to find all necessary speech sounds. Although in Ebert's case, recorded material existed abundantly in the form of movie reviews and DVD commentaries, most of it wasn't ideal for voice synthesis, said Aylett.

"A lot of the recordings contain multiple speakers, background music, audiences, different recording studios -- all these things are a real problem," Aylett said.

CereProc worked mainly with original versions of Ebert's DVD commentaries from movies such as "Casablanca," "Citizen Kane" and "Beyond the Valley of the Dolls," devoid of soundtrack or film audio. But even then, a lot of the material was unusable, said Aylett. "Roger knows these films so well he can more or less just riff on them and a lot of this material is too spontaneous to use for synthesis."

Although using this found data posed a challenge, the concept wasn't new.

"What's being done for Ebert is not completely novel," Carnegie Mellon's Black said. In the late 1990s, for example, Black worked with a Japanese company to synthesize Bill Clinton's voice from found data, ("I think we had him speaking Japanese," Black recalled), and when CereProc synthesized Bush's voice several years ago, they used found data from his presidential speeches.

But one big difference is that Bush didn't have a need to use the technology, whereas Ebert clearly does. "Bush's voice was different because the objective was different," said Aylett. "Bush didn't have to be that good."

In contrast, the synthesized voice of "Roger Jr.," which will be heard on segments of Ebert's weekly movie review program "At the Movies," must be engaging to viewers.

"A lot of what makes us who we are rests in our voices," said Aylett. So "if your synthetic voice has character, it will be more engaging."

And it's this aspect of "Roger Jr." that CereProc continues to finesse. After typing in text, Ebert can fine-tune the synthesized text for things like pronunciation, stress, pitch, reduction and more. He can thereby make the voice more "Roger Ebert" -- and thus more human.

Comments

CBSN Live

pop-out
Live Video

Watch CBSN Live

Watch CBS News anytime, anywhere with the new 24/7 digital news network. Stream CBSN live or on demand for FREE on your TV, computer, tablet, or smartphone.