comp.dsp | Speech Synthesis

In the bad old days of LPC Speech synthesis, the best we could hope
for was a robotic sounding voice. Now however it would often be hard
to tell the difference between real speech and synthesised. I am
guessing they use real speech samples - would this be right? The
voices use huge amounts of disk space.

H.

Reply by Tim Wescott ●January 1, 20092009-01-01

On Wed, 31 Dec 2008 12:04:51 -0800, HardySpicer wrote:

> In the bad old days of LPC Speech synthesis, the best we could hope for
> was a robotic sounding voice. Now however it would often be hard to tell
> the difference between real speech and synthesised. I am guessing they
> use real speech samples - would this be right? The voices use huge
> amounts of disk space.
> 
> H.

I'm not sure what you're including under the rubric of "synthesis", but 
quite a while ago (nearly 20 years!) I worked with a company doing voice 
response stuff (you know "press 1 to enter a big loop that will get you 
back here, press 2 to enter a small loop that will get you back here, to 
hear a busy signal please press 0").

All of that was phrases, recorded from a script (and took a huge amount 
of disk space), but only numbers were "synthesized", and even that 
sounded clunky without careful editing and a lot of work on the part of 
the recording engineer and the voice talent.

When you're speaking naturally the sound of the end of one word is 
colored by the sound of the word that comes next.  Just putting together 
random words sounds "chopped up"; putting together words that are spoken 
individually sounds angry, robotic, or both.  While there have certainly 
been advances in this since I was doing it, I doubt that you could do it 
without either storing several different versions of each word, or 
without some algorithm that did the "coloring".

Are you sure you aren't just listing to recordings of scripted phrases?

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" gives you just what it says.
See details at http://www.wescottdesign.com/actfes/actfes.html

Reply by Vladimir Vassilevsky ●January 1, 20092009-01-01


Tim Wescott wrote:
> On Wed, 31 Dec 2008 12:04:51 -0800, HardySpicer wrote:
> 
>>In the bad old days of LPC Speech synthesis, the best we could hope for
>>was a robotic sounding voice. Now however it would often be hard to tell
>>the difference between real speech and synthesised. I am guessing they
>>use real speech samples - would this be right? The voices use huge
>>amounts of disk space.
>>
>>H.
> 
> 
> I'm not sure what you're including under the rubric of "synthesis", but 
> quite a while ago (nearly 20 years!) I worked with a company doing voice 
> response stuff (you know "press 1 to enter a big loop that will get you 
> back here, press 2 to enter a small loop that will get you back here, to 
> hear a busy signal please press 0").
> 
> All of that was phrases, recorded from a script (and took a huge amount 
> of disk space), but only numbers were "synthesized", and even that 
> sounded clunky without careful editing and a lot of work on the part of 
> the recording engineer and the voice talent.
> 
> When you're speaking naturally the sound of the end of one word is 
> colored by the sound of the word that comes next.  Just putting together 
> random words sounds "chopped up"; putting together words that are spoken 
> individually sounds angry, robotic, or both.  While there have certainly 
> been advances in this since I was doing it, I doubt that you could do it 
> without either storing several different versions of each word, or 
> without some algorithm that did the "coloring".
> 
> Are you sure you aren't just listing to recordings of scripted phrases?

The MS Windows built-in text to speech sounds very unnatural. However 
the modern GPS navigators have a pretty good voice. Since they can spell 
the street names, that should be a combination of the text to speech, 
scripting and the pre-recorded phrases. The total storage is only a few 
gigs; that includes the software and the maps as well. The CPU power is 
very limited, too.


Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by Rune Allnor ●January 1, 20092009-01-01

On 1 Jan, 17:43, Tim Wescott <t...@justseemywebsite.com> wrote:
> On Wed, 31 Dec 2008 12:04:51 -0800, HardySpicer wrote:
> > In the bad old days of LPC Speech synthesis, the best we could hope for
> > was a robotic sounding voice.
...
> All of that was phrases, recorded from a script (and took a huge amount
> of disk space), but only numbers were "synthesized", and even that
> sounded clunky without careful editing and a lot of work on the part of
> the recording engineer and the voice talent.
>
> When you're speaking naturally the sound of the end of one word is
> colored by the sound of the word that comes next. &#4294967295;Just putting together
> random words sounds "chopped up"; putting together words that are spoken
> individually sounds angry, robotic, or both.

When I chose my MSc subject in the mid '90s, the acoustics lab had a
speech
processing group who did text-to-speech synthesis. While the demo
thing worked
(with a couple of quirks; one of which related to the main scientist's
name
which was just about the only Norwegian word with a particular segment
that
was pronunced differently than the others with the same segment) the
technicians
told us - as I remember it - that they had spent years doing
recordings of
thousands of sounds from each of hundreds of subjects in the anechoic
chamber,
to get a reference library for the kinds of sounds you are talking
about.

Rune

Reply by Stefan Reuther ●January 1, 20092009-01-01

Vladimir Vassilevsky wrote:
> Tim Wescott wrote:
>> When you're speaking naturally the sound of the end of one word is
>> colored by the sound of the word that comes next.  Just putting
>> together random words sounds "chopped up"; putting together words that
>> are spoken individually sounds angry, robotic, or both.  While there
>> have certainly been advances in this since I was doing it, I doubt
>> that you could do it without either storing several different versions
>> of each word, or without some algorithm that did the "coloring".
> 
> The MS Windows built-in text to speech sounds very unnatural. However
> the modern GPS navigators have a pretty good voice. Since they can spell
> the street names, that should be a combination of the text to speech,
> scripting and the pre-recorded phrases. The total storage is only a few
> gigs; that includes the software and the maps as well. The CPU power is
> very limited, too.

Street names aside, GPS nav devices have a very limited set of sentences
they can say. "Turn left/right in 200 metres", and that's about 90% of
what it says. Those I have looked at or coded for have a repertoire of a
few hundred phrases, yielding a megabyte or two when compressed. Most
phrases are only needed in one intonation variant. For example, numbers
may always appear before a unit, never in an end position.

Regarding street names, they appear in fixed grammar positions as well
("turn right into the <street name>"), so you don't need too much
intonation variation for a phoneme-based synthesis (IIRC, you can get
street names not just spelled with latin letters from a nav database,
but also as phonemes). In an older system, I've even seen pre-recorded
city names. This probably took an enormous amount of storage, hence they
had a dozen speakers for regular instructions, but just one for city
names, and happily mixed these two within sentences :-)

  Stefan

Reply by Vladimir Vassilevsky ●January 1, 20092009-01-01


Stefan Reuther wrote:

> Vladimir Vassilevsky wrote:
> 

>>The MS Windows built-in text to speech sounds very unnatural. However
>>the modern GPS navigators have a pretty good voice. Since they can spell
>>the street names, that should be a combination of the text to speech,
>>scripting and the pre-recorded phrases. The total storage is only a few
>>gigs; that includes the software and the maps as well. The CPU power is
>>very limited, too.
> 
> 
> Street names aside, GPS nav devices have a very limited set of sentences
> they can say. "Turn left/right in 200 metres", and that's about 90% of
> what it says. Those I have looked at or coded for have a repertoire of a
> few hundred phrases, yielding a megabyte or two when compressed. Most
> phrases are only needed in one intonation variant. For example, numbers
> may always appear before a unit, never in an end position.
> 
> Regarding street names, they appear in fixed grammar positions as well
> ("turn right into the <street name>"), so you don't need too much
> intonation variation for a phoneme-based synthesis (IIRC, you can get
> street names not just spelled with latin letters from a nav database,
> but also as phonemes). In an older system, I've even seen pre-recorded
> city names. This probably took an enormous amount of storage, hence they
> had a dozen speakers for regular instructions, but just one for city
> names, and happily mixed these two within sentences :-)

I guess the 99% of the streets can be covered by a fixed dictionary with 
few hundreds of the trivial names. For US, that could be something like: 
First, Second, Memorial, Broadway, Washington, etc. I wonder how would 
it spell the street names of the historical or foreign origin. Having 
all of the names spelled in the proper way would take the enormous 
amount of work and a lot of storage space.


Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by HardySpicer ●January 1, 20092009-01-01

On Jan 1, 9:04&#4294967295;am, HardySpicer <gyansor...@gmail.com> wrote:
> In the bad old days of LPC Speech synthesis, the best we could hope
> for was a robotic sounding voice. Now however it would often be hard
> to tell the difference between real speech and synthesised. I am
> guessing they use real speech samples - would this be right? The
> voices use huge amounts of disk space.
>
> H.

Well, AT & T are the best along with Cepstral voices. But this one is
maybe the best of all

http://www.cereproc.com/demo.html

H

Reply by VelociChicken ●January 1, 20092009-01-01

"HardySpicer" <gyansorova@gmail.com> wrote in message 
news:00e5a02a-c75a-47fe-b247-497a8bf39057@r37g2000prr.googlegroups.com...
On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote:
> In the bad old days of LPC Speech synthesis, the best we could hope
> for was a robotic sounding voice. Now however it would often be hard
> to tell the difference between real speech and synthesised. I am
> guessing they use real speech samples - would this be right? The
> voices use huge amounts of disk space.
>
> H.

>Well, AT & T are the best along with Cepstral voices. But this one is
>maybe the best of all
>
>http://www.cereproc.com/demo.html

        ^        ^        ^        ^        ^        ^
Be aware that the first sample on that page is the 'studio recording' of the 
woman used for the voice, the second sample down is the actual 
text-to-speech thingy.
They all sound like they've got a mouth full bread and nails to me, why 
can't they move properly between speech sections?
I guess it's a bit of a lost art these days, as all the time and money's 
gone into recognition.
See also 'com.speech.research' - it's seems really quiet, but most questions 
are replied to.

Dave

Reply by steveu ●January 2, 20092009-01-02

>On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote:
>> In the bad old days of LPC Speech synthesis, the best we could hope
>> for was a robotic sounding voice. Now however it would often be hard
>> to tell the difference between real speech and synthesised. I am
>> guessing they use real speech samples - would this be right? The
>> voices use huge amounts of disk space.
>>
>> H.
>
>Well, AT & T are the best along with Cepstral voices. But this one is
>maybe the best of all
>
>http://www.cereproc.com/demo.html

I wonder which AT&T synthesiser you mean. There is more than one. The
version that went through various ownerships, and ended up as a Nuance
product seems the most widely used, and can sound pretty good. The cereproc
demo sounds rather unpleasant, although it has reasonable clarity. Did you
listen to the studio recording by mistake? They seems to include that to
fool the unwary. :-\

If you have a TTS voice which is 200MB to 300MB long, it will probably be
a concatenative synthesiser. These select "best fit" units of recorded
speech, apply pitch shifting, to get better emphasis, and blends the units
together. The result can sound very natural, but the clarity can be poor.
If the voice says something like an address, where context provides no help
in discriminating words, the effectiveness of these synthesisers can be
poor.

If you have a TTS voice which is less than 1M is will probably be a true
synthesiser, based on something like the old Klatt synthesiser. These all
seem to sound rather robotic, but can achieve great clarity. If the voice
says something like an address, where context provides no help in
discriminating words, these are generally the best.

The latest synthesisers, from people like Cepstral, seem to have voices in
the 10's of MB range. They appear to require far less studio recording that
the purely concatenative synthesizers. They seem to be use some hybrid
approaches.

Most of the commercial synthesisers can be traced back to the Speech
centre at Edinburgh University, and the Festival speech synthesizer they
produced. Cepstral and AT&T are amongst those. It looks like Cereproc may
be too.

Regards,
Steve

Reply by HardySpicer ●January 2, 20092009-01-02

On Jan 2, 7:12&#4294967295;pm, "steveu" <ste...@coppice.org> wrote:
> >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote:
> >> In the bad old days of LPC Speech synthesis, the best we could hope
> >> for was a robotic sounding voice. Now however it would often be hard
> >> to tell the difference between real speech and synthesised. I am
> >> guessing they use real speech samples - would this be right? The
> >> voices use huge amounts of disk space.
>
> >> H.
>
> >Well, AT & T are the best along with Cepstral voices. But this one is
> >maybe the best of all
>
> >http://www.cereproc.com/demo.html
>
> I wonder which AT&T synthesiser you mean. There is more than one. The
> version that went through various ownerships, and ended up as a Nuance
> product seems the most widely used, and can sound pretty good. The cereproc
> demo sounds rather unpleasant, although it has reasonable clarity. Did you
> listen to the studio recording by mistake? They seems to include that to
> fool the unwary. :-\
>
> If you have a TTS voice which is 200MB to 300MB long, it will probably be
> a concatenative synthesiser. These select "best fit" units of recorded
> speech, apply pitch shifting, to get better emphasis, and blends the units
> together. The result can sound very natural, but the clarity can be poor.
> If the voice says something like an address, where context provides no help
> in discriminating words, the effectiveness of these synthesisers can be
> poor.
>
> If you have a TTS voice which is less than 1M is will probably be a true
> synthesiser, based on something like the old Klatt synthesiser. These all
> seem to sound rather robotic, but can achieve great clarity. If the voice
> says something like an address, where context provides no help in
> discriminating words, these are generally the best.
>
> The latest synthesisers, from people like Cepstral, seem to have voices in
> the 10's of MB range. They appear to require far less studio recording that
> the purely concatenative synthesizers. They seem to be use some hybrid
> approaches.
>
> Most of the commercial synthesisers can be traced back to the Speech
> centre at Edinburgh University, and the Festival speech synthesizer they
> produced. Cepstral and AT&T are amongst those. It looks like Cereproc may
> be too.
>
> Regards,
> Steve

I've tried them all and they are all pretty good compared with the old
fashioned LPC robotic. Agreed that none are perfect yet, but we live
in hope! I thought they used recorded speech - explains a lot. Size is
not as much a matter as it once was 20 -30 years ago or more when all
this stuff got going. I can imagine that personalities will be the
next thing from actors etc and voices with attitude.

H.

Previous12 Next

Speech Synthesis

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group