DSPRelated.com
Forums

Speech Synthesis

Started by HardySpicer December 31, 2008
In the bad old days of LPC Speech synthesis, the best we could hope
for was a robotic sounding voice. Now however it would often be hard
to tell the difference between real speech and synthesised. I am
guessing they use real speech samples - would this be right? The
voices use huge amounts of disk space.

H.
On Wed, 31 Dec 2008 12:04:51 -0800, HardySpicer wrote:

> In the bad old days of LPC Speech synthesis, the best we could hope for > was a robotic sounding voice. Now however it would often be hard to tell > the difference between real speech and synthesised. I am guessing they > use real speech samples - would this be right? The voices use huge > amounts of disk space. > > H.
I'm not sure what you're including under the rubric of "synthesis", but quite a while ago (nearly 20 years!) I worked with a company doing voice response stuff (you know "press 1 to enter a big loop that will get you back here, press 2 to enter a small loop that will get you back here, to hear a busy signal please press 0"). All of that was phrases, recorded from a script (and took a huge amount of disk space), but only numbers were "synthesized", and even that sounded clunky without careful editing and a lot of work on the part of the recording engineer and the voice talent. When you're speaking naturally the sound of the end of one word is colored by the sound of the word that comes next. Just putting together random words sounds "chopped up"; putting together words that are spoken individually sounds angry, robotic, or both. While there have certainly been advances in this since I was doing it, I doubt that you could do it without either storing several different versions of each word, or without some algorithm that did the "coloring". Are you sure you aren't just listing to recordings of scripted phrases? -- Tim Wescott Wescott Design Services http://www.wescottdesign.com Do you need to implement control loops in software? "Applied Control Theory for Embedded Systems" gives you just what it says. See details at http://www.wescottdesign.com/actfes/actfes.html

Tim Wescott wrote:
> On Wed, 31 Dec 2008 12:04:51 -0800, HardySpicer wrote: > >>In the bad old days of LPC Speech synthesis, the best we could hope for >>was a robotic sounding voice. Now however it would often be hard to tell >>the difference between real speech and synthesised. I am guessing they >>use real speech samples - would this be right? The voices use huge >>amounts of disk space. >> >>H. > > > I'm not sure what you're including under the rubric of "synthesis", but > quite a while ago (nearly 20 years!) I worked with a company doing voice > response stuff (you know "press 1 to enter a big loop that will get you > back here, press 2 to enter a small loop that will get you back here, to > hear a busy signal please press 0"). > > All of that was phrases, recorded from a script (and took a huge amount > of disk space), but only numbers were "synthesized", and even that > sounded clunky without careful editing and a lot of work on the part of > the recording engineer and the voice talent. > > When you're speaking naturally the sound of the end of one word is > colored by the sound of the word that comes next. Just putting together > random words sounds "chopped up"; putting together words that are spoken > individually sounds angry, robotic, or both. While there have certainly > been advances in this since I was doing it, I doubt that you could do it > without either storing several different versions of each word, or > without some algorithm that did the "coloring". > > Are you sure you aren't just listing to recordings of scripted phrases?
The MS Windows built-in text to speech sounds very unnatural. However the modern GPS navigators have a pretty good voice. Since they can spell the street names, that should be a combination of the text to speech, scripting and the pre-recorded phrases. The total storage is only a few gigs; that includes the software and the maps as well. The CPU power is very limited, too. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
On 1 Jan, 17:43, Tim Wescott <t...@justseemywebsite.com> wrote:
> On Wed, 31 Dec 2008 12:04:51 -0800, HardySpicer wrote: > > In the bad old days of LPC Speech synthesis, the best we could hope for > > was a robotic sounding voice.
...
> All of that was phrases, recorded from a script (and took a huge amount > of disk space), but only numbers were "synthesized", and even that > sounded clunky without careful editing and a lot of work on the part of > the recording engineer and the voice talent. > > When you're speaking naturally the sound of the end of one word is > colored by the sound of the word that comes next. &#4294967295;Just putting together > random words sounds "chopped up"; putting together words that are spoken > individually sounds angry, robotic, or both.
When I chose my MSc subject in the mid '90s, the acoustics lab had a speech processing group who did text-to-speech synthesis. While the demo thing worked (with a couple of quirks; one of which related to the main scientist's name which was just about the only Norwegian word with a particular segment that was pronunced differently than the others with the same segment) the technicians told us - as I remember it - that they had spent years doing recordings of thousands of sounds from each of hundreds of subjects in the anechoic chamber, to get a reference library for the kinds of sounds you are talking about. Rune
Vladimir Vassilevsky wrote:
> Tim Wescott wrote: >> When you're speaking naturally the sound of the end of one word is >> colored by the sound of the word that comes next. Just putting >> together random words sounds "chopped up"; putting together words that >> are spoken individually sounds angry, robotic, or both. While there >> have certainly been advances in this since I was doing it, I doubt >> that you could do it without either storing several different versions >> of each word, or without some algorithm that did the "coloring". > > The MS Windows built-in text to speech sounds very unnatural. However > the modern GPS navigators have a pretty good voice. Since they can spell > the street names, that should be a combination of the text to speech, > scripting and the pre-recorded phrases. The total storage is only a few > gigs; that includes the software and the maps as well. The CPU power is > very limited, too.
Street names aside, GPS nav devices have a very limited set of sentences they can say. "Turn left/right in 200 metres", and that's about 90% of what it says. Those I have looked at or coded for have a repertoire of a few hundred phrases, yielding a megabyte or two when compressed. Most phrases are only needed in one intonation variant. For example, numbers may always appear before a unit, never in an end position. Regarding street names, they appear in fixed grammar positions as well ("turn right into the <street name>"), so you don't need too much intonation variation for a phoneme-based synthesis (IIRC, you can get street names not just spelled with latin letters from a nav database, but also as phonemes). In an older system, I've even seen pre-recorded city names. This probably took an enormous amount of storage, hence they had a dozen speakers for regular instructions, but just one for city names, and happily mixed these two within sentences :-) Stefan

Stefan Reuther wrote:

> Vladimir Vassilevsky wrote: >
>>The MS Windows built-in text to speech sounds very unnatural. However >>the modern GPS navigators have a pretty good voice. Since they can spell >>the street names, that should be a combination of the text to speech, >>scripting and the pre-recorded phrases. The total storage is only a few >>gigs; that includes the software and the maps as well. The CPU power is >>very limited, too. > > > Street names aside, GPS nav devices have a very limited set of sentences > they can say. "Turn left/right in 200 metres", and that's about 90% of > what it says. Those I have looked at or coded for have a repertoire of a > few hundred phrases, yielding a megabyte or two when compressed. Most > phrases are only needed in one intonation variant. For example, numbers > may always appear before a unit, never in an end position. > > Regarding street names, they appear in fixed grammar positions as well > ("turn right into the <street name>"), so you don't need too much > intonation variation for a phoneme-based synthesis (IIRC, you can get > street names not just spelled with latin letters from a nav database, > but also as phonemes). In an older system, I've even seen pre-recorded > city names. This probably took an enormous amount of storage, hence they > had a dozen speakers for regular instructions, but just one for city > names, and happily mixed these two within sentences :-)
I guess the 99% of the streets can be covered by a fixed dictionary with few hundreds of the trivial names. For US, that could be something like: First, Second, Memorial, Broadway, Washington, etc. I wonder how would it spell the street names of the historical or foreign origin. Having all of the names spelled in the proper way would take the enormous amount of work and a lot of storage space. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
On Jan 1, 9:04&#4294967295;am, HardySpicer <gyansor...@gmail.com> wrote:
> In the bad old days of LPC Speech synthesis, the best we could hope > for was a robotic sounding voice. Now however it would often be hard > to tell the difference between real speech and synthesised. I am > guessing they use real speech samples - would this be right? The > voices use huge amounts of disk space. > > H.
Well, AT & T are the best along with Cepstral voices. But this one is maybe the best of all http://www.cereproc.com/demo.html H
"HardySpicer" <gyansorova@gmail.com> wrote in message 
news:00e5a02a-c75a-47fe-b247-497a8bf39057@r37g2000prr.googlegroups.com...
On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote:
> In the bad old days of LPC Speech synthesis, the best we could hope > for was a robotic sounding voice. Now however it would often be hard > to tell the difference between real speech and synthesised. I am > guessing they use real speech samples - would this be right? The > voices use huge amounts of disk space. > > H.
>Well, AT & T are the best along with Cepstral voices. But this one is >maybe the best of all > >http://www.cereproc.com/demo.html
^ ^ ^ ^ ^ ^ Be aware that the first sample on that page is the 'studio recording' of the woman used for the voice, the second sample down is the actual text-to-speech thingy. They all sound like they've got a mouth full bread and nails to me, why can't they move properly between speech sections? I guess it's a bit of a lost art these days, as all the time and money's gone into recognition. See also 'com.speech.research' - it's seems really quiet, but most questions are replied to. Dave
>On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote: >> In the bad old days of LPC Speech synthesis, the best we could hope >> for was a robotic sounding voice. Now however it would often be hard >> to tell the difference between real speech and synthesised. I am >> guessing they use real speech samples - would this be right? The >> voices use huge amounts of disk space. >> >> H. > >Well, AT & T are the best along with Cepstral voices. But this one is >maybe the best of all > >http://www.cereproc.com/demo.html
I wonder which AT&T synthesiser you mean. There is more than one. The version that went through various ownerships, and ended up as a Nuance product seems the most widely used, and can sound pretty good. The cereproc demo sounds rather unpleasant, although it has reasonable clarity. Did you listen to the studio recording by mistake? They seems to include that to fool the unwary. :-\ If you have a TTS voice which is 200MB to 300MB long, it will probably be a concatenative synthesiser. These select "best fit" units of recorded speech, apply pitch shifting, to get better emphasis, and blends the units together. The result can sound very natural, but the clarity can be poor. If the voice says something like an address, where context provides no help in discriminating words, the effectiveness of these synthesisers can be poor. If you have a TTS voice which is less than 1M is will probably be a true synthesiser, based on something like the old Klatt synthesiser. These all seem to sound rather robotic, but can achieve great clarity. If the voice says something like an address, where context provides no help in discriminating words, these are generally the best. The latest synthesisers, from people like Cepstral, seem to have voices in the 10's of MB range. They appear to require far less studio recording that the purely concatenative synthesizers. They seem to be use some hybrid approaches. Most of the commercial synthesisers can be traced back to the Speech centre at Edinburgh University, and the Festival speech synthesizer they produced. Cepstral and AT&T are amongst those. It looks like Cereproc may be too. Regards, Steve
On Jan 2, 7:12&#4294967295;pm, "steveu" <ste...@coppice.org> wrote:
> >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote: > >> In the bad old days of LPC Speech synthesis, the best we could hope > >> for was a robotic sounding voice. Now however it would often be hard > >> to tell the difference between real speech and synthesised. I am > >> guessing they use real speech samples - would this be right? The > >> voices use huge amounts of disk space. > > >> H. > > >Well, AT & T are the best along with Cepstral voices. But this one is > >maybe the best of all > > >http://www.cereproc.com/demo.html > > I wonder which AT&T synthesiser you mean. There is more than one. The > version that went through various ownerships, and ended up as a Nuance > product seems the most widely used, and can sound pretty good. The cereproc > demo sounds rather unpleasant, although it has reasonable clarity. Did you > listen to the studio recording by mistake? They seems to include that to > fool the unwary. :-\ > > If you have a TTS voice which is 200MB to 300MB long, it will probably be > a concatenative synthesiser. These select "best fit" units of recorded > speech, apply pitch shifting, to get better emphasis, and blends the units > together. The result can sound very natural, but the clarity can be poor. > If the voice says something like an address, where context provides no help > in discriminating words, the effectiveness of these synthesisers can be > poor. > > If you have a TTS voice which is less than 1M is will probably be a true > synthesiser, based on something like the old Klatt synthesiser. These all > seem to sound rather robotic, but can achieve great clarity. If the voice > says something like an address, where context provides no help in > discriminating words, these are generally the best. > > The latest synthesisers, from people like Cepstral, seem to have voices in > the 10's of MB range. They appear to require far less studio recording that > the purely concatenative synthesizers. They seem to be use some hybrid > approaches. > > Most of the commercial synthesisers can be traced back to the Speech > centre at Edinburgh University, and the Festival speech synthesizer they > produced. Cepstral and AT&T are amongst those. It looks like Cereproc may > be too. > > Regards, > Steve
I've tried them all and they are all pretty good compared with the old fashioned LPC robotic. Agreed that none are perfect yet, but we live in hope! I thought they used recorded speech - explains a lot. Size is not as much a matter as it once was 20 -30 years ago or more when all this stuff got going. I can imagine that personalities will be the next thing from actors etc and voices with attitude. H.