DSPRelated.com
Forums

Speech Synthesis

Started by HardySpicer December 31, 2008
On Jan 2, 5:00&#4294967295;pm, HardySpicer <gyansor...@gmail.com> wrote:
> On Jan 2, 7:12&#4294967295;pm, "steveu" <ste...@coppice.org> wrote: > > > > > >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote: > > >> In the bad old days of LPC Speech synthesis, the best we could hope > > >> for was a robotic sounding voice. Now however it would often be hard > > >> to tell the difference between real speech and synthesised. I am > > >> guessing they use real speech samples - would this be right? The > > >> voices use huge amounts of disk space. > > > >> H. > > > >Well, AT & T are the best along with Cepstral voices. But this one is > > >maybe the best of all > > > >http://www.cereproc.com/demo.html > > > I wonder which AT&T synthesiser you mean. There is more than one. The > > version that went through various ownerships, and ended up as a Nuance > > product seems the most widely used, and can sound pretty good. The cereproc > > demo sounds rather unpleasant, although it has reasonable clarity. Did you > > listen to the studio recording by mistake? They seems to include that to > > fool the unwary. :-\ > > > If you have a TTS voice which is 200MB to 300MB long, it will probably be > > a concatenative synthesiser. These select "best fit" units of recorded > > speech, apply pitch shifting, to get better emphasis, and blends the units > > together. The result can sound very natural, but the clarity can be poor. > > If the voice says something like an address, where context provides no help > > in discriminating words, the effectiveness of these synthesisers can be > > poor. > > > If you have a TTS voice which is less than 1M is will probably be a true > > synthesiser, based on something like the old Klatt synthesiser. These all > > seem to sound rather robotic, but can achieve great clarity. If the voice > > says something like an address, where context provides no help in > > discriminating words, these are generally the best. > > > The latest synthesisers, from people like Cepstral, seem to have voices in > > the 10's of MB range. They appear to require far less studio recording that > > the purely concatenative synthesizers. They seem to be use some hybrid > > approaches. > > > Most of the commercial synthesisers can be traced back to the Speech > > centre at Edinburgh University, and the Festival speech synthesizer they > > produced. Cepstral and AT&T are amongst those. It looks like Cereproc may > > be too. > > > Regards, > > Steve > > I've tried them all and they are all pretty good compared with the old > fashioned LPC robotic. Agreed that none are perfect yet, but we live > in hope! I thought they used recorded speech - explains a lot. Size is > not as much a matter as it once was 20 -30 years ago or more when all > this stuff got going. I can imagine that personalities will be the > next thing from actors etc and voices with attitude. > > H.
If you Google, you'll find quite a lot on adding emotion to TTS. I don't know of a commercial product that adds such a feature, though. Singing TTS is another fun research area. Steve
On Jan 3, 3:48&#4294967295;am, ste...@coppice.org wrote:
> On Jan 2, 5:00&#4294967295;pm, HardySpicer <gyansor...@gmail.com> wrote: > > > > > On Jan 2, 7:12&#4294967295;pm, "steveu" <ste...@coppice.org> wrote: > > > > >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote: > > > >> In the bad old days of LPC Speech synthesis, the best we could hope > > > >> for was a robotic sounding voice. Now however it would often be hard > > > >> to tell the difference between real speech and synthesised. I am > > > >> guessing they use real speech samples - would this be right? The > > > >> voices use huge amounts of disk space. > > > > >> H. > > > > >Well, AT & T are the best along with Cepstral voices. But this one is > > > >maybe the best of all > > > > >http://www.cereproc.com/demo.html > > > > I wonder which AT&T synthesiser you mean. There is more than one. The > > > version that went through various ownerships, and ended up as a Nuance > > > product seems the most widely used, and can sound pretty good. The cereproc > > > demo sounds rather unpleasant, although it has reasonable clarity. Did you > > > listen to the studio recording by mistake? They seems to include that to > > > fool the unwary. :-\ > > > > If you have a TTS voice which is 200MB to 300MB long, it will probably be > > > a concatenative synthesiser. These select "best fit" units of recorded > > > speech, apply pitch shifting, to get better emphasis, and blends the units > > > together. The result can sound very natural, but the clarity can be poor. > > > If the voice says something like an address, where context provides no help > > > in discriminating words, the effectiveness of these synthesisers can be > > > poor. > > > > If you have a TTS voice which is less than 1M is will probably be a true > > > synthesiser, based on something like the old Klatt synthesiser. These all > > > seem to sound rather robotic, but can achieve great clarity. If the voice > > > says something like an address, where context provides no help in > > > discriminating words, these are generally the best. > > > > The latest synthesisers, from people like Cepstral, seem to have voices in > > > the 10's of MB range. They appear to require far less studio recording that > > > the purely concatenative synthesizers. They seem to be use some hybrid > > > approaches. > > > > Most of the commercial synthesisers can be traced back to the Speech > > > centre at Edinburgh University, and the Festival speech synthesizer they > > > produced. Cepstral and AT&T are amongst those. It looks like Cereproc may > > > be too. > > > > Regards, > > > Steve > > > I've tried them all and they are all pretty good compared with the old > > fashioned LPC robotic. Agreed that none are perfect yet, but we live > > in hope! I thought they used recorded speech - explains a lot. Size is > > not as much a matter as it once was 20 -30 years ago or more when all > > this stuff got going. I can imagine that personalities will be the > > next thing from actors etc and voices with attitude. > > > H. > > If you Google, you'll find quite a lot on adding emotion to TTS. I > don't know of a commercial product that adds such a feature, though. > Singing TTS is another fun research area. > > Steve
I tried the singing with no success so far!
For the speech synthesis backend "mbrola", there exist
several tts frontends including singing ones and frontends that
attemp to include emotions.  The quality of the results varies.
It's definitely not a wonder-weapon, but the system is really
fun to play with since you can enter the phonemes and the prosodic
information yourself.  Even without a frontend its fairly easy
to make it sing. 

http://tcts.fpms.ac.be/synthesis/mbrola.html


HardySpicer wrote:
> On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote: >> In the bad old days of LPC Speech synthesis, the best we could hope >> for was a robotic sounding voice. Now however it would often be hard >> to tell the difference between real speech and synthesised. I am >> guessing they use real speech samples - would this be right? The >> voices use huge amounts of disk space. >> >> H. > > Well, AT & T are the best along with Cepstral voices. But this one is > maybe the best of all > > http://www.cereproc.com/demo.html > > H
Although I recognized many words, I did not find that intelligible enough to get the gist of the paragraph. Compare it to Microsoft's text-to-speech or better yet, http://www.thescottishvoice.org.uk/Home/index.php Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
> HardySpicer wrote: >> On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote: >>> In the bad old days of LPC Speech synthesis, the best we could hope >>> for was a robotic sounding voice. Now however it would often be hard >>> to tell the difference between real speech and synthesised. I am >>> guessing they use real speech samples - would this be right? The >>> voices use huge amounts of disk space. >>> >>> H. >> >> Well, AT & T are the best along with Cepstral voices. But this one is >> maybe the best of all >> >> http://www.cereproc.com/demo.html >> >> H > > Although I recognized many words, I did not find that intelligible enough > to get the gist of the paragraph. Compare it to Microsoft's text-to-speech > or better yet, http://www.thescottishvoice.org.uk/Home/index.php > > Jerry
That's a good Scottish voice indeed - but why can't they get rid of those sharp glitches? Is it just down to the hours spent on chopping the voice up and the expertise of the analyst? I can't help feeling that it can be made glitch free somehow...