Reply by VelociChicken January 6, 20092009-01-06
> HardySpicer wrote: >> On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote: >>> In the bad old days of LPC Speech synthesis, the best we could hope >>> for was a robotic sounding voice. Now however it would often be hard >>> to tell the difference between real speech and synthesised. I am >>> guessing they use real speech samples - would this be right? The >>> voices use huge amounts of disk space. >>> >>> H. >> >> Well, AT & T are the best along with Cepstral voices. But this one is >> maybe the best of all >> >> http://www.cereproc.com/demo.html >> >> H > > Although I recognized many words, I did not find that intelligible enough > to get the gist of the paragraph. Compare it to Microsoft's text-to-speech > or better yet, http://www.thescottishvoice.org.uk/Home/index.php > > Jerry
That's a good Scottish voice indeed - but why can't they get rid of those sharp glitches? Is it just down to the hours spent on chopping the voice up and the expertise of the analyst? I can't help feeling that it can be made glitch free somehow...
Reply by Jerry Avins January 5, 20092009-01-05
HardySpicer wrote:
> On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote: >> In the bad old days of LPC Speech synthesis, the best we could hope >> for was a robotic sounding voice. Now however it would often be hard >> to tell the difference between real speech and synthesised. I am >> guessing they use real speech samples - would this be right? The >> voices use huge amounts of disk space. >> >> H. > > Well, AT & T are the best along with Cepstral voices. But this one is > maybe the best of all > > http://www.cereproc.com/demo.html > > H
Although I recognized many words, I did not find that intelligible enough to get the gist of the paragraph. Compare it to Microsoft's text-to-speech or better yet, http://www.thescottishvoice.org.uk/Home/index.php Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Reply by banton January 4, 20092009-01-04
For the speech synthesis backend "mbrola", there exist
several tts frontends including singing ones and frontends that
attemp to include emotions.  The quality of the results varies.
It's definitely not a wonder-weapon, but the system is really
fun to play with since you can enter the phonemes and the prosodic
information yourself.  Even without a frontend its fairly easy
to make it sing. 

http://tcts.fpms.ac.be/synthesis/mbrola.html


Reply by HardySpicer January 2, 20092009-01-02
On Jan 3, 3:48&#4294967295;am, ste...@coppice.org wrote:
> On Jan 2, 5:00&#4294967295;pm, HardySpicer <gyansor...@gmail.com> wrote: > > > > > On Jan 2, 7:12&#4294967295;pm, "steveu" <ste...@coppice.org> wrote: > > > > >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote: > > > >> In the bad old days of LPC Speech synthesis, the best we could hope > > > >> for was a robotic sounding voice. Now however it would often be hard > > > >> to tell the difference between real speech and synthesised. I am > > > >> guessing they use real speech samples - would this be right? The > > > >> voices use huge amounts of disk space. > > > > >> H. > > > > >Well, AT & T are the best along with Cepstral voices. But this one is > > > >maybe the best of all > > > > >http://www.cereproc.com/demo.html > > > > I wonder which AT&T synthesiser you mean. There is more than one. The > > > version that went through various ownerships, and ended up as a Nuance > > > product seems the most widely used, and can sound pretty good. The cereproc > > > demo sounds rather unpleasant, although it has reasonable clarity. Did you > > > listen to the studio recording by mistake? They seems to include that to > > > fool the unwary. :-\ > > > > If you have a TTS voice which is 200MB to 300MB long, it will probably be > > > a concatenative synthesiser. These select "best fit" units of recorded > > > speech, apply pitch shifting, to get better emphasis, and blends the units > > > together. The result can sound very natural, but the clarity can be poor. > > > If the voice says something like an address, where context provides no help > > > in discriminating words, the effectiveness of these synthesisers can be > > > poor. > > > > If you have a TTS voice which is less than 1M is will probably be a true > > > synthesiser, based on something like the old Klatt synthesiser. These all > > > seem to sound rather robotic, but can achieve great clarity. If the voice > > > says something like an address, where context provides no help in > > > discriminating words, these are generally the best. > > > > The latest synthesisers, from people like Cepstral, seem to have voices in > > > the 10's of MB range. They appear to require far less studio recording that > > > the purely concatenative synthesizers. They seem to be use some hybrid > > > approaches. > > > > Most of the commercial synthesisers can be traced back to the Speech > > > centre at Edinburgh University, and the Festival speech synthesizer they > > > produced. Cepstral and AT&T are amongst those. It looks like Cereproc may > > > be too. > > > > Regards, > > > Steve > > > I've tried them all and they are all pretty good compared with the old > > fashioned LPC robotic. Agreed that none are perfect yet, but we live > > in hope! I thought they used recorded speech - explains a lot. Size is > > not as much a matter as it once was 20 -30 years ago or more when all > > this stuff got going. I can imagine that personalities will be the > > next thing from actors etc and voices with attitude. > > > H. > > If you Google, you'll find quite a lot on adding emotion to TTS. I > don't know of a commercial product that adds such a feature, though. > Singing TTS is another fun research area. > > Steve
I tried the singing with no success so far!
Reply by January 2, 20092009-01-02
On Jan 2, 5:00&#4294967295;pm, HardySpicer <gyansor...@gmail.com> wrote:
> On Jan 2, 7:12&#4294967295;pm, "steveu" <ste...@coppice.org> wrote: > > > > > >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote: > > >> In the bad old days of LPC Speech synthesis, the best we could hope > > >> for was a robotic sounding voice. Now however it would often be hard > > >> to tell the difference between real speech and synthesised. I am > > >> guessing they use real speech samples - would this be right? The > > >> voices use huge amounts of disk space. > > > >> H. > > > >Well, AT & T are the best along with Cepstral voices. But this one is > > >maybe the best of all > > > >http://www.cereproc.com/demo.html > > > I wonder which AT&T synthesiser you mean. There is more than one. The > > version that went through various ownerships, and ended up as a Nuance > > product seems the most widely used, and can sound pretty good. The cereproc > > demo sounds rather unpleasant, although it has reasonable clarity. Did you > > listen to the studio recording by mistake? They seems to include that to > > fool the unwary. :-\ > > > If you have a TTS voice which is 200MB to 300MB long, it will probably be > > a concatenative synthesiser. These select "best fit" units of recorded > > speech, apply pitch shifting, to get better emphasis, and blends the units > > together. The result can sound very natural, but the clarity can be poor. > > If the voice says something like an address, where context provides no help > > in discriminating words, the effectiveness of these synthesisers can be > > poor. > > > If you have a TTS voice which is less than 1M is will probably be a true > > synthesiser, based on something like the old Klatt synthesiser. These all > > seem to sound rather robotic, but can achieve great clarity. If the voice > > says something like an address, where context provides no help in > > discriminating words, these are generally the best. > > > The latest synthesisers, from people like Cepstral, seem to have voices in > > the 10's of MB range. They appear to require far less studio recording that > > the purely concatenative synthesizers. They seem to be use some hybrid > > approaches. > > > Most of the commercial synthesisers can be traced back to the Speech > > centre at Edinburgh University, and the Festival speech synthesizer they > > produced. Cepstral and AT&T are amongst those. It looks like Cereproc may > > be too. > > > Regards, > > Steve > > I've tried them all and they are all pretty good compared with the old > fashioned LPC robotic. Agreed that none are perfect yet, but we live > in hope! I thought they used recorded speech - explains a lot. Size is > not as much a matter as it once was 20 -30 years ago or more when all > this stuff got going. I can imagine that personalities will be the > next thing from actors etc and voices with attitude. > > H.
If you Google, you'll find quite a lot on adding emotion to TTS. I don't know of a commercial product that adds such a feature, though. Singing TTS is another fun research area. Steve
Reply by HardySpicer January 2, 20092009-01-02
On Jan 2, 7:12&#4294967295;pm, "steveu" <ste...@coppice.org> wrote:
> >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote: > >> In the bad old days of LPC Speech synthesis, the best we could hope > >> for was a robotic sounding voice. Now however it would often be hard > >> to tell the difference between real speech and synthesised. I am > >> guessing they use real speech samples - would this be right? The > >> voices use huge amounts of disk space. > > >> H. > > >Well, AT & T are the best along with Cepstral voices. But this one is > >maybe the best of all > > >http://www.cereproc.com/demo.html > > I wonder which AT&T synthesiser you mean. There is more than one. The > version that went through various ownerships, and ended up as a Nuance > product seems the most widely used, and can sound pretty good. The cereproc > demo sounds rather unpleasant, although it has reasonable clarity. Did you > listen to the studio recording by mistake? They seems to include that to > fool the unwary. :-\ > > If you have a TTS voice which is 200MB to 300MB long, it will probably be > a concatenative synthesiser. These select "best fit" units of recorded > speech, apply pitch shifting, to get better emphasis, and blends the units > together. The result can sound very natural, but the clarity can be poor. > If the voice says something like an address, where context provides no help > in discriminating words, the effectiveness of these synthesisers can be > poor. > > If you have a TTS voice which is less than 1M is will probably be a true > synthesiser, based on something like the old Klatt synthesiser. These all > seem to sound rather robotic, but can achieve great clarity. If the voice > says something like an address, where context provides no help in > discriminating words, these are generally the best. > > The latest synthesisers, from people like Cepstral, seem to have voices in > the 10's of MB range. They appear to require far less studio recording that > the purely concatenative synthesizers. They seem to be use some hybrid > approaches. > > Most of the commercial synthesisers can be traced back to the Speech > centre at Edinburgh University, and the Festival speech synthesizer they > produced. Cepstral and AT&T are amongst those. It looks like Cereproc may > be too. > > Regards, > Steve
I've tried them all and they are all pretty good compared with the old fashioned LPC robotic. Agreed that none are perfect yet, but we live in hope! I thought they used recorded speech - explains a lot. Size is not as much a matter as it once was 20 -30 years ago or more when all this stuff got going. I can imagine that personalities will be the next thing from actors etc and voices with attitude. H.
Reply by steveu January 2, 20092009-01-02
>On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote: >> In the bad old days of LPC Speech synthesis, the best we could hope >> for was a robotic sounding voice. Now however it would often be hard >> to tell the difference between real speech and synthesised. I am >> guessing they use real speech samples - would this be right? The >> voices use huge amounts of disk space. >> >> H. > >Well, AT & T are the best along with Cepstral voices. But this one is >maybe the best of all > >http://www.cereproc.com/demo.html
I wonder which AT&T synthesiser you mean. There is more than one. The version that went through various ownerships, and ended up as a Nuance product seems the most widely used, and can sound pretty good. The cereproc demo sounds rather unpleasant, although it has reasonable clarity. Did you listen to the studio recording by mistake? They seems to include that to fool the unwary. :-\ If you have a TTS voice which is 200MB to 300MB long, it will probably be a concatenative synthesiser. These select "best fit" units of recorded speech, apply pitch shifting, to get better emphasis, and blends the units together. The result can sound very natural, but the clarity can be poor. If the voice says something like an address, where context provides no help in discriminating words, the effectiveness of these synthesisers can be poor. If you have a TTS voice which is less than 1M is will probably be a true synthesiser, based on something like the old Klatt synthesiser. These all seem to sound rather robotic, but can achieve great clarity. If the voice says something like an address, where context provides no help in discriminating words, these are generally the best. The latest synthesisers, from people like Cepstral, seem to have voices in the 10's of MB range. They appear to require far less studio recording that the purely concatenative synthesizers. They seem to be use some hybrid approaches. Most of the commercial synthesisers can be traced back to the Speech centre at Edinburgh University, and the Festival speech synthesizer they produced. Cepstral and AT&T are amongst those. It looks like Cereproc may be too. Regards, Steve
Reply by VelociChicken January 1, 20092009-01-01
"HardySpicer" <gyansorova@gmail.com> wrote in message 
news:00e5a02a-c75a-47fe-b247-497a8bf39057@r37g2000prr.googlegroups.com...
On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote:
> In the bad old days of LPC Speech synthesis, the best we could hope > for was a robotic sounding voice. Now however it would often be hard > to tell the difference between real speech and synthesised. I am > guessing they use real speech samples - would this be right? The > voices use huge amounts of disk space. > > H.
>Well, AT & T are the best along with Cepstral voices. But this one is >maybe the best of all > >http://www.cereproc.com/demo.html
^ ^ ^ ^ ^ ^ Be aware that the first sample on that page is the 'studio recording' of the woman used for the voice, the second sample down is the actual text-to-speech thingy. They all sound like they've got a mouth full bread and nails to me, why can't they move properly between speech sections? I guess it's a bit of a lost art these days, as all the time and money's gone into recognition. See also 'com.speech.research' - it's seems really quiet, but most questions are replied to. Dave
Reply by HardySpicer January 1, 20092009-01-01
On Jan 1, 9:04&#4294967295;am, HardySpicer <gyansor...@gmail.com> wrote:
> In the bad old days of LPC Speech synthesis, the best we could hope > for was a robotic sounding voice. Now however it would often be hard > to tell the difference between real speech and synthesised. I am > guessing they use real speech samples - would this be right? The > voices use huge amounts of disk space. > > H.
Well, AT & T are the best along with Cepstral voices. But this one is maybe the best of all http://www.cereproc.com/demo.html H
Reply by Vladimir Vassilevsky January 1, 20092009-01-01

Stefan Reuther wrote:

> Vladimir Vassilevsky wrote: >
>>The MS Windows built-in text to speech sounds very unnatural. However >>the modern GPS navigators have a pretty good voice. Since they can spell >>the street names, that should be a combination of the text to speech, >>scripting and the pre-recorded phrases. The total storage is only a few >>gigs; that includes the software and the maps as well. The CPU power is >>very limited, too. > > > Street names aside, GPS nav devices have a very limited set of sentences > they can say. "Turn left/right in 200 metres", and that's about 90% of > what it says. Those I have looked at or coded for have a repertoire of a > few hundred phrases, yielding a megabyte or two when compressed. Most > phrases are only needed in one intonation variant. For example, numbers > may always appear before a unit, never in an end position. > > Regarding street names, they appear in fixed grammar positions as well > ("turn right into the <street name>"), so you don't need too much > intonation variation for a phoneme-based synthesis (IIRC, you can get > street names not just spelled with latin letters from a nav database, > but also as phonemes). In an older system, I've even seen pre-recorded > city names. This probably took an enormous amount of storage, hence they > had a dozen speakers for regular instructions, but just one for city > names, and happily mixed these two within sentences :-)
I guess the 99% of the streets can be covered by a fixed dictionary with few hundreds of the trivial names. For US, that could be something like: First, Second, Memorial, Broadway, Washington, etc. I wonder how would it spell the street names of the historical or foreign origin. Having all of the names spelled in the proper way would take the enormous amount of work and a lot of storage space. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com