> HardySpicer wrote:
>> On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote:
>>> In the bad old days of LPC Speech synthesis, the best we could hope
>>> for was a robotic sounding voice. Now however it would often be hard
>>> to tell the difference between real speech and synthesised. I am
>>> guessing they use real speech samples - would this be right? The
>>> voices use huge amounts of disk space.
>>>
>>> H.
>>
>> Well, AT & T are the best along with Cepstral voices. But this one is
>> maybe the best of all
>>
>> http://www.cereproc.com/demo.html
>>
>> H
>
> Although I recognized many words, I did not find that intelligible enough
> to get the gist of the paragraph. Compare it to Microsoft's text-to-speech
> or better yet, http://www.thescottishvoice.org.uk/Home/index.php
>
> Jerry
That's a good Scottish voice indeed - but why can't they get rid of those
sharp glitches? Is it just down to the hours spent on chopping the voice up
and the expertise of the analyst? I can't help feeling that it can be made
glitch free somehow...
Reply by Jerry Avins●January 5, 20092009-01-05
HardySpicer wrote:
> On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote:
>> In the bad old days of LPC Speech synthesis, the best we could hope
>> for was a robotic sounding voice. Now however it would often be hard
>> to tell the difference between real speech and synthesised. I am
>> guessing they use real speech samples - would this be right? The
>> voices use huge amounts of disk space.
>>
>> H.
>
> Well, AT & T are the best along with Cepstral voices. But this one is
> maybe the best of all
>
> http://www.cereproc.com/demo.html
>
> H
Although I recognized many words, I did not find that intelligible
enough to get the gist of the paragraph. Compare it to Microsoft's
text-to-speech or better yet,
http://www.thescottishvoice.org.uk/Home/index.php
Jerry
--
Engineering is the art of making what you want from things you can get.
�����������������������������������������������������������������������
Reply by banton●January 4, 20092009-01-04
For the speech synthesis backend "mbrola", there exist
several tts frontends including singing ones and frontends that
attemp to include emotions. The quality of the results varies.
It's definitely not a wonder-weapon, but the system is really
fun to play with since you can enter the phonemes and the prosodic
information yourself. Even without a frontend its fairly easy
to make it sing.
http://tcts.fpms.ac.be/synthesis/mbrola.html
Reply by HardySpicer●January 2, 20092009-01-02
On Jan 3, 3:48�am, ste...@coppice.org wrote:
> On Jan 2, 5:00�pm, HardySpicer <gyansor...@gmail.com> wrote:
>
>
>
> > On Jan 2, 7:12�pm, "steveu" <ste...@coppice.org> wrote:
>
> > > >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote:
> > > >> In the bad old days of LPC Speech synthesis, the best we could hope
> > > >> for was a robotic sounding voice. Now however it would often be hard
> > > >> to tell the difference between real speech and synthesised. I am
> > > >> guessing they use real speech samples - would this be right? The
> > > >> voices use huge amounts of disk space.
>
> > > >> H.
>
> > > >Well, AT & T are the best along with Cepstral voices. But this one is
> > > >maybe the best of all
>
> > > >http://www.cereproc.com/demo.html
>
> > > I wonder which AT&T synthesiser you mean. There is more than one. The
> > > version that went through various ownerships, and ended up as a Nuance
> > > product seems the most widely used, and can sound pretty good. The cereproc
> > > demo sounds rather unpleasant, although it has reasonable clarity. Did you
> > > listen to the studio recording by mistake? They seems to include that to
> > > fool the unwary. :-\
>
> > > If you have a TTS voice which is 200MB to 300MB long, it will probably be
> > > a concatenative synthesiser. These select "best fit" units of recorded
> > > speech, apply pitch shifting, to get better emphasis, and blends the units
> > > together. The result can sound very natural, but the clarity can be poor.
> > > If the voice says something like an address, where context provides no help
> > > in discriminating words, the effectiveness of these synthesisers can be
> > > poor.
>
> > > If you have a TTS voice which is less than 1M is will probably be a true
> > > synthesiser, based on something like the old Klatt synthesiser. These all
> > > seem to sound rather robotic, but can achieve great clarity. If the voice
> > > says something like an address, where context provides no help in
> > > discriminating words, these are generally the best.
>
> > > The latest synthesisers, from people like Cepstral, seem to have voices in
> > > the 10's of MB range. They appear to require far less studio recording that
> > > the purely concatenative synthesizers. They seem to be use some hybrid
> > > approaches.
>
> > > Most of the commercial synthesisers can be traced back to the Speech
> > > centre at Edinburgh University, and the Festival speech synthesizer they
> > > produced. Cepstral and AT&T are amongst those. It looks like Cereproc may
> > > be too.
>
> > > Regards,
> > > Steve
>
> > I've tried them all and they are all pretty good compared with the old
> > fashioned LPC robotic. Agreed that none are perfect yet, but we live
> > in hope! I thought they used recorded speech - explains a lot. Size is
> > not as much a matter as it once was 20 -30 years ago or more when all
> > this stuff got going. I can imagine that personalities will be the
> > next thing from actors etc and voices with attitude.
>
> > H.
>
> If you Google, you'll find quite a lot on adding emotion to TTS. I
> don't know of a commercial product that adds such a feature, though.
> Singing TTS is another fun research area.
>
> Steve
I tried the singing with no success so far!
Reply by ●January 2, 20092009-01-02
On Jan 2, 5:00�pm, HardySpicer <gyansor...@gmail.com> wrote:
> On Jan 2, 7:12�pm, "steveu" <ste...@coppice.org> wrote:
>
>
>
> > >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote:
> > >> In the bad old days of LPC Speech synthesis, the best we could hope
> > >> for was a robotic sounding voice. Now however it would often be hard
> > >> to tell the difference between real speech and synthesised. I am
> > >> guessing they use real speech samples - would this be right? The
> > >> voices use huge amounts of disk space.
>
> > >> H.
>
> > >Well, AT & T are the best along with Cepstral voices. But this one is
> > >maybe the best of all
>
> > >http://www.cereproc.com/demo.html
>
> > I wonder which AT&T synthesiser you mean. There is more than one. The
> > version that went through various ownerships, and ended up as a Nuance
> > product seems the most widely used, and can sound pretty good. The cereproc
> > demo sounds rather unpleasant, although it has reasonable clarity. Did you
> > listen to the studio recording by mistake? They seems to include that to
> > fool the unwary. :-\
>
> > If you have a TTS voice which is 200MB to 300MB long, it will probably be
> > a concatenative synthesiser. These select "best fit" units of recorded
> > speech, apply pitch shifting, to get better emphasis, and blends the units
> > together. The result can sound very natural, but the clarity can be poor.
> > If the voice says something like an address, where context provides no help
> > in discriminating words, the effectiveness of these synthesisers can be
> > poor.
>
> > If you have a TTS voice which is less than 1M is will probably be a true
> > synthesiser, based on something like the old Klatt synthesiser. These all
> > seem to sound rather robotic, but can achieve great clarity. If the voice
> > says something like an address, where context provides no help in
> > discriminating words, these are generally the best.
>
> > The latest synthesisers, from people like Cepstral, seem to have voices in
> > the 10's of MB range. They appear to require far less studio recording that
> > the purely concatenative synthesizers. They seem to be use some hybrid
> > approaches.
>
> > Most of the commercial synthesisers can be traced back to the Speech
> > centre at Edinburgh University, and the Festival speech synthesizer they
> > produced. Cepstral and AT&T are amongst those. It looks like Cereproc may
> > be too.
>
> > Regards,
> > Steve
>
> I've tried them all and they are all pretty good compared with the old
> fashioned LPC robotic. Agreed that none are perfect yet, but we live
> in hope! I thought they used recorded speech - explains a lot. Size is
> not as much a matter as it once was 20 -30 years ago or more when all
> this stuff got going. I can imagine that personalities will be the
> next thing from actors etc and voices with attitude.
>
> H.
If you Google, you'll find quite a lot on adding emotion to TTS. I
don't know of a commercial product that adds such a feature, though.
Singing TTS is another fun research area.
Steve
Reply by HardySpicer●January 2, 20092009-01-02
On Jan 2, 7:12�pm, "steveu" <ste...@coppice.org> wrote:
> >On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote:
> >> In the bad old days of LPC Speech synthesis, the best we could hope
> >> for was a robotic sounding voice. Now however it would often be hard
> >> to tell the difference between real speech and synthesised. I am
> >> guessing they use real speech samples - would this be right? The
> >> voices use huge amounts of disk space.
>
> >> H.
>
> >Well, AT & T are the best along with Cepstral voices. But this one is
> >maybe the best of all
>
> >http://www.cereproc.com/demo.html
>
> I wonder which AT&T synthesiser you mean. There is more than one. The
> version that went through various ownerships, and ended up as a Nuance
> product seems the most widely used, and can sound pretty good. The cereproc
> demo sounds rather unpleasant, although it has reasonable clarity. Did you
> listen to the studio recording by mistake? They seems to include that to
> fool the unwary. :-\
>
> If you have a TTS voice which is 200MB to 300MB long, it will probably be
> a concatenative synthesiser. These select "best fit" units of recorded
> speech, apply pitch shifting, to get better emphasis, and blends the units
> together. The result can sound very natural, but the clarity can be poor.
> If the voice says something like an address, where context provides no help
> in discriminating words, the effectiveness of these synthesisers can be
> poor.
>
> If you have a TTS voice which is less than 1M is will probably be a true
> synthesiser, based on something like the old Klatt synthesiser. These all
> seem to sound rather robotic, but can achieve great clarity. If the voice
> says something like an address, where context provides no help in
> discriminating words, these are generally the best.
>
> The latest synthesisers, from people like Cepstral, seem to have voices in
> the 10's of MB range. They appear to require far less studio recording that
> the purely concatenative synthesizers. They seem to be use some hybrid
> approaches.
>
> Most of the commercial synthesisers can be traced back to the Speech
> centre at Edinburgh University, and the Festival speech synthesizer they
> produced. Cepstral and AT&T are amongst those. It looks like Cereproc may
> be too.
>
> Regards,
> Steve
I've tried them all and they are all pretty good compared with the old
fashioned LPC robotic. Agreed that none are perfect yet, but we live
in hope! I thought they used recorded speech - explains a lot. Size is
not as much a matter as it once was 20 -30 years ago or more when all
this stuff got going. I can imagine that personalities will be the
next thing from actors etc and voices with attitude.
H.
Reply by steveu●January 2, 20092009-01-02
>On Jan 1, 9:04=A0am, HardySpicer <gyansor...@gmail.com> wrote:
>> In the bad old days of LPC Speech synthesis, the best we could hope
>> for was a robotic sounding voice. Now however it would often be hard
>> to tell the difference between real speech and synthesised. I am
>> guessing they use real speech samples - would this be right? The
>> voices use huge amounts of disk space.
>>
>> H.
>
>Well, AT & T are the best along with Cepstral voices. But this one is
>maybe the best of all
>
>http://www.cereproc.com/demo.html
I wonder which AT&T synthesiser you mean. There is more than one. The
version that went through various ownerships, and ended up as a Nuance
product seems the most widely used, and can sound pretty good. The cereproc
demo sounds rather unpleasant, although it has reasonable clarity. Did you
listen to the studio recording by mistake? They seems to include that to
fool the unwary. :-\
If you have a TTS voice which is 200MB to 300MB long, it will probably be
a concatenative synthesiser. These select "best fit" units of recorded
speech, apply pitch shifting, to get better emphasis, and blends the units
together. The result can sound very natural, but the clarity can be poor.
If the voice says something like an address, where context provides no help
in discriminating words, the effectiveness of these synthesisers can be
poor.
If you have a TTS voice which is less than 1M is will probably be a true
synthesiser, based on something like the old Klatt synthesiser. These all
seem to sound rather robotic, but can achieve great clarity. If the voice
says something like an address, where context provides no help in
discriminating words, these are generally the best.
The latest synthesisers, from people like Cepstral, seem to have voices in
the 10's of MB range. They appear to require far less studio recording that
the purely concatenative synthesizers. They seem to be use some hybrid
approaches.
Most of the commercial synthesisers can be traced back to the Speech
centre at Edinburgh University, and the Festival speech synthesizer they
produced. Cepstral and AT&T are amongst those. It looks like Cereproc may
be too.
Regards,
Steve
Reply by VelociChicken●January 1, 20092009-01-01
"HardySpicer" <gyansorova@gmail.com> wrote in message
news:00e5a02a-c75a-47fe-b247-497a8bf39057@r37g2000prr.googlegroups.com...
On Jan 1, 9:04 am, HardySpicer <gyansor...@gmail.com> wrote:
> In the bad old days of LPC Speech synthesis, the best we could hope
> for was a robotic sounding voice. Now however it would often be hard
> to tell the difference between real speech and synthesised. I am
> guessing they use real speech samples - would this be right? The
> voices use huge amounts of disk space.
>
> H.
^ ^ ^ ^ ^ ^
Be aware that the first sample on that page is the 'studio recording' of the
woman used for the voice, the second sample down is the actual
text-to-speech thingy.
They all sound like they've got a mouth full bread and nails to me, why
can't they move properly between speech sections?
I guess it's a bit of a lost art these days, as all the time and money's
gone into recognition.
See also 'com.speech.research' - it's seems really quiet, but most questions
are replied to.
Dave
Reply by HardySpicer●January 1, 20092009-01-01
On Jan 1, 9:04�am, HardySpicer <gyansor...@gmail.com> wrote:
> In the bad old days of LPC Speech synthesis, the best we could hope
> for was a robotic sounding voice. Now however it would often be hard
> to tell the difference between real speech and synthesised. I am
> guessing they use real speech samples - would this be right? The
> voices use huge amounts of disk space.
>
> H.
Reply by Vladimir Vassilevsky●January 1, 20092009-01-01
Stefan Reuther wrote:
> Vladimir Vassilevsky wrote:
>
>>The MS Windows built-in text to speech sounds very unnatural. However
>>the modern GPS navigators have a pretty good voice. Since they can spell
>>the street names, that should be a combination of the text to speech,
>>scripting and the pre-recorded phrases. The total storage is only a few
>>gigs; that includes the software and the maps as well. The CPU power is
>>very limited, too.
>
>
> Street names aside, GPS nav devices have a very limited set of sentences
> they can say. "Turn left/right in 200 metres", and that's about 90% of
> what it says. Those I have looked at or coded for have a repertoire of a
> few hundred phrases, yielding a megabyte or two when compressed. Most
> phrases are only needed in one intonation variant. For example, numbers
> may always appear before a unit, never in an end position.
>
> Regarding street names, they appear in fixed grammar positions as well
> ("turn right into the <street name>"), so you don't need too much
> intonation variation for a phoneme-based synthesis (IIRC, you can get
> street names not just spelled with latin letters from a nav database,
> but also as phonemes). In an older system, I've even seen pre-recorded
> city names. This probably took an enormous amount of storage, hence they
> had a dozen speakers for regular instructions, but just one for city
> names, and happily mixed these two within sentences :-)
I guess the 99% of the streets can be covered by a fixed dictionary with
few hundreds of the trivial names. For US, that could be something like:
First, Second, Memorial, Broadway, Washington, etc. I wonder how would
it spell the street names of the historical or foreign origin. Having
all of the names spelled in the proper way would take the enormous
amount of work and a lot of storage space.
Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com